Benchmarking organizations with the Dimensions API¶

This Python notebook shows how to use the Dimensions Analytics API in order to perform different benchmarking analyses of Organizations using publications data.

Outline

Quick yet effective benchmarking calculations via built-in API aggregate indicators
Building more complex quality benchmarking indicators

[2]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Feb 21, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[3]:

!pip install dimcli -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
import pandas as pd

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Quick benchmarking using the API¶

Benchmarking by volume is reasonably straight forward if what you want to compare is volume, or one of the available aggregate indicators in the Dimensions API (see https://docs.dimensions.ai/dsl/examples.html#indicators-aggregations).

[4]:

%%dsldf
search publications
return research_orgs[name] aggregate altmetric_median

Returned Research_orgs: 20
Time: 21.14s

[4]:

	altmetric_median	count	id	name
0	5.0	546592	grid.38142.3c	Harvard University
1	3.0	484017	grid.26999.3d	University of Tokyo
2	4.0	342764	grid.17063.33	University of Toronto
3	3.0	320966	grid.214458.e	University of Michigan
4	3.0	310485	grid.258799.8	Kyoto University
5	4.0	302094	grid.168010.e	Stanford University
6	4.0	297558	grid.34477.33	University of Washington
7	3.0	297094	grid.19006.3e	University of California, Los Angeles
8	5.0	289280	grid.4991.5	University of Oxford
9	4.0	285143	grid.21107.35	Johns Hopkins University
10	4.0	282170	grid.5335.0	University of Cambridge
11	2.0	280405	grid.11899.38	University of São Paulo
12	4.0	271170	grid.25879.31	University of Pennsylvania
13	4.0	266337	grid.83440.3b	University College London
14	3.0	265592	grid.136593.b	Osaka University
15	3.0	250749	grid.69566.3a	Tohoku University
16	3.0	244713	grid.5386.8	Cornell University
17	4.0	242749	grid.47840.3f	University of California, Berkeley
18	3.0	239283	grid.17635.36	University of Minnesota
19	4.0	236142	grid.21729.3f	Columbia University

[5]:

%%dsldf
search publications
return research_orgs[name] aggregate citations_total

Returned Research_orgs: 20
Time: 6.63s

[5]:

	citations_total	count	id	name
0	28836616.0	546592	grid.38142.3c	Harvard University
1	8545148.0	484017	grid.26999.3d	University of Tokyo
2	11040840.0	342764	grid.17063.33	University of Toronto
3	11710248.0	320966	grid.214458.e	University of Michigan
4	5928948.0	310485	grid.258799.8	Kyoto University
5	14738599.0	302094	grid.168010.e	Stanford University
6	12585381.0	297558	grid.34477.33	University of Washington
7	11710928.0	297094	grid.19006.3e	University of California, Los Angeles
8	10879614.0	289280	grid.4991.5	University of Oxford
9	12084053.0	285143	grid.21107.35	Johns Hopkins University
10	10814051.0	282170	grid.5335.0	University of Cambridge
11	4105653.0	280405	grid.11899.38	University of São Paulo
12	10450691.0	271170	grid.25879.31	University of Pennsylvania
13	9614297.0	266337	grid.83440.3b	University College London
14	4653874.0	265592	grid.136593.b	Osaka University
15	3694359.0	250749	grid.69566.3a	Tohoku University
16	9370701.0	244713	grid.5386.8	Cornell University
17	11806056.0	242749	grid.47840.3f	University of California, Berkeley
18	8360048.0	239283	grid.17635.36	University of Minnesota
19	9400497.0	236142	grid.21729.3f	Columbia University

[6]:

%%dsldf
search publications
return research_orgs[name] aggregate recent_citations_total

Returned Research_orgs: 20
Time: 6.54s

[6]:

	count	id	name	recent_citations_total
0	546592	grid.38142.3c	Harvard University	5562378.0
1	484017	grid.26999.3d	University of Tokyo	1471000.0
2	342764	grid.17063.33	University of Toronto	2380994.0
3	320966	grid.214458.e	University of Michigan	2370219.0
4	310485	grid.258799.8	Kyoto University	1006685.0
5	302094	grid.168010.e	Stanford University	2985116.0
6	297558	grid.34477.33	University of Washington	2411827.0
7	297094	grid.19006.3e	University of California, Los Angeles	2137101.0
8	289280	grid.4991.5	University of Oxford	2504619.0
9	285143	grid.21107.35	Johns Hopkins University	2352686.0
10	282170	grid.5335.0	University of Cambridge	2110364.0
11	280405	grid.11899.38	University of São Paulo	1124894.0
12	271170	grid.25879.31	University of Pennsylvania	2049126.0
13	266337	grid.83440.3b	University College London	2197569.0
14	265592	grid.136593.b	Osaka University	727151.0
15	250749	grid.69566.3a	Tohoku University	644246.0
16	244713	grid.5386.8	Cornell University	1809884.0
17	242749	grid.47840.3f	University of California, Berkeley	2057506.0
18	239283	grid.17635.36	University of Minnesota	1519539.0
19	236142	grid.21729.3f	Columbia University	1754780.0

Aside: Recent Citations

[7]:

%%dsldf
search publications
return year aggregate recent_citations_total

Returned Year: 20
Time: 4.06s

[7]:

	count	id	recent_citations_total
0	6503486	2020	18375337.0
1	6391947	2021	4632716.0
2	5792555	2019	22470145.0
3	5369555	2018	23030935.0
4	5044596	2017	21362603.0
5	4598245	2016	19046830.0
6	4395107	2015	17010283.0
7	4244049	2014	15057104.0
8	4046162	2013	13475978.0
9	3762532	2012	11970228.0
10	3667073	2011	10958039.0
11	3430544	2010	9915351.0
12	3144460	2009	8991871.0
13	2937393	2008	7853718.0
14	2915691	2007	7198101.0
15	2610760	2006	6579372.0
16	2410569	2005	5985630.0
17	2246194	2004	5335870.0
18	2037978	2003	4730168.0
19	1892417	2002	4234096.0

[8]:

dsl_last_results.sort_values(by='id').plot(x='id', y='recent_citations_total', figsize=(20,10))

[8]:

<AxesSubplot:xlabel='id'>

../../_images/cookbooks_8-organizations_7-benchmarking-organizations_11_1.png

[9]:

recent_citations = dsl_last_results

[10]:

recent_citations['recent_ratio'] = recent_citations['recent_citations_total']/recent_citations['count']
recent_citations['year'] = recent_citations['id']

[11]:

recent_citations.sort_values(by='year').\
    plot(x='year',y='recent_ratio', figsize=(20,10))

[11]:

<AxesSubplot:xlabel='year'>

../../_images/cookbooks_8-organizations_7-benchmarking-organizations_14_1.png

End Aside:

2. Calculating more complex ‘Quality’ Benchmarking indicators: Number of articles in the top X percent of research their category¶

Step 1. retrieve the total volume of publications by volume. (focusing on Fields of Research)¶

[12]:

%%dsldf

search publications
where year=2018
return category_for limit 1000

Returned Category_for: 176
Time: 1.02s

[12]:

	count	id	name
0	1168442	2211	11 Medical and Health Sciences
1	610238	2209	09 Engineering
2	447354	3053	1103 Clinical Sciences
3	335403	2206	06 Biological Sciences
4	332128	2208	08 Information and Computing Sciences
...	...	...	...
171	187	3528	1899 Other Law and Legal Studies
172	144	3491	1799 Other Psychology and Cognitive Sciences
173	72	3567	1999 Other Studies In Creative Arts and Writing
174	62	3240	1299 Other Built Environment and Design
175	21	3223	1204 Engineering Design

176 rows × 3 columns

Step 1.2. … Need to filter for level 2 codes¶

[13]:

result = dsl.query("""
      search publications
      where year=2018
      return category_for limit 1000

""").as_dataframe()

Returned Category_for: 176
Time: 0.84s

[14]:

result['level'] = result.name.apply(lambda n: len(n.split(' ')[0]))

[15]:

result

[15]:

	count	id	name	level
0	1168442	2211	11 Medical and Health Sciences	2
1	610238	2209	09 Engineering	2
2	447354	3053	1103 Clinical Sciences	4
3	335403	2206	06 Biological Sciences	2
4	332128	2208	08 Information and Computing Sciences	2
...	...	...	...	...
171	187	3528	1899 Other Law and Legal Studies	4
172	144	3491	1799 Other Psychology and Cognitive Sciences	4
173	72	3567	1999 Other Studies In Creative Arts and Writing	4
174	62	3240	1299 Other Built Environment and Design	4
175	21	3223	1204 Engineering Design	4

176 rows × 4 columns

[16]:

result[result['level']==2]

[16]:

	count	id	name	level
0	1168442	2211	11 Medical and Health Sciences	2
1	610238	2209	09 Engineering	2
3	335403	2206	06 Biological Sciences	2
4	332128	2208	08 Information and Computing Sciences	2
5	304680	2203	03 Chemical Sciences	2
7	224973	2202	02 Physical Sciences	2
8	201573	2201	01 Mathematical Sciences	2
12	161476	2217	17 Psychology and Cognitive Sciences	2
13	151455	2216	16 Studies in Human Society	2
18	98630	2215	15 Commerce, Management, Tourism and Services	2
20	95061	2210	10 Technology	2
21	94318	2220	20 Language, Communication and Culture	2
24	88929	2213	13 Education	2
25	86868	2204	04 Earth Sciences	2
26	85471	2214	14 Economics	2
27	80461	2221	21 History and Archaeology	2
32	71522	2205	05 Environmental Sciences	2
35	67805	2207	07 Agricultural and Veterinary Sciences	2
41	56606	2222	22 Philosophy and Religious Studies	2
48	43353	2218	18 Law and Legal Studies	2
74	26972	2212	12 Built Environment and Design	2
84	20301	2219	19 Studies in Creative Arts and Writing	2

Step 2. calculate 1% of the total number of records by category. This will be used to retrieve the 1% boundary record..¶

What is the boundary record?

[17]:

result['cutoff'] = (result['count'] * .01).astype('int')

[18]:

result[result['level']==2]

[18]:

	count	id	name	level	cutoff
0	1168442	2211	11 Medical and Health Sciences	2	11684
1	610238	2209	09 Engineering	2	6102
3	335403	2206	06 Biological Sciences	2	3354
4	332128	2208	08 Information and Computing Sciences	2	3321
5	304680	2203	03 Chemical Sciences	2	3046
7	224973	2202	02 Physical Sciences	2	2249
8	201573	2201	01 Mathematical Sciences	2	2015
12	161476	2217	17 Psychology and Cognitive Sciences	2	1614
13	151455	2216	16 Studies in Human Society	2	1514
18	98630	2215	15 Commerce, Management, Tourism and Services	2	986
20	95061	2210	10 Technology	2	950
21	94318	2220	20 Language, Communication and Culture	2	943
24	88929	2213	13 Education	2	889
25	86868	2204	04 Earth Sciences	2	868
26	85471	2214	14 Economics	2	854
27	80461	2221	21 History and Archaeology	2	804
32	71522	2205	05 Environmental Sciences	2	715
35	67805	2207	07 Agricultural and Veterinary Sciences	2	678
41	56606	2222	22 Philosophy and Religious Studies	2	566
48	43353	2218	18 Law and Legal Studies	2	433
74	26972	2212	12 Built Environment and Design	2	269
84	20301	2219	19 Studies in Creative Arts and Writing	2	203

Step 3. Use the cutoff value to get the indicator value for the 1% boundary¶

Note: Here we use:

‘sort by’ , limit, and skip!

‘sort by’: return results in order of field_citation_ratio
‘limit’: we are only interested in the first result returned
‘skip’ we are ‘skipping’ to the boundary record

Double Note: this strategy won’t work when the boundary record is > 50,000…

[19]:

dfl = []

for r in result[result['level']==2].iterrows():

    result = dsl.query(f"""

           search publications
           where category_for.id = "{r[1]['id']}"
           and year = 2018
           return publications[field_citation_ratio]
               sort by field_citation_ratio
               limit 1
               skip {r[1]['cutoff']}

      """).as_dataframe()

    result['name'] = r[1]['name']
    result['id'] = r[1]['id']
    dfl.append(result)

Returned Publications: 1 (total = 1168442)
Time: 9.13s
Returned Publications: 1 (total = 610238)
Time: 4.71s
Returned Publications: 1 (total = 335403)
Time: 2.55s
Returned Publications: 1 (total = 332128)
Time: 2.55s
Returned Publications: 1 (total = 304680)
Time: 2.17s
Returned Publications: 1 (total = 224973)
Time: 2.28s
Returned Publications: 1 (total = 201573)
Time: 2.07s
Returned Publications: 1 (total = 161476)
Time: 1.58s
Returned Publications: 1 (total = 151455)
Time: 1.67s
Returned Publications: 1 (total = 98630)
Time: 1.27s
Returned Publications: 1 (total = 95061)
Time: 0.91s
Returned Publications: 1 (total = 94318)
Time: 1.18s
Returned Publications: 1 (total = 88929)
Time: 1.07s
Returned Publications: 1 (total = 86868)
Time: 1.03s
Returned Publications: 1 (total = 85471)
Time: 1.12s
Returned Publications: 1 (total = 80461)
Time: 1.27s
Returned Publications: 1 (total = 71522)
Time: 0.92s
Returned Publications: 1 (total = 67805)
Time: 1.14s
Returned Publications: 1 (total = 56606)
Time: 1.06s
Returned Publications: 1 (total = 43353)
Time: 0.86s
Returned Publications: 1 (total = 26972)
Time: 0.75s
Returned Publications: 1 (total = 20301)
Time: 0.82s

[20]:

cutoffs = pd.concat(dfl)

[21]:

cutoffs

[21]:

field_citation_ratio	name	id
28.41	11 Medical and Health Sciences	2211
21.35	09 Engineering	2209
20.52	06 Biological Sciences	2206
35.44	08 Information and Computing Sciences	2208
20.51	03 Chemical Sciences	2203
24.72	02 Physical Sciences	2202
27.12	01 Mathematical Sciences	2201
24.56	17 Psychology and Cognitive Sciences	2217
27.91	16 Studies in Human Society	2216
32.01	15 Commerce, Management, Tourism and Services	2215
25.02	10 Technology	2210
30.45	20 Language, Communication and Culture	2220
25.34	13 Education	2213
16.52	04 Earth Sciences	2204
33.18	14 Economics	2214
28.80	21 History and Archaeology	2221
20.46	05 Environmental Sciences	2205
15.42	07 Agricultural and Veterinary Sciences	2207
27.68	22 Philosophy and Religious Studies	2222
27.52	18 Law and Legal Studies	2218
16.68	12 Built Environment and Design	2212
27.55	19 Studies in Creative Arts and Writing	2219

We can only filter on integers in the DSL, so we will round up the values¶

[22]:

cutoffs.field_citation_ratio =  cutoffs.field_citation_ratio.astype('int')

[23]:

cutoffs

[23]:

field_citation_ratio	name	id
28	11 Medical and Health Sciences	2211
21	09 Engineering	2209
20	06 Biological Sciences	2206
35	08 Information and Computing Sciences	2208
20	03 Chemical Sciences	2203
24	02 Physical Sciences	2202
27	01 Mathematical Sciences	2201
24	17 Psychology and Cognitive Sciences	2217
27	16 Studies in Human Society	2216
32	15 Commerce, Management, Tourism and Services	2215
25	10 Technology	2210
30	20 Language, Communication and Culture	2220
25	13 Education	2213
16	04 Earth Sciences	2204
33	14 Economics	2214
28	21 History and Archaeology	2221
20	05 Environmental Sciences	2205
15	07 Agricultural and Veterinary Sciences	2207
27	22 Philosophy and Religious Studies	2222
27	18 Law and Legal Studies	2218
16	12 Built Environment and Design	2212
27	19 Studies in Creative Arts and Writing	2219

Step 4. Now get the number of publications by organisation, filtered by category that have a field_citation_ratio > the boundary score¶

[24]:

dfl = []

for r in cutoffs.iterrows():

  result = dsl.query(f"""

     search publications
     where
         year=2018
         and category_for.id = "{r[1]['id']}"
         and field_citation_ratio >= {int(r[1]['field_citation_ratio'])}
    return research_orgs limit 1000

  """).as_dataframe()

  result['for_name'] = r[1]['name']
  result['for_id'] = r[1]['id']
  dfl.append(result)

Returned Research_orgs: 1000
Time: 1.21s
Returned Research_orgs: 1000
Time: 1.09s
Returned Research_orgs: 1000
Time: 3.14s
Returned Research_orgs: 1000
Time: 1.30s
Returned Research_orgs: 1000
Time: 0.96s
Returned Research_orgs: 1000
Time: 1.13s
Returned Research_orgs: 1000
Time: 1.06s
Returned Research_orgs: 1000
Time: 1.16s
Returned Research_orgs: 1000
Time: 1.06s
Returned Research_orgs: 927
Time: 1.13s
Returned Research_orgs: 915
Time: 1.04s
Returned Research_orgs: 704
Time: 0.83s
Returned Research_orgs: 863
Time: 1.03s
Returned Research_orgs: 1000
Time: 1.01s
Returned Research_orgs: 903
Time: 0.96s
Returned Research_orgs: 896
Time: 1.10s
Returned Research_orgs: 1000
Time: 1.26s
Returned Research_orgs: 1000
Time: 1.07s
Returned Research_orgs: 476
Time: 0.81s
Returned Research_orgs: 495
Time: 0.71s
Returned Research_orgs: 369
Time: 0.75s
Returned Research_orgs: 210
Time: 0.77s

ok, can only filter on Integrers

[25]:

top_insts = pd.concat(dfl)

Step 5. Rank the results¶

[26]:

top_insts['rank'] = top_insts.groupby('for_name')['count'].rank(ascending=False)

[27]:

top_insts[top_insts['name']=='University of Melbourne'][['for_name','rank']]

[27]:

	for_name	rank
21	11 Medical and Health Sciences	22.0
103	09 Engineering	107.0
25	06 Biological Sciences	26.0
99	08 Information and Computing Sciences	105.0
161	03 Chemical Sciences	170.5
142	02 Physical Sciences	150.0
45	01 Mathematical Sciences	48.5
9	17 Psychology and Cognitive Sciences	11.5
32	16 Studies in Human Society	36.0
66	15 Commerce, Management, Tourism and Services	88.0
35	20 Language, Communication and Culture	46.0
17	13 Education	22.5
196	04 Earth Sciences	230.0
83	14 Economics	110.0
263	21 History and Archaeology	579.5
30	05 Environmental Sciences	34.5
22	07 Agricultural and Veterinary Sciences	26.0
133	22 Philosophy and Religious Studies	304.0
23	18 Law and Legal Studies	37.0
20	12 Built Environment and Design	32.5
0	19 Studies in Creative Arts and Writing	1.0

We should probably control for Volume though…¶

Step 6. Get the total paper counts for each organisation¶

[28]:

dfl = []

for r in cutoffs.iterrows():

  result = dsl.query(f"""

     search publications
     where
         year=2018
         and category_for.id = "{r[1]['id']}"
    return research_orgs limit 1000

  """).as_dataframe()

  result['for_name'] = r[1]['name']
  result['for_id'] = r[1]['id']
  dfl.append(result)

Returned Research_orgs: 1000
Time: 1.47s
Returned Research_orgs: 1000
Time: 1.12s
Returned Research_orgs: 1000
Time: 1.14s
Returned Research_orgs: 1000
Time: 1.09s
Returned Research_orgs: 1000
Time: 0.97s
Returned Research_orgs: 1000
Time: 1.17s
Returned Research_orgs: 1000
Time: 1.29s
Returned Research_orgs: 1000
Time: 1.11s
Returned Research_orgs: 1000
Time: 1.00s
Returned Research_orgs: 1000
Time: 1.02s
Returned Research_orgs: 1000
Time: 0.98s
Returned Research_orgs: 1000
Time: 0.98s
Returned Research_orgs: 1000
Time: 1.03s
Returned Research_orgs: 1000
Time: 0.98s
Returned Research_orgs: 1000
Time: 0.98s
Returned Research_orgs: 1000
Time: 1.12s
Returned Research_orgs: 1000
Time: 1.14s
Returned Research_orgs: 1000
Time: 1.15s
Returned Research_orgs: 1000
Time: 1.15s
Returned Research_orgs: 1000
Time: 1.10s
Returned Research_orgs: 1000
Time: 0.97s
Returned Research_orgs: 1000
Time: 0.96s

[29]:

all_publications = pd.concat(dfl)[['id','for_id','count']]

[30]:

top_insts_all = all_publications.rename(columns={'count':'count all'}).merge(top_insts, on =['id','for_id'])

[31]:

top_insts_all[['for_name','name','count','count all']]

[31]:

	for_name	name	count	count all
0	11 Medical and Health Sciences	Harvard University	845	16932
1	11 Medical and Health Sciences	University of Toronto	392	10281
2	11 Medical and Health Sciences	Johns Hopkins University	391	10120
3	11 Medical and Health Sciences	University of California, San Francisco	365	7850
4	11 Medical and Health Sciences	Mayo Clinic	321	7659
...	...	...	...	...
12220	19 Studies in Creative Arts and Writing	University of Bamberg	1	3
12221	19 Studies in Creative Arts and Writing	National University of Quilmes	1	2
12222	19 Studies in Creative Arts and Writing	Czech University of Life Sciences Prague	1	2
12223	19 Studies in Creative Arts and Writing	University Hospitals of Cleveland	1	2
12224	19 Studies in Creative Arts and Writing	Grinnell College	1	2

12225 rows × 4 columns

Step 7. calculate the percentage of local papers in the top 1% of global publications (in 2018)¶

[32]:

top_insts_all['percentage top 1'] = (100 * top_insts_all['count']/top_insts_all['count all']).round(2)

[33]:

top_insts_all['percent rank'] = top_insts_all.groupby('for_name')['percentage top 1'].rank(ascending=False)

Now the results are going to look a little strange…¶

[34]:

top_insts_all[top_insts_all['name']=='University of Cambridge'][['for_name','percent rank']]

[34]:

	for_name	percent rank
66	11 Medical and Health Sciences	41.0
840	09 Engineering	138.0
1498	06 Biological Sciences	100.0
2294	08 Information and Computing Sciences	475.5
2875	03 Chemical Sciences	93.5
3512	02 Physical Sciences	278.5
4277	01 Mathematical Sciences	117.0
4921	17 Psychology and Cognitive Sciences	52.5
5584	16 Studies in Human Society	236.5
6165	15 Commerce, Management, Tourism and Services	150.0
6864	10 Technology	304.0
7312	20 Language, Communication and Culture	398.5
7785	13 Education	377.0
8302	04 Earth Sciences	346.5
8940	14 Economics	310.5
9467	21 History and Archaeology	305.0
10013	05 Environmental Sciences	123.0
10741	07 Agricultural and Veterinary Sciences	124.5
11176	22 Philosophy and Religious Studies	311.5
11503	18 Law and Legal Studies	196.0
11823	12 Built Environment and Design	181.0

[35]:

top_insts_all[top_insts_all['for_name']=='11 Medical and Health Sciences'][['name','percent rank']]

[35]:

	name	percent rank
0	Harvard University	61.5
1	University of Toronto	236.5
2	Johns Hopkins University	220.0
3	University of California, San Francisco	93.5
4	Mayo Clinic	152.0
...	...	...
780	University of Bath	425.0
781	Kuopio University Hospital	250.5
782	Marqués de Valdecilla University Hospital	299.0
783	Policlinico San Matteo Fondazione	114.5
784	Centre Hospitalier Universitaire de Caen	351.5

785 rows × 2 columns

Smaller institutions are being preferenced too much…

Need to control for size…¶

[36]:

reference_institutions = top_insts_all[['id','name','for_id','count all']].\
     rename(columns={
            'id':'reference id',
            'name':'reference name',
           'count all':'reference count all'
           })

[37]:

relative_ranking = reference_institutions.merge(top_insts_all, on='for_id')

[38]:

relative_ranking[relative_ranking['reference name']=='University of Melbourne']

[38]:

	reference id	reference name	for_id	reference count all	id	count all	city_name	count	country_name	latitude	linkout	longitude	name	state_name	types	acronym	for_name	rank	percentage top 1	percent rank
15700	grid.1008.9	University of Melbourne	2211	5457	grid.38142.3c	16932	Cambridge	845	United States	42.377052	[http://www.harvard.edu/]	-71.116650	Harvard University	Massachusetts	[Education]	NaN	11 Medical and Health Sciences	1.0	4.99	61.5
15701	grid.1008.9	University of Melbourne	2211	5457	grid.17063.33	10281	Toronto	392	Canada	43.661667	[http://www.utoronto.ca/]	-79.395000	University of Toronto	Ontario	[Education]	NaN	11 Medical and Health Sciences	2.0	3.81	236.5
15702	grid.1008.9	University of Melbourne	2211	5457	grid.21107.35	10120	Baltimore	391	United States	39.328888	[https://www.jhu.edu/]	-76.620280	Johns Hopkins University	Maryland	[Education]	JHU	11 Medical and Health Sciences	3.0	3.86	220.0
15703	grid.1008.9	University of Melbourne	2211	5457	grid.266102.1	7850	San Francisco	365	United States	37.762800	[https://www.ucsf.edu/]	-122.457670	University of California, San Francisco	California	[Education]	UCSF	11 Medical and Health Sciences	6.0	4.65	93.5
15704	grid.1008.9	University of Melbourne	2211	5457	grid.66875.3a	7659	Rochester	321	United States	44.024070	[http://www.mayoclinic.org/patient-visitor-gui...	-92.466310	Mayo Clinic	Minnesota	[Healthcare]	NaN	11 Medical and Health Sciences	10.0	4.19	152.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7350128	grid.1008.9	University of Melbourne	2219	85	grid.7359.8	3	Bamberg	1	Germany	49.893845	[https://www.uni-bamberg.de/]	10.886044	University of Bamberg	NaN	[Education]	NaN	19 Studies in Creative Arts and Writing	130.0	33.33	8.0
7350129	grid.1008.9	University of Melbourne	2219	85	grid.11560.33	2	Bernal	1	Argentina	-34.706670	[http://www.unq.edu.ar/english/sections/158-unq/]	-58.277500	National University of Quilmes	NaN	[Education]	UNQ	19 Studies in Creative Arts and Writing	130.0	50.00	2.5
7350130	grid.1008.9	University of Melbourne	2219	85	grid.15866.3c	2	Prague	1	Czechia	50.131460	[http://www.czu.cz/en/]	14.373258	Czech University of Life Sciences Prague	NaN	[Education]	CULS	19 Studies in Creative Arts and Writing	130.0	50.00	2.5
7350131	grid.1008.9	University of Melbourne	2219	85	grid.241104.2	2	Cleveland	1	United States	41.506096	[http://www.uhhospitals.org/]	-81.604820	University Hospitals of Cleveland	Ohio	[Healthcare]	NaN	19 Studies in Creative Arts and Writing	130.0	50.00	2.5
7350132	grid.1008.9	University of Melbourne	2219	85	grid.256592.f	2	Grinnell	1	United States	41.749737	[http://www.grinnell.edu/]	-92.719505	Grinnell College	Iowa	[Education]	NaN	19 Studies in Creative Arts and Writing	130.0	50.00	2.5

11685 rows × 20 columns

[39]:

filtered_relative_ranking = relative_ranking[relative_ranking[
                                      'reference count all'] <= relative_ranking['count all']
                                      ].copy()

[40]:

filtered_relative_ranking['filtered percent rank'] = filtered_relative_ranking.\
                                                   groupby(['reference id','for_name'])['percentage top 1'].\
                                                   rank(ascending=False)

[41]:

inst = 'University of Melbourne'

filtered_relative_ranking[

                          (filtered_relative_ranking['reference name'] == inst) &
                          (filtered_relative_ranking['name'] == inst)

                         ][['id', 'for_id', 'name','for_name','filtered percent rank']]

[41]:

	id	for_id	name	for_name	filtered percent rank
15720	grid.1008.9	2211	University of Melbourne	11 Medical and Health Sciences	15.5
738709	grid.1008.9	2209	University of Melbourne	09 Engineering	40.0
1131875	grid.1008.9	2206	University of Melbourne	06 Biological Sciences	13.0
1643602	grid.1008.9	2208	University of Melbourne	08 Information and Computing Sciences	45.0
2140491	grid.1008.9	2203	University of Melbourne	03 Chemical Sciences	87.5
2627450	grid.1008.9	2202	University of Melbourne	02 Physical Sciences	49.0
3094798	grid.1008.9	2201	University of Melbourne	01 Mathematical Sciences	18.0
3454060	grid.1008.9	2217	University of Melbourne	17 Psychology and Cognitive Sciences	4.0
3909944	grid.1008.9	2216	University of Melbourne	16 Studies in Human Society	4.0
4225053	grid.1008.9	2215	University of Melbourne	15 Commerce, Management, Tourism and Services	9.0
4915245	grid.1008.9	2220	University of Melbourne	20 Language, Communication and Culture	3.0
5111347	grid.1008.9	2213	University of Melbourne	13 Education	8.0
5429203	grid.1008.9	2204	University of Melbourne	04 Earth Sciences	64.0
5817193	grid.1008.9	2214	University of Melbourne	14 Economics	13.0
6111107	grid.1008.9	2221	University of Melbourne	21 History and Archaeology	22.0
6352914	grid.1008.9	2205	University of Melbourne	05 Environmental Sciences	6.0
6797399	grid.1008.9	2207	University of Melbourne	07 Agricultural and Veterinary Sciences	9.0
7091867	grid.1008.9	2222	University of Melbourne	22 Philosophy and Religious Studies	13.0
7194418	grid.1008.9	2218	University of Melbourne	18 Law and Legal Studies	4.0
7281134	grid.1008.9	2212	University of Melbourne	12 Built Environment and Design	7.0
7349969	grid.1008.9	2219	University of Melbourne	19 Studies in Creative Arts and Writing	1.0

[ ]:

Final step. Show me the institutions that I should be most interested in (Five above)¶

[42]:

rank_cutoffs = filtered_relative_ranking[

                          (filtered_relative_ranking['reference name'] == filtered_relative_ranking['name'] )

                         ][['id', 'for_id', 'filtered percent rank']].\
                         rename(columns={'id':'reference id',
                                         'filtered percent rank':'reference filtered percent rank'})

[43]:

filtered_relative_ranking_final = rank_cutoffs.merge(filtered_relative_ranking, on=['reference id','for_id'])

[44]:

filtered_relative_ranking_final['rank_difference'] = filtered_relative_ranking_final['filtered percent rank'] - filtered_relative_ranking_final['reference filtered percent rank']

[45]:

inst = 'Monash University'
forname = '11 Medical and Health Sciences'

filtered_relative_ranking_final[

                                 (filtered_relative_ranking_final['rank_difference'].between(-5, 5)) &
                                 (filtered_relative_ranking_final['reference name'] == inst) &
                                 (filtered_relative_ranking_final['for_name'] == forname)

                                 ][['name','filtered percent rank']].sort_values(by='filtered percent rank')

[45]:

	name	filtered percent rank
515	University of Michigan	24.5
524	Karolinska Institute	24.5
523	Emory University	26.0
528	University of Pittsburgh	27.0
521	University of Sydney	28.0
538	Monash University	29.0
533	University of British Columbia	30.0
520	University of São Paulo	31.0
534	Shanghai Jiao Tong University	32.0

[46]:

filtered_relative_ranking_final

[46]:

	reference id	for_id	reference filtered percent rank	reference name	reference count all	id	count all	city_name	count	country_name	...	name	state_name	types	acronym	for_name	rank	percentage top 1	percent rank	filtered percent rank	rank_difference
0	grid.38142.3c	2211	1.0	Harvard University	16932	grid.38142.3c	16932	Cambridge	845	United States	...	Harvard University	Massachusetts	[Education]	NaN	11 Medical and Health Sciences	1.0	4.99	61.5	1.0	0.0
1	grid.17063.33	2211	2.0	University of Toronto	10281	grid.38142.3c	16932	Cambridge	845	United States	...	Harvard University	Massachusetts	[Education]	NaN	11 Medical and Health Sciences	1.0	4.99	61.5	1.0	-1.0
2	grid.17063.33	2211	2.0	University of Toronto	10281	grid.17063.33	10281	Toronto	392	Canada	...	University of Toronto	Ontario	[Education]	NaN	11 Medical and Health Sciences	2.0	3.81	236.5	2.0	0.0
3	grid.21107.35	2211	2.0	Johns Hopkins University	10120	grid.38142.3c	16932	Cambridge	845	United States	...	Harvard University	Massachusetts	[Education]	NaN	11 Medical and Health Sciences	1.0	4.99	61.5	1.0	-1.0
4	grid.21107.35	2211	2.0	Johns Hopkins University	10120	grid.17063.33	10281	Toronto	392	Canada	...	University of Toronto	Ontario	[Education]	NaN	11 Medical and Health Sciences	2.0	3.81	236.5	3.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3721588	grid.256592.f	2219	2.5	Grinnell College	2	grid.7359.8	3	Bamberg	1	Germany	...	University of Bamberg	NaN	[Education]	NaN	19 Studies in Creative Arts and Writing	130.0	33.33	8.0	8.0	5.5
3721589	grid.256592.f	2219	2.5	Grinnell College	2	grid.11560.33	2	Bernal	1	Argentina	...	National University of Quilmes	NaN	[Education]	UNQ	19 Studies in Creative Arts and Writing	130.0	50.00	2.5	2.5	0.0
3721590	grid.256592.f	2219	2.5	Grinnell College	2	grid.15866.3c	2	Prague	1	Czechia	...	Czech University of Life Sciences Prague	NaN	[Education]	CULS	19 Studies in Creative Arts and Writing	130.0	50.00	2.5	2.5	0.0
3721591	grid.256592.f	2219	2.5	Grinnell College	2	grid.241104.2	2	Cleveland	1	United States	...	University Hospitals of Cleveland	Ohio	[Healthcare]	NaN	19 Studies in Creative Arts and Writing	130.0	50.00	2.5	2.5	0.0
3721592	grid.256592.f	2219	2.5	Grinnell College	2	grid.256592.f	2	Grinnell	1	United States	...	Grinnell College	Iowa	[Education]	NaN	19 Studies in Creative Arts and Writing	130.0	50.00	2.5	2.5	0.0

3721593 rows × 23 columns

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.