../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Benchmarking organizations with the Dimensions API

This Python notebook shows how to use the Dimensions Analytics API in order to perform different benchmarking analyses of Organizations using publications data.

Outline

  1. Quick yet effective benchmarking calculations via built-in API aggregate indicators

  2. Building more complex quality benchmarking indicators

[2]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Feb 21, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[3]:
!pip install dimcli -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
import pandas as pd

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Quick benchmarking using the API

Benchmarking by volume is reasonably straight forward if what you want to compare is volume, or one of the available aggregate indicators in the Dimensions API (see https://docs.dimensions.ai/dsl/examples.html#indicators-aggregations).

[4]:
%%dsldf
search publications
return research_orgs[name] aggregate altmetric_median
Returned Research_orgs: 20
Time: 21.14s
[4]:
altmetric_median count id name
0 5.0 546592 grid.38142.3c Harvard University
1 3.0 484017 grid.26999.3d University of Tokyo
2 4.0 342764 grid.17063.33 University of Toronto
3 3.0 320966 grid.214458.e University of Michigan
4 3.0 310485 grid.258799.8 Kyoto University
5 4.0 302094 grid.168010.e Stanford University
6 4.0 297558 grid.34477.33 University of Washington
7 3.0 297094 grid.19006.3e University of California, Los Angeles
8 5.0 289280 grid.4991.5 University of Oxford
9 4.0 285143 grid.21107.35 Johns Hopkins University
10 4.0 282170 grid.5335.0 University of Cambridge
11 2.0 280405 grid.11899.38 University of São Paulo
12 4.0 271170 grid.25879.31 University of Pennsylvania
13 4.0 266337 grid.83440.3b University College London
14 3.0 265592 grid.136593.b Osaka University
15 3.0 250749 grid.69566.3a Tohoku University
16 3.0 244713 grid.5386.8 Cornell University
17 4.0 242749 grid.47840.3f University of California, Berkeley
18 3.0 239283 grid.17635.36 University of Minnesota
19 4.0 236142 grid.21729.3f Columbia University
[5]:
%%dsldf
search publications
return research_orgs[name] aggregate citations_total
Returned Research_orgs: 20
Time: 6.63s
[5]:
citations_total count id name
0 28836616.0 546592 grid.38142.3c Harvard University
1 8545148.0 484017 grid.26999.3d University of Tokyo
2 11040840.0 342764 grid.17063.33 University of Toronto
3 11710248.0 320966 grid.214458.e University of Michigan
4 5928948.0 310485 grid.258799.8 Kyoto University
5 14738599.0 302094 grid.168010.e Stanford University
6 12585381.0 297558 grid.34477.33 University of Washington
7 11710928.0 297094 grid.19006.3e University of California, Los Angeles
8 10879614.0 289280 grid.4991.5 University of Oxford
9 12084053.0 285143 grid.21107.35 Johns Hopkins University
10 10814051.0 282170 grid.5335.0 University of Cambridge
11 4105653.0 280405 grid.11899.38 University of São Paulo
12 10450691.0 271170 grid.25879.31 University of Pennsylvania
13 9614297.0 266337 grid.83440.3b University College London
14 4653874.0 265592 grid.136593.b Osaka University
15 3694359.0 250749 grid.69566.3a Tohoku University
16 9370701.0 244713 grid.5386.8 Cornell University
17 11806056.0 242749 grid.47840.3f University of California, Berkeley
18 8360048.0 239283 grid.17635.36 University of Minnesota
19 9400497.0 236142 grid.21729.3f Columbia University
[6]:
%%dsldf
search publications
return research_orgs[name] aggregate recent_citations_total
Returned Research_orgs: 20
Time: 6.54s
[6]:
count id name recent_citations_total
0 546592 grid.38142.3c Harvard University 5562378.0
1 484017 grid.26999.3d University of Tokyo 1471000.0
2 342764 grid.17063.33 University of Toronto 2380994.0
3 320966 grid.214458.e University of Michigan 2370219.0
4 310485 grid.258799.8 Kyoto University 1006685.0
5 302094 grid.168010.e Stanford University 2985116.0
6 297558 grid.34477.33 University of Washington 2411827.0
7 297094 grid.19006.3e University of California, Los Angeles 2137101.0
8 289280 grid.4991.5 University of Oxford 2504619.0
9 285143 grid.21107.35 Johns Hopkins University 2352686.0
10 282170 grid.5335.0 University of Cambridge 2110364.0
11 280405 grid.11899.38 University of São Paulo 1124894.0
12 271170 grid.25879.31 University of Pennsylvania 2049126.0
13 266337 grid.83440.3b University College London 2197569.0
14 265592 grid.136593.b Osaka University 727151.0
15 250749 grid.69566.3a Tohoku University 644246.0
16 244713 grid.5386.8 Cornell University 1809884.0
17 242749 grid.47840.3f University of California, Berkeley 2057506.0
18 239283 grid.17635.36 University of Minnesota 1519539.0
19 236142 grid.21729.3f Columbia University 1754780.0

Aside: Recent Citations

[7]:
%%dsldf
search publications
return year aggregate recent_citations_total
Returned Year: 20
Time: 4.06s
[7]:
count id recent_citations_total
0 6503486 2020 18375337.0
1 6391947 2021 4632716.0
2 5792555 2019 22470145.0
3 5369555 2018 23030935.0
4 5044596 2017 21362603.0
5 4598245 2016 19046830.0
6 4395107 2015 17010283.0
7 4244049 2014 15057104.0
8 4046162 2013 13475978.0
9 3762532 2012 11970228.0
10 3667073 2011 10958039.0
11 3430544 2010 9915351.0
12 3144460 2009 8991871.0
13 2937393 2008 7853718.0
14 2915691 2007 7198101.0
15 2610760 2006 6579372.0
16 2410569 2005 5985630.0
17 2246194 2004 5335870.0
18 2037978 2003 4730168.0
19 1892417 2002 4234096.0
[8]:
dsl_last_results.sort_values(by='id').plot(x='id', y='recent_citations_total', figsize=(20,10))
[8]:
<AxesSubplot:xlabel='id'>
../../_images/cookbooks_8-organizations_7-benchmarking-organizations_11_1.png
[9]:
recent_citations = dsl_last_results
[10]:
recent_citations['recent_ratio'] = recent_citations['recent_citations_total']/recent_citations['count']
recent_citations['year'] = recent_citations['id']
[11]:
recent_citations.sort_values(by='year').\
    plot(x='year',y='recent_ratio', figsize=(20,10))
[11]:
<AxesSubplot:xlabel='year'>
../../_images/cookbooks_8-organizations_7-benchmarking-organizations_14_1.png

End Aside:

2. Calculating more complex ‘Quality’ Benchmarking indicators: Number of articles in the top X percent of research their category

Step 1. retrieve the total volume of publications by volume. (focusing on Fields of Research)

[12]:
%%dsldf

search publications
where year=2018
return category_for limit 1000
Returned Category_for: 176
Time: 1.02s
[12]:
count id name
0 1168442 2211 11 Medical and Health Sciences
1 610238 2209 09 Engineering
2 447354 3053 1103 Clinical Sciences
3 335403 2206 06 Biological Sciences
4 332128 2208 08 Information and Computing Sciences
... ... ... ...
171 187 3528 1899 Other Law and Legal Studies
172 144 3491 1799 Other Psychology and Cognitive Sciences
173 72 3567 1999 Other Studies In Creative Arts and Writing
174 62 3240 1299 Other Built Environment and Design
175 21 3223 1204 Engineering Design

176 rows × 3 columns

Step 1.2. … Need to filter for level 2 codes

[13]:
result = dsl.query("""
      search publications
      where year=2018
      return category_for limit 1000

""").as_dataframe()
Returned Category_for: 176
Time: 0.84s
[14]:
result['level'] = result.name.apply(lambda n: len(n.split(' ')[0]))
[15]:
result
[15]:
count id name level
0 1168442 2211 11 Medical and Health Sciences 2
1 610238 2209 09 Engineering 2
2 447354 3053 1103 Clinical Sciences 4
3 335403 2206 06 Biological Sciences 2
4 332128 2208 08 Information and Computing Sciences 2
... ... ... ... ...
171 187 3528 1899 Other Law and Legal Studies 4
172 144 3491 1799 Other Psychology and Cognitive Sciences 4
173 72 3567 1999 Other Studies In Creative Arts and Writing 4
174 62 3240 1299 Other Built Environment and Design 4
175 21 3223 1204 Engineering Design 4

176 rows × 4 columns

[16]:
result[result['level']==2]
[16]:
count id name level
0 1168442 2211 11 Medical and Health Sciences 2
1 610238 2209 09 Engineering 2
3 335403 2206 06 Biological Sciences 2
4 332128 2208 08 Information and Computing Sciences 2
5 304680 2203 03 Chemical Sciences 2
7 224973 2202 02 Physical Sciences 2
8 201573 2201 01 Mathematical Sciences 2
12 161476 2217 17 Psychology and Cognitive Sciences 2
13 151455 2216 16 Studies in Human Society 2
18 98630 2215 15 Commerce, Management, Tourism and Services 2
20 95061 2210 10 Technology 2
21 94318 2220 20 Language, Communication and Culture 2
24 88929 2213 13 Education 2
25 86868 2204 04 Earth Sciences 2
26 85471 2214 14 Economics 2
27 80461 2221 21 History and Archaeology 2
32 71522 2205 05 Environmental Sciences 2
35 67805 2207 07 Agricultural and Veterinary Sciences 2
41 56606 2222 22 Philosophy and Religious Studies 2
48 43353 2218 18 Law and Legal Studies 2
74 26972 2212 12 Built Environment and Design 2
84 20301 2219 19 Studies in Creative Arts and Writing 2

Step 2. calculate 1% of the total number of records by category. This will be used to retrieve the 1% boundary record..

What is the boundary record?

[17]:
result['cutoff'] = (result['count'] * .01).astype('int')
[18]:
result[result['level']==2]
[18]:
count id name level cutoff
0 1168442 2211 11 Medical and Health Sciences 2 11684
1 610238 2209 09 Engineering 2 6102
3 335403 2206 06 Biological Sciences 2 3354
4 332128 2208 08 Information and Computing Sciences 2 3321
5 304680 2203 03 Chemical Sciences 2 3046
7 224973 2202 02 Physical Sciences 2 2249
8 201573 2201 01 Mathematical Sciences 2 2015
12 161476 2217 17 Psychology and Cognitive Sciences 2 1614
13 151455 2216 16 Studies in Human Society 2 1514
18 98630 2215 15 Commerce, Management, Tourism and Services 2 986
20 95061 2210 10 Technology 2 950
21 94318 2220 20 Language, Communication and Culture 2 943
24 88929 2213 13 Education 2 889
25 86868 2204 04 Earth Sciences 2 868
26 85471 2214 14 Economics 2 854
27 80461 2221 21 History and Archaeology 2 804
32 71522 2205 05 Environmental Sciences 2 715
35 67805 2207 07 Agricultural and Veterinary Sciences 2 678
41 56606 2222 22 Philosophy and Religious Studies 2 566
48 43353 2218 18 Law and Legal Studies 2 433
74 26972 2212 12 Built Environment and Design 2 269
84 20301 2219 19 Studies in Creative Arts and Writing 2 203

Step 3. Use the cutoff value to get the indicator value for the 1% boundary

Note: Here we use:

‘sort by’ , limit, and skip!

  • ‘sort by’: return results in order of field_citation_ratio

  • ‘limit’: we are only interested in the first result returned

  • ‘skip’ we are ‘skipping’ to the boundary record

Double Note: this strategy won’t work when the boundary record is > 50,000…

[19]:
dfl = []

for r in result[result['level']==2].iterrows():

    result = dsl.query(f"""

           search publications
           where category_for.id = "{r[1]['id']}"
           and year = 2018
           return publications[field_citation_ratio]
               sort by field_citation_ratio
               limit 1
               skip {r[1]['cutoff']}

      """).as_dataframe()

    result['name'] = r[1]['name']
    result['id'] = r[1]['id']
    dfl.append(result)
Returned Publications: 1 (total = 1168442)
Time: 9.13s
Returned Publications: 1 (total = 610238)
Time: 4.71s
Returned Publications: 1 (total = 335403)
Time: 2.55s
Returned Publications: 1 (total = 332128)
Time: 2.55s
Returned Publications: 1 (total = 304680)
Time: 2.17s
Returned Publications: 1 (total = 224973)
Time: 2.28s
Returned Publications: 1 (total = 201573)
Time: 2.07s
Returned Publications: 1 (total = 161476)
Time: 1.58s
Returned Publications: 1 (total = 151455)
Time: 1.67s
Returned Publications: 1 (total = 98630)
Time: 1.27s
Returned Publications: 1 (total = 95061)
Time: 0.91s
Returned Publications: 1 (total = 94318)
Time: 1.18s
Returned Publications: 1 (total = 88929)
Time: 1.07s
Returned Publications: 1 (total = 86868)
Time: 1.03s
Returned Publications: 1 (total = 85471)
Time: 1.12s
Returned Publications: 1 (total = 80461)
Time: 1.27s
Returned Publications: 1 (total = 71522)
Time: 0.92s
Returned Publications: 1 (total = 67805)
Time: 1.14s
Returned Publications: 1 (total = 56606)
Time: 1.06s
Returned Publications: 1 (total = 43353)
Time: 0.86s
Returned Publications: 1 (total = 26972)
Time: 0.75s
Returned Publications: 1 (total = 20301)
Time: 0.82s
[20]:
cutoffs = pd.concat(dfl)
[21]:
cutoffs
[21]:
field_citation_ratio name id
0 28.41 11 Medical and Health Sciences 2211
0 21.35 09 Engineering 2209
0 20.52 06 Biological Sciences 2206
0 35.44 08 Information and Computing Sciences 2208
0 20.51 03 Chemical Sciences 2203
0 24.72 02 Physical Sciences 2202
0 27.12 01 Mathematical Sciences 2201
0 24.56 17 Psychology and Cognitive Sciences 2217
0 27.91 16 Studies in Human Society 2216
0 32.01 15 Commerce, Management, Tourism and Services 2215
0 25.02 10 Technology 2210
0 30.45 20 Language, Communication and Culture 2220
0 25.34 13 Education 2213
0 16.52 04 Earth Sciences 2204
0 33.18 14 Economics 2214
0 28.80 21 History and Archaeology 2221
0 20.46 05 Environmental Sciences 2205
0 15.42 07 Agricultural and Veterinary Sciences 2207
0 27.68 22 Philosophy and Religious Studies 2222
0 27.52 18 Law and Legal Studies 2218
0 16.68 12 Built Environment and Design 2212
0 27.55 19 Studies in Creative Arts and Writing 2219

We can only filter on integers in the DSL, so we will round up the values

[22]:
cutoffs.field_citation_ratio =  cutoffs.field_citation_ratio.astype('int')
[23]:
cutoffs
[23]:
field_citation_ratio name id
0 28 11 Medical and Health Sciences 2211
0 21 09 Engineering 2209
0 20 06 Biological Sciences 2206
0 35 08 Information and Computing Sciences 2208
0 20 03 Chemical Sciences 2203
0 24 02 Physical Sciences 2202
0 27 01 Mathematical Sciences 2201
0 24 17 Psychology and Cognitive Sciences 2217
0 27 16 Studies in Human Society 2216
0 32 15 Commerce, Management, Tourism and Services 2215
0 25 10 Technology 2210
0 30 20 Language, Communication and Culture 2220
0 25 13 Education 2213
0 16 04 Earth Sciences 2204
0 33 14 Economics 2214
0 28 21 History and Archaeology 2221
0 20 05 Environmental Sciences 2205
0 15 07 Agricultural and Veterinary Sciences 2207
0 27 22 Philosophy and Religious Studies 2222
0 27 18 Law and Legal Studies 2218
0 16 12 Built Environment and Design 2212
0 27 19 Studies in Creative Arts and Writing 2219

Step 4. Now get the number of publications by organisation, filtered by category that have a field_citation_ratio > the boundary score

[24]:
dfl = []

for r in cutoffs.iterrows():

  result = dsl.query(f"""

     search publications
     where
         year=2018
         and category_for.id = "{r[1]['id']}"
         and field_citation_ratio >= {int(r[1]['field_citation_ratio'])}
    return research_orgs limit 1000

  """).as_dataframe()

  result['for_name'] = r[1]['name']
  result['for_id'] = r[1]['id']
  dfl.append(result)


Returned Research_orgs: 1000
Time: 1.21s
Returned Research_orgs: 1000
Time: 1.09s
Returned Research_orgs: 1000
Time: 3.14s
Returned Research_orgs: 1000
Time: 1.30s
Returned Research_orgs: 1000
Time: 0.96s
Returned Research_orgs: 1000
Time: 1.13s
Returned Research_orgs: 1000
Time: 1.06s
Returned Research_orgs: 1000
Time: 1.16s
Returned Research_orgs: 1000
Time: 1.06s
Returned Research_orgs: 927
Time: 1.13s
Returned Research_orgs: 915
Time: 1.04s
Returned Research_orgs: 704
Time: 0.83s
Returned Research_orgs: 863
Time: 1.03s
Returned Research_orgs: 1000
Time: 1.01s
Returned Research_orgs: 903
Time: 0.96s
Returned Research_orgs: 896
Time: 1.10s
Returned Research_orgs: 1000
Time: 1.26s
Returned Research_orgs: 1000
Time: 1.07s
Returned Research_orgs: 476
Time: 0.81s
Returned Research_orgs: 495
Time: 0.71s
Returned Research_orgs: 369
Time: 0.75s
Returned Research_orgs: 210
Time: 0.77s

ok, can only filter on Integrers

[25]:
top_insts = pd.concat(dfl)

Step 5. Rank the results

[26]:
top_insts['rank'] = top_insts.groupby('for_name')['count'].rank(ascending=False)
[27]:
top_insts[top_insts['name']=='University of Melbourne'][['for_name','rank']]
[27]:
for_name rank
21 11 Medical and Health Sciences 22.0
103 09 Engineering 107.0
25 06 Biological Sciences 26.0
99 08 Information and Computing Sciences 105.0
161 03 Chemical Sciences 170.5
142 02 Physical Sciences 150.0
45 01 Mathematical Sciences 48.5
9 17 Psychology and Cognitive Sciences 11.5
32 16 Studies in Human Society 36.0
66 15 Commerce, Management, Tourism and Services 88.0
35 20 Language, Communication and Culture 46.0
17 13 Education 22.5
196 04 Earth Sciences 230.0
83 14 Economics 110.0
263 21 History and Archaeology 579.5
30 05 Environmental Sciences 34.5
22 07 Agricultural and Veterinary Sciences 26.0
133 22 Philosophy and Religious Studies 304.0
23 18 Law and Legal Studies 37.0
20 12 Built Environment and Design 32.5
0 19 Studies in Creative Arts and Writing 1.0

We should probably control for Volume though…

Step 6. Get the total paper counts for each organisation

[28]:
dfl = []

for r in cutoffs.iterrows():

  result = dsl.query(f"""

     search publications
     where
         year=2018
         and category_for.id = "{r[1]['id']}"
    return research_orgs limit 1000

  """).as_dataframe()

  result['for_name'] = r[1]['name']
  result['for_id'] = r[1]['id']
  dfl.append(result)


Returned Research_orgs: 1000
Time: 1.47s
Returned Research_orgs: 1000
Time: 1.12s
Returned Research_orgs: 1000
Time: 1.14s
Returned Research_orgs: 1000
Time: 1.09s
Returned Research_orgs: 1000
Time: 0.97s
Returned Research_orgs: 1000
Time: 1.17s
Returned Research_orgs: 1000
Time: 1.29s
Returned Research_orgs: 1000
Time: 1.11s
Returned Research_orgs: 1000
Time: 1.00s
Returned Research_orgs: 1000
Time: 1.02s
Returned Research_orgs: 1000
Time: 0.98s
Returned Research_orgs: 1000
Time: 0.98s
Returned Research_orgs: 1000
Time: 1.03s
Returned Research_orgs: 1000
Time: 0.98s
Returned Research_orgs: 1000
Time: 0.98s
Returned Research_orgs: 1000
Time: 1.12s
Returned Research_orgs: 1000
Time: 1.14s
Returned Research_orgs: 1000
Time: 1.15s
Returned Research_orgs: 1000
Time: 1.15s
Returned Research_orgs: 1000
Time: 1.10s
Returned Research_orgs: 1000
Time: 0.97s
Returned Research_orgs: 1000
Time: 0.96s
[29]:
all_publications = pd.concat(dfl)[['id','for_id','count']]
[30]:
top_insts_all = all_publications.rename(columns={'count':'count all'}).merge(top_insts, on =['id','for_id'])
[31]:
top_insts_all[['for_name','name','count','count all']]
[31]:
for_name name count count all
0 11 Medical and Health Sciences Harvard University 845 16932
1 11 Medical and Health Sciences University of Toronto 392 10281
2 11 Medical and Health Sciences Johns Hopkins University 391 10120
3 11 Medical and Health Sciences University of California, San Francisco 365 7850
4 11 Medical and Health Sciences Mayo Clinic 321 7659
... ... ... ... ...
12220 19 Studies in Creative Arts and Writing University of Bamberg 1 3
12221 19 Studies in Creative Arts and Writing National University of Quilmes 1 2
12222 19 Studies in Creative Arts and Writing Czech University of Life Sciences Prague 1 2
12223 19 Studies in Creative Arts and Writing University Hospitals of Cleveland 1 2
12224 19 Studies in Creative Arts and Writing Grinnell College 1 2

12225 rows × 4 columns

Step 7. calculate the percentage of local papers in the top 1% of global publications (in 2018)

[32]:
top_insts_all['percentage top 1'] = (100 * top_insts_all['count']/top_insts_all['count all']).round(2)
[33]:
top_insts_all['percent rank'] = top_insts_all.groupby('for_name')['percentage top 1'].rank(ascending=False)

Now the results are going to look a little strange…

[34]:
top_insts_all[top_insts_all['name']=='University of Cambridge'][['for_name','percent rank']]
[34]:
for_name percent rank
66 11 Medical and Health Sciences 41.0
840 09 Engineering 138.0
1498 06 Biological Sciences 100.0
2294 08 Information and Computing Sciences 475.5
2875 03 Chemical Sciences 93.5
3512 02 Physical Sciences 278.5
4277 01 Mathematical Sciences 117.0
4921 17 Psychology and Cognitive Sciences 52.5
5584 16 Studies in Human Society 236.5
6165 15 Commerce, Management, Tourism and Services 150.0
6864 10 Technology 304.0
7312 20 Language, Communication and Culture 398.5
7785 13 Education 377.0
8302 04 Earth Sciences 346.5
8940 14 Economics 310.5
9467 21 History and Archaeology 305.0
10013 05 Environmental Sciences 123.0
10741 07 Agricultural and Veterinary Sciences 124.5
11176 22 Philosophy and Religious Studies 311.5
11503 18 Law and Legal Studies 196.0
11823 12 Built Environment and Design 181.0
[35]:
top_insts_all[top_insts_all['for_name']=='11 Medical and Health Sciences'][['name','percent rank']]
[35]:
name percent rank
0 Harvard University 61.5
1 University of Toronto 236.5
2 Johns Hopkins University 220.0
3 University of California, San Francisco 93.5
4 Mayo Clinic 152.0
... ... ...
780 University of Bath 425.0
781 Kuopio University Hospital 250.5
782 Marqués de Valdecilla University Hospital 299.0
783 Policlinico San Matteo Fondazione 114.5
784 Centre Hospitalier Universitaire de Caen 351.5

785 rows × 2 columns

Smaller institutions are being preferenced too much…

Need to control for size…

[36]:
reference_institutions = top_insts_all[['id','name','for_id','count all']].\
     rename(columns={
            'id':'reference id',
            'name':'reference name',
           'count all':'reference count all'
           })
[37]:
relative_ranking = reference_institutions.merge(top_insts_all, on='for_id')
[38]:
relative_ranking[relative_ranking['reference name']=='University of Melbourne']
[38]:
reference id reference name for_id reference count all id count all city_name count country_name latitude linkout longitude name state_name types acronym for_name rank percentage top 1 percent rank
15700 grid.1008.9 University of Melbourne 2211 5457 grid.38142.3c 16932 Cambridge 845 United States 42.377052 [http://www.harvard.edu/] -71.116650 Harvard University Massachusetts [Education] NaN 11 Medical and Health Sciences 1.0 4.99 61.5
15701 grid.1008.9 University of Melbourne 2211 5457 grid.17063.33 10281 Toronto 392 Canada 43.661667 [http://www.utoronto.ca/] -79.395000 University of Toronto Ontario [Education] NaN 11 Medical and Health Sciences 2.0 3.81 236.5
15702 grid.1008.9 University of Melbourne 2211 5457 grid.21107.35 10120 Baltimore 391 United States 39.328888 [https://www.jhu.edu/] -76.620280 Johns Hopkins University Maryland [Education] JHU 11 Medical and Health Sciences 3.0 3.86 220.0
15703 grid.1008.9 University of Melbourne 2211 5457 grid.266102.1 7850 San Francisco 365 United States 37.762800 [https://www.ucsf.edu/] -122.457670 University of California, San Francisco California [Education] UCSF 11 Medical and Health Sciences 6.0 4.65 93.5
15704 grid.1008.9 University of Melbourne 2211 5457 grid.66875.3a 7659 Rochester 321 United States 44.024070 [http://www.mayoclinic.org/patient-visitor-gui... -92.466310 Mayo Clinic Minnesota [Healthcare] NaN 11 Medical and Health Sciences 10.0 4.19 152.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7350128 grid.1008.9 University of Melbourne 2219 85 grid.7359.8 3 Bamberg 1 Germany 49.893845 [https://www.uni-bamberg.de/] 10.886044 University of Bamberg NaN [Education] NaN 19 Studies in Creative Arts and Writing 130.0 33.33 8.0
7350129 grid.1008.9 University of Melbourne 2219 85 grid.11560.33 2 Bernal 1 Argentina -34.706670 [http://www.unq.edu.ar/english/sections/158-unq/] -58.277500 National University of Quilmes NaN [Education] UNQ 19 Studies in Creative Arts and Writing 130.0 50.00 2.5
7350130 grid.1008.9 University of Melbourne 2219 85 grid.15866.3c 2 Prague 1 Czechia 50.131460 [http://www.czu.cz/en/] 14.373258 Czech University of Life Sciences Prague NaN [Education] CULS 19 Studies in Creative Arts and Writing 130.0 50.00 2.5
7350131 grid.1008.9 University of Melbourne 2219 85 grid.241104.2 2 Cleveland 1 United States 41.506096 [http://www.uhhospitals.org/] -81.604820 University Hospitals of Cleveland Ohio [Healthcare] NaN 19 Studies in Creative Arts and Writing 130.0 50.00 2.5
7350132 grid.1008.9 University of Melbourne 2219 85 grid.256592.f 2 Grinnell 1 United States 41.749737 [http://www.grinnell.edu/] -92.719505 Grinnell College Iowa [Education] NaN 19 Studies in Creative Arts and Writing 130.0 50.00 2.5

11685 rows × 20 columns

[39]:
filtered_relative_ranking = relative_ranking[relative_ranking[
                                      'reference count all'] <= relative_ranking['count all']
                                      ].copy()
[40]:
filtered_relative_ranking['filtered percent rank'] = filtered_relative_ranking.\
                                                   groupby(['reference id','for_name'])['percentage top 1'].\
                                                   rank(ascending=False)
[41]:
inst = 'University of Melbourne'

filtered_relative_ranking[

                          (filtered_relative_ranking['reference name'] == inst) &
                          (filtered_relative_ranking['name'] == inst)

                         ][['id', 'for_id', 'name','for_name','filtered percent rank']]
[41]:
id for_id name for_name filtered percent rank
15720 grid.1008.9 2211 University of Melbourne 11 Medical and Health Sciences 15.5
738709 grid.1008.9 2209 University of Melbourne 09 Engineering 40.0
1131875 grid.1008.9 2206 University of Melbourne 06 Biological Sciences 13.0
1643602 grid.1008.9 2208 University of Melbourne 08 Information and Computing Sciences 45.0
2140491 grid.1008.9 2203 University of Melbourne 03 Chemical Sciences 87.5
2627450 grid.1008.9 2202 University of Melbourne 02 Physical Sciences 49.0
3094798 grid.1008.9 2201 University of Melbourne 01 Mathematical Sciences 18.0
3454060 grid.1008.9 2217 University of Melbourne 17 Psychology and Cognitive Sciences 4.0
3909944 grid.1008.9 2216 University of Melbourne 16 Studies in Human Society 4.0
4225053 grid.1008.9 2215 University of Melbourne 15 Commerce, Management, Tourism and Services 9.0
4915245 grid.1008.9 2220 University of Melbourne 20 Language, Communication and Culture 3.0
5111347 grid.1008.9 2213 University of Melbourne 13 Education 8.0
5429203 grid.1008.9 2204 University of Melbourne 04 Earth Sciences 64.0
5817193 grid.1008.9 2214 University of Melbourne 14 Economics 13.0
6111107 grid.1008.9 2221 University of Melbourne 21 History and Archaeology 22.0
6352914 grid.1008.9 2205 University of Melbourne 05 Environmental Sciences 6.0
6797399 grid.1008.9 2207 University of Melbourne 07 Agricultural and Veterinary Sciences 9.0
7091867 grid.1008.9 2222 University of Melbourne 22 Philosophy and Religious Studies 13.0
7194418 grid.1008.9 2218 University of Melbourne 18 Law and Legal Studies 4.0
7281134 grid.1008.9 2212 University of Melbourne 12 Built Environment and Design 7.0
7349969 grid.1008.9 2219 University of Melbourne 19 Studies in Creative Arts and Writing 1.0
[ ]:

Final step. Show me the institutions that I should be most interested in (Five above)

[42]:
rank_cutoffs = filtered_relative_ranking[

                          (filtered_relative_ranking['reference name'] == filtered_relative_ranking['name'] )

                         ][['id', 'for_id', 'filtered percent rank']].\
                         rename(columns={'id':'reference id',
                                         'filtered percent rank':'reference filtered percent rank'})
[43]:
filtered_relative_ranking_final = rank_cutoffs.merge(filtered_relative_ranking, on=['reference id','for_id'])
[44]:
filtered_relative_ranking_final['rank_difference'] = filtered_relative_ranking_final['filtered percent rank'] - filtered_relative_ranking_final['reference filtered percent rank']
[45]:
inst = 'Monash University'
forname = '11 Medical and Health Sciences'

filtered_relative_ranking_final[

                                 (filtered_relative_ranking_final['rank_difference'].between(-5, 5)) &
                                 (filtered_relative_ranking_final['reference name'] == inst) &
                                 (filtered_relative_ranking_final['for_name'] == forname)

                                 ][['name','filtered percent rank']].sort_values(by='filtered percent rank')



[45]:
name filtered percent rank
515 University of Michigan 24.5
524 Karolinska Institute 24.5
523 Emory University 26.0
528 University of Pittsburgh 27.0
521 University of Sydney 28.0
538 Monash University 29.0
533 University of British Columbia 30.0
520 University of São Paulo 31.0
534 Shanghai Jiao Tong University 32.0
[46]:
filtered_relative_ranking_final
[46]:
reference id for_id reference filtered percent rank reference name reference count all id count all city_name count country_name ... name state_name types acronym for_name rank percentage top 1 percent rank filtered percent rank rank_difference
0 grid.38142.3c 2211 1.0 Harvard University 16932 grid.38142.3c 16932 Cambridge 845 United States ... Harvard University Massachusetts [Education] NaN 11 Medical and Health Sciences 1.0 4.99 61.5 1.0 0.0
1 grid.17063.33 2211 2.0 University of Toronto 10281 grid.38142.3c 16932 Cambridge 845 United States ... Harvard University Massachusetts [Education] NaN 11 Medical and Health Sciences 1.0 4.99 61.5 1.0 -1.0
2 grid.17063.33 2211 2.0 University of Toronto 10281 grid.17063.33 10281 Toronto 392 Canada ... University of Toronto Ontario [Education] NaN 11 Medical and Health Sciences 2.0 3.81 236.5 2.0 0.0
3 grid.21107.35 2211 2.0 Johns Hopkins University 10120 grid.38142.3c 16932 Cambridge 845 United States ... Harvard University Massachusetts [Education] NaN 11 Medical and Health Sciences 1.0 4.99 61.5 1.0 -1.0
4 grid.21107.35 2211 2.0 Johns Hopkins University 10120 grid.17063.33 10281 Toronto 392 Canada ... University of Toronto Ontario [Education] NaN 11 Medical and Health Sciences 2.0 3.81 236.5 3.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3721588 grid.256592.f 2219 2.5 Grinnell College 2 grid.7359.8 3 Bamberg 1 Germany ... University of Bamberg NaN [Education] NaN 19 Studies in Creative Arts and Writing 130.0 33.33 8.0 8.0 5.5
3721589 grid.256592.f 2219 2.5 Grinnell College 2 grid.11560.33 2 Bernal 1 Argentina ... National University of Quilmes NaN [Education] UNQ 19 Studies in Creative Arts and Writing 130.0 50.00 2.5 2.5 0.0
3721590 grid.256592.f 2219 2.5 Grinnell College 2 grid.15866.3c 2 Prague 1 Czechia ... Czech University of Life Sciences Prague NaN [Education] CULS 19 Studies in Creative Arts and Writing 130.0 50.00 2.5 2.5 0.0
3721591 grid.256592.f 2219 2.5 Grinnell College 2 grid.241104.2 2 Cleveland 1 United States ... University Hospitals of Cleveland Ohio [Healthcare] NaN 19 Studies in Creative Arts and Writing 130.0 50.00 2.5 2.5 0.0
3721592 grid.256592.f 2219 2.5 Grinnell College 2 grid.256592.f 2 Grinnell 1 United States ... Grinnell College Iowa [Education] NaN 19 Studies in Creative Arts and Writing 130.0 50.00 2.5 2.5 0.0

3721593 rows × 23 columns



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg