General Publication Statistics about a Research Organization¶
This Notebook shows how it’s possible to extract basic indicators about a research organization programmatically, using the Dimensions Analytics API and Jupyter Notebooks.
[15]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[1]:
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
#
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
Choose a Research Organization¶
For the purpose of this exercise, we will are going to use grid.168010.e (Stanford University). Feel free though to change the parameters below as you want, eg by choosing another GRID organization.
[2]:
GRIDID = "grid.168010.e" #@param {type:"string"}
def grids_url(grids):
"gen link to Dimensions webapp"
root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
return root + "&or_facet_research_org=".join([x for x in grids])
from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Preview {} Dimensions ⧉</a>'.format(grids_url([GRIDID]), GRIDID)))
Publications output by year¶
[3]:
tot = dsl.query(f"""search publications where research_orgs.id="{GRIDID}" return publications limit 1""", verbose=False).count_total
print(f"{GRIDID} has a total of {tot} publications in Dimensions")
grid.168010.e has a total of 300560 publications in Dimensions
[4]:
df = dsl.query(f"""search publications where research_orgs.id="{GRIDID}" return year limit 100""").as_dataframe()
df.rename(columns={"id": "year"}, inplace=True)
#
px.bar(df, x="year", y="count",
title=f"Publications from {GRIDID} - by year")
Returned Year: 100
Time: 0.52s
Publications most cited in last 2 years¶
[5]:
data = dslquery(f"""search publications where research_orgs.id="{GRIDID}"
return publications[doi+title+recent_citations+category_for+journal]
sort by recent_citations limit 100""")
df = data.as_dataframe()
df.head(10)[['title', 'doi', 'recent_citations', 'journal.title']]
Returned Publications: 100 (total = 300560)
Time: 1.04s
[5]:
title | doi | recent_citations | journal.title | |
---|---|---|---|---|
0 | ImageNet Large Scale Visual Recognition Challenge | 10.1007/s11263-015-0816-y | 7132 | International Journal of Computer Vision |
1 | DADA2: High resolution sample inference from I... | 10.1038/nmeth.3869 | 5971 | Nature Methods |
2 | The Elements of Statistical Learning, Data Min... | 10.1007/978-0-387-84858-7 | 4626 | NaN |
3 | phyloseq: An R Package for Reproducible Intera... | 10.1371/journal.pone.0061217 | 4305 | PLOS ONE |
4 | Reproducible, interactive, scalable and extens... | 10.1038/s41587-019-0209-9 | 3801 | Nature Biotechnology |
5 | Regularization Paths for Generalized Linear Mo... | 10.18637/jss.v033.i01 | 3351 | Journal of Statistical Software |
6 | Regularization and variable selection via the ... | 10.1111/j.1467-9868.2005.00503.x | 3147 | Journal of the Royal Statistical Society Serie... |
7 | Dermatologist-level classification of skin can... | 10.1038/nature21056 | 3047 | Nature |
8 | Review of Particle Physics* | 10.1103/physrevd.98.030001 | 2988 | Physical Review D |
9 | Combining theory and experiment in electrocata... | 10.1126/science.aad4998 | 2923 | Science |
Publications most cited - all time¶
[6]:
data = dslquery(f"""search publications
where research_orgs.id="{GRIDID}"
return publications[doi+title+times_cited+category_for+journal]
sort by times_cited limit 1000""")
df = data.as_dataframe()
df.head(10)[['title', 'doi', 'times_cited', 'journal.title']]
Returned Publications: 1000 (total = 300560)
Time: 2.78s
[6]:
title | doi | times_cited | journal.title | |
---|---|---|---|---|
0 | Compressed Sensing | 10.1109/tit.2006.871582 | 17803 | IEEE Transactions on Information Theory |
1 | Initial sequencing and analysis of the human g... | 10.1038/35057062 | 17426 | Nature |
2 | The american rheumatism association 1987 revis... | 10.1002/art.1780310302 | 16489 | Arthritis & Rheumatism |
3 | Building Theories from Case Study Research | 10.5465/amr.1989.4308385 | 16294 | Academy of Management Review |
4 | ImageNet Large Scale Visual Recognition Challenge | 10.1007/s11263-015-0816-y | 14916 | International Journal of Computer Vision |
5 | The Elements of Statistical Learning, Data Min... | 10.1007/978-0-387-84858-7 | 14465 | NaN |
6 | Cluster analysis and display of genome-wide ex... | 10.1073/pnas.95.25.14863 | 12815 | Proceedings of the National Academy of Science... |
7 | Atomic Force Microscope | 10.1103/physrevlett.56.930 | 11565 | Physical Review Letters |
8 | The 1982 revised criteria for the classificati... | 10.1002/art.1780251101 | 11421 | Arthritis & Rheumatism |
9 | Molecular portraits of human breast tumours | 10.1038/35021093 | 11132 | Nature |
Publications most cited : which research areas?¶
[7]:
data = dslquery(f"""search publications
where research_orgs.id="{GRIDID}"
return publications[doi+title+times_cited+category_for+journal]
sort by times_cited limit 1000""")
Returned Publications: 1000 (total = 300560)
Time: 1.88s
Most publications have one or more associated Field Of Research (FOR) category, which is represented in the JSON like this:
{'category_for' : [{'id': '3292', 'name': '1402 Applied Economics'},
{'id': '3177', 'name': '1117 Public Health and Health Services'}]
}`
However since some publications may not have an associated FOR category, the resulting JSON in some cases may not have category_for
as a key. Since we want to import the data into pandas we need to ensure the key is always there and has an empty list when no category is available.
[8]:
# dimcli.shortcuts.normalize_key takes: field name / json list / value to add when the field is not found
normalize_key("category_for", data.publications, [])
[9]:
df = pd.json_normalize(data.publications, record_path='category_for', meta=['doi', 'title', 'times_cited', ], errors='ignore' )
df.head()
[9]:
id | name | doi | title | times_cited | |
---|---|---|---|---|---|
0 | 2208 | 08 Information and Computing Sciences | 10.1109/tit.2006.871582 | Compressed Sensing | 17803 |
1 | 2209 | 09 Engineering | 10.1109/tit.2006.871582 | Compressed Sensing | 17803 |
2 | 2210 | 10 Technology | 10.1109/tit.2006.871582 | Compressed Sensing | 17803 |
3 | 2867 | 0906 Electrical and Electronic Engineering | 10.1109/tit.2006.871582 | Compressed Sensing | 17803 |
4 | 3001 | 1005 Communications Technologies | 10.1109/tit.2006.871582 | Compressed Sensing | 17803 |
[10]:
px.scatter(df,
x="times_cited", y="name",
marginal_x="histogram",
marginal_y="histogram",
hover_data=["doi", "title"],
height=600,
title=f"Publications from {GRIDID} - Research Areas VS Citations")
Publications most cited : which journals?¶
[11]:
data = dslquery(f"""search publications
where research_orgs.id="{GRIDID}"
return publications[doi+title+times_cited+category_for+journal]
sort by times_cited limit 1000""")
df = data.as_dataframe()
#
px.scatter(df,
x="times_cited", y="journal.title",
marginal_x="histogram",
marginal_y="histogram",
height=600,
title=f"Publications from {GRIDID} - Journals VS Citations")
Returned Publications: 1000 (total = 300560)
Time: 1.71s
Top Funders (by aggregated funding amount)¶
[12]:
fundersdata = dsl.query(f"""search grants
where research_orgs.id="{GRIDID}"
return funders aggregate funding
sort by funding""")
df = fundersdata.as_dataframe()
df.head(10)
Returned Funders: 20
Time: 0.65s
[12]:
acronym | city_name | count | country_name | funding | id | latitude | linkout | longitude | name | state_name | types | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NCI | Bethesda | 849 | United States | 1.169651e+09 | grid.48336.3a | 39.004326 | [http://www.cancer.gov/] | -77.101190 | National Cancer Institute | Maryland | [Government] |
1 | NIGMS | Bethesda | 804 | United States | 1.123588e+09 | grid.280785.0 | 38.997833 | [http://www.nigms.nih.gov/Pages/default.aspx] | -77.099380 | National Institute of General Medical Sciences | Maryland | [Facility] |
2 | NIAID | Bethesda | 597 | United States | 8.251580e+08 | grid.419681.3 | 39.066647 | [http://www.niaid.nih.gov/Pages/default.aspx] | -77.111830 | National Institute of Allergy and Infectious D... | Maryland | [Facility] |
3 | NHLBI | Bethesda | 649 | United States | 8.117882e+08 | grid.279885.9 | 39.004280 | [http://www.nhlbi.nih.gov/] | -77.100945 | National Heart Lung and Blood Institute | Maryland | [Facility] |
4 | NSF MPS | Arlington | 1319 | United States | 7.034854e+08 | grid.457875.c | 38.880566 | [http://www.nsf.gov/dir/index.jsp?org=MPS] | -77.110990 | Directorate for Mathematical & Physical Sciences | Virginia | [Government] |
5 | NINDS | Bethesda | 564 | United States | 6.840811e+08 | grid.416870.c | 39.003826 | [http://www.ninds.nih.gov/] | -77.101180 | National Institute of Neurological Disorders a... | Maryland | [Facility] |
6 | NHGRI | Bethesda | 159 | United States | 6.532875e+08 | grid.280128.1 | 38.996967 | [https://www.genome.gov/] | -77.096930 | National Human Genome Research Institute | Maryland | [Facility] |
7 | NIMH | Bethesda | 539 | United States | 5.899221e+08 | grid.416868.5 | 39.003693 | [https://www.nimh.nih.gov/index.shtml] | -77.104570 | National Institute of Mental Health | Maryland | [Facility] |
8 | EPSRC | Swindon | 91 | United Kingdom | 4.753546e+08 | grid.421091.f | 51.567093 | [https://www.epsrc.ac.uk/] | -1.784602 | Engineering and Physical Sciences Research Cou... | England | [Government] |
9 | NSF EHR | Arlington | 134 | United States | 4.538210e+08 | grid.457799.1 | 38.880580 | [http://www.nsf.gov/dir/index.jsp?org=EHR] | -77.111000 | Directorate for Education & Human Resources | Virginia | [Government] |
Top funders split by country of the funder¶
[13]:
px.bar(df,
x="name", y="funding",
color="country_name",
title=f"Funding for {GRIDID} - by year")
Correlation between No of Publications VS Funding¶
[14]:
px.scatter(df,
x="funding", y="count",
color="name",
height=600,
title=f"Funding for {GRIDID} - Publications VS Aggregated Funding Amount")
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.