../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

General Publication Statistics about a Research Organization

This Notebook shows how it’s possible to extract basic indicators about a research organization programmatically, using the Dimensions Analytics API and Jupyter Notebooks.

[15]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Choose a Research Organization

For the purpose of this exercise, we will are going to use grid.168010.e (Stanford University). Feel free though to change the parameters below as you want, eg by choosing another GRID organization.

[2]:
GRIDID = "grid.168010.e" #@param {type:"string"}

def grids_url(grids):
    "gen link to Dimensions webapp"
    root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
    return root + "&or_facet_research_org=".join([x for x in grids])

from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Preview {} Dimensions &#x29c9;</a>'.format(grids_url([GRIDID]), GRIDID)))

Publications output by year

[3]:
tot = dsl.query(f"""search publications where research_orgs.id="{GRIDID}" return publications limit 1""", verbose=False).count_total
print(f"{GRIDID} has a total of {tot} publications in Dimensions")
grid.168010.e has a total of 300560 publications in Dimensions
[4]:
df = dsl.query(f"""search publications where research_orgs.id="{GRIDID}" return year limit 100""").as_dataframe()
df.rename(columns={"id": "year"}, inplace=True)
#
px.bar(df, x="year", y="count",
       title=f"Publications from {GRIDID} - by year")
Returned Year: 100
Time: 0.52s

Publications most cited in last 2 years

[5]:
data = dslquery(f"""search publications where research_orgs.id="{GRIDID}"
        return publications[doi+title+recent_citations+category_for+journal]
        sort by recent_citations limit 100""")
df = data.as_dataframe()
df.head(10)[['title', 'doi', 'recent_citations', 'journal.title']]
Returned Publications: 100 (total = 300560)
Time: 1.04s
[5]:
title doi recent_citations journal.title
0 ImageNet Large Scale Visual Recognition Challenge 10.1007/s11263-015-0816-y 7132 International Journal of Computer Vision
1 DADA2: High resolution sample inference from I... 10.1038/nmeth.3869 5971 Nature Methods
2 The Elements of Statistical Learning, Data Min... 10.1007/978-0-387-84858-7 4626 NaN
3 phyloseq: An R Package for Reproducible Intera... 10.1371/journal.pone.0061217 4305 PLOS ONE
4 Reproducible, interactive, scalable and extens... 10.1038/s41587-019-0209-9 3801 Nature Biotechnology
5 Regularization Paths for Generalized Linear Mo... 10.18637/jss.v033.i01 3351 Journal of Statistical Software
6 Regularization and variable selection via the ... 10.1111/j.1467-9868.2005.00503.x 3147 Journal of the Royal Statistical Society Serie...
7 Dermatologist-level classification of skin can... 10.1038/nature21056 3047 Nature
8 Review of Particle Physics* 10.1103/physrevd.98.030001 2988 Physical Review D
9 Combining theory and experiment in electrocata... 10.1126/science.aad4998 2923 Science

Publications most cited - all time

[6]:
data = dslquery(f"""search publications
                where research_orgs.id="{GRIDID}"
                return publications[doi+title+times_cited+category_for+journal]
                sort by times_cited limit 1000""")
df = data.as_dataframe()
df.head(10)[['title', 'doi', 'times_cited', 'journal.title']]
Returned Publications: 1000 (total = 300560)
Time: 2.78s
[6]:
title doi times_cited journal.title
0 Compressed Sensing 10.1109/tit.2006.871582 17803 IEEE Transactions on Information Theory
1 Initial sequencing and analysis of the human g... 10.1038/35057062 17426 Nature
2 The american rheumatism association 1987 revis... 10.1002/art.1780310302 16489 Arthritis & Rheumatism
3 Building Theories from Case Study Research 10.5465/amr.1989.4308385 16294 Academy of Management Review
4 ImageNet Large Scale Visual Recognition Challenge 10.1007/s11263-015-0816-y 14916 International Journal of Computer Vision
5 The Elements of Statistical Learning, Data Min... 10.1007/978-0-387-84858-7 14465 NaN
6 Cluster analysis and display of genome-wide ex... 10.1073/pnas.95.25.14863 12815 Proceedings of the National Academy of Science...
7 Atomic Force Microscope 10.1103/physrevlett.56.930 11565 Physical Review Letters
8 The 1982 revised criteria for the classificati... 10.1002/art.1780251101 11421 Arthritis & Rheumatism
9 Molecular portraits of human breast tumours 10.1038/35021093 11132 Nature

Publications most cited : which research areas?

[7]:
data = dslquery(f"""search publications
                    where research_orgs.id="{GRIDID}"
                    return publications[doi+title+times_cited+category_for+journal]
                    sort by times_cited limit 1000""")
Returned Publications: 1000 (total = 300560)
Time: 1.88s

Most publications have one or more associated Field Of Research (FOR) category, which is represented in the JSON like this:

{'category_for' : [{'id': '3292', 'name': '1402 Applied Economics'},
                    {'id': '3177', 'name': '1117 Public Health and Health Services'}]
                    }`

However since some publications may not have an associated FOR category, the resulting JSON in some cases may not have category_for as a key. Since we want to import the data into pandas we need to ensure the key is always there and has an empty list when no category is available.

[8]:
# dimcli.shortcuts.normalize_key takes: field name / json list / value to add when the field is not found
normalize_key("category_for", data.publications, [])
[9]:
df = pd.json_normalize(data.publications, record_path='category_for', meta=['doi', 'title', 'times_cited', ], errors='ignore' )
df.head()
[9]:
id name doi title times_cited
0 2208 08 Information and Computing Sciences 10.1109/tit.2006.871582 Compressed Sensing 17803
1 2209 09 Engineering 10.1109/tit.2006.871582 Compressed Sensing 17803
2 2210 10 Technology 10.1109/tit.2006.871582 Compressed Sensing 17803
3 2867 0906 Electrical and Electronic Engineering 10.1109/tit.2006.871582 Compressed Sensing 17803
4 3001 1005 Communications Technologies 10.1109/tit.2006.871582 Compressed Sensing 17803
[10]:
px.scatter(df,
           x="times_cited", y="name",
           marginal_x="histogram",
           marginal_y="histogram",
           hover_data=["doi", "title"],
           height=600,
           title=f"Publications from {GRIDID} - Research Areas VS Citations")

Publications most cited : which journals?

[11]:
data = dslquery(f"""search publications
                    where research_orgs.id="{GRIDID}"
                    return publications[doi+title+times_cited+category_for+journal]
                    sort by times_cited limit 1000""")

df = data.as_dataframe()
#
px.scatter(df,
           x="times_cited", y="journal.title",
           marginal_x="histogram",
           marginal_y="histogram",
           height=600,
           title=f"Publications from {GRIDID} - Journals VS Citations")
Returned Publications: 1000 (total = 300560)
Time: 1.71s

Top Funders (by aggregated funding amount)

[12]:
fundersdata = dsl.query(f"""search grants
                        where research_orgs.id="{GRIDID}"
                        return funders aggregate funding
                        sort by funding""")
df = fundersdata.as_dataframe()
df.head(10)
Returned Funders: 20
Time: 0.65s
[12]:
acronym city_name count country_name funding id latitude linkout longitude name state_name types
0 NCI Bethesda 849 United States 1.169651e+09 grid.48336.3a 39.004326 [http://www.cancer.gov/] -77.101190 National Cancer Institute Maryland [Government]
1 NIGMS Bethesda 804 United States 1.123588e+09 grid.280785.0 38.997833 [http://www.nigms.nih.gov/Pages/default.aspx] -77.099380 National Institute of General Medical Sciences Maryland [Facility]
2 NIAID Bethesda 597 United States 8.251580e+08 grid.419681.3 39.066647 [http://www.niaid.nih.gov/Pages/default.aspx] -77.111830 National Institute of Allergy and Infectious D... Maryland [Facility]
3 NHLBI Bethesda 649 United States 8.117882e+08 grid.279885.9 39.004280 [http://www.nhlbi.nih.gov/] -77.100945 National Heart Lung and Blood Institute Maryland [Facility]
4 NSF MPS Arlington 1319 United States 7.034854e+08 grid.457875.c 38.880566 [http://www.nsf.gov/dir/index.jsp?org=MPS] -77.110990 Directorate for Mathematical & Physical Sciences Virginia [Government]
5 NINDS Bethesda 564 United States 6.840811e+08 grid.416870.c 39.003826 [http://www.ninds.nih.gov/] -77.101180 National Institute of Neurological Disorders a... Maryland [Facility]
6 NHGRI Bethesda 159 United States 6.532875e+08 grid.280128.1 38.996967 [https://www.genome.gov/] -77.096930 National Human Genome Research Institute Maryland [Facility]
7 NIMH Bethesda 539 United States 5.899221e+08 grid.416868.5 39.003693 [https://www.nimh.nih.gov/index.shtml] -77.104570 National Institute of Mental Health Maryland [Facility]
8 EPSRC Swindon 91 United Kingdom 4.753546e+08 grid.421091.f 51.567093 [https://www.epsrc.ac.uk/] -1.784602 Engineering and Physical Sciences Research Cou... England [Government]
9 NSF EHR Arlington 134 United States 4.538210e+08 grid.457799.1 38.880580 [http://www.nsf.gov/dir/index.jsp?org=EHR] -77.111000 Directorate for Education & Human Resources Virginia [Government]

Top funders split by country of the funder

[13]:
px.bar(df,
       x="name", y="funding",
       color="country_name",
       title=f"Funding for {GRIDID} - by year")

Correlation between No of Publications VS Funding

[14]:
px.scatter(df,
           x="funding", y="count",
           color="name",
           height=600,
           title=f"Funding for {GRIDID} - Publications VS Aggregated Funding Amount")


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg