../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

General Publication Statistics about a Research Organization

This Notebook shows how it’s possible to extract basic indicators about a research organization programmatically, using the Dimensions Analytics API and Jupyter Notebooks.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[1]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.shortcuts import *
import os, sys, time, json
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.7.4.2)
Connected to: https://app.dimensions.ai - DSL v1.27
Method: dsl.ini file

Choose a Research Organization

For the purpose of this exercise, we will are going to use grid.168010.e (Stanford University). Feel free though to change the parameters below as you want, eg by choosing another GRID organization.

[2]:
GRIDID = "grid.168010.e" #@param {type:"string"}

def grids_url(grids):
    "gen link to Dimensions webapp"
    root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
    return root + "&or_facet_research_org=".join([x for x in grids])

from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Preview {} Dimensions &#x29c9;</a>'.format(grids_url([GRIDID]), GRIDID)))

Publications output by year

[3]:
tot = dsl.query(f"""search publications where research_orgs.id="{GRIDID}" return publications limit 1""", verbose=False).count_total
print(f"{GRIDID} has a total of {tot} publications in Dimensions")
grid.168010.e has a total of 249953 publications in Dimensions
[4]:
df = dsl.query(f"""search publications where research_orgs.id="{GRIDID}" return year limit 100""").as_dataframe()
df.rename(columns={"id": "year"}, inplace=True)
#
px.bar(df, x="year", y="count",
       title=f"Publications from {GRIDID} - by year")
Returned Year: 100
Time: 0.56s

Publications most cited in last 2 years

[5]:
data = dslquery(f"""search publications where research_orgs.id="{GRIDID}"
        return publications[doi+title+recent_citations+category_for+journal]
        sort by recent_citations limit 100""")
df = data.as_dataframe()
df.head(10)[['title', 'doi', 'recent_citations', 'journal.title']]
Returned Publications: 100 (total = 249953)
Time: 0.97s
[5]:
title doi recent_citations journal.title
0 ImageNet Large Scale Visual Recognition Challenge 10.1007/s11263-015-0816-y 7321 International Journal of Computer Vision
1 A global reference for human genetic variation 10.1038/nature15393 4499 Nature
2 The Elements of Statistical Learning, Data Min... 10.1007/978-0-387-84858-7 4458 NaN
3 An integrated encyclopedia of DNA elements in ... 10.1038/nature11247 4229 Nature
4 Compressed Sensing 10.1109/tit.2006.871582 4082 IEEE Transactions on Information Theory
5 Planck 2015 results - XIII. Cosmological param... 10.1051/0004-6361/201525830 3707 Astronomy & Astrophysics
6 Building Theories from Case Study Research 10.5465/amr.1989.4308385 3559 Academy of Management Review
7 Global, regional, and national prevalence of o... 10.1016/s0140-6736(14)60460-8 3388 The Lancet
8 Regularization and variable selection via the ... 10.1111/j.1467-9868.2005.00503.x 3229 Journal of the Royal Statistical Society Serie...
9 phyloseq: An R Package for Reproducible Intera... 10.1371/journal.pone.0061217 3086 PLoS ONE

Publications most cited - all time

[6]:
data = dslquery(f"""search publications
                where research_orgs.id="{GRIDID}"
                return publications[doi+title+times_cited+category_for+journal]
                sort by times_cited limit 1000""")
df = data.as_dataframe()
df.head(10)[['title', 'doi', 'times_cited', 'journal.title']]
Returned Publications: 1000 (total = 249953)
Time: 2.04s
[6]:
title doi times_cited journal.title
0 Initial sequencing and analysis of the human g... 10.1038/35057062 15997 Nature
1 Compressed Sensing 10.1109/tit.2006.871582 15811 IEEE Transactions on Information Theory
2 The american rheumatism association 1987 revis... 10.1002/art.1780310302 15717 Arthritis & Rheumatology
3 Cluster analysis and display of genome-wide ex... 10.1073/pnas.95.25.14863 12325 Proceedings of the National Academy of Science...
4 Building Theories from Case Study Research 10.5465/amr.1989.4308385 10920 Academy of Management Review
5 The 1982 revised criteria for the classificati... 10.1002/art.1780251101 10912 Arthritis & Rheumatology
6 Atomic Force Microscope 10.1103/physrevlett.56.930 10434 Physical Review Letters
7 An integrated encyclopedia of DNA elements in ... 10.1038/nature11247 10204 Nature
8 Molecular portraits of human breast tumours 10.1038/35021093 9888 Nature
9 ImageNet Large Scale Visual Recognition Challenge 10.1007/s11263-015-0816-y 9464 International Journal of Computer Vision

Publications most cited : which research areas?

[28]:
data = dslquery(f"""search publications
                    where research_orgs.id="{GRIDID}"
                    return publications[doi+title+times_cited+category_for+journal]
                    sort by times_cited limit 1000""")
Returned Publications: 1000 (total = 249953)
Time: 1.38s

Most publications have one or more associated Field Of Research (FOR) category, which is represented in the JSON like this:

{'category_for' : [{'id': '3292', 'name': '1402 Applied Economics'},
                    {'id': '3177', 'name': '1117 Public Health and Health Services'}]
                    }`

However since some publications may not have an associated FOR category, the resulting JSON in some cases may not have category_for as a key. Since we want to import the data into pandas we need to ensure the key is always there and has an empty list when no category is available.

[29]:
# dimcli.shortcuts.normalize_key takes: field name / json list / value to add when the field is not found
normalize_key("category_for", data.publications, [])
[30]:
df = pd.json_normalize(data.publications, record_path='category_for', meta=['doi', 'title', 'times_cited', ], errors='ignore' )
df.head()
[30]:
id name doi title times_cited
0 2620 0604 Genetics 10.1038/35057062 Initial sequencing and analysis of the human g... 15997
1 2206 06 Biological Sciences 10.1038/35057062 Initial sequencing and analysis of the human g... 15997
2 2208 08 Information and Computing Sciences 10.1109/tit.2006.871582 Compressed Sensing 15811
3 2209 09 Engineering 10.1109/tit.2006.871582 Compressed Sensing 15811
4 2210 10 Technology 10.1109/tit.2006.871582 Compressed Sensing 15811
[31]:
px.scatter(df,
           x="times_cited", y="name",
           marginal_x="histogram",
           marginal_y="histogram",
           hover_data=["doi", "title"],
           height=600,
           title=f"Publications from {GRIDID} - Research Areas VS Citations")

Publications most cited : which journals?

[32]:
data = dslquery(f"""search publications
                    where research_orgs.id="{GRIDID}"
                    return publications[doi+title+times_cited+category_for+journal]
                    sort by times_cited limit 1000""")

df = data.as_dataframe()
#
px.scatter(df,
           x="times_cited", y="journal.title",
           marginal_x="histogram",
           marginal_y="histogram",
           height=600,
           title=f"Publications from {GRIDID} - Journals VS Citations")
Returned Publications: 1000 (total = 249953)
Time: 1.45s

Top Funders (by aggregated funding amount)

[33]:
fundersdata = dsl.query(f"""search grants
                        where research_orgs.id="{GRIDID}"
                        return funders aggregate funding
                        sort by funding""")
df = fundersdata.as_dataframe()
df.head(10)
Returned Funders: 20
Time: 0.53s
[33]:
id count funding country_name acronym state_name city_name name types longitude linkout latitude
0 grid.48336.3a 788 1.059476e+09 United States NCI Maryland Rockville National Cancer Institute [Government] -77.101190 [http://www.cancer.gov/] 39.004326
1 grid.280785.0 770 1.040659e+09 United States NIGMS Maryland Bethesda National Institute of General Medical Sciences [Facility] -77.099380 [http://www.nigms.nih.gov/Pages/default.aspx] 38.997833
2 grid.419681.3 567 7.449747e+08 United States NIAID Maryland Bethesda National Institute of Allergy and Infectious D... [Facility] -77.111830 [http://www.niaid.nih.gov/Pages/default.aspx] 39.066647
3 grid.279885.9 609 7.437638e+08 United States NHLBI Maryland Bethesda National Heart Lung and Blood Institute [Facility] -77.100945 [http://www.nhlbi.nih.gov/] 39.004280
4 grid.457875.c 1292 6.756204e+08 United States NSF MPS Virginia Arlington Directorate for Mathematical & Physical Sciences [Government] -77.110990 [http://www.nsf.gov/dir/index.jsp?org=MPS] 38.880566
5 grid.280128.1 140 6.005008e+08 United States NHGRI Maryland Bethesda National Human Genome Research Institute [Facility] -77.096930 [https://www.genome.gov/] 38.996967
6 grid.416870.c 523 5.996880e+08 United States NINDS Maryland Bethesda National Institute of Neurological Disorders a... [Facility] -77.101180 [http://www.ninds.nih.gov/] 39.003826
7 grid.416868.5 521 5.463138e+08 United States NIMH Maryland Bethesda National Institute of Mental Health [Facility] -77.104570 [https://www.nimh.nih.gov/index.shtml] 39.003693
8 grid.418557.a 137 4.777585e+08 United States CIRM California San Francisco California Institute for Regenerative Medicine [Government] -122.392340 [https://www.cirm.ca.gov/] 37.778034
9 grid.457799.1 131 4.298322e+08 United States NSF EHR Virginia Arlington Directorate for Education & Human Resources [Government] -77.111000 [http://www.nsf.gov/dir/index.jsp?org=EHR] 38.880580

Top funders split by country of the funder

[34]:
px.bar(df,
       x="name", y="funding",
       color="country_name",
       title=f"Funding for {GRIDID} - by year")

Correlation between No of Publications VS Funding

[35]:
px.scatter(df,
           x="funding", y="count",
           color="name",
           height=600,
           title=f"Funding for {GRIDID} - Publications VS Aggregated Funding Amount")


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg