General Publication Statistics about a Research Organization¶

This Notebook shows how it’s possible to extract basic indicators about a research organization programmatically, using the Dimensions Analytics API and Jupyter Notebooks.

[15]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:

!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Choose a Research Organization¶

For the purpose of this exercise, we will are going to use grid.168010.e (Stanford University). Feel free though to change the parameters below as you want, eg by choosing another GRID organization.

[2]:

GRIDID = "grid.168010.e" #@param {type:"string"}

def grids_url(grids):
    "gen link to Dimensions webapp"
    root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
    return root + "&or_facet_research_org=".join([x for x in grids])

from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Preview {} Dimensions &#x29c9;</a>'.format(grids_url([GRIDID]), GRIDID)))

---
Preview grid.168010.e Dimensions ⧉

Publications output by year¶

[3]:

tot = dsl.query(f"""search publications where research_orgs.id="{GRIDID}" return publications limit 1""", verbose=False).count_total
print(f"{GRIDID} has a total of {tot} publications in Dimensions")

grid.168010.e has a total of 300560 publications in Dimensions

[4]:

df = dsl.query(f"""search publications where research_orgs.id="{GRIDID}" return year limit 100""").as_dataframe()
df.rename(columns={"id": "year"}, inplace=True)
#
px.bar(df, x="year", y="count",
       title=f"Publications from {GRIDID} - by year")

Returned Year: 100
Time: 0.52s

Publications most cited in last 2 years¶

[5]:

data = dslquery(f"""search publications where research_orgs.id="{GRIDID}"
        return publications[doi+title+recent_citations+category_for+journal]
        sort by recent_citations limit 100""")
df = data.as_dataframe()
df.head(10)[['title', 'doi', 'recent_citations', 'journal.title']]

Returned Publications: 100 (total = 300560)
Time: 1.04s

[5]:

	title	doi	recent_citations	journal.title
0	ImageNet Large Scale Visual Recognition Challenge	10.1007/s11263-015-0816-y	7132	International Journal of Computer Vision
1	DADA2: High resolution sample inference from I...	10.1038/nmeth.3869	5971	Nature Methods
2	The Elements of Statistical Learning, Data Min...	10.1007/978-0-387-84858-7	4626	NaN
3	phyloseq: An R Package for Reproducible Intera...	10.1371/journal.pone.0061217	4305	PLOS ONE
4	Reproducible, interactive, scalable and extens...	10.1038/s41587-019-0209-9	3801	Nature Biotechnology
5	Regularization Paths for Generalized Linear Mo...	10.18637/jss.v033.i01	3351	Journal of Statistical Software
6	Regularization and variable selection via the ...	10.1111/j.1467-9868.2005.00503.x	3147	Journal of the Royal Statistical Society Serie...
7	Dermatologist-level classification of skin can...	10.1038/nature21056	3047	Nature
8	Review of Particle Physics*	10.1103/physrevd.98.030001	2988	Physical Review D
9	Combining theory and experiment in electrocata...	10.1126/science.aad4998	2923	Science

Publications most cited - all time¶

[6]:

data = dslquery(f"""search publications
                where research_orgs.id="{GRIDID}"
                return publications[doi+title+times_cited+category_for+journal]
                sort by times_cited limit 1000""")
df = data.as_dataframe()
df.head(10)[['title', 'doi', 'times_cited', 'journal.title']]

Returned Publications: 1000 (total = 300560)
Time: 2.78s

[6]:

	title	doi	times_cited	journal.title
0	Compressed Sensing	10.1109/tit.2006.871582	17803	IEEE Transactions on Information Theory
1	Initial sequencing and analysis of the human g...	10.1038/35057062	17426	Nature
2	The american rheumatism association 1987 revis...	10.1002/art.1780310302	16489	Arthritis & Rheumatism
3	Building Theories from Case Study Research	10.5465/amr.1989.4308385	16294	Academy of Management Review
4	ImageNet Large Scale Visual Recognition Challenge	10.1007/s11263-015-0816-y	14916	International Journal of Computer Vision
5	The Elements of Statistical Learning, Data Min...	10.1007/978-0-387-84858-7	14465	NaN
6	Cluster analysis and display of genome-wide ex...	10.1073/pnas.95.25.14863	12815	Proceedings of the National Academy of Science...
7	Atomic Force Microscope	10.1103/physrevlett.56.930	11565	Physical Review Letters
8	The 1982 revised criteria for the classificati...	10.1002/art.1780251101	11421	Arthritis & Rheumatism
9	Molecular portraits of human breast tumours	10.1038/35021093	11132	Nature

Publications most cited : which research areas?¶

[7]:

data = dslquery(f"""search publications
                    where research_orgs.id="{GRIDID}"
                    return publications[doi+title+times_cited+category_for+journal]
                    sort by times_cited limit 1000""")

Returned Publications: 1000 (total = 300560)
Time: 1.88s

Most publications have one or more associated Field Of Research (FOR) category, which is represented in the JSON like this:

{'category_for' : [{'id': '3292', 'name': '1402 Applied Economics'},
                    {'id': '3177', 'name': '1117 Public Health and Health Services'}]
                    }`

However since some publications may not have an associated FOR category, the resulting JSON in some cases may not have category_for as a key. Since we want to import the data into pandas we need to ensure the key is always there and has an empty list when no category is available.

[8]:

# dimcli.shortcuts.normalize_key takes: field name / json list / value to add when the field is not found
normalize_key("category_for", data.publications, [])

[9]:

df = pd.json_normalize(data.publications, record_path='category_for', meta=['doi', 'title', 'times_cited', ], errors='ignore' )
df.head()

[9]:

	id	name	doi	title	times_cited
0	2208	08 Information and Computing Sciences	10.1109/tit.2006.871582	Compressed Sensing	17803
1	2209	09 Engineering	10.1109/tit.2006.871582	Compressed Sensing	17803
2	2210	10 Technology	10.1109/tit.2006.871582	Compressed Sensing	17803
3	2867	0906 Electrical and Electronic Engineering	10.1109/tit.2006.871582	Compressed Sensing	17803
4	3001	1005 Communications Technologies	10.1109/tit.2006.871582	Compressed Sensing	17803

[10]:

px.scatter(df,
           x="times_cited", y="name",
           marginal_x="histogram",
           marginal_y="histogram",
           hover_data=["doi", "title"],
           height=600,
           title=f"Publications from {GRIDID} - Research Areas VS Citations")

Publications most cited : which journals?¶

[11]:

data = dslquery(f"""search publications
                    where research_orgs.id="{GRIDID}"
                    return publications[doi+title+times_cited+category_for+journal]
                    sort by times_cited limit 1000""")

df = data.as_dataframe()
#
px.scatter(df,
           x="times_cited", y="journal.title",
           marginal_x="histogram",
           marginal_y="histogram",
           height=600,
           title=f"Publications from {GRIDID} - Journals VS Citations")

Returned Publications: 1000 (total = 300560)
Time: 1.71s

Top Funders (by aggregated funding amount)¶

[12]:

fundersdata = dsl.query(f"""search grants
                        where research_orgs.id="{GRIDID}"
                        return funders aggregate funding
                        sort by funding""")
df = fundersdata.as_dataframe()
df.head(10)

Returned Funders: 20
Time: 0.65s

[12]:

	acronym	city_name	count	country_name	funding	id	latitude	linkout	longitude	name	state_name	types
0	NCI	Bethesda	849	United States	1.169651e+09	grid.48336.3a	39.004326	[http://www.cancer.gov/]	-77.101190	National Cancer Institute	Maryland	[Government]
1	NIGMS	Bethesda	804	United States	1.123588e+09	grid.280785.0	38.997833	[http://www.nigms.nih.gov/Pages/default.aspx]	-77.099380	National Institute of General Medical Sciences	Maryland	[Facility]
2	NIAID	Bethesda	597	United States	8.251580e+08	grid.419681.3	39.066647	[http://www.niaid.nih.gov/Pages/default.aspx]	-77.111830	National Institute of Allergy and Infectious D...	Maryland	[Facility]
3	NHLBI	Bethesda	649	United States	8.117882e+08	grid.279885.9	39.004280	[http://www.nhlbi.nih.gov/]	-77.100945	National Heart Lung and Blood Institute	Maryland	[Facility]
4	NSF MPS	Arlington	1319	United States	7.034854e+08	grid.457875.c	38.880566	[http://www.nsf.gov/dir/index.jsp?org=MPS]	-77.110990	Directorate for Mathematical & Physical Sciences	Virginia	[Government]
5	NINDS	Bethesda	564	United States	6.840811e+08	grid.416870.c	39.003826	[http://www.ninds.nih.gov/]	-77.101180	National Institute of Neurological Disorders a...	Maryland	[Facility]
6	NHGRI	Bethesda	159	United States	6.532875e+08	grid.280128.1	38.996967	[https://www.genome.gov/]	-77.096930	National Human Genome Research Institute	Maryland	[Facility]
7	NIMH	Bethesda	539	United States	5.899221e+08	grid.416868.5	39.003693	[https://www.nimh.nih.gov/index.shtml]	-77.104570	National Institute of Mental Health	Maryland	[Facility]
8	EPSRC	Swindon	91	United Kingdom	4.753546e+08	grid.421091.f	51.567093	[https://www.epsrc.ac.uk/]	-1.784602	Engineering and Physical Sciences Research Cou...	England	[Government]
9	NSF EHR	Arlington	134	United States	4.538210e+08	grid.457799.1	38.880580	[http://www.nsf.gov/dir/index.jsp?org=EHR]	-77.111000	Directorate for Education & Human Resources	Virginia	[Government]

Top funders split by country of the funder¶

[13]:

px.bar(df,
       x="name", y="funding",
       color="country_name",
       title=f"Funding for {GRIDID} - by year")

Correlation between No of Publications VS Funding¶

[14]:

px.scatter(df,
           x="funding", y="count",
           color="name",
           height=600,
           title=f"Funding for {GRIDID} - Publications VS Aggregated Funding Amount")

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.