Enriching Grants part 2: Adding Publications Information from Dimensions¶
In this tutorial we enrich a grants dataset by adding information about number of publications per grant. This information will then let us draw some interesting statistics about the impact of grants & funders from the point of view of research outputs.
This tutorial builds on the previous one, Matching your grants records to Dimensions, and it assumes that our grants list already includes Dimensions IDs for each grant.
The grants dataset we are starting from focuses on the broad topic of ‘vaccines’ and can be downloaded here.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli tqdm plotly -U --quiet
import dimcli
from dimcli.utils import *
import sys, time, json
import pandas as pd
from tqdm.notebook import tqdm as progressbar
import plotly.express as px
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
Reusing the sample grants data from part-1¶
First, we are going to load the sample grants dataset from part-1 of this tutorial: “vaccines-grants-sample-part-1.csv”.
[3]:
grants = pd.read_csv("http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-1.csv")
The dataset that was created in part-1 of this tutorial contains ~1000 grants. Even if we had more, the same steps descrived in what follows still apply (but of course it’ll take more time to extract and process the data).
[4]:
# remove rows without a Dimensions ID
grants.dropna(subset=["Dimensions ID"], inplace=True)
grantsids = grants['Dimensions ID'].to_list()
Now we can preview the contents of the file.
[5]:
grants.head(10)
[5]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | Dimensions ID | |
---|---|---|---|---|---|---|---|---|
0 | 30410203277 | 疫苗-整体方案 | 1208.0 | 2004 | 2004 | National Natural Science Foundation of China | China | grant.8172033 |
1 | 620792 | Engineering Inhalable Vaccines | 26956.0 | 2017 | 2018 | Natural Sciences and Engineering Research Council | Canada | grant.7715379 |
2 | 599115 | Engineering Inhalable Vaccines | 26403.0 | 2016 | 2017 | Natural Sciences and Engineering Research Council | Canada | grant.6962629 |
3 | 251564 | HIV Vaccine research | 442366.0 | 2003 | 2007 | National Health and Medical Research Council | Australia | grant.6723913 |
4 | 334174 | HIV Vaccine Development | 236067.0 | 2005 | 2009 | National Health and Medical Research Council | Australia | grant.6722306 |
5 | 910292 | Dengue virus vaccine. | 130890.0 | 1991 | 1993 | National Health and Medical Research Council | Australia | grant.6716312 |
6 | 578221 | Engineering Inhalable Vaccines | 27386.0 | 2015 | 2016 | Natural Sciences and Engineering Research Council | Canada | grant.5526688 |
7 | IC18980360 | Schistosomiasis Vaccine Network. | 0.0 | 1998 | 2000 | European Commission | Belgium | grant.3733803 |
8 | 7621798 | Pneumococcal Ribosomal Vaccines | 46000.0 | 1977 | 1980 | Directorate for Biological Sciences | United States | grant.3274273 |
9 | 255890 | Rational vaccine design | 7138.0 | 2003 | 2004 | Natural Sciences and Engineering Research Council | Canada | grant.2936015 |
Extracting linked Publications data from Dimensions¶
Tip: see the API data model page for more details on which relationships exist between publications, grants and the other Dimensions source types.
The generic query for extracting publications linked to a grant goes like this:
search publications
where supporting_grant_ids in ['grant.3274273', 'grant.2936015', etc.. ]
return
publications[title+doi+year+supporting_grant_ids]
This query is pretty straighforward. Note that, since we are extracting a merged list publications for several grants at the same time, we are including in the results the ‘supporting_grant_ids’ data (return publications[title+doi+year+supporting_grant_ids]
). This will allow us to ‘disentangle’ the list later on so to know exactly how many publications are linked to each single grant.
Building a looped extraction¶
Since we are using the query above to extract linked publications for thousands of grants there are a few more things to keep in mind:
Grants IDs per query. In general, the API can handle up to 300-400 grants IDs per query. This is number isn’t set in stone though, but rather it should fine-tuned by trial and error, also considering the impact of the following points.
Query complexity. If the query contains many other constraints (ie
where
clauses) these will impact the query complexity/speed, hence indirectly the max number of grants IDs it can handle.Response time. It’s always useful to keep an eye on the time it takes to get back results: for example, it may be more efficient to retrieve less grants per query and have more queries overall.
Total number of records. One should keep an eye on the overall number of records (= publications) coming back from a single query: if it’s up to 1000, one
dsl.query
statement is enough. If instead there are more than 1000 records, that means we need to add another inner loop to extract all the data, or use thedsl.query_iterative
helper function.
[6]:
#
# the main query
#
q = """search publications
where supporting_grant_ids in {}
return publications[title+doi+year+supporting_grant_ids]"""
#
# let's loop through all grants IDs in chunks and query Dimensions
#
results = []
for chunk in progressbar(list(chunks_of(list(grantsids), 200))):
data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=False)
results += data.publications
time.sleep(1)
#
# put the data into a dataframe, remove duplicates and save
#
pubs = pd.DataFrame().from_dict(results)
print("Publications found: ", len(pubs))
pubs.drop_duplicates(subset='doi', inplace=True)
print("Unique publications found: ", len(pubs))
# turning lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
pubs['supporting_grant_ids'] = pubs['supporting_grant_ids'].apply(lambda x: ','.join(map(str, x)))
Publications found: 82027
Unique publications found: 78239
Let’s preview the publications data now:
[7]:
pubs.head(5)
[7]:
doi | supporting_grant_ids | title | year | |
---|---|---|---|---|
0 | 10.1158/1078-0432.ccr-21-3659 | grant.9018788,grant.2438856,grant.2440246 | Phase I trial combining chemokine-targeting wi... | 2022 |
1 | 10.1093/jac/dkab490 | grant.7751928,grant.2691196 | Proximal tubular dysfunction in pregnant women... | 2022 |
2 | 10.1038/s41375-021-01494-w | grant.6499387,grant.8884725,grant.7073139 | Inhibition of the deubiquitinating enzyme USP4... | 2022 |
3 | 10.1016/s1470-2045(21)00718-x | grant.3804391,grant.3536471,grant.3536478 | Trastuzumab with trimodality treatment for oes... | 2022 |
4 | 10.1080/2162402x.2021.2020983 | grant.2440246,grant.7922722,grant.2438818,gran... | Identification of Claudin 6-specific HLA class... | 2022 |
Final step: grouping publications by grant¶
The publications dataset we obtained can be ‘turned inside out’ so that we have one row per grant, and information about how many publications are linked to it.
One approach is to use a simple function that from a grantID
will return how many publications are related to it.
[8]:
def pubs_for_grantid(grantid):
global pubs
return pubs[pubs['supporting_grant_ids'].str.contains(grantid)]
Using this function, we can loop through the original list of grants and calculate the tot number of pubs for each of them.
The results are then used to enrich the original table with one extra column called ‘pubs’.
[9]:
l = []
for x in progressbar(grantsids):
l.append(len(pubs_for_grantid(x)))
grants['Resulting Publications'] = l
Let’s now preview the data:
[17]:
# grants.to_csv("vaccines-grants-sample-part-2.csv", index=False)
grants.head(5)
[17]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | Dimensions ID | Resulting Publications | |
---|---|---|---|---|---|---|---|---|---|
0 | 30410203277 | 疫苗-整体方案 | 1208.0 | 2004 | 2004 | National Natural Science Foundation of China | China | grant.8172033 | 0 |
1 | 620792 | Engineering Inhalable Vaccines | 26956.0 | 2017 | 2018 | Natural Sciences and Engineering Research Council | Canada | grant.7715379 | 0 |
2 | 599115 | Engineering Inhalable Vaccines | 26403.0 | 2016 | 2017 | Natural Sciences and Engineering Research Council | Canada | grant.6962629 | 0 |
3 | 251564 | HIV Vaccine research | 442366.0 | 2003 | 2007 | National Health and Medical Research Council | Australia | grant.6723913 | 0 |
4 | 334174 | HIV Vaccine Development | 236067.0 | 2005 | 2009 | National Health and Medical Research Council | Australia | grant.6722306 | 1 |
Data Exploration¶
Now we can explore a bit the grants+publications dataset using the plotly express library.
Publications per grant by year and funding amount¶
[11]:
px.bar(grants,
x="End Year", y="Resulting Publications",
color="Funding Amount in USD",
hover_name="Title",
hover_data=['Dimensions ID', 'Start Year', 'End Year', 'Funder', 'Funder Country', "Grant Number"],
title=f"Publications per grant")
Publications per grant by country¶
[12]:
px.bar(grants,
x="Funder Country", y="Resulting Publications",
color="End Year",
hover_name="Title",
hover_data=['Dimensions ID', 'Start Year', 'End Year', 'Funder', 'Funder Country'],
title=f"Publications per grant")
Correlation of num of publications to grant length¶
[13]:
px.scatter(grants.query('`Resulting Publications` > 0'),
y="End Year", x="Start Year",
size="Resulting Publications",
color="Funder Country",
marginal_x="histogram",
hover_name="Title",
hover_data=['Dimensions ID', 'Start Year', 'End Year', 'Funder', 'Funder Country'],
trendline="ols",
title=f"Tot Publications vs grant length")
/Users/michele.pasin/Envs/jupyterlab/lib/python3.9/site-packages/statsmodels/tools/_testing.py:19: FutureWarning:
pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
Publications by grant funder¶
[14]:
funders = grants.query('`Resulting Publications` > 0')\
.groupby(['Funder', 'Funder Country'], as_index=False)\
.sum().sort_values(by=["Resulting Publications"], ascending=False)
[15]:
px.bar(funders,
y="Resulting Publications", x="Funder",
color="Funder Country",
hover_name="Funder",
hover_data=['Funder', 'Funder Country'],
height=500,
title=f"Funders")
Publications by grant funder vs funding amount¶
[16]:
px.scatter(funders,
y="Resulting Publications", x="Funding Amount in USD",
color="Funder Country",
hover_name="Funder",
hover_data=['Funder', 'Funder Country'],
title=f"Funders")
Conclusion¶
In this tutorial we have enriched a grants dataset on the topic of ‘vaccines’ by adding information about number of publications per grant. This information has let us draw some interesting statistics about the impact of grants & funders from the point of view of research outputs.
In the next tutorial we will continue the analysis by enriching the data also with patents and clinical trials information.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.