Enriching Grants part 3: adding related Patents and Clinical Trials data¶
In this third and final final part of the grants enrichment tutorial we are going to extract from Dimensions all Patents and Clinical Trials information linked to our vaccines grants datasets.
This tutorial builds on the previous one, Enriching Grants with Publications Information from Dimensions, and it assumes that our grants list already includes Dimensions IDs as well as publications counts for each grant.
The enriched grants list we are starting from can be downloaded here
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Sep 22, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli tqdm plotly -U --quiet
import dimcli
from dimcli.utils import *
import sys, time, json
import pandas as pd
from tqdm.notebook import tqdm as progressbar
import plotly.express as px
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.9.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.2
Method: dsl.ini file
Reusing the enriched grants data from part-2¶
First, we are going to load the enriched grants dataset resulted from part-2 of this tutorial: “vaccines-grants-sample-part-2.csv”.
[3]:
grants = pd.read_csv("http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-2.csv")
# pull out the grant IDs as a list
grantsids = list(grants['Dimensions ID'])
This file contains ~1k recent grants records on the topic of vaccines. Now we can preview the contents of the file.
[4]:
grants.head(10)
[4]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | Dimensions ID | Resulting Publications | |
---|---|---|---|---|---|---|---|---|---|
0 | 30410203277 | 疫苗-整体方案 | 1208.0 | 2004 | 2004 | National Natural Science Foundation of China | China | grant.8172033 | 0 |
1 | 620792 | Engineering Inhalable Vaccines | 26956.0 | 2017 | 2018 | Natural Sciences and Engineering Research Council | Canada | grant.7715379 | 0 |
2 | 599115 | Engineering Inhalable Vaccines | 26403.0 | 2016 | 2017 | Natural Sciences and Engineering Research Council | Canada | grant.6962629 | 0 |
3 | 251564 | HIV Vaccine research | 442366.0 | 2003 | 2007 | National Health and Medical Research Council | Australia | grant.6723913 | 0 |
4 | 334174 | HIV Vaccine Development | 236067.0 | 2005 | 2009 | National Health and Medical Research Council | Australia | grant.6722306 | 1 |
5 | 910292 | Dengue virus vaccine. | 130890.0 | 1991 | 1993 | National Health and Medical Research Council | Australia | grant.6716312 | 0 |
6 | 578221 | Engineering Inhalable Vaccines | 27386.0 | 2015 | 2016 | Natural Sciences and Engineering Research Council | Canada | grant.5526688 | 0 |
7 | IC18980360 | Schistosomiasis Vaccine Network. | 0.0 | 1998 | 2000 | European Commission | Belgium | grant.3733803 | 0 |
8 | 7621798 | Pneumococcal Ribosomal Vaccines | 46000.0 | 1977 | 1980 | Directorate for Biological Sciences | United States | grant.3274273 | 0 |
9 | 255890 | Rational vaccine design | 7138.0 | 2003 | 2004 | Natural Sciences and Engineering Research Council | Canada | grant.2936015 | 0 |
Extracting linked Patents data¶
Using a similar methodology as with publications, we can easily extract all patents linked to each grant in two steps.
retrieve all the relevant patents records using the
associated_grant_ids
field (see also the data model and the patents API fields)group patents by grant ID in so that we can have a single count per record
Note: in this case we can iterate 400 grants at a time cause in general there are much less associated patents per grant (compared to publications).
[5]:
#
# the main query
#
q = """search patents
where associated_grant_ids in {}
return patents[basics+associated_grant_ids]"""
#
# let's loop through all grants IDs in chunks and query Dimensions
#
print("===\nExtracting patents data ...")
results = []
for chunk in progressbar(list(chunks_of(list(grantsids), 400))):
data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=False)
results += data.patents
time.sleep(1)
#
# put the patents data into a dataframe, remove duplicates and save
#
patents = pd.DataFrame().from_dict(results)
print("Patents found: ", len(patents))
patents.drop_duplicates(subset='id', inplace=True)
print("Unique Patents found: ", len(patents))
if 'associated_grant_ids' in patents:
# turning lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
patents['associated_grant_ids'] = patents['associated_grant_ids'].apply(lambda x: ','.join(map(str, x)))
else:
patents['associated_grant_ids'] = ""
#
# count patents per grant and enrich the original dataset
#
def patents_for_grantid(grantid):
global patents
return patents[patents['associated_grant_ids'].str.contains(grantid)]
print("===\nCounting patents per grant...")
l = []
for x in progressbar(grantsids):
l.append(len(patents_for_grantid(x)))
grants['Associated Patents'] = l
print("===\nDone")
===
Extracting patents data ...
Patents found: 2326
Unique Patents found: 2235
===
Counting patents per grant...
===
Done
Let’s quickly preview the patents dataset, and the grants one, which now has an extra column counting patents.
[6]:
patents.head(5)
[6]:
assignee_names | assignees | associated_grant_ids | filing_status | granted_year | id | inventor_names | publication_date | times_cited | title | year | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | [Duquesne Univ of Holy Spirit] | [{'acronym': 'DU', 'city_name': 'Pittsburgh', ... | grant.2459509,grant.2480714,grant.2481896,gran... | Grant | 2018.0 | US-9994586-B2 | [GANGJEE ALEEM] | 2018-06-12 | 0 | Monocyclic, thieno, pyrido, and pyrrolo pyrimi... | 2016 |
1 | [Women and Infants Hospital of Rhode Island] | [{'city_name': 'Providence', 'country_name': '... | grant.2440150,grant.2480088 | Grant | 2018.0 | US-9980982-B2 | [MOORE RICHARD G, SINGH RAKESH K] | 2018-05-29 | 0 | HE4 based therapy for malignant disease | 2012 |
2 | [BioVentures LLC] | [{'city_name': 'Murfreesboro', 'country_name':... | grant.2480743 | Grant | 2018.0 | US-9974849-B2 | [NAKAGAWA MAYUMI, CHANG BYEONG S] | 2018-05-22 | 0 | Human Papilloma virus therapeutic vaccine | 2014 |
3 | [Duke University] | [{'city_name': 'Durham', 'country_name': 'Unit... | grant.2705114,grant.2435866,grant.2562651,gran... | Grant | 2018.0 | US-9974848-B2 | [SAMPSON JOHN H, MITCHELL DUANE A, BATICH KRIS... | 2018-05-22 | 6 | Tetanus toxoid and CCL3 improve DC vaccines | 2014 |
4 | [Fred Hutchinson Cancer Research Center, Fred ... | [{'acronym': 'FHCRC', 'city_name': 'Seattle', ... | grant.2416431 | Grant | 2018.0 | US-9956249-B2 | [BERNSTEIN IRWIN D, HADLAND BRANDON K] | 2018-05-01 | 2 | Compositions and methods for expansion of embr... | 2014 |
[7]:
grants.sort_values("Associated Patents", ascending=False).head(5)
[7]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | Dimensions ID | Resulting Publications | Associated Patents | |
---|---|---|---|---|---|---|---|---|---|---|
435 | P30CA008748 | Cancer Center Support Grant | 250916048.0 | 1977 | 2023 | National Cancer Institute | United States | grant.2438793 | 12716 | 316 |
431 | P30CA021765 | Cancer Center Support Grant (CCSG) | 117691216.0 | 1977 | 2024 | National Cancer Institute | United States | grant.2438833 | 2220 | 145 |
412 | P30CA033572 | Cancer Center Support Grant | 55374368.0 | 1981 | 2022 | National Cancer Institute | United States | grant.2438845 | 975 | 121 |
423 | P30CA016672 | Cancer Center Support Grant | 207129520.0 | 1978 | 2024 | National Cancer Institute | United States | grant.2438826 | 11839 | 120 |
310 | U54AI057159 | New England Regional Center of Excellence in B... | 113703600.0 | 2003 | 2015 | National Institute of Allergy and Infectious D... | United States | grant.2698981 | 438 | 82 |
Extracting linked Clinical Trials data¶
Now we can repeat the same process once more, for Clinical Trials. The field we need is called associated_grant_ids
(see also the clinical trials API docs).
As with patents, we can iterate 400 grants at a time cause in general there is much less associated content per grant (compared to publications).
[9]:
#
# the main query
#
q = """search clinical_trials
where associated_grant_ids in {}
return clinical_trials[basics+associated_grant_ids]"""
#
# let's loop through all grants IDs in chunks and query Dimensions
#
print("===\nExtracting clinical trials data ...")
results = []
CHUNKSIZE = 300
for chunk in progressbar(list(chunks_of(list(grantsids), CHUNKSIZE))):
data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=False)
results += data.clinical_trials
time.sleep(1)
#
# put the patents data into a dataframe, remove duplicates and save
#
clinical_trials = pd.DataFrame().from_dict(results)
print("Clinical Trials found: ", len(clinical_trials))
clinical_trials.drop_duplicates(subset='id', inplace=True)
print("Unique Clinical Trials found: ", len(clinical_trials))
if 'associated_grant_ids' in clinical_trials:
# turning lists into strings to ensure compatibility with CSV loaded data
clinical_trials['associated_grant_ids'] = clinical_trials['associated_grant_ids'].apply(lambda x: ','.join(map(str, x)))
else:
clinical_trials['associated_grant_ids'] = ""
#
# count patents per grant and enrich the original dataset
#
def cltrials_for_grantid(grantid):
global clinical_trials
return clinical_trials[clinical_trials['associated_grant_ids'].str.contains(grantid)]
print("===\nCounting clinical trials per grant...")
l = []
for x in progressbar(grantsids):
l.append(len(cltrials_for_grantid(x)))
grants['Associated Clinical Trials'] = l
print("===\nDone")
===
Extracting clinical trials data ...
Clinical Trials found: 4970
Unique Clinical Trials found: 4769
===
Counting clinical trials per grant...
===
Done
[10]:
clinical_trials.head(5)
[10]:
active_years | associated_grant_ids | id | investigators | title | |
---|---|---|---|---|---|
0 | [2022, 2023] | grant.2438666,grant.9330831 | NCT05543265 | [[Mark A Clapp, MD, MPH, Principal Investigato... | Bridging the Gap From Postpartum to Primary Ca... |
1 | [2023, 2024, 2025, 2026, 2027] | grant.3536471 | NCT05538897 | [[Michaela O Grinsfelder, Principal Investigat... | A Phase IB and Randomized Phase II Trial of Me... |
2 | [2022, 2023, 2024, 2025, 2026, 2027, 2028, 202... | grant.3536099 | NCT05538663 | NaN | A Randomized Phase III Trial of Intravesical B... |
3 | [2022, 2023] | grant.7211819 | NCT05535777 | [[Peter Szilagyi, Principal Investigator, Univ... | Improving Influenza Vaccination Delivery Acros... |
4 | [2022, 2023] | grant.7211819 | NCT05525494 | [[Peter Szilagyi, MPH, MD, Principal Investiga... | Improving Influenza Vaccination Delivery Acros... |
[11]:
grants.sort_values("Associated Clinical Trials", ascending=False).head(5)
[11]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | Dimensions ID | Resulting Publications | Associated Patents | Associated Clinical Trials | |
---|---|---|---|---|---|---|---|---|---|---|---|
415 | P30CA015083 | Mayo Comprehensive Cancer Center Grant | 110102400.0 | 1980 | 2024 | National Cancer Institute | United States | grant.2438813 | 1096 | 24 | 459 |
423 | P30CA016672 | Cancer Center Support Grant | 207129520.0 | 1978 | 2024 | National Cancer Institute | United States | grant.2438826 | 11839 | 120 | 394 |
410 | U10CA032102 | Southwest Oncology Group Treatment Grant | 141286976.0 | 1981 | 2015 | National Cancer Institute | United States | grant.2693350 | 229 | 2 | 264 |
436 | P30CA015704 | Cancer Center Support Grant | 216676224.0 | 1977 | 2024 | National Cancer Institute | United States | grant.2438816 | 1350 | 30 | 251 |
412 | P30CA033572 | Cancer Center Support Grant | 55374368.0 | 1981 | 2022 | National Cancer Institute | United States | grant.2438845 | 975 | 121 | 249 |
Let’s now save the data and preview it.
[12]:
# uncommment next line to save the data locally
# grants.to_csv("vaccines-grants-sample-part-3.csv", index=False)
grants.head(5)
[12]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | Dimensions ID | Resulting Publications | Associated Patents | Associated Clinical Trials | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30410203277 | 疫苗-整体方案 | 1208.0 | 2004 | 2004 | National Natural Science Foundation of China | China | grant.8172033 | 0 | 0 | 0 |
1 | 620792 | Engineering Inhalable Vaccines | 26956.0 | 2017 | 2018 | Natural Sciences and Engineering Research Council | Canada | grant.7715379 | 0 | 0 | 0 |
2 | 599115 | Engineering Inhalable Vaccines | 26403.0 | 2016 | 2017 | Natural Sciences and Engineering Research Council | Canada | grant.6962629 | 0 | 0 | 0 |
3 | 251564 | HIV Vaccine research | 442366.0 | 2003 | 2007 | National Health and Medical Research Council | Australia | grant.6723913 | 0 | 0 | 0 |
4 | 334174 | HIV Vaccine Development | 236067.0 | 2005 | 2009 | National Health and Medical Research Council | Australia | grant.6722306 | 1 | 0 | 0 |
Data Exploration¶
Now we can explore a bit the grants+publications+patents+clinical_trials dataset using the plotly express library.
How many linked objects overall?¶
[13]:
df = pd.DataFrame({
'measure' : ['Grants', 'Grants with pubs', 'Grants with Patents', 'Grants with Clinical Trials'],
'count' : [len(grants), len(grants[grants['Resulting Publications'] > 0]), len(grants[grants['Associated Patents'] > 0]), len(grants[grants['Associated Clinical Trials'] > 0])],
})
px.bar(df,
x="measure", y="count",
title=f"Grants: overview of associated objects found")
Patents and Clinical Trials by Year¶
[14]:
px.bar(grants,
x="End Year", y="Associated Patents",
color="Funding Amount in USD",
hover_name="Title",
hover_data=['Dimensions ID', 'Start Year', 'End Year', 'Funder', 'Funder Country'],
title=f"Patents per grant")
[15]:
px.bar(grants,
x="End Year", y="Associated Clinical Trials",
color="Funding Amount in USD",
hover_name="Title",
hover_data=['Dimensions ID', 'Start Year', 'End Year', 'Funder', 'Funder Country'],
title=f"Clinical Trials per grant")
Patents and Clinical Trials by Grant Funder¶
[16]:
funders_patents = grants.query('`Associated Patents` > 0')\
.groupby(['Funder', 'Funder Country'], as_index=False)\
.sum()\
.sort_values(by=["Associated Patents"], ascending=False)
funders_trials = grants.query('`Associated Clinical Trials` > 0')\
.groupby(['Funder', 'Funder Country'], as_index=False)\
.sum()\
.sort_values(by=["Associated Clinical Trials"], ascending=False)
[17]:
px.bar(funders_patents,
y="Associated Patents", x="Funder",
color="Funder Country",
hover_name="Funder",
hover_data=['Funder', 'Funder Country'],
title=f"Patents by Funders")
[18]:
px.bar(funders_trials,
y="Associated Clinical Trials", x="Funder",
color="Funder Country",
hover_name="Funder",
hover_data=['Funder', 'Funder Country'],
title=f"Clinical Trials by Funders")
Exploring Correlations between dimensions¶
Tip: a straight diagonal indicates a strong correlation, while a 90 degree angle indicates no correlation.
[19]:
px.scatter_matrix(grants,
dimensions=["Associated Patents", "Associated Clinical Trials", "Resulting Publications"],
color="Funder Country")
Conclusion¶
In this tutorial we have enriched a grants dataset on the topic of ‘vaccines’ by adding information about patent and clinical trials.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.