Enriching Grants part 3: adding related Patents and Clinical Trials data¶

In this third and final final part of the grants enrichment tutorial we are going to extract from Dimensions all Patents and Clinical Trials information linked to our vaccines grants datasets.

This tutorial builds on the previous one, Enriching Grants with Publications Information from Dimensions, and it assumes that our grants list already includes Dimensions IDs as well as publications counts for each grant.

The enriched grants list we are starting from can be downloaded here

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Sep 22, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli tqdm plotly -U --quiet

import dimcli
from dimcli.utils import *

import sys, time, json
import pandas as pd
from tqdm.notebook import tqdm as progressbar

import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.9.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.2
Method: dsl.ini file

Reusing the enriched grants data from part-2¶

First, we are going to load the enriched grants dataset resulted from part-2 of this tutorial: “vaccines-grants-sample-part-2.csv”.

[3]:

grants = pd.read_csv("http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-2.csv")
# pull out the grant IDs as a list
grantsids = list(grants['Dimensions ID'])

This file contains ~1k recent grants records on the topic of vaccines. Now we can preview the contents of the file.

[4]:

grants.head(10)

[4]:

	Grant Number	Title	Funding Amount in USD	Start Year	End Year	Funder	Funder Country	Dimensions ID	Resulting Publications
0	30410203277	疫苗－整体方案	1208.0	2004	2004	National Natural Science Foundation of China	China	grant.8172033	0
1	620792	Engineering Inhalable Vaccines	26956.0	2017	2018	Natural Sciences and Engineering Research Council	Canada	grant.7715379	0
2	599115	Engineering Inhalable Vaccines	26403.0	2016	2017	Natural Sciences and Engineering Research Council	Canada	grant.6962629	0
3	251564	HIV Vaccine research	442366.0	2003	2007	National Health and Medical Research Council	Australia	grant.6723913	0
4	334174	HIV Vaccine Development	236067.0	2005	2009	National Health and Medical Research Council	Australia	grant.6722306	1
5	910292	Dengue virus vaccine.	130890.0	1991	1993	National Health and Medical Research Council	Australia	grant.6716312	0
6	578221	Engineering Inhalable Vaccines	27386.0	2015	2016	Natural Sciences and Engineering Research Council	Canada	grant.5526688	0
7	IC18980360	Schistosomiasis Vaccine Network.	0.0	1998	2000	European Commission	Belgium	grant.3733803	0
8	7621798	Pneumococcal Ribosomal Vaccines	46000.0	1977	1980	Directorate for Biological Sciences	United States	grant.3274273	0
9	255890	Rational vaccine design	7138.0	2003	2004	Natural Sciences and Engineering Research Council	Canada	grant.2936015	0

Extracting linked Patents data¶

Using a similar methodology as with publications, we can easily extract all patents linked to each grant in two steps.

retrieve all the relevant patents records using the associated_grant_ids field (see also the data model and the patents API fields)
group patents by grant ID in so that we can have a single count per record

Note: in this case we can iterate 400 grants at a time cause in general there are much less associated patents per grant (compared to publications).

[5]:

#
# the main query
#
q = """search patents
        where associated_grant_ids in {}
      return patents[basics+associated_grant_ids]"""


#
# let's loop through all grants IDs in chunks and query Dimensions
#
print("===\nExtracting patents data ...")
results = []

for chunk in progressbar(list(chunks_of(list(grantsids), 400))):
    data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=False)
    results += data.patents
    time.sleep(1)

#
# put the patents data into a dataframe, remove duplicates and save
#
patents = pd.DataFrame().from_dict(results)
print("Patents found: ", len(patents))
patents.drop_duplicates(subset='id', inplace=True)
print("Unique Patents found: ", len(patents))

if 'associated_grant_ids' in patents:
    # turning lists into strings to ensure compatibility with CSV loaded data
    # see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
    patents['associated_grant_ids'] = patents['associated_grant_ids'].apply(lambda x: ','.join(map(str, x)))
else:
    patents['associated_grant_ids'] = ""


#
# count patents per grant and enrich the original dataset
#
def patents_for_grantid(grantid):
  global patents
  return patents[patents['associated_grant_ids'].str.contains(grantid)]

print("===\nCounting patents per grant...")

l = []
for x in progressbar(grantsids):
  l.append(len(patents_for_grantid(x)))

grants['Associated Patents'] = l

print("===\nDone")

===
Extracting patents data ...

Patents found:  2326
Unique Patents found:  2235
===
Counting patents per grant...

===
Done

Let’s quickly preview the patents dataset, and the grants one, which now has an extra column counting patents.

[6]:

patents.head(5)

[6]:

	assignee_names	assignees	associated_grant_ids	filing_status	granted_year	id	inventor_names	publication_date	times_cited	title	year
0	[Duquesne Univ of Holy Spirit]	[{'acronym': 'DU', 'city_name': 'Pittsburgh', ...	grant.2459509,grant.2480714,grant.2481896,gran...	Grant	2018.0	US-9994586-B2	[GANGJEE ALEEM]	2018-06-12	0	Monocyclic, thieno, pyrido, and pyrrolo pyrimi...	2016
1	[Women and Infants Hospital of Rhode Island]	[{'city_name': 'Providence', 'country_name': '...	grant.2440150,grant.2480088	Grant	2018.0	US-9980982-B2	[MOORE RICHARD G, SINGH RAKESH K]	2018-05-29	0	HE4 based therapy for malignant disease	2012
2	[BioVentures LLC]	[{'city_name': 'Murfreesboro', 'country_name':...	grant.2480743	Grant	2018.0	US-9974849-B2	[NAKAGAWA MAYUMI, CHANG BYEONG S]	2018-05-22	0	Human Papilloma virus therapeutic vaccine	2014
3	[Duke University]	[{'city_name': 'Durham', 'country_name': 'Unit...	grant.2705114,grant.2435866,grant.2562651,gran...	Grant	2018.0	US-9974848-B2	[SAMPSON JOHN H, MITCHELL DUANE A, BATICH KRIS...	2018-05-22	6	Tetanus toxoid and CCL3 improve DC vaccines	2014
4	[Fred Hutchinson Cancer Research Center, Fred ...	[{'acronym': 'FHCRC', 'city_name': 'Seattle', ...	grant.2416431	Grant	2018.0	US-9956249-B2	[BERNSTEIN IRWIN D, HADLAND BRANDON K]	2018-05-01	2	Compositions and methods for expansion of embr...	2014

[7]:

grants.sort_values("Associated Patents", ascending=False).head(5)

[7]:

	Grant Number	Title	Funding Amount in USD	Start Year	End Year	Funder	Funder Country	Dimensions ID	Resulting Publications	Associated Patents
435	P30CA008748	Cancer Center Support Grant	250916048.0	1977	2023	National Cancer Institute	United States	grant.2438793	12716	316
431	P30CA021765	Cancer Center Support Grant (CCSG)	117691216.0	1977	2024	National Cancer Institute	United States	grant.2438833	2220	145
412	P30CA033572	Cancer Center Support Grant	55374368.0	1981	2022	National Cancer Institute	United States	grant.2438845	975	121
423	P30CA016672	Cancer Center Support Grant	207129520.0	1978	2024	National Cancer Institute	United States	grant.2438826	11839	120
310	U54AI057159	New England Regional Center of Excellence in B...	113703600.0	2003	2015	National Institute of Allergy and Infectious D...	United States	grant.2698981	438	82

Extracting linked Clinical Trials data¶

Now we can repeat the same process once more, for Clinical Trials. The field we need is called associated_grant_ids (see also the clinical trials API docs).

As with patents, we can iterate 400 grants at a time cause in general there is much less associated content per grant (compared to publications).

[9]:

#
# the main query
#
q = """search clinical_trials
        where associated_grant_ids in {}
      return clinical_trials[basics+associated_grant_ids]"""


#
# let's loop through all grants IDs in chunks and query Dimensions
#
print("===\nExtracting clinical trials data ...")
results = []
CHUNKSIZE = 300

for chunk in progressbar(list(chunks_of(list(grantsids), CHUNKSIZE))):
    data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=False)
    results += data.clinical_trials
    time.sleep(1)

#
# put the patents data into a dataframe, remove duplicates and save
#
clinical_trials = pd.DataFrame().from_dict(results)
print("Clinical Trials found: ", len(clinical_trials))
clinical_trials.drop_duplicates(subset='id', inplace=True)
print("Unique Clinical Trials found: ", len(clinical_trials))

if 'associated_grant_ids' in clinical_trials:
    # turning lists into strings to ensure compatibility with CSV loaded data
    clinical_trials['associated_grant_ids'] = clinical_trials['associated_grant_ids'].apply(lambda x: ','.join(map(str, x)))
else:
    clinical_trials['associated_grant_ids'] = ""


#
# count patents per grant and enrich the original dataset
#
def cltrials_for_grantid(grantid):
  global clinical_trials
  return clinical_trials[clinical_trials['associated_grant_ids'].str.contains(grantid)]

print("===\nCounting clinical trials per grant...")
l = []
for x in progressbar(grantsids):
  l.append(len(cltrials_for_grantid(x)))

grants['Associated Clinical Trials'] = l

print("===\nDone")

===
Extracting clinical trials data ...

Clinical Trials found:  4970
Unique Clinical Trials found:  4769
===
Counting clinical trials per grant...

===
Done

[10]:

clinical_trials.head(5)

[10]:

	active_years	associated_grant_ids	id	investigators	title
0	[2022, 2023]	grant.2438666,grant.9330831	NCT05543265	[[Mark A Clapp, MD, MPH, Principal Investigato...	Bridging the Gap From Postpartum to Primary Ca...
1	[2023, 2024, 2025, 2026, 2027]	grant.3536471	NCT05538897	[[Michaela O Grinsfelder, Principal Investigat...	A Phase IB and Randomized Phase II Trial of Me...
2	[2022, 2023, 2024, 2025, 2026, 2027, 2028, 202...	grant.3536099	NCT05538663	NaN	A Randomized Phase III Trial of Intravesical B...
3	[2022, 2023]	grant.7211819	NCT05535777	[[Peter Szilagyi, Principal Investigator, Univ...	Improving Influenza Vaccination Delivery Acros...
4	[2022, 2023]	grant.7211819	NCT05525494	[[Peter Szilagyi, MPH, MD, Principal Investiga...	Improving Influenza Vaccination Delivery Acros...

[11]:

grants.sort_values("Associated Clinical Trials", ascending=False).head(5)

[11]:

	Grant Number	Title	Funding Amount in USD	Start Year	End Year	Funder	Funder Country	Dimensions ID	Resulting Publications	Associated Patents	Associated Clinical Trials
415	P30CA015083	Mayo Comprehensive Cancer Center Grant	110102400.0	1980	2024	National Cancer Institute	United States	grant.2438813	1096	24	459
423	P30CA016672	Cancer Center Support Grant	207129520.0	1978	2024	National Cancer Institute	United States	grant.2438826	11839	120	394
410	U10CA032102	Southwest Oncology Group Treatment Grant	141286976.0	1981	2015	National Cancer Institute	United States	grant.2693350	229	2	264
436	P30CA015704	Cancer Center Support Grant	216676224.0	1977	2024	National Cancer Institute	United States	grant.2438816	1350	30	251
412	P30CA033572	Cancer Center Support Grant	55374368.0	1981	2022	National Cancer Institute	United States	grant.2438845	975	121	249

Let’s now save the data and preview it.

[12]:

# uncommment next line to save the data locally
# grants.to_csv("vaccines-grants-sample-part-3.csv", index=False)
grants.head(5)

[12]:

	Grant Number	Title	Funding Amount in USD	Start Year	End Year	Funder	Funder Country	Dimensions ID	Resulting Publications
0	30410203277	疫苗－整体方案	1208.0	2004	2004	National Natural Science Foundation of China	China	grant.8172033	0
1	620792	Engineering Inhalable Vaccines	26956.0	2017	2018	Natural Sciences and Engineering Research Council	Canada	grant.7715379	0
2	599115	Engineering Inhalable Vaccines	26403.0	2016	2017	Natural Sciences and Engineering Research Council	Canada	grant.6962629	0
3	251564	HIV Vaccine research	442366.0	2003	2007	National Health and Medical Research Council	Australia	grant.6723913	0
4	334174	HIV Vaccine Development	236067.0	2005	2009	National Health and Medical Research Council	Australia	grant.6722306	1

Data Exploration¶

Now we can explore a bit the grants+publications+patents+clinical_trials dataset using the plotly express library.

How many linked objects overall?¶

[13]:

df = pd.DataFrame({
    'measure' : ['Grants', 'Grants with pubs', 'Grants with Patents', 'Grants with Clinical Trials'],
    'count' : [len(grants), len(grants[grants['Resulting Publications'] > 0]), len(grants[grants['Associated Patents'] > 0]), len(grants[grants['Associated Clinical Trials'] > 0])],
})

px.bar(df,
       x="measure", y="count",
       title=f"Grants: overview of associated objects found")

Patents and Clinical Trials by Year¶

[14]:

px.bar(grants,
       x="End Year", y="Associated Patents",
       color="Funding Amount in USD",
       hover_name="Title",
       hover_data=['Dimensions ID', 'Start Year', 'End Year', 'Funder', 'Funder Country'],
       title=f"Patents per grant")

[15]:

px.bar(grants,
       x="End Year", y="Associated Clinical Trials",
       color="Funding Amount in USD",
       hover_name="Title",
       hover_data=['Dimensions ID', 'Start Year', 'End Year', 'Funder', 'Funder Country'],
       title=f"Clinical Trials per grant")

Patents and Clinical Trials by Grant Funder¶

[16]:

funders_patents = grants.query('`Associated Patents` > 0')\
    .groupby(['Funder', 'Funder Country'], as_index=False)\
    .sum()\
    .sort_values(by=["Associated Patents"], ascending=False)

funders_trials = grants.query('`Associated Clinical Trials` > 0')\
    .groupby(['Funder', 'Funder Country'], as_index=False)\
    .sum()\
    .sort_values(by=["Associated Clinical Trials"], ascending=False)

[17]:

px.bar(funders_patents,
       y="Associated Patents", x="Funder",
       color="Funder Country",
       hover_name="Funder",
       hover_data=['Funder', 'Funder Country'],
       title=f"Patents by Funders")

[18]:

px.bar(funders_trials,
       y="Associated Clinical Trials", x="Funder",
       color="Funder Country",
       hover_name="Funder",
       hover_data=['Funder', 'Funder Country'],
       title=f"Clinical Trials by Funders")

Exploring Correlations between dimensions¶

Tip: a straight diagonal indicates a strong correlation, while a 90 degree angle indicates no correlation.

[19]:

px.scatter_matrix(grants,
                  dimensions=["Associated Patents", "Associated Clinical Trials", "Resulting Publications"],
                  color="Funder Country")

Conclusion¶

In this tutorial we have enriched a grants dataset on the topic of ‘vaccines’ by adding information about patent and clinical trials.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.