Enriching Grants part 1: Matching your grants records to Dimensions¶
In this tutorial we are going to match a sample local grants datasets to Dimensions.
By matching we mean discovering the unique Dimensions identifier for these grants, so that we can then use the IDs to extract from Dimensions more related objects (eg researchers, publications, patents, clinical trials etc.. related to the grants).
A sample grants list¶
Our starting point is a sample list of completed grants on the topic of vaccines, which contains common fields such as title, funder, grant/project ID, funding amount etc..
We will show below how to enrich this dataset with Dimensions IDs.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Apr 19, 2023
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli tqdm -U --quiet
import dimcli
from dimcli.utils import *
import sys, time
import pandas as pd
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.0.2)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.6
Method: dsl.ini file
Loading the sample grants data¶
First we are going to load the sample dataset “vaccines-grants-sample-part-0.csv”.
[3]:
grants_source = pd.read_csv("http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-0.csv")
Now we can preview the contents of the file.
[4]:
grants_source.head(5)
[4]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | |
---|---|---|---|---|---|---|---|
0 | 30410203277 | 疫苗-整体方案 | 1208.0 | 2004 | 2004 | National Natural Science Foundation of China | China |
1 | 620792 | Engineering Inhalable Vaccines | 26956.0 | 2017 | 2018 | Natural Sciences and Engineering Research Council | Canada |
2 | 599115 | Engineering Inhalable Vaccines | 26403.0 | 2016 | 2017 | Natural Sciences and Engineering Research Council | Canada |
3 | 251564 | HIV Vaccine research | 442366.0 | 2003 | 2007 | National Health and Medical Research Council | Australia |
4 | 334174 | HIV Vaccine Development | 236067.0 | 2005 | 2009 | National Health and Medical Research Council | Australia |
Matching grants data¶
The are two possible situations to consider.
A) Matching grants when we have a grant number¶
The Dimensions API includes a function ‘extract_grants’ that makes it easier to find a grant in Dimensions, by using information such as funder name and the funder grant identifier.
So one approach is the following:
[5]:
dsl.query("""extract_grants(grant_number="30410203277",
funder_name="National Natural Science Foundation of China")""").json
[5]:
{'grant_id': 'grant.8172033'}
[6]:
dsl.query("""extract_grants(grant_number="334174",
funder_name="National Health and Medical Research Council")""").json
[6]:
{'grant_id': 'grant.6722306'}
Note: this won’t work without funder name
[7]:
dsl.query("""extract_grants(grant_number="30410203277",
funder_name="")""").json
[7]:
{'grant_id': None}
But we can pass a fundref ID if we have it (also available via GRID)
[8]:
dsl.query("""extract_grants(grant_number="30410203277",
fundref="501100001809")""").json
[8]:
{'grant_id': 'grant.8172033'}
B) What if we don’t have a grant number?¶
Then the only way is to
query Dimensions using the best grants metadata we have available
if we get only one result, we just take it
if we get more than one result, we need to manually review them at a later point, or develop some sort of algorithm to sort them by relevancy so that we can take the first result
Let’s take some of the grants without a number:
[9]:
grants_without_number = grants_source[grants_source['Grant Number'].isnull()]
[10]:
grants_without_number.head(5)
[10]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | |
---|---|---|---|---|---|---|---|
38 | NaN | DENGUE VACCINE DEVELOPMENT | 50000.0 | 1985 | 1986 | United States Department of the Army | United States |
79 | NaN | Sterilization Vaccine for Cattle | 0.0 | 2006 | 2006 | Council for International Exchange of Scholars | United States |
80 | NaN | Vaccine Production in Plants | 0.0 | 2011 | 2011 | Council for International Exchange of Scholars | United States |
81 | NaN | Novel vaccine formulations against tuberculosis | 295003.0 | 2002 | 2005 | Canadian Institutes of Health Research | Canada |
446 | NaN | Development of recombinant TB vaccine | 101459.0 | 1999 | 2003 | Canadian Institutes of Health Research | Canada |
Now let’s try to find the second grant in the list above, using only its title and the funder country
[11]:
%%dsldf
search grants
in title_only for "Vaccine Production in Plants"
where funder_countries.name="United States"
return grants
Returned Grants: 1 (total = 1)
Time: 0.65s
WARNINGS [1]
Field 'funder_countries' is deprecated in favor of funder_org_countries. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
[11]:
id | title | active_year | end_date | funder_org_name | funder_orgs | grant_number | language | original_title | start_date | start_year | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | grant.9923948 | Vaccine Production in Plants | [2011] | 2011-09-01 | Council for International Exchange of Scholars | [{'acronym': 'CIES', 'city_name': 'Washington ... | N/A | en | Vaccine Production in Plants | 2011-01-01 | 2011 |
[12]:
# we're in luck! only one record found
print("The Dimensions ID is: ", dsl_last_results.iloc[0]['id'])
The Dimensions ID is: grant.9923948
Back to our grants list¶
We can set up a loop to go through all the grants we want to get a Dimensions ID for, so that we can enrich our original dataset with those IDs. A simple approach is the following:
we try to use the
extract_grants
function firstsecond we try the
search
operation as a fall back planif that returns more than one record, we simply take the first one (even though in real life we’d want a more sophisticated approach)
note: we pause a second after each query to ensure we don’t hit the max queries quota (~30 per minute)
NOTE For the purpose of this exercise, you can select less that the ~1200 grants in the original list, so to speed things up.
[13]:
grants = grants_source[:100].copy()
[14]:
grants.head(10)
[14]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | |
---|---|---|---|---|---|---|---|
0 | 30410203277 | 疫苗-整体方案 | 1208.0 | 2004 | 2004 | National Natural Science Foundation of China | China |
1 | 620792 | Engineering Inhalable Vaccines | 26956.0 | 2017 | 2018 | Natural Sciences and Engineering Research Council | Canada |
2 | 599115 | Engineering Inhalable Vaccines | 26403.0 | 2016 | 2017 | Natural Sciences and Engineering Research Council | Canada |
3 | 251564 | HIV Vaccine research | 442366.0 | 2003 | 2007 | National Health and Medical Research Council | Australia |
4 | 334174 | HIV Vaccine Development | 236067.0 | 2005 | 2009 | National Health and Medical Research Council | Australia |
5 | 910292 | Dengue virus vaccine. | 130890.0 | 1991 | 1993 | National Health and Medical Research Council | Australia |
6 | 578221 | Engineering Inhalable Vaccines | 27386.0 | 2015 | 2016 | Natural Sciences and Engineering Research Council | Canada |
7 | IC18980360 | Schistosomiasis Vaccine Network. | 0.0 | 1998 | 2000 | European Commission | Belgium |
8 | 7621798 | Pneumococcal Ribosomal Vaccines | 46000.0 | 1977 | 1980 | Directorate for Biological Sciences | United States |
9 | 255890 | Rational vaccine design | 7138.0 | 2003 | 2004 | Natural Sciences and Engineering Research Council | Canada |
Setting up the loop
[25]:
# load the progress bar widget for jupyter
from tqdm.notebook import tqdm as progressbar
output = []
def find_grant_first_method(grantno, funder):
match = dsl.query(f'''extract_grants(grant_number="{grantno}", funder_name="{funder}")''').json
grant_id = match.get("grant_id")
if grant_id:
print("Found a match with method 1: ", grant_id)
return grant_id
def find_grant_second_method(title, country):
# match titles exactly - see also https://docs.dimensions.ai/dsl/language.html#using-triple-quotes
res = dsl.query(f'''search grants in title_only for """ "{title}" """ where funder_org_countries.name="{country}" return grants''')
if not res.errors:
if res.grants and res.grants[0].get('id'):
grant_id = res.grants[0].get('id')
print("=== Found a match with method 2: ", grant_id)
return grant_id
for index, row in progressbar(grants.iterrows(), total=grants.shape[0]):
# get data from table
grantno, funder = row['Grant Number'], row['Funder']
# try first method
grant_id = find_grant_first_method(grantno, funder)
if not grant_id:
# try second method
title, country = row['Title'], row['Funder Country']
grant_id = find_grant_second_method(title, country)
if not grant_id:
print("Failed - skipping")
output.append(grant_id)
time.sleep(1)
Found a match with method 1: grant.8172033
Returned Grants: 3 (total = 3)
Time: 0.56s
=== Found a match with method 2: grant.7715379
Returned Grants: 3 (total = 3)
Time: 3.35s
=== Found a match with method 2: grant.7715379
Found a match with method 1: grant.6723913
Found a match with method 1: grant.6722306
Found a match with method 1: grant.6716312
Returned Grants: 3 (total = 3)
Time: 7.82s
=== Found a match with method 2: grant.7715379
Found a match with method 1: grant.3733803
Found a match with method 1: grant.3274273
Returned Grants: 2 (total = 2)
Time: 1.62s
=== Found a match with method 2: grant.2863615
Returned Grants: 2 (total = 2)
Time: 0.55s
=== Found a match with method 2: grant.2863615
Returned Grants: 2 (total = 2)
Time: 6.07s
=== Found a match with method 2: grant.2807651
Returned Grants: 2 (total = 2)
Time: 3.28s
=== Found a match with method 2: grant.2807651
Returned Grants: 2 (total = 2)
Time: 6.50s
=== Found a match with method 2: grant.2920553
Found a match with method 1: grant.2710461
Found a match with method 1: grant.2657659
Found a match with method 1: grant.2640369
Found a match with method 1: grant.2640228
Found a match with method 1: grant.2426045
Found a match with method 1: grant.2426044
Found a match with method 1: grant.2426039
Found a match with method 1: grant.2425936
Found a match with method 1: grant.2425935
Found a match with method 1: grant.2425932
Found a match with method 1: grant.2425010
Found a match with method 1: grant.2424436
Found a match with method 1: grant.8068979
Found a match with method 1: grant.7106480
Found a match with method 1: grant.2639983
Found a match with method 1: grant.2355874
Found a match with method 1: grant.7566903
Found a match with method 1: grant.7042623
Found a match with method 1: grant.7040393
Found a match with method 1: grant.6731478
Found a match with method 1: grant.6728996
Found a match with method 1: grant.6723289
Found a match with method 1: grant.6718929
Found a match with method 1: grant.6713873
Returned Grants: 4 (total = 4)
Time: 2.53s
=== Found a match with method 2: grant.6585159
Returned Grants: 2 (total = 2)
Time: 4.45s
=== Found a match with method 2: grant.7667275
Returned Grants: 2 (total = 2)
Time: 5.75s
=== Found a match with method 2: grant.2957579
Returned Grants: 1 (total = 1)
Time: 0.53s
=== Found a match with method 2: grant.2972379
Returned Grants: 2 (total = 2)
Time: 0.66s
=== Found a match with method 2: grant.2964974
Returned Grants: 2 (total = 2)
Time: 2.65s
=== Found a match with method 2: grant.2957579
Returned Grants: 0
Time: 0.54s
Failed - skipping
Returned Grants: 1 (total = 1)
Time: 0.55s
=== Found a match with method 2: grant.2920553
Returned Grants: 3 (total = 3)
Time: 3.16s
=== Found a match with method 2: grant.7654331
Returned Grants: 2 (total = 2)
Time: 4.28s
=== Found a match with method 2: grant.2964974
Returned Grants: 2 (total = 2)
Time: 5.40s
=== Found a match with method 2: grant.2880945
Returned Grants: 0
Time: 0.58s
Failed - skipping
Returned Grants: 0
Time: 6.05s
Failed - skipping
Returned Grants: 2 (total = 2)
Time: 0.52s
=== Found a match with method 2: grant.2880945
Returned Grants: 1 (total = 1)
Time: 0.55s
=== Found a match with method 2: grant.2795214
Found a match with method 1: grant.2692263
Found a match with method 1: grant.2688631
Found a match with method 1: grant.2657670
Found a match with method 1: grant.2640366
Found a match with method 1: grant.2597598
Found a match with method 1: grant.2424826
Found a match with method 1: grant.2424691
Found a match with method 1: grant.2424653
Found a match with method 1: grant.2424634
Found a match with method 1: grant.2424633
Found a match with method 1: grant.2424632
Found a match with method 1: grant.2424631
Found a match with method 1: grant.2424630
Found a match with method 1: grant.2424629
Found a match with method 1: grant.2424455
Found a match with method 1: grant.2424372
Found a match with method 1: grant.2393802
Found a match with method 1: grant.2393756
Found a match with method 1: grant.7987079
Found a match with method 1: grant.8217971
Found a match with method 1: grant.8176576
Found a match with method 1: grant.8175344
Found a match with method 1: grant.8172020
Found a match with method 1: grant.8166566
Found a match with method 1: grant.7988463
Found a match with method 1: grant.100075050
Returned Grants: 2 (total = 2)
Time: 3.13s
=== Found a match with method 2: grant.7688859
Returned Grants: 1 (total = 1)
Time: 0.57s
=== Found a match with method 2: grant.9923948
Returned Grants: 1 (total = 1)
Time: 4.67s
=== Found a match with method 2: grant.7666696
Found a match with method 1: grant.8684816
Found a match with method 1: grant.8633207
Found a match with method 1: grant.8633389
Found a match with method 1: grant.8633097
Found a match with method 1: grant.8557183
Found a match with method 1: grant.8554194
Found a match with method 1: grant.8553708
Found a match with method 1: grant.8554075
Found a match with method 1: grant.8553908
Found a match with method 1: grant.8473170
Found a match with method 1: grant.8387795
Found a match with method 1: grant.8384515
Found a match with method 1: grant.8383668
Found a match with method 1: grant.7909702
Found a match with method 1: grant.7908868
Found a match with method 1: grant.7752832
Found a match with method 1: grant.7754464
Found a match with method 1: grant.7752982
Enriching the original list¶
Finally, we can take the Dimensions ID data we extracted and add it to the original grants table as an extra column.
[26]:
grants["Dimensions ID"] = output
[27]:
grants.head(10)
[27]:
Grant Number | Title | Funding Amount in USD | Start Year | End Year | Funder | Funder Country | Dimensions ID | |
---|---|---|---|---|---|---|---|---|
0 | 30410203277 | 疫苗-整体方案 | 1208.0 | 2004 | 2004 | National Natural Science Foundation of China | China | grant.8172033 |
1 | 620792 | Engineering Inhalable Vaccines | 26956.0 | 2017 | 2018 | Natural Sciences and Engineering Research Council | Canada | grant.7715379 |
2 | 599115 | Engineering Inhalable Vaccines | 26403.0 | 2016 | 2017 | Natural Sciences and Engineering Research Council | Canada | grant.7715379 |
3 | 251564 | HIV Vaccine research | 442366.0 | 2003 | 2007 | National Health and Medical Research Council | Australia | grant.6723913 |
4 | 334174 | HIV Vaccine Development | 236067.0 | 2005 | 2009 | National Health and Medical Research Council | Australia | grant.6722306 |
5 | 910292 | Dengue virus vaccine. | 130890.0 | 1991 | 1993 | National Health and Medical Research Council | Australia | grant.6716312 |
6 | 578221 | Engineering Inhalable Vaccines | 27386.0 | 2015 | 2016 | Natural Sciences and Engineering Research Council | Canada | grant.7715379 |
7 | IC18980360 | Schistosomiasis Vaccine Network. | 0.0 | 1998 | 2000 | European Commission | Belgium | grant.3733803 |
8 | 7621798 | Pneumococcal Ribosomal Vaccines | 46000.0 | 1977 | 1980 | Directorate for Biological Sciences | United States | grant.3274273 |
9 | 255890 | Rational vaccine design | 7138.0 | 2003 | 2004 | Natural Sciences and Engineering Research Council | Canada | grant.2863615 |
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.