Enriching Grants part 1: Matching your grants records to Dimensions¶

In this tutorial we are going to match a sample local grants datasets to Dimensions.

By matching we mean discovering the unique Dimensions identifier for these grants, so that we can then use the IDs to extract from Dimensions more related objects (eg researchers, publications, patents, clinical trials etc.. related to the grants).

A sample grants list¶

Our starting point is a sample list of completed grants on the topic of vaccines, which contains common fields such as title, funder, grant/project ID, funding amount etc..

We will show below how to enrich this dataset with Dimensions IDs.

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Apr 19, 2023
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli tqdm -U --quiet

import dimcli
from dimcli.utils import *

import sys, time
import pandas as pd


print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v1.0.2)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.6
Method: dsl.ini file

Loading the sample grants data¶

First we are going to load the sample dataset “vaccines-grants-sample-part-0.csv”.

[3]:

grants_source = pd.read_csv("http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-0.csv")

Now we can preview the contents of the file.

[4]:

grants_source.head(5)

[4]:

	Grant Number	Title	Funding Amount in USD	Start Year	End Year	Funder	Funder Country
0	30410203277	疫苗－整体方案	1208.0	2004	2004	National Natural Science Foundation of China	China
1	620792	Engineering Inhalable Vaccines	26956.0	2017	2018	Natural Sciences and Engineering Research Council	Canada
2	599115	Engineering Inhalable Vaccines	26403.0	2016	2017	Natural Sciences and Engineering Research Council	Canada
3	251564	HIV Vaccine research	442366.0	2003	2007	National Health and Medical Research Council	Australia
4	334174	HIV Vaccine Development	236067.0	2005	2009	National Health and Medical Research Council	Australia

Matching grants data¶

The are two possible situations to consider.

A) Matching grants when we have a grant number¶

The Dimensions API includes a function ‘extract_grants’ that makes it easier to find a grant in Dimensions, by using information such as funder name and the funder grant identifier.

So one approach is the following:

[5]:

dsl.query("""extract_grants(grant_number="30410203277",
    funder_name="National Natural Science Foundation of China")""").json

[5]:

{'grant_id': 'grant.8172033'}

[6]:

dsl.query("""extract_grants(grant_number="334174",
    funder_name="National Health and Medical Research Council")""").json

[6]:

{'grant_id': 'grant.6722306'}

Note: this won’t work without funder name

[7]:

dsl.query("""extract_grants(grant_number="30410203277",
    funder_name="")""").json

[7]:

{'grant_id': None}

But we can pass a fundref ID if we have it (also available via GRID)

[8]:

dsl.query("""extract_grants(grant_number="30410203277",
    fundref="501100001809")""").json

[8]:

{'grant_id': 'grant.8172033'}

B) What if we don’t have a grant number?¶

Then the only way is to

query Dimensions using the best grants metadata we have available
if we get only one result, we just take it
if we get more than one result, we need to manually review them at a later point, or develop some sort of algorithm to sort them by relevancy so that we can take the first result

Let’s take some of the grants without a number:

[9]:

grants_without_number = grants_source[grants_source['Grant Number'].isnull()]

[10]:

grants_without_number.head(5)

[10]:

	Grant Number	Title	Funding Amount in USD	Start Year	End Year	Funder	Funder Country
38	NaN	DENGUE VACCINE DEVELOPMENT	50000.0	1985	1986	United States Department of the Army	United States
79	NaN	Sterilization Vaccine for Cattle	0.0	2006	2006	Council for International Exchange of Scholars	United States
80	NaN	Vaccine Production in Plants	0.0	2011	2011	Council for International Exchange of Scholars	United States
81	NaN	Novel vaccine formulations against tuberculosis	295003.0	2002	2005	Canadian Institutes of Health Research	Canada
446	NaN	Development of recombinant TB vaccine	101459.0	1999	2003	Canadian Institutes of Health Research	Canada

Now let’s try to find the second grant in the list above, using only its title and the funder country

[11]:

%%dsldf

search grants
    in title_only for "Vaccine Production in Plants"
    where funder_countries.name="United States"
return grants

Returned Grants: 1 (total = 1)
Time: 0.65s
WARNINGS [1]
Field 'funder_countries' is deprecated in favor of funder_org_countries. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details

[11]:

	id	title	active_year	end_date	funder_org_name	funder_orgs	grant_number	language	original_title	start_date	start_year
0	grant.9923948	Vaccine Production in Plants	[2011]	2011-09-01	Council for International Exchange of Scholars	[{'acronym': 'CIES', 'city_name': 'Washington ...	N/A	en	Vaccine Production in Plants	2011-01-01	2011

[12]:

# we're in luck! only one record found
print("The Dimensions ID is: ", dsl_last_results.iloc[0]['id'])

The Dimensions ID is:  grant.9923948

Back to our grants list¶

We can set up a loop to go through all the grants we want to get a Dimensions ID for, so that we can enrich our original dataset with those IDs. A simple approach is the following:

we try to use the extract_grants function first
second we try the search operation as a fall back plan
- if that returns more than one record, we simply take the first one (even though in real life we’d want a more sophisticated approach)
note: we pause a second after each query to ensure we don’t hit the max queries quota (~30 per minute)

NOTE For the purpose of this exercise, you can select less that the ~1200 grants in the original list, so to speed things up.

[13]:

grants = grants_source[:100].copy()

[14]:

grants.head(10)

[14]:

	Grant Number	Title	Funding Amount in USD	Start Year	End Year	Funder	Funder Country
0	30410203277	疫苗－整体方案	1208.0	2004	2004	National Natural Science Foundation of China	China
1	620792	Engineering Inhalable Vaccines	26956.0	2017	2018	Natural Sciences and Engineering Research Council	Canada
2	599115	Engineering Inhalable Vaccines	26403.0	2016	2017	Natural Sciences and Engineering Research Council	Canada
3	251564	HIV Vaccine research	442366.0	2003	2007	National Health and Medical Research Council	Australia
4	334174	HIV Vaccine Development	236067.0	2005	2009	National Health and Medical Research Council	Australia
5	910292	Dengue virus vaccine.	130890.0	1991	1993	National Health and Medical Research Council	Australia
6	578221	Engineering Inhalable Vaccines	27386.0	2015	2016	Natural Sciences and Engineering Research Council	Canada
7	IC18980360	Schistosomiasis Vaccine Network.	0.0	1998	2000	European Commission	Belgium
8	7621798	Pneumococcal Ribosomal Vaccines	46000.0	1977	1980	Directorate for Biological Sciences	United States
9	255890	Rational vaccine design	7138.0	2003	2004	Natural Sciences and Engineering Research Council	Canada

Setting up the loop

[25]:

# load the progress bar widget for jupyter
from tqdm.notebook import tqdm as progressbar

output = []

def find_grant_first_method(grantno, funder):
  match = dsl.query(f'''extract_grants(grant_number="{grantno}", funder_name="{funder}")''').json
  grant_id = match.get("grant_id")
  if grant_id:
    print("Found a match with method 1: ", grant_id)
    return grant_id

def find_grant_second_method(title, country):
  # match titles exactly - see also https://docs.dimensions.ai/dsl/language.html#using-triple-quotes
  res = dsl.query(f'''search grants in title_only for """ "{title}" """ where funder_org_countries.name="{country}" return grants''')
  if not res.errors:
      if res.grants and res.grants[0].get('id'):
        grant_id = res.grants[0].get('id')
        print("=== Found a match with method 2: ", grant_id)
        return grant_id


for index, row in progressbar(grants.iterrows(), total=grants.shape[0]):
  # get data from table
  grantno, funder = row['Grant Number'], row['Funder']
  # try first method
  grant_id = find_grant_first_method(grantno, funder)
  if not grant_id:
    # try second method
    title, country = row['Title'], row['Funder Country']
    grant_id = find_grant_second_method(title, country)
    if not grant_id:
      print("Failed - skipping")
  output.append(grant_id)
  time.sleep(1)

Found a match with method 1:  grant.8172033
Returned Grants: 3 (total = 3)
Time: 0.56s
=== Found a match with method 2:  grant.7715379
Returned Grants: 3 (total = 3)
Time: 3.35s
=== Found a match with method 2:  grant.7715379
Found a match with method 1:  grant.6723913
Found a match with method 1:  grant.6722306
Found a match with method 1:  grant.6716312
Returned Grants: 3 (total = 3)
Time: 7.82s
=== Found a match with method 2:  grant.7715379
Found a match with method 1:  grant.3733803
Found a match with method 1:  grant.3274273
Returned Grants: 2 (total = 2)
Time: 1.62s
=== Found a match with method 2:  grant.2863615
Returned Grants: 2 (total = 2)
Time: 0.55s
=== Found a match with method 2:  grant.2863615
Returned Grants: 2 (total = 2)
Time: 6.07s
=== Found a match with method 2:  grant.2807651
Returned Grants: 2 (total = 2)
Time: 3.28s
=== Found a match with method 2:  grant.2807651
Returned Grants: 2 (total = 2)
Time: 6.50s
=== Found a match with method 2:  grant.2920553
Found a match with method 1:  grant.2710461
Found a match with method 1:  grant.2657659
Found a match with method 1:  grant.2640369
Found a match with method 1:  grant.2640228
Found a match with method 1:  grant.2426045
Found a match with method 1:  grant.2426044
Found a match with method 1:  grant.2426039
Found a match with method 1:  grant.2425936
Found a match with method 1:  grant.2425935
Found a match with method 1:  grant.2425932
Found a match with method 1:  grant.2425010
Found a match with method 1:  grant.2424436
Found a match with method 1:  grant.8068979
Found a match with method 1:  grant.7106480
Found a match with method 1:  grant.2639983
Found a match with method 1:  grant.2355874
Found a match with method 1:  grant.7566903
Found a match with method 1:  grant.7042623
Found a match with method 1:  grant.7040393
Found a match with method 1:  grant.6731478
Found a match with method 1:  grant.6728996
Found a match with method 1:  grant.6723289
Found a match with method 1:  grant.6718929
Found a match with method 1:  grant.6713873
Returned Grants: 4 (total = 4)
Time: 2.53s
=== Found a match with method 2:  grant.6585159
Returned Grants: 2 (total = 2)
Time: 4.45s
=== Found a match with method 2:  grant.7667275
Returned Grants: 2 (total = 2)
Time: 5.75s
=== Found a match with method 2:  grant.2957579
Returned Grants: 1 (total = 1)
Time: 0.53s
=== Found a match with method 2:  grant.2972379
Returned Grants: 2 (total = 2)
Time: 0.66s
=== Found a match with method 2:  grant.2964974
Returned Grants: 2 (total = 2)
Time: 2.65s
=== Found a match with method 2:  grant.2957579
Returned Grants: 0
Time: 0.54s
Failed - skipping
Returned Grants: 1 (total = 1)
Time: 0.55s
=== Found a match with method 2:  grant.2920553
Returned Grants: 3 (total = 3)
Time: 3.16s
=== Found a match with method 2:  grant.7654331
Returned Grants: 2 (total = 2)
Time: 4.28s
=== Found a match with method 2:  grant.2964974
Returned Grants: 2 (total = 2)
Time: 5.40s
=== Found a match with method 2:  grant.2880945
Returned Grants: 0
Time: 0.58s
Failed - skipping
Returned Grants: 0
Time: 6.05s
Failed - skipping
Returned Grants: 2 (total = 2)
Time: 0.52s
=== Found a match with method 2:  grant.2880945
Returned Grants: 1 (total = 1)
Time: 0.55s
=== Found a match with method 2:  grant.2795214
Found a match with method 1:  grant.2692263
Found a match with method 1:  grant.2688631
Found a match with method 1:  grant.2657670
Found a match with method 1:  grant.2640366
Found a match with method 1:  grant.2597598
Found a match with method 1:  grant.2424826
Found a match with method 1:  grant.2424691
Found a match with method 1:  grant.2424653
Found a match with method 1:  grant.2424634
Found a match with method 1:  grant.2424633
Found a match with method 1:  grant.2424632
Found a match with method 1:  grant.2424631
Found a match with method 1:  grant.2424630
Found a match with method 1:  grant.2424629
Found a match with method 1:  grant.2424455
Found a match with method 1:  grant.2424372
Found a match with method 1:  grant.2393802
Found a match with method 1:  grant.2393756
Found a match with method 1:  grant.7987079
Found a match with method 1:  grant.8217971
Found a match with method 1:  grant.8176576
Found a match with method 1:  grant.8175344
Found a match with method 1:  grant.8172020
Found a match with method 1:  grant.8166566
Found a match with method 1:  grant.7988463
Found a match with method 1:  grant.100075050
Returned Grants: 2 (total = 2)
Time: 3.13s
=== Found a match with method 2:  grant.7688859
Returned Grants: 1 (total = 1)
Time: 0.57s
=== Found a match with method 2:  grant.9923948
Returned Grants: 1 (total = 1)
Time: 4.67s
=== Found a match with method 2:  grant.7666696
Found a match with method 1:  grant.8684816
Found a match with method 1:  grant.8633207
Found a match with method 1:  grant.8633389
Found a match with method 1:  grant.8633097
Found a match with method 1:  grant.8557183
Found a match with method 1:  grant.8554194
Found a match with method 1:  grant.8553708
Found a match with method 1:  grant.8554075
Found a match with method 1:  grant.8553908
Found a match with method 1:  grant.8473170
Found a match with method 1:  grant.8387795
Found a match with method 1:  grant.8384515
Found a match with method 1:  grant.8383668
Found a match with method 1:  grant.7909702
Found a match with method 1:  grant.7908868
Found a match with method 1:  grant.7752832
Found a match with method 1:  grant.7754464
Found a match with method 1:  grant.7752982

Enriching the original list¶

Finally, we can take the Dimensions ID data we extracted and add it to the original grants table as an extra column.

[26]:

grants["Dimensions ID"] = output

[27]:

grants.head(10)

[27]:

	Grant Number	Title	Funding Amount in USD	Start Year	End Year	Funder	Funder Country	Dimensions ID
0	30410203277	疫苗－整体方案	1208.0	2004	2004	National Natural Science Foundation of China	China	grant.8172033
1	620792	Engineering Inhalable Vaccines	26956.0	2017	2018	Natural Sciences and Engineering Research Council	Canada	grant.7715379
2	599115	Engineering Inhalable Vaccines	26403.0	2016	2017	Natural Sciences and Engineering Research Council	Canada	grant.7715379
3	251564	HIV Vaccine research	442366.0	2003	2007	National Health and Medical Research Council	Australia	grant.6723913
4	334174	HIV Vaccine Development	236067.0	2005	2009	National Health and Medical Research Council	Australia	grant.6722306
5	910292	Dengue virus vaccine.	130890.0	1991	1993	National Health and Medical Research Council	Australia	grant.6716312
6	578221	Engineering Inhalable Vaccines	27386.0	2015	2016	Natural Sciences and Engineering Research Council	Canada	grant.7715379
7	IC18980360	Schistosomiasis Vaccine Network.	0.0	1998	2000	European Commission	Belgium	grant.3733803
8	7621798	Pneumococcal Ribosomal Vaccines	46000.0	1977	1980	Directorate for Biological Sciences	United States	grant.3274273
9	255890	Rational vaccine design	7138.0	2003	2004	Natural Sciences and Engineering Research Council	Canada	grant.2863615

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.