../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Enriching Grants part 1: Matching your grants records to Dimensions

In this tutorial we are going to match a sample local grants datasets to Dimensions.

By matching we mean discovering the unique Dimensions identifier for these grants, so that we can then use the IDs to extract from Dimensions more related objects (eg researchers, publications, patents, clinical trials etc.. related to the grants).

A sample grants list

Our starting point is a sample list of completed grants on the topic of vaccines, which contains common fields such as title, funder, grant/project ID, funding amount etc..

We will show below how to enrich this dataset with Dimensions IDs.

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli tqdm -U --quiet

import dimcli
from dimcli.utils import *

import sys, time
import pandas as pd


print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Loading the sample grants data

First we are going to load the sample dataset “vaccines-grants-sample-part-0.csv”.

[3]:
grants_source = pd.read_csv("http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-0.csv")

Now we can preview the contents of the file.

[4]:
grants_source.head(5)
[4]:
Grant Number Title Funding Amount in USD Start Year End Year Funder Funder Country
0 30410203277 疫苗-整体方案 1208.0 2004 2004 National Natural Science Foundation of China China
1 620792 Engineering Inhalable Vaccines 26956.0 2017 2018 Natural Sciences and Engineering Research Council Canada
2 599115 Engineering Inhalable Vaccines 26403.0 2016 2017 Natural Sciences and Engineering Research Council Canada
3 251564 HIV Vaccine research 442366.0 2003 2007 National Health and Medical Research Council Australia
4 334174 HIV Vaccine Development 236067.0 2005 2009 National Health and Medical Research Council Australia

Matching grants data

The are two possible situations to consider.

A) Matching grants when we have a grant number

The Dimensions API includes a function ‘extract_grants’ that makes it easier to find a grant in Dimensions, by using information such as funder name and the funder grant identifier.

So one approach is the following:

[5]:
dsl.query("""extract_grants(grant_number="30410203277",
    funder_name="National Natural Science Foundation of China")""").json
[5]:
{'grant_id': 'grant.8172033'}
[6]:
dsl.query("""extract_grants(grant_number="334174",
    funder_name="National Health and Medical Research Council")""").json
[6]:
{'grant_id': 'grant.6722306'}

Note: this won’t work without funder name

[7]:
dsl.query("""extract_grants(grant_number="30410203277",
    funder_name="")""").json
[7]:
{'grant_id': None}

But we can pass a fundref ID if we have it (also available via GRID)

[8]:
dsl.query("""extract_grants(grant_number="30410203277",
    fundref="501100001809")""").json
[8]:
{'grant_id': 'grant.8172033'}

B) What if we don’t have a grant number?

Then the only way is to

  • query Dimensions using the best grants metadata we have available

  • if we get only one result, we just take it

  • if we get more than one result, we need to manually review them at a later point, or develop some sort of algorithm to sort them by relevancy so that we can take the first result

Let’s take some of the grants without a number:

[9]:
grants_without_number = grants_source[grants_source['Grant Number'].isnull()]
[10]:
grants_without_number.head(5)
[10]:
Grant Number Title Funding Amount in USD Start Year End Year Funder Funder Country
38 NaN DENGUE VACCINE DEVELOPMENT 50000.0 1985 1986 United States Department of the Army United States
79 NaN Sterilization Vaccine for Cattle 0.0 2006 2006 Council for International Exchange of Scholars United States
80 NaN Vaccine Production in Plants 0.0 2011 2011 Council for International Exchange of Scholars United States
81 NaN Novel vaccine formulations against tuberculosis 295003.0 2002 2005 Canadian Institutes of Health Research Canada
446 NaN Development of recombinant TB vaccine 101459.0 1999 2003 Canadian Institutes of Health Research Canada

Now let’s try to find the second grant in the list above, using only its title and the funder country

[11]:
%%dsldf

search grants
    in title_only for "Vaccine Production in Plants"
    where funder_countries.name="United States"
return grants
Returned Grants: 1 (total = 1)
Time: 0.59s
[11]:
active_year end_date funders funding_org_name grant_number id language original_title start_date start_year title
0 [2011] 2011-09-01 [{'acronym': 'CIES', 'city_name': 'Washington ... Council for International Exchange of Scholars N/A grant.9923948 en Vaccine Production in Plants 2011-01-01 2011 Vaccine Production in Plants
[12]:
# we're in luck! only one record found
print("The Dimensions ID is: ", dsl_last_results.iloc[0]['id'])
The Dimensions ID is:  grant.9923948

Back to our grants list

We can set up a loop to go through all the grants we want to get a Dimensions ID for, so that we can enrich our original dataset with those IDs. A simple approach is the following:

  • we try to use the extract_grants function first

  • second we try the search operation as a fall back plan

    • if that returns more than one record, we simply take the first one (even though in real life we’d want a more sophisticated approach)

  • note: we pause a second after each query to ensure we don’t hit the max queries quota (~30 per minute)

NOTE For the purpose of this exercise, you can select less that the ~1200 grants in the original list, so to speed things up.

[16]:
grants = grants_source[:100].copy()
[17]:
grants.head(10)
[17]:
Grant Number Title Funding Amount in USD Start Year End Year Funder Funder Country
0 30410203277 疫苗-整体方案 1208.0 2004 2004 National Natural Science Foundation of China China
1 620792 Engineering Inhalable Vaccines 26956.0 2017 2018 Natural Sciences and Engineering Research Council Canada
2 599115 Engineering Inhalable Vaccines 26403.0 2016 2017 Natural Sciences and Engineering Research Council Canada
3 251564 HIV Vaccine research 442366.0 2003 2007 National Health and Medical Research Council Australia
4 334174 HIV Vaccine Development 236067.0 2005 2009 National Health and Medical Research Council Australia
5 910292 Dengue virus vaccine. 130890.0 1991 1993 National Health and Medical Research Council Australia
6 578221 Engineering Inhalable Vaccines 27386.0 2015 2016 Natural Sciences and Engineering Research Council Canada
7 IC18980360 Schistosomiasis Vaccine Network. 0.0 1998 2000 European Commission Belgium
8 7621798 Pneumococcal Ribosomal Vaccines 46000.0 1977 1980 Directorate for Biological Sciences United States
9 255890 Rational vaccine design 7138.0 2003 2004 Natural Sciences and Engineering Research Council Canada

Setting up the loop

[18]:
# load the progress bar widget for jupyter
from tqdm.notebook import tqdm as progressbar

output = []

def remove_punctuation(s):
  import string
  return s.translate(str.maketrans('', '', string.punctuation))

def find_grant_first_method(grantno, funder):
  match = dsl.query(f"""extract_grants(grant_number="{grantno}", funder_name="{funder}")""").json
  grant_id = match.get("grant_id")
  if grant_id:
    print("Found a match with method 1: ", grant_id)
    return grant_id

def find_grant_second_method(title, country):
  res = dsl.query(f"""search grants in title_only for "{title}" where funder_countries.name="{country}" return grants""")
  if not res.errors:
      if res.grants and res.grants[0].get('id'):
        grant_id = res.grants[0].get('id')
        print("=== Found a match with method 2: ", grant_id)
        return grant_id


for index, row in progressbar(grants.iterrows(), total=grants.shape[0]):
  # get data from table
  grantno, funder = row['Grant Number'], row['Funder']
  # try first method
  grant_id = find_grant_first_method(grantno, funder)
  if not grant_id:
    # try second method
    title, country = remove_punctuation(row['Title']), row['Funder Country']
    grant_id = find_grant_second_method(title, country)
    if not grant_id:
      print("Failed - skipping")
  output.append(grant_id)
  time.sleep(1)

Found a match with method 1:  grant.8172033
Returned Grants: 3 (total = 3)
Time: 0.60s
=== Found a match with method 2:  grant.7715379
Returned Grants: 3 (total = 3)
Time: 0.51s
=== Found a match with method 2:  grant.7715379
Found a match with method 1:  grant.6723913
Found a match with method 1:  grant.6722306
Found a match with method 1:  grant.6716312
Returned Grants: 3 (total = 3)
Time: 0.54s
=== Found a match with method 2:  grant.7715379
Found a match with method 1:  grant.3733803
Found a match with method 1:  grant.3274273
Returned Grants: 4 (total = 4)
Time: 0.62s
=== Found a match with method 2:  grant.7637970
Returned Grants: 4 (total = 4)
Time: 28.84s
=== Found a match with method 2:  grant.7637970
Returned Grants: 2 (total = 2)
Time: 0.62s
=== Found a match with method 2:  grant.2807651
Returned Grants: 2 (total = 2)
Time: 1.62s
=== Found a match with method 2:  grant.2807651
Returned Grants: 2 (total = 2)
Time: 0.53s
=== Found a match with method 2:  grant.2920553
Found a match with method 1:  grant.2710461
Found a match with method 1:  grant.2657659
Found a match with method 1:  grant.2640369
Found a match with method 1:  grant.2640228
Found a match with method 1:  grant.2426045
Found a match with method 1:  grant.2426044
Found a match with method 1:  grant.2426039
Found a match with method 1:  grant.2425936
Found a match with method 1:  grant.2425935
Found a match with method 1:  grant.2425932
Found a match with method 1:  grant.2425010
Found a match with method 1:  grant.2424436
Found a match with method 1:  grant.8068979
Found a match with method 1:  grant.7106480
Found a match with method 1:  grant.2639983
Found a match with method 1:  grant.2355874
Found a match with method 1:  grant.7566903
Found a match with method 1:  grant.7042623
Found a match with method 1:  grant.7040393
Found a match with method 1:  grant.6731478
Found a match with method 1:  grant.6728996
Found a match with method 1:  grant.6723289
Found a match with method 1:  grant.6718929
Found a match with method 1:  grant.6713873
Returned Grants: 20 (total = 25)
Time: 0.67s
=== Found a match with method 2:  grant.9733127
Returned Grants: 2 (total = 2)
Time: 0.57s
=== Found a match with method 2:  grant.7667275
Returned Grants: 2 (total = 2)
Time: 0.53s
=== Found a match with method 2:  grant.2957579
Returned Grants: 2 (total = 2)
Time: 0.65s
=== Found a match with method 2:  grant.2869083
Returned Grants: 5 (total = 5)
Time: 0.60s
=== Found a match with method 2:  grant.4173726
Returned Grants: 2 (total = 2)
Time: 0.53s
=== Found a match with method 2:  grant.2957579
Returned Grants: 0
Time: 0.82s
Failed - skipping
Returned Grants: 1 (total = 1)
Time: 36.89s
=== Found a match with method 2:  grant.2920553
Returned Grants: 0
Time: 4.09s
Failed - skipping
Returned Grants: 5 (total = 5)
Time: 0.59s
=== Found a match with method 2:  grant.4173726
Returned Grants: 2 (total = 2)
Time: 0.62s
=== Found a match with method 2:  grant.2880945
Returned Grants: 0
Time: 0.64s
Failed - skipping
Returned Grants: 0
Time: 0.54s
Failed - skipping
Returned Grants: 2 (total = 2)
Time: 0.57s
=== Found a match with method 2:  grant.2880945
Returned Grants: 1 (total = 1)
Time: 0.62s
=== Found a match with method 2:  grant.2795214
Found a match with method 1:  grant.2692263
Found a match with method 1:  grant.2688631
Found a match with method 1:  grant.2657670
Found a match with method 1:  grant.2640366
Found a match with method 1:  grant.2597598
Found a match with method 1:  grant.2424826
Found a match with method 1:  grant.2424691
Found a match with method 1:  grant.2424653
Found a match with method 1:  grant.2424634
Found a match with method 1:  grant.2424633
Found a match with method 1:  grant.2424632
Found a match with method 1:  grant.2424631
Found a match with method 1:  grant.2424630
Found a match with method 1:  grant.2424629
Found a match with method 1:  grant.2424455
Found a match with method 1:  grant.2424372
Found a match with method 1:  grant.2393802
Found a match with method 1:  grant.2393756
Found a match with method 1:  grant.7987079
Found a match with method 1:  grant.8217971
Found a match with method 1:  grant.8176576
Found a match with method 1:  grant.8175344
Found a match with method 1:  grant.8172020
Found a match with method 1:  grant.8166566
Found a match with method 1:  grant.7988463
Found a match with method 1:  grant.100075050
Returned Grants: 2 (total = 2)
Time: 0.57s
=== Found a match with method 2:  grant.7688859
Returned Grants: 1 (total = 1)
Time: 0.60s
=== Found a match with method 2:  grant.9923948
Returned Grants: 1 (total = 1)
Time: 0.58s
=== Found a match with method 2:  grant.7666696
Found a match with method 1:  grant.8684816
Found a match with method 1:  grant.8633207
Found a match with method 1:  grant.8633389
Found a match with method 1:  grant.8633097
Found a match with method 1:  grant.8557183
Found a match with method 1:  grant.8554194
Found a match with method 1:  grant.8553708
Found a match with method 1:  grant.8554075
Found a match with method 1:  grant.8553908
Found a match with method 1:  grant.8473170
Found a match with method 1:  grant.8387795
Found a match with method 1:  grant.8384515
Found a match with method 1:  grant.8383668
Found a match with method 1:  grant.7909702
Found a match with method 1:  grant.7908868
Found a match with method 1:  grant.7752832
Found a match with method 1:  grant.7754464
Found a match with method 1:  grant.7752982

Enriching the original list

Finally, we can take the Dimensions ID data we extracted and add it to the original grants table as an extra column.

[19]:
grants["Dimensions ID"] = output
[20]:
grants.head(10)
[20]:
Grant Number Title Funding Amount in USD Start Year End Year Funder Funder Country Dimensions ID
0 30410203277 疫苗-整体方案 1208.0 2004 2004 National Natural Science Foundation of China China grant.8172033
1 620792 Engineering Inhalable Vaccines 26956.0 2017 2018 Natural Sciences and Engineering Research Council Canada grant.7715379
2 599115 Engineering Inhalable Vaccines 26403.0 2016 2017 Natural Sciences and Engineering Research Council Canada grant.7715379
3 251564 HIV Vaccine research 442366.0 2003 2007 National Health and Medical Research Council Australia grant.6723913
4 334174 HIV Vaccine Development 236067.0 2005 2009 National Health and Medical Research Council Australia grant.6722306
5 910292 Dengue virus vaccine. 130890.0 1991 1993 National Health and Medical Research Council Australia grant.6716312
6 578221 Engineering Inhalable Vaccines 27386.0 2015 2016 Natural Sciences and Engineering Research Council Canada grant.7715379
7 IC18980360 Schistosomiasis Vaccine Network. 0.0 1998 2000 European Commission Belgium grant.3733803
8 7621798 Pneumococcal Ribosomal Vaccines 46000.0 1977 1980 Directorate for Biological Sciences United States grant.3274273
9 255890 Rational vaccine design 7138.0 2003 2004 Natural Sciences and Engineering Research Council Canada grant.7637970


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg