../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Enriching Grants part 1: Matching your grants records to Dimensions

In this tutorial we are going to match a sample local grants datasets to Dimensions.

By matching we mean discovering the unique Dimensions identifier for these grants, so that we can then use the IDs to extract from Dimensions more related objects (eg researchers, publications, patents, clinical trials etc.. related to the grants).

A sample grants list

Our starting point is a sample list of completed grants on the topic of vaccines, which contains common fields such as title, funder, grant/project ID, funding amount etc..

We will show below how to enrich this dataset with Dimensions IDs.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[2]:
!pip install dimcli tqdm -U --quiet

import dimcli
from dimcli.utils import *

import sys, time
import pandas as pd


print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.8.2)
Connected to: https://app.dimensions.ai - DSL v1.28
Method: dsl.ini file

Loading the sample grants data

First we are going to load the sample dataset “vaccines-grants-sample-part-0.csv”.

[2]:
grants_source = pd.read_csv("http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-0.csv")

Now we can preview the contents of the file.

[3]:
grants_source.head(5)
[3]:
Grant Number Title Funding Amount in USD Start Year End Year Funder Funder Country
0 30410203277 疫苗-整体方案 1208.0 2004 2004 National Natural Science Foundation of China China
1 620792 Engineering Inhalable Vaccines 26956.0 2017 2018 Natural Sciences and Engineering Research Council Canada
2 599115 Engineering Inhalable Vaccines 26403.0 2016 2017 Natural Sciences and Engineering Research Council Canada
3 251564 HIV Vaccine research 442366.0 2003 2007 National Health and Medical Research Council Australia
4 334174 HIV Vaccine Development 236067.0 2005 2009 National Health and Medical Research Council Australia

Matching grants data

The are two possible situations to consider.

A) Matching grants when we have a grant number

The Dimensions API includes a function ‘extract_grants’ that makes it easier to find a grant in Dimensions, by using information such as funder name and the funder grant identifier.

So one approach is the following:

[4]:
dsl.query("""extract_grants(grant_number="30410203277",
    funder_name="National Natural Science Foundation of China")""").json
[4]:
{'grant_id': 'grant.8172033'}
[5]:
dsl.query("""extract_grants(grant_number="334174",
    funder_name="National Health and Medical Research Council")""").json
[5]:
{'grant_id': 'grant.6722306'}

Note: this won’t work without funder name

[6]:
dsl.query("""extract_grants(grant_number="30410203277",
    funder_name="")""").json
[6]:
{'grant_id': None}

But we can pass a fundref ID if we have it (also available via GRID)

[7]:
dsl.query("""extract_grants(grant_number="30410203277",
    fundref="501100001809")""").json
[7]:
{'grant_id': 'grant.8172033'}

B) What if we don’t have a grant number?

Then the only way is to

  • query Dimensions using the best grants metadata we have available

  • if we get only one result, we just take it

  • if we get more than one result, we need to manually review them at a later point, or develop some sort of algorithm to sort them by relevancy so that we can take the first result

Let’s take some of the grants without a number:

[ ]:
grants_without_number = grants_source[grants_source['Grant Number'].isnull()]
[ ]:
grants_without_number.head(5)

Now let’s try to find the second grant in the list above, using only its title and the funder country

[ ]:
%%dsldf

search grants
    in title_only for "Vaccine Production in Plants"
    where funder_countries.name="United States"
return grants
[ ]:
# we're in luck! only one record found
print("The Dimensions ID is: ", dsl_last_results.iloc[0]['id'])

Back to our grants list

We can set up a loop to go through all the grants we want to get a Dimensions ID for, so that we can enrich our original dataset with those IDs. A simple approach is the following:

  • we try to use the extract_grants function first

  • second we try the search operation as a fall back plan

    • if that returns more than one record, we simply take the first one (even though in real life we’d want a more sophisticated approach)

  • note: we pause a second after each query to ensure we don’t hit the max queries quota (~30 per minute)

NOTE For the purpose of this exercise, you can select less that the ~1200 grants in the original list, so to speed things up.

[ ]:
grants = grants_source[:1200].copy()
[ ]:
grants.head(10)

Setting up the loop

[ ]:
# load the progress bar widget for jupyter
from tqdm.notebook import tqdm as progressbar

output = []

def remove_punctuation(s):
  import string
  return s.translate(str.maketrans('', '', string.punctuation))

def find_grant_first_method(grantno, funder):
  match = dsl.query(f"""extract_grants(grant_number="{grantno}", funder_name="{funder}")""").json
  grant_id = match.get("grant_id")
  if grant_id:
    print("Found a match with method 1: ", grant_id)
    return grant_id

def find_grant_second_method(title, country):
  res = dsl.query(f"""search grants in title_only for "{title}" where funder_countries.name="{country}" return grants""")
  if not res.errors:
      if res.grants and res.grants[0].get('id'):
        grant_id = res.grants[0].get('id')
        print("=== Found a match with method 2: ", grant_id)
        return grant_id


for index, row in progressbar(grants.iterrows(), total=grants.shape[0]):
  # get data from table
  grantno, funder = row['Grant Number'], row['Funder']
  # try first method
  grant_id = find_grant_first_method(grantno, funder)
  if not grant_id:
    # try second method
    title, country = remove_punctuation(row['Title']), row['Funder Country']
    grant_id = find_grant_second_method(title, country)
    if not grant_id:
      print("Failed - skipping")
  output.append(grant_id)
  time.sleep(1)

Enriching the original list

Finally, we can take the Dimensions ID data we extracted and add it to the original grants table as an extra column.

[ ]:
grants["Dimensions ID"] = output
[ ]:
grants.head(10)

Save the data

[ ]:
grants.to_csv("vaccines-grants-sample-part-1.csv", index=False)


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg