Mapping GRID IDs to Organization Data¶

In this tutorial, we show how to use the Dimensions Analytics API and organization data to extract GRID IDs.

GRID is a free and openly available global database of research-related organisations, cataloging research-related organizations and providing each with a unique and persistent identifier. Dimensions uses this identifier to link organizations to publications, grants, etc.

Use case scenarios:

An analyst has a list of organizations of interest, and wants to get details of their publications from Dimensions. To do this, they they need to map them to GRID IDs so they can extract information from the Dimensions database. The organization data can be run through the Dimensions API extract_affiliations function in order to extract GRID IDs, which can then be utilized to get publication data statistics.
A second use case is to standardize messy organization data for analysis. For example, an analyst might have a set of affiliation data containing many variants of organization names (“University of Cambridge”, “Cambridge University”). By mapping to GRID IDs, the analyst can standardize the data so it’s easier to analyse.

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

To generate an API key from the Dimensions webapp, go to “My Account”. Under “General Settings” there is an “API key” section where there is a “Create API key” button. More information on this can be found here.

[2]:

!pip install dimcli --quiet

import dimcli
from dimcli.utils import *
from dimcli.functions import extract_affiliations

import json
import sys
import pandas as pd
import re
import time

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Importing Organization Data¶

There are several ways to obtain organization data. Below we show examples for 2 different ways to obtain organization data that can be used to run through Dimensions API for GRID ID mapping. For purposes of this demostration, we will be using method 1. Please uncomment the other sections if you wish to use those methods instead.

Manually Generate Organization Data
Load Organization Data from Local Machine

Note - To map organizational data to GRID IDs, the data must conform to mapping specifications and contain data (if available) for the following 4 columns (with column headers being lowercase): * name - name of the organization * city - city of the organization * state - state of the organization (use the full name of the state, not acronym) * country - country of the organization

The user may use structured or unstructured organization data for mapping to GRID IDs like the following:

Structured Data e.g., [{"name":"Southwestern University", "city":"Georgetown", "state":"Texas", "country":"USA"}]
Unstructured Data e.g., [{"affiliation": "university of oxford, uk"}]

For purposes of this notebook, we will be utilizing structured data in a pandas dataframe. Therefore, please ensure your organization dataset resembles the format observed under method 1, below.

1.1 Manually Generate Organization Data¶

The following cell builds an example organization dataset.

[3]:

# The following generates a table of organization data with 4 columns
organization_names = pd.Series(['Augusta Univeristy', 'Baylor College of Medicine', 'Brown University', 'California Institute of Technology', 'Duke Univerisity',
                                'Emory University', 'Florida State University', 'Harvard Medical School', 'Kent State University', 'New York University', 'Mayo Clinic'])
organization_cities = pd.Series(['Augusta', 'Houston', 'Providence', 'Pasadena', 'Durham',
                                 'Atlanta', 'Tallahassee', 'Boston', 'Kent', 'New York'])
organization_states = pd.Series(['Georgia', 'Texas', 'Rhode Island', 'California', 'North Carolina',
                                 'Georgia', 'Florida', 'Massachusetts', 'Ohio', 'New York'])
organization_countries = pd.Series(['United States', 'United States', 'United States', 'United States', 'United States',
                                    'United States', 'United States', 'United States', 'United States', 'United States', 'United States'])

orgs = pd.DataFrame({'name':organization_names, 'city':organization_cities, 'state':organization_states, 'country':organization_countries})

# Preview Dataset
orgs

[3]:

	name	city	state	country
0	Augusta Univeristy	Augusta	Georgia	United States
1	Baylor College of Medicine	Houston	Texas	United States
2	Brown University	Providence	Rhode Island	United States
3	California Institute of Technology	Pasadena	California	United States
4	Duke Univerisity	Durham	North Carolina	United States
5	Emory University	Atlanta	Georgia	United States
6	Florida State University	Tallahassee	Florida	United States
7	Harvard Medical School	Boston	Massachusetts	United States
8	Kent State University	Kent	Ohio	United States
9	New York University	New York	New York	United States
10	Mayo Clinic	NaN	NaN	United States

1.2 Load Organization Data from Local Machine¶

The following cells can be utilized to import an excel file of organization data from a local machine.

This method is useful for when you need to map hundreds or thousands of organizations to GRID IDs, as the bulk process using the API will be much faster than any individual mapping.

Please uncomment the cells below if to be utilized

[4]:

# # Upload the organization dataset from local machine

# from google.colab import files
# uploaded = files.upload()

[5]:

# # Load and preview the organization dataset into a pandas dataframe

# import io
# import pandas as pd

# orgs = pd.read_excel(io.BytesIO(uploaded['dataset_name.xlsx']))

# orgs.head()

2. Utilizing Dimensions API to Extract GRID IDs¶

The following cells will take our organization data and run it through the Dimensions API to pull back GRID IDs mapped to each organization.

Here, we utilize the “extract_affiliations” API function which can be used to enrich private datasets including non-disambiguated organizations data with Dimensions GRID IDs.

[6]:

# First, we replace empty data with 'null' to satisfy mapping specifications

orgs = orgs.fillna('null')

[7]:

# Second, we will convert organization data from a dataframe to a dictionary (json) for GRID mapping

recs = orgs.to_dict(orient='records')

[8]:

# Then we will take the organization data, run it through the API and return GRID IDs

# Chunk records to batches, API takes up to 200 records at a time.
def chunk_records(l, n):
    for i in range(0, len(l), n):
        yield l[i : i + n]

# Use dimcli's from extract_affiliations API wrapper to process data

chunksize = 200
grid = pd.DataFrame()
for k,chunk in enumerate(chunk_records(recs, chunksize)):
    output = extract_affiliations(chunk, as_json=False)
    grid = grid.append(output,sort = False, ignore_index = True)
    # Pause to avoid overloading API with too many calls too quickly
    time.sleep(1)
    print(f"{(k+1)*chunksize} records complete!")

200 records complete!

[9]:

# Preview the extracted GRID ID dataframe
# Note: data columns labeled with "input" are the original organization data supplied to the API

grid.head()

[9]:

	input.city	input.country	input.name	input.state	grid_id	grid_name	grid_city	grid_state	grid_country	requires_review	geo_country_id	geo_country_name	geo_country_code	geo_state_id	geo_state_name	geo_state_code	geo_city_id	geo_city_name
0	Augusta	United States	Augusta Univeristy	Georgia	grid.410427.4	Augusta University	Augusta	Georgia	United States	False	6252001	United States	US	4197000	Georgia	US-GA	4180531	Augusta
1	Houston	United States	Baylor College of Medicine	Texas	grid.39382.33	Baylor College of Medicine	Houston	Texas	United States	False	6252001	United States	US	4736286	Texas	US-TX	4699066	Houston
2	Providence	United States	Brown University	Rhode Island	grid.40263.33	Brown University	Providence	Rhode Island	United States	False	6252001	United States	US	5224323	Rhode Island	US-RI	5224151	Providence
3	Pasadena	United States	California Institute of Technology	California	grid.20861.3d	California Institute of Technology	Pasadena	California	United States	False	6252001	United States	US	5332921	California	US-CA	5381396	Pasadena
4	Durham	United States	Duke Univerisity	North Carolina	grid.26009.3d	Duke University	Durham	North Carolina	United States	False	6252001	United States	US	4482348	North Carolina	US-NC	4464368	Durham

Note: Some records returned in the GRID mapping may require manual review, as some results may give more than one organization of interest (see below). The user can utilize this information to update their original organization data that is inputted to this notebook.

[10]:

grid['requires_review'] = grid['requires_review'].astype(str)
grid_review = grid.loc[grid['requires_review'] == 'True']
grid_review

[10]:

	input.city	input.country	input.name	input.state	grid_id	grid_name	grid_city	grid_state	grid_country	requires_review	geo_country_id	geo_country_name	geo_country_code	geo_state_id	geo_state_name	geo_state_code	geo_city_id	geo_city_name
10	null	United States	Mayo Clinic	null	grid.417468.8	Mayo Clinic	Scottsdale	Arizona	United States	True	6252001	United States	US	5551752	Arizona	US-AZ	5313457	Scottsdale
11	null	United States	Mayo Clinic	null	grid.417468.8	Mayo Clinic	Scottsdale	Arizona	United States	True	6252001	United States	US	5551752	Arizona	US-AZ	4160021	Jacksonville
12	null	United States	Mayo Clinic	null	grid.417468.8	Mayo Clinic	Scottsdale	Arizona	United States	True	6252001	United States	US	4155751	Florida	US-FL	5313457	Scottsdale
13	null	United States	Mayo Clinic	null	grid.417468.8	Mayo Clinic	Scottsdale	Arizona	United States	True	6252001	United States	US	4155751	Florida	US-FL	4160021	Jacksonville
14	null	United States	Mayo Clinic	null	grid.417467.7	Mayo Clinic	Jacksonville	Florida	United States	True	6252001	United States	US	5551752	Arizona	US-AZ	5313457	Scottsdale
15	null	United States	Mayo Clinic	null	grid.417467.7	Mayo Clinic	Jacksonville	Florida	United States	True	6252001	United States	US	5551752	Arizona	US-AZ	4160021	Jacksonville
16	null	United States	Mayo Clinic	null	grid.417467.7	Mayo Clinic	Jacksonville	Florida	United States	True	6252001	United States	US	4155751	Florida	US-FL	5313457	Scottsdale
17	null	United States	Mayo Clinic	null	grid.417467.7	Mayo Clinic	Jacksonville	Florida	United States	True	6252001	United States	US	4155751	Florida	US-FL	4160021	Jacksonville

3. Save the GRID ID Dataset we created¶

The following cell will export the GRID ID mapped organization data to a csv file that can be saved to your local machine.

[11]:

# temporarily save pandas dataframe as file in colab environment
grid.to_csv('file_name.csv')

if 'google.colab' in sys.modules:

    from google.colab import files

    # download file to local machine
    files.download('file_name.csv')

Conclusions¶

In this notebook we have shown how to use the Dimensions Analytics API extract_affiliations function to assign GRID identifiers to organizations data.

For more background, see the extract_affiliations function documentation, as well as the other functions available via the Dimensions API.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.