Mapping GRID IDs to Organization Data¶
In this tutorial, we show how to use the Dimensions Analytics API and organization data to extract GRID IDs.
GRID is a free and openly available global database of research-related organisations, cataloging research-related organizations and providing each with a unique and persistent identifier. Dimensions uses this identifier to link organizations to publications, grants, etc.
Use case scenarios:
An analyst has a list of organizations of interest, and wants to get details of their publications from Dimensions. To do this, they they need to map them to GRID IDs so they can extract information from the Dimensions database. The organization data can be run through the Dimensions API extract_affiliations function in order to extract GRID IDs, which can then be utilized to get publication data statistics.
A second use case is to standardize messy organization data for analysis. For example, an analyst might have a set of affiliation data containing many variants of organization names (“University of Cambridge”, “Cambridge University”). By mapping to GRID IDs, the analyst can standardize the data so it’s easier to analyse.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
To generate an API key from the Dimensions webapp, go to “My Account”. Under “General Settings” there is an “API key” section where there is a “Create API key” button. More information on this can be found here.
[2]:
!pip install dimcli --quiet
import dimcli
from dimcli.utils import *
from dimcli.functions import extract_affiliations
import json
import sys
import pandas as pd
import re
import time
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
1. Importing Organization Data¶
There are several ways to obtain organization data. Below we show examples for 2 different ways to obtain organization data that can be used to run through Dimensions API for GRID ID mapping. For purposes of this demostration, we will be using method 1. Please uncomment the other sections if you wish to use those methods instead.
Manually Generate Organization Data
Load Organization Data from Local Machine
Note - To map organizational data to GRID IDs, the data must conform to mapping specifications and contain data (if available) for the following 4 columns (with column headers being lowercase): * name - name of the organization * city - city of the organization * state - state of the organization (use the full name of the state, not acronym) * country - country of the organization
The user may use structured or unstructured organization data for mapping to GRID IDs like the following:
Structured Data e.g.,
[{"name":"Southwestern University", "city":"Georgetown", "state":"Texas", "country":"USA"}]
Unstructured Data e.g.,
[{"affiliation": "university of oxford, uk"}]
For purposes of this notebook, we will be utilizing structured data in a pandas dataframe. Therefore, please ensure your organization dataset resembles the format observed under method 1, below.
1.1 Manually Generate Organization Data¶
The following cell builds an example organization dataset.
[3]:
# The following generates a table of organization data with 4 columns
organization_names = pd.Series(['Augusta Univeristy', 'Baylor College of Medicine', 'Brown University', 'California Institute of Technology', 'Duke Univerisity',
'Emory University', 'Florida State University', 'Harvard Medical School', 'Kent State University', 'New York University', 'Mayo Clinic'])
organization_cities = pd.Series(['Augusta', 'Houston', 'Providence', 'Pasadena', 'Durham',
'Atlanta', 'Tallahassee', 'Boston', 'Kent', 'New York'])
organization_states = pd.Series(['Georgia', 'Texas', 'Rhode Island', 'California', 'North Carolina',
'Georgia', 'Florida', 'Massachusetts', 'Ohio', 'New York'])
organization_countries = pd.Series(['United States', 'United States', 'United States', 'United States', 'United States',
'United States', 'United States', 'United States', 'United States', 'United States', 'United States'])
orgs = pd.DataFrame({'name':organization_names, 'city':organization_cities, 'state':organization_states, 'country':organization_countries})
# Preview Dataset
orgs
[3]:
name | city | state | country | |
---|---|---|---|---|
0 | Augusta Univeristy | Augusta | Georgia | United States |
1 | Baylor College of Medicine | Houston | Texas | United States |
2 | Brown University | Providence | Rhode Island | United States |
3 | California Institute of Technology | Pasadena | California | United States |
4 | Duke Univerisity | Durham | North Carolina | United States |
5 | Emory University | Atlanta | Georgia | United States |
6 | Florida State University | Tallahassee | Florida | United States |
7 | Harvard Medical School | Boston | Massachusetts | United States |
8 | Kent State University | Kent | Ohio | United States |
9 | New York University | New York | New York | United States |
10 | Mayo Clinic | NaN | NaN | United States |
1.2 Load Organization Data from Local Machine¶
The following cells can be utilized to import an excel file of organization data from a local machine.
This method is useful for when you need to map hundreds or thousands of organizations to GRID IDs, as the bulk process using the API will be much faster than any individual mapping.
Please uncomment the cells below if to be utilized
[4]:
# # Upload the organization dataset from local machine
# from google.colab import files
# uploaded = files.upload()
[5]:
# # Load and preview the organization dataset into a pandas dataframe
# import io
# import pandas as pd
# orgs = pd.read_excel(io.BytesIO(uploaded['dataset_name.xlsx']))
# orgs.head()
2. Utilizing Dimensions API to Extract GRID IDs¶
The following cells will take our organization data and run it through the Dimensions API to pull back GRID IDs mapped to each organization.
Here, we utilize the “extract_affiliations” API function which can be used to enrich private datasets including non-disambiguated organizations data with Dimensions GRID IDs.
[6]:
# First, we replace empty data with 'null' to satisfy mapping specifications
orgs = orgs.fillna('null')
[7]:
# Second, we will convert organization data from a dataframe to a dictionary (json) for GRID mapping
recs = orgs.to_dict(orient='records')
[8]:
# Then we will take the organization data, run it through the API and return GRID IDs
# Chunk records to batches, API takes up to 200 records at a time.
def chunk_records(l, n):
for i in range(0, len(l), n):
yield l[i : i + n]
# Use dimcli's from extract_affiliations API wrapper to process data
chunksize = 200
grid = pd.DataFrame()
for k,chunk in enumerate(chunk_records(recs, chunksize)):
output = extract_affiliations(chunk, as_json=False)
grid = grid.append(output,sort = False, ignore_index = True)
# Pause to avoid overloading API with too many calls too quickly
time.sleep(1)
print(f"{(k+1)*chunksize} records complete!")
200 records complete!
[9]:
# Preview the extracted GRID ID dataframe
# Note: data columns labeled with "input" are the original organization data supplied to the API
grid.head()
[9]:
input.city | input.country | input.name | input.state | grid_id | grid_name | grid_city | grid_state | grid_country | requires_review | geo_country_id | geo_country_name | geo_country_code | geo_state_id | geo_state_name | geo_state_code | geo_city_id | geo_city_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Augusta | United States | Augusta Univeristy | Georgia | grid.410427.4 | Augusta University | Augusta | Georgia | United States | False | 6252001 | United States | US | 4197000 | Georgia | US-GA | 4180531 | Augusta |
1 | Houston | United States | Baylor College of Medicine | Texas | grid.39382.33 | Baylor College of Medicine | Houston | Texas | United States | False | 6252001 | United States | US | 4736286 | Texas | US-TX | 4699066 | Houston |
2 | Providence | United States | Brown University | Rhode Island | grid.40263.33 | Brown University | Providence | Rhode Island | United States | False | 6252001 | United States | US | 5224323 | Rhode Island | US-RI | 5224151 | Providence |
3 | Pasadena | United States | California Institute of Technology | California | grid.20861.3d | California Institute of Technology | Pasadena | California | United States | False | 6252001 | United States | US | 5332921 | California | US-CA | 5381396 | Pasadena |
4 | Durham | United States | Duke Univerisity | North Carolina | grid.26009.3d | Duke University | Durham | North Carolina | United States | False | 6252001 | United States | US | 4482348 | North Carolina | US-NC | 4464368 | Durham |
Note: Some records returned in the GRID mapping may require manual review, as some results may give more than one organization of interest (see below). The user can utilize this information to update their original organization data that is inputted to this notebook.
[10]:
grid['requires_review'] = grid['requires_review'].astype(str)
grid_review = grid.loc[grid['requires_review'] == 'True']
grid_review
[10]:
input.city | input.country | input.name | input.state | grid_id | grid_name | grid_city | grid_state | grid_country | requires_review | geo_country_id | geo_country_name | geo_country_code | geo_state_id | geo_state_name | geo_state_code | geo_city_id | geo_city_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | null | United States | Mayo Clinic | null | grid.417468.8 | Mayo Clinic | Scottsdale | Arizona | United States | True | 6252001 | United States | US | 5551752 | Arizona | US-AZ | 5313457 | Scottsdale |
11 | null | United States | Mayo Clinic | null | grid.417468.8 | Mayo Clinic | Scottsdale | Arizona | United States | True | 6252001 | United States | US | 5551752 | Arizona | US-AZ | 4160021 | Jacksonville |
12 | null | United States | Mayo Clinic | null | grid.417468.8 | Mayo Clinic | Scottsdale | Arizona | United States | True | 6252001 | United States | US | 4155751 | Florida | US-FL | 5313457 | Scottsdale |
13 | null | United States | Mayo Clinic | null | grid.417468.8 | Mayo Clinic | Scottsdale | Arizona | United States | True | 6252001 | United States | US | 4155751 | Florida | US-FL | 4160021 | Jacksonville |
14 | null | United States | Mayo Clinic | null | grid.417467.7 | Mayo Clinic | Jacksonville | Florida | United States | True | 6252001 | United States | US | 5551752 | Arizona | US-AZ | 5313457 | Scottsdale |
15 | null | United States | Mayo Clinic | null | grid.417467.7 | Mayo Clinic | Jacksonville | Florida | United States | True | 6252001 | United States | US | 5551752 | Arizona | US-AZ | 4160021 | Jacksonville |
16 | null | United States | Mayo Clinic | null | grid.417467.7 | Mayo Clinic | Jacksonville | Florida | United States | True | 6252001 | United States | US | 4155751 | Florida | US-FL | 5313457 | Scottsdale |
17 | null | United States | Mayo Clinic | null | grid.417467.7 | Mayo Clinic | Jacksonville | Florida | United States | True | 6252001 | United States | US | 4155751 | Florida | US-FL | 4160021 | Jacksonville |
3. Save the GRID ID Dataset we created¶
The following cell will export the GRID ID mapped organization data to a csv file that can be saved to your local machine.
[11]:
# temporarily save pandas dataframe as file in colab environment
grid.to_csv('file_name.csv')
if 'google.colab' in sys.modules:
from google.colab import files
# download file to local machine
files.download('file_name.csv')
Conclusions¶
In this notebook we have shown how to use the Dimensions Analytics API extract_affiliations function to assign GRID identifiers to organizations data.
For more background, see the extract_affiliations function documentation, as well as the other functions available via the Dimensions API.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.