../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Mapping GRID IDs to Organization Data

In this tutorial, we show how to use the Dimensions Analytics API and organization data to extract GRID IDs.

GRID is a free and openly available global database of research-related organisations, cataloging research-related organizations and providing each with a unique and persistent identifier. Dimensions uses this identifier to link organizations to publications, grants, etc.

Use case scenarios:

  • An analyst has a list of organizations of interest, and wants to get details of their publications from Dimensions. To do this, they they need to map them to GRID IDs so they can extract information from the Dimensions database. The organization data can be run through the Dimensions API extract_affiliations function in order to extract GRID IDs, which can then be utilized to get publication data statistics.

  • A second use case is to standardize messy organization data for analysis. For example, an analyst might have a set of affiliation data containing many variants of organization names (“University of Cambridge”, “Cambridge University”). By mapping to GRID IDs, the analyst can standardize the data so it’s easier to analyse.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

To generate an API key from the Dimensions webapp, go to “My Account”. Under “General Settings” there is an “API key” section where there is a “Create API key” button. More information on this can be found here.

[5]:
!pip install dimcli --quiet

import dimcli
from dimcli.utils import *
from dimcli.functions import extract_affiliations

import json
import sys
import pandas as pd
import re
import time

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.1)
Connected to: https://app.dimensions.ai - DSL v1.31
Method: dsl.ini file

1. Importing Organization Data

There are several ways to obtain organization data. Below we show examples for 2 different ways to obtain organization data that can be used to run through Dimensions API for GRID ID mapping. For purposes of this demostration, we will be using method 1. Please uncomment the other sections if you wish to use those methods instead.

  1. Manually Generate Organization Data

  2. Load Organization Data from Local Machine

Note - To map organizational data to GRID IDs, the data must conform to mapping specifications and contain data (if available) for the following 4 columns (with column headers being lowercase): * name - name of the organization * city - city of the organization * state - state of the organization (use the full name of the state, not acronym) * country - country of the organization

The user may use structured or unstructured organization data for mapping to GRID IDs like the following:

  • Structured Data e.g., [{"name":"Southwestern University", "city":"Georgetown", "state":"Texas", "country":"USA"}]

  • Unstructured Data e.g., [{"affiliation": "university of oxford, uk"}]

For purposes of this notebook, we will be utilizing structured data in a pandas dataframe. Therefore, please ensure your organization dataset resembles the format observed under method 1, below.

1.1 Manually Generate Organization Data

The following cell builds an example organization dataset.

[2]:
# The following generates a table of organization data with 4 columns
organization_names = pd.Series(['Augusta Univeristy', 'Baylor College of Medicine', 'Brown University', 'California Institute of Technology', 'Duke Univerisity',
                                'Emory University', 'Florida State University', 'Harvard Medical School', 'Kent State University', 'New York University', 'Mayo Clinic'])
organization_cities = pd.Series(['Augusta', 'Houston', 'Providence', 'Pasadena', 'Durham',
                                 'Atlanta', 'Tallahassee', 'Boston', 'Kent', 'New York'])
organization_states = pd.Series(['Georgia', 'Texas', 'Rhode Island', 'California', 'North Carolina',
                                 'Georgia', 'Florida', 'Massachusetts', 'Ohio', 'New York'])
organization_countries = pd.Series(['United States', 'United States', 'United States', 'United States', 'United States',
                                    'United States', 'United States', 'United States', 'United States', 'United States', 'United States'])

orgs = pd.DataFrame({'name':organization_names, 'city':organization_cities, 'state':organization_states, 'country':organization_countries})

# Preview Dataset
orgs
[2]:
name city state country
0 Augusta Univeristy Augusta Georgia United States
1 Baylor College of Medicine Houston Texas United States
2 Brown University Providence Rhode Island United States
3 California Institute of Technology Pasadena California United States
4 Duke Univerisity Durham North Carolina United States
5 Emory University Atlanta Georgia United States
6 Florida State University Tallahassee Florida United States
7 Harvard Medical School Boston Massachusetts United States
8 Kent State University Kent Ohio United States
9 New York University New York New York United States
10 Mayo Clinic NaN NaN United States

1.2 Load Organization Data from Local Machine

The following cells can be utilized to import an excel file of organization data from a local machine.

This method is useful for when you need to map hundreds or thousands of organizations to GRID IDs, as the bulk process using the API will be much faster than any individual mapping.

Please uncomment the cells below if to be utilized

[ ]:
# # Upload the organization dataset from local machine

# from google.colab import files
# uploaded = files.upload()
[ ]:
# # Load and preview the organization dataset into a pandas dataframe

# import io
# import pandas as pd

# orgs = pd.read_excel(io.BytesIO(uploaded['dataset_name.xlsx']))

# orgs.head()

2. Utilizing Dimensions API to Extract GRID IDs

The following cells will take our organization data and run it through the Dimensions API to pull back GRID IDs mapped to each organization.

Here, we utilize the “extract_affiliations” API function which can be used to enrich private datasets including non-disambiguated organizations data with Dimensions GRID IDs.

[3]:
# First, we replace empty data with 'null' to satisfy mapping specifications

orgs = orgs.fillna('null')
[4]:
# Second, we will convert organization data from a dataframe to a dictionary (json) for GRID mapping

recs = orgs.to_dict(orient='records')
[6]:
# Then we will take the organization data, run it through the API and return GRID IDs

# Chunk records to batches, API takes up to 200 records at a time.
def chunk_records(l, n):
    for i in range(0, len(l), n):
        yield l[i : i + n]

# Use dimcli's from extract_affiliations API wrapper to process data

chunksize = 200
grid = pd.DataFrame()
for k,chunk in enumerate(chunk_records(recs, chunksize)):
    output = extract_affiliations(chunk, as_json=False)
    grid = grid.append(output,sort = False, ignore_index = True)
    # Pause to avoid overloading API with too many calls too quickly
    time.sleep(1)
    print(f"{(k+1)*chunksize} records complete!")
200 records complete!
[7]:
# Preview the extracted GRID ID dataframe
# Note: data columns labeled with "input" are the original organization data supplied to the API

grid.head()
[7]:
input.name input.city input.state input.country grid_id grid_name grid_city grid_state grid_country requires_review geo_country_id geo_country_name geo_country_code geo_state_id geo_state_name geo_state_code geo_city_id geo_city_name
0 Augusta Univeristy Augusta Georgia United States grid.410427.4 Augusta University Augusta Georgia United States False 6252001 United States US 4197000 Georgia US-GA 4180531 Augusta
1 Baylor College of Medicine Houston Texas United States grid.39382.33 Baylor College of Medicine Houston Texas United States False 6252001 United States US 4736286 Texas US-TX 4699066 Houston
2 Brown University Providence Rhode Island United States grid.40263.33 Brown University Providence Rhode Island United States False 6252001 United States US 5224323 Rhode Island US-RI 5224151 Providence
3 California Institute of Technology Pasadena California United States grid.20861.3d California Institute of Technology Pasadena California United States False 6252001 United States US 5332921 California US-CA 5381396 Pasadena
4 Duke Univerisity Durham North Carolina United States grid.26009.3d Duke University Durham North Carolina United States False 6252001 United States US 4482348 North Carolina US-NC 4464368 Durham

Note: Some records returned in the GRID mapping may require manual review, as some results may give more than one organization of interest (see below). The user can utilize this information to update their original organization data that is inputted to this notebook.

[8]:
grid['requires_review'] = grid['requires_review'].astype(str)
grid_review = grid.loc[grid['requires_review'] == 'True']
grid_review
[8]:
input.name input.city input.state input.country grid_id grid_name grid_city grid_state grid_country requires_review geo_country_id geo_country_name geo_country_code geo_state_id geo_state_name geo_state_code geo_city_id geo_city_name
10 Mayo Clinic null null United States grid.417468.8 Mayo Clinic Scottsdale Arizona United States True 6252001 United States US 5551752 Arizona US-AZ 5313457 Scottsdale
11 Mayo Clinic null null United States grid.417468.8 Mayo Clinic Scottsdale Arizona United States True 6252001 United States US 5551752 Arizona US-AZ 4160021 Jacksonville
12 Mayo Clinic null null United States grid.417468.8 Mayo Clinic Scottsdale Arizona United States True 6252001 United States US 4155751 Florida US-FL 5313457 Scottsdale
13 Mayo Clinic null null United States grid.417468.8 Mayo Clinic Scottsdale Arizona United States True 6252001 United States US 4155751 Florida US-FL 4160021 Jacksonville
14 Mayo Clinic null null United States grid.417467.7 Mayo Clinic Jacksonville Florida United States True 6252001 United States US 5551752 Arizona US-AZ 5313457 Scottsdale
15 Mayo Clinic null null United States grid.417467.7 Mayo Clinic Jacksonville Florida United States True 6252001 United States US 5551752 Arizona US-AZ 4160021 Jacksonville
16 Mayo Clinic null null United States grid.417467.7 Mayo Clinic Jacksonville Florida United States True 6252001 United States US 4155751 Florida US-FL 5313457 Scottsdale
17 Mayo Clinic null null United States grid.417467.7 Mayo Clinic Jacksonville Florida United States True 6252001 United States US 4155751 Florida US-FL 4160021 Jacksonville

3. Save the GRID ID Dataset we created

The following cell will export the GRID ID mapped organization data to a csv file that can be saved to your local machine.

[ ]:
# temporarily save pandas dataframe as file in colab environment
grid.to_csv('file_name.csv')

if 'google.colab' in sys.modules:

    from google.colab import files

    # download file to local machine
    files.download('file_name.csv')

Conclusions

In this notebook we have shown how to use the Dimensions Analytics API extract_affiliations function to assign GRID identifiers to organizations data.

For more background, see the extract_affiliations function documentation, as well as the other functions available via the Dimensions API.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg