Mapping Organization IDs to Organization Data¶
In this tutorial, we show how to use the Dimensions Analytics API and organization data to extract organization IDs.
Use case scenarios:
An analyst has a list of organizations of interest, and wants to get details of their publications from Dimensions. To do this, they they need to map them to organization IDs so they can extract information from the Dimensions database. The organization data can be run through the Dimensions API extract_affiliations function in order to extract IDs, which can then be utilized to get publication data statistics.
A second use case is to standardize messy organization data for analysis. For example, an analyst might have a set of affiliation data containing many variants of organization names (“University of Cambridge”, “Cambridge University”). By mapping to IDs, the analyst can standardize the data so it’s easier to analyse.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Sep 10, 2025
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
To generate an API key from the Dimensions webapp, go to “My Account”. Under “General Settings” there is an “API key” section where there is a “Create API key” button. More information on this can be found here.
[2]:
!pip install dimcli --quiet
import dimcli
from dimcli.utils import *
from dimcli.functions import extract_affiliations
import json
import sys
import pandas as pd
import re
import time
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
==
Logging in..
Dimcli - Dimensions API Client (v1.4)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.12
Method: dsl.ini file
1. Importing Organization Data¶
There are several ways to obtain organization data. Below we show examples for 2 different ways to obtain organization data that can be used to run through the Dimensions API for ID mapping. For purposes of this demostration, we will be using method 1. Please uncomment the other sections if you wish to use those methods instead.
Manually Generate Organization Data
Load Organization Data from Local Machine
Note - To map organizational data to IDs, the data must conform to mapping specifications and contain data (if available) for the following 4 columns (with column headers being lowercase): * name - name of the organization * city - city of the organization * state - state of the organization (use the full name of the state, not acronym) * country - country of the organization
The user may use structured or unstructured organization data for mapping to IDs like the following:
Structured Data e.g.,
[{"name":"Southwestern University", "city":"Georgetown", "state":"Texas", "country":"USA"}]Unstructured Data e.g.,
[{"affiliation": "university of oxford, uk"}]
For purposes of this notebook, we will be utilizing structured data in a pandas dataframe. Therefore, please ensure your organization dataset resembles the format observed under method 1, below.
1.1 Manually Generate Organization Data¶
The following cell builds an example organization dataset.
[3]:
# The following generates a table of organization data with 4 columns
organization_names = pd.Series(['Augusta Univeristy', 'Baylor College of Medicine', 'Brown University', 'California Institute of Technology', 'Duke Univerisity',
'Emory University', 'Florida State University', 'Harvard Medical School', 'Kent State University', 'New York University', 'Mayo Clinic'])
organization_cities = pd.Series(['Augusta', 'Houston', 'Providence', 'Pasadena', 'Durham',
'Atlanta', 'Tallahassee', 'Boston', 'Kent', 'New York'])
organization_states = pd.Series(['Georgia', 'Texas', 'Rhode Island', 'California', 'North Carolina',
'Georgia', 'Florida', 'Massachusetts', 'Ohio', 'New York'])
organization_countries = pd.Series(['United States', 'United States', 'United States', 'United States', 'United States',
'United States', 'United States', 'United States', 'United States', 'United States', 'United States'])
orgs = pd.DataFrame({'name':organization_names, 'city':organization_cities, 'state':organization_states, 'country':organization_countries})
# Preview Dataset
orgs
[3]:
| name | city | state | country | |
|---|---|---|---|---|
| 0 | Augusta Univeristy | Augusta | Georgia | United States |
| 1 | Baylor College of Medicine | Houston | Texas | United States |
| 2 | Brown University | Providence | Rhode Island | United States |
| 3 | California Institute of Technology | Pasadena | California | United States |
| 4 | Duke Univerisity | Durham | North Carolina | United States |
| 5 | Emory University | Atlanta | Georgia | United States |
| 6 | Florida State University | Tallahassee | Florida | United States |
| 7 | Harvard Medical School | Boston | Massachusetts | United States |
| 8 | Kent State University | Kent | Ohio | United States |
| 9 | New York University | New York | New York | United States |
| 10 | Mayo Clinic | NaN | NaN | United States |
1.2 Load Organization Data from Local Machine¶
The following cells can be utilized to import an excel file of organization data from a local machine.
This method is useful for when you need to map hundreds or thousands of organizations to IDs, as the bulk process using the API will be much faster than any individual mapping.
Please uncomment the cells below if to be utilized
[4]:
# # Upload the organization dataset from local machine
# from google.colab import files
# uploaded = files.upload()
[5]:
# # Load and preview the organization dataset into a pandas dataframe
# import io
# import pandas as pd
# orgs = pd.read_excel(io.BytesIO(uploaded['dataset_name.xlsx']))
# orgs.head()
2. Utilizing Dimensions API to Extract IDs¶
The following cells will take our organization data and run it through the Dimensions API to pull back IDs mapped to each organization.
Here, we utilize the “extract_affiliations” API function which can be used to enrich private datasets including non-disambiguated organizations data with Dimensions organization IDs.
[6]:
# First, we replace empty data with 'null' to satisfy mapping specifications
orgs = orgs.fillna('null')
[7]:
# Second, we will convert organization data from a dataframe to a dictionary (json) for ID mapping
recs = orgs.to_dict(orient='records')
[8]:
# Then we will take the organization data, run it through the API and return organization IDs
# Chunk records to batches, API takes up to 200 records at a time.
def chunk_records(l, n):
for i in range(0, len(l), n):
yield l[i : i + n]
# Use dimcli's from extract_affiliations API wrapper to process data
chunksize = 200
org_data = pd.DataFrame()
for k,chunk in enumerate(chunk_records(recs, chunksize)):
output = extract_affiliations(chunk, as_json=False)
org_data = pd.concat([org_data, output])
# Pause to avoid overloading API with too many calls too quickly
time.sleep(1)
print(f"{(k+1)*chunksize} records complete!")
200 records complete!
[9]:
# Preview the extracted organization ID dataframe
# Note: data columns labeled with "input" are the original organization data supplied to the API
org_data.head()
[9]:
| input.city | input.country | input.name | input.state | grid_id | grid_name | grid_city | grid_state | grid_country | requires_review | geo_country_id | geo_country_name | geo_country_code | geo_state_id | geo_state_name | geo_state_code | geo_city_id | geo_city_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Augusta | United States | Augusta Univeristy | Georgia | grid.410427.4 | Augusta University | Augusta | Georgia | United States | False | 6252001 | United States | US | 4197000 | Georgia | US-GA | 4180531 | Augusta |
| 1 | Houston | United States | Baylor College of Medicine | Texas | grid.39382.33 | Baylor College of Medicine | Houston | Texas | United States | False | 6252001 | United States | US | 4736286 | Texas | US-TX | 4699066 | Houston |
| 2 | Providence | United States | Brown University | Rhode Island | grid.40263.33 | Brown University | Providence | Rhode Island | United States | False | 6252001 | United States | US | 5224323 | Rhode Island | US-RI | 5224151 | Providence |
| 3 | Pasadena | United States | California Institute of Technology | California | grid.20861.3d | California Institute of Technology | Pasadena | California | United States | False | 6252001 | United States | US | 5332921 | California | US-CA | 5381396 | Pasadena |
| 4 | Durham | United States | Duke Univerisity | North Carolina | grid.26009.3d | Duke University | Durham | North Carolina | United States | False | 6252001 | United States | US | 4482348 | North Carolina | US-NC | 4464368 | Durham |
Note: Some records returned in the mapping may require manual review, as some results may give more than one organization of interest (see below). The user can utilize this information to update their original organization data that is inputted to this notebook.
[10]:
org_data['requires_review'] = org_data['requires_review'].astype(str)
org_data_review = org_data.loc[org_data['requires_review'] == 'True']
org_data_review
[10]:
| input.city | input.country | input.name | input.state | grid_id | grid_name | grid_city | grid_state | grid_country | requires_review | geo_country_id | geo_country_name | geo_country_code | geo_state_id | geo_state_name | geo_state_code | geo_city_id | geo_city_name |
|---|
3. Save the ID Dataset we created¶
The following cell will export the ID-mapped organization data to a csv file that can be saved to your local machine.
[11]:
# temporarily save pandas dataframe as file in colab environment
org_data.to_csv('file_name.csv')
if 'google.colab' in sys.modules:
from google.colab import files
# download file to local machine
files.download('file_name.csv')
Conclusions¶
In this notebook we have shown how to use the Dimensions Analytics API extract_affiliations function to assign identifiers to organizations data.
For more background, see the extract_affiliations function documentation, as well as the other functions available via the Dimensions API.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.