../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

The Organizations API: Features Overview

This tutorial provides an overview of the Organizations data source available via the Dimensions Analytics API.

Organizations data in Dimensions is based on GRID - the Global Research Identifiers Database.

The topics covered in this notebook are:

  • How to align your affiliation data with GRID/Dimensions using the API disambiguation service

  • How to retrieve organizations metadata using the search fields available

  • How to use the schema API to obtain some statistics about the Organizations data available

Prerequisites

Please install the latest versions of these libraries to run this notebook.

[22]:
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai"

!pip install dimcli -U --quiet


# import all libraries and login
import dimcli
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

import pandas as pd
import sys
import json
import time
from tqdm.notebook import tqdm as pbar
DimCli v0.6.6.5 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)

1. Matching affiliation data to GRID IDs using extract_affiliations

The API function extract_affiliations (docs) can be used to enrich private datasets including non-disambiguated organizations data with Dimensions IDs, so to then take advantage of the wealth of linked data available in Dimensions.

For example, let’s assume our dataset has four columns (affiliation name, city, state and country) - any of which can be empty of course. Like this:

[23]:
affiliations = [
                ['University of Nebraska–Lincoln', 'Lincoln', 'Nebraska', 'United States'],
                ['Tarbiat Modares University', 'Tehran', '', 'Iran'],
                ['Harvard University', 'Cambridge', 'Massachusetts', 'United States'],
                ['China Academy of Chinese Medical Sciences', 'Beijing', '', 'China'],
                ['Liaoning University', 'Shenyang', '', 'China'],
                ['Liaoning Normal University', 'Dalian', '', 'China'],
                ['P.G. Department of Zoology and Research Centre, Shri Shiv Chhatrapati College of Arts, Commerce and Science, Junnar 410502, Pune, India.', '', '', ''],
                ['Sungkyunkwan University', 'Seoul', '', 'South Korea'],
                ['Centre for Materials for Electronics Technology', 'Pune', '', 'India'],
                ['Institut Necker-Enfants Malades (INEM), INSERM U1151-CNRS UMR8253, Université de Paris, Faculté de Médecine, 156 rue de Vaugirard, 75730 Paris Cedex 15, France', '', '', '']
                ]

We want to look up GRID identifiers for those affiliations using the structured affiliation matching.

[24]:
for d in pbar(affiliations):
    res = dsl.query(f"""extract_affiliations(name="{d[0]}", city="{d[1]}", state="{d[2]}", country="{d[3]}")""")
    time.sleep(0.5)
    print(res.json)
{'results': [{'institutes': [{'institute': {'id': 'grid.24434.35', 'name': 'University of Nebraska–Lincoln', 'city': 'Lincoln', 'state': 'Nebraska', 'country': 'United States'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 5072006, 'name': 'Lincoln'}], 'states': [{'geonames_id': 5073708, 'name': 'Nebraska'}], 'countries': [{'geonames_id': 6252001, 'name': 'United States', 'code': 'US'}]}, 'input': {'name': 'University of Nebraska–Lincoln', 'city': 'Lincoln', 'state': 'Nebraska', 'country': 'United States'}}]}
{'results': [{'institutes': [{'institute': {'id': 'grid.412266.5', 'name': 'Tarbiat Modares University', 'city': 'Tehran', 'state': None, 'country': 'Iran'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 112931, 'name': 'Tehran'}], 'states': [{'geonames_id': 110791, 'name': 'Tehran'}], 'countries': [{'geonames_id': 130758, 'name': 'Iran', 'code': 'IR'}]}, 'input': {'name': 'Tarbiat Modares University', 'city': 'Tehran', 'state': '', 'country': 'Iran'}}]}
{'results': [{'institutes': [{'institute': {'id': 'grid.38142.3c', 'name': 'Harvard University', 'city': 'Cambridge', 'state': 'Massachusetts', 'country': 'United States'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 4931972, 'name': 'Cambridge'}], 'states': [{'geonames_id': 6254926, 'name': 'Massachusetts'}], 'countries': [{'geonames_id': 6252001, 'name': 'United States', 'code': 'US'}]}, 'input': {'name': 'Harvard University', 'city': 'Cambridge', 'state': 'Massachusetts', 'country': 'United States'}}]}
{'results': [{'institutes': [{'institute': {'id': 'grid.410318.f', 'name': 'China Academy of Chinese Medical Sciences', 'city': 'Beijing', 'state': None, 'country': 'China'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 1816670, 'name': 'Beijing'}], 'states': [{'geonames_id': 2038349, 'name': 'Beijing'}], 'countries': [{'geonames_id': 1814991, 'name': 'China', 'code': 'CN'}]}, 'input': {'name': 'China Academy of Chinese Medical Sciences', 'city': 'Beijing', 'state': '', 'country': 'China'}}]}
{'results': [{'institutes': [{'institute': {'id': 'grid.411356.4', 'name': 'Liaoning University', 'city': 'Shenyang', 'state': None, 'country': 'China'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 2034937, 'name': 'Shenyang'}], 'states': [], 'countries': [{'geonames_id': 1814991, 'name': 'China', 'code': 'CN'}]}, 'input': {'name': 'Liaoning University', 'city': 'Shenyang', 'state': '', 'country': 'China'}}]}
{'results': [{'institutes': [{'institute': {'id': 'grid.440818.1', 'name': 'Liaoning Normal University', 'city': 'Dalian', 'state': None, 'country': 'China'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 1814087, 'name': 'Dalian'}], 'states': [], 'countries': [{'geonames_id': 1814991, 'name': 'China', 'code': 'CN'}]}, 'input': {'name': 'Liaoning Normal University', 'city': 'Dalian', 'state': '', 'country': 'China'}}]}
{'results': [{'institutes': [], 'geo': {'cities': [], 'states': [], 'countries': []}, 'input': {'name': 'P.G. Department of Zoology and Research Centre, Shri Shiv Chhatrapati College of Arts, Commerce and Science, Junnar 410502, Pune, India.', 'city': '', 'state': '', 'country': ''}}]}
{'results': [{'institutes': [{'institute': {'id': 'grid.264381.a', 'name': 'Sungkyunkwan University', 'city': 'Seoul', 'state': None, 'country': 'South Korea'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 1835848, 'name': 'Seoul'}], 'states': [], 'countries': [{'geonames_id': 1835841, 'name': 'South Korea', 'code': 'KR'}]}, 'input': {'name': 'Sungkyunkwan University', 'city': 'Seoul', 'state': '', 'country': 'South Korea'}}]}
{'results': [{'institutes': [{'institute': {'id': 'grid.494569.3', 'name': 'Centre for Materials for Electronics Technology', 'city': 'Pune', 'state': None, 'country': 'India'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 1259229, 'name': 'Pune'}], 'states': [{'geonames_id': 1264418, 'name': 'Maharashtra'}], 'countries': [{'geonames_id': 1269750, 'name': 'India', 'code': 'IN'}]}, 'input': {'name': 'Centre for Materials for Electronics Technology', 'city': 'Pune', 'state': '', 'country': 'India'}}]}
{'results': [{'institutes': [{'institute': {'id': 'grid.410511.0', 'name': 'Paris 12 Val de Marne University', 'city': 'Paris', 'state': None, 'country': 'France'}, 'metadata': {'requires_manual_review': True}}, {'institute': {'id': 'grid.5842.b', 'name': 'University of Paris-Sud', 'city': 'Orsay', 'state': None, 'country': 'France'}, 'metadata': {'requires_manual_review': True}}, {'institute': {'id': 'grid.11318.3a', 'name': 'Paris 13 University', 'city': 'Villetaneuse', 'state': None, 'country': 'France'}, 'metadata': {'requires_manual_review': True}}], 'geo': {'cities': [{'geonames_id': 2989204, 'name': 'Orsay'}, {'geonames_id': 2988507, 'name': 'Paris'}, {'geonames_id': 2968275, 'name': 'Villetaneuse'}], 'states': [], 'countries': [{'geonames_id': 3017382, 'name': 'France', 'code': 'FR'}]}, 'input': {'name': 'Institut Necker-Enfants Malades (INEM), INSERM U1151-CNRS UMR8253, Université de Paris, Faculté de Médecine, 156 rue de Vaugirard, 75730 Paris Cedex 15, France', 'city': '', 'state': '', 'country': ''}}]}

If we combine the affiliation data into a single long string, we can also perform the same king of operation using the unstructured affiliation matching.

[25]:
# implicit results
for d in pbar(affiliations):
    merged = f"{d[0]} {d[1]} {d[2]} {d[3]}"
    res = dsl.query(f"""extract_affiliations(affiliation="{merged}")""")
    time.sleep(0.5)
    print(res.json)
{'results': [{'matches': [{'affiliation_part': 'University of Nebraska–Lincoln Lincoln Nebraska United States', 'institutes': [{'institute': {'id': 'grid.24434.35', 'name': 'University of Nebraska–Lincoln', 'city': 'Lincoln', 'state': 'Nebraska', 'country': 'United States'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 5072006, 'name': 'Lincoln'}], 'states': [{'geonames_id': 5073708, 'name': 'Nebraska'}], 'countries': [{'geonames_id': 6252001, 'name': 'United States', 'code': 'US'}]}}], 'input': {'affiliation': 'University of Nebraska–Lincoln Lincoln Nebraska United States'}}]}
{'results': [{'matches': [{'affiliation_part': 'Tarbiat Modares University Tehran Iran', 'institutes': [{'institute': {'id': 'grid.412266.5', 'name': 'Tarbiat Modares University', 'city': 'Tehran', 'state': None, 'country': 'Iran'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 112931, 'name': 'Tehran'}], 'states': [{'geonames_id': 110791, 'name': 'Tehran'}], 'countries': [{'geonames_id': 130758, 'name': 'Iran', 'code': 'IR'}]}}], 'input': {'affiliation': 'Tarbiat Modares University Tehran  Iran'}}]}
{'results': [{'matches': [{'affiliation_part': 'Harvard University Cambridge Massachusetts United States', 'institutes': [{'institute': {'id': 'grid.38142.3c', 'name': 'Harvard University', 'city': 'Cambridge', 'state': 'Massachusetts', 'country': 'United States'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 4931972, 'name': 'Cambridge'}], 'states': [{'geonames_id': 6254926, 'name': 'Massachusetts'}], 'countries': [{'geonames_id': 6252001, 'name': 'United States', 'code': 'US'}]}}], 'input': {'affiliation': 'Harvard University Cambridge Massachusetts United States'}}]}
{'results': [{'matches': [{'affiliation_part': 'China Academy of Chinese Medical Sciences Beijing China', 'institutes': [{'institute': {'id': 'grid.410318.f', 'name': 'China Academy of Chinese Medical Sciences', 'city': 'Beijing', 'state': None, 'country': 'China'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 1816670, 'name': 'Beijing'}], 'states': [{'geonames_id': 2038349, 'name': 'Beijing'}], 'countries': [{'geonames_id': 1814991, 'name': 'China', 'code': 'CN'}]}}], 'input': {'affiliation': 'China Academy of Chinese Medical Sciences Beijing  China'}}]}
{'results': [{'matches': [{'affiliation_part': 'Liaoning University Shenyang China', 'institutes': [{'institute': {'id': 'grid.411356.4', 'name': 'Liaoning University', 'city': 'Shenyang', 'state': None, 'country': 'China'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 2034937, 'name': 'Shenyang'}], 'states': [{'geonames_id': 2036115, 'name': 'Liaoning'}], 'countries': [{'geonames_id': 1814991, 'name': 'China', 'code': 'CN'}]}}], 'input': {'affiliation': 'Liaoning University Shenyang  China'}}]}
{'results': [{'matches': [{'affiliation_part': 'Liaoning Normal University Dalian China', 'institutes': [{'institute': {'id': 'grid.440818.1', 'name': 'Liaoning Normal University', 'city': 'Dalian', 'state': None, 'country': 'China'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 1814087, 'name': 'Dalian'}], 'states': [{'geonames_id': 2036115, 'name': 'Liaoning'}], 'countries': [{'geonames_id': 1814991, 'name': 'China', 'code': 'CN'}]}}], 'input': {'affiliation': 'Liaoning Normal University Dalian  China'}}]}
{'results': [{'matches': [{'affiliation_part': 'P.G. Department of Zoology and Research Centre, Shri Shiv Chhatrapati College of Arts, Commerce and Science, Junnar 410502, Pune, India', 'institutes': [], 'geo': {'cities': [{'geonames_id': 1268761, 'name': 'Junnar'}], 'states': [{'geonames_id': 1264418, 'name': 'Maharashtra'}], 'countries': [{'geonames_id': 1269750, 'name': 'India', 'code': 'IN'}]}}], 'input': {'affiliation': 'P.G. Department of Zoology and Research Centre, Shri Shiv Chhatrapati College of Arts, Commerce and Science, Junnar 410502, Pune, India.   '}}]}
{'results': [{'matches': [{'affiliation_part': 'Sungkyunkwan University Seoul South Korea', 'institutes': [{'institute': {'id': 'grid.264381.a', 'name': 'Sungkyunkwan University', 'city': 'Seoul', 'state': None, 'country': 'South Korea'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 1835848, 'name': 'Seoul'}], 'states': [], 'countries': [{'geonames_id': 1835841, 'name': 'South Korea', 'code': 'KR'}]}}], 'input': {'affiliation': 'Sungkyunkwan University Seoul  South Korea'}}]}
{'results': [{'matches': [{'affiliation_part': 'Centre for Materials for Electronics Technology Pune India', 'institutes': [{'institute': {'id': 'grid.494569.3', 'name': 'Centre for Materials for Electronics Technology', 'city': 'Pune', 'state': None, 'country': 'India'}, 'metadata': {'requires_manual_review': False}}], 'geo': {'cities': [{'geonames_id': 1259229, 'name': 'Pune'}], 'states': [{'geonames_id': 1264418, 'name': 'Maharashtra'}], 'countries': [{'geonames_id': 1269750, 'name': 'India', 'code': 'IN'}]}}], 'input': {'affiliation': 'Centre for Materials for Electronics Technology Pune  India'}}]}
{'results': [{'matches': [{'affiliation_part': 'Institut Necker-Enfants Malades (INEM), INSERM U1151-CNRS UMR8253, Université de Paris, Faculté de Médecine, 156 rue de Vaugirard, 75730 Paris Cedex 15, France', 'institutes': [{'institute': {'id': 'grid.11318.3a', 'name': 'Paris 13 University', 'city': 'Villetaneuse', 'state': None, 'country': 'France'}, 'metadata': {'requires_manual_review': True}}, {'institute': {'id': 'grid.410511.0', 'name': 'Paris 12 Val de Marne University', 'city': 'Paris', 'state': None, 'country': 'France'}, 'metadata': {'requires_manual_review': True}}, {'institute': {'id': 'grid.5842.b', 'name': 'University of Paris-Sud', 'city': 'Orsay', 'state': None, 'country': 'France'}, 'metadata': {'requires_manual_review': True}}], 'geo': {'cities': [{'geonames_id': 2989204, 'name': 'Orsay'}, {'geonames_id': 2988507, 'name': 'Paris'}, {'geonames_id': 2968275, 'name': 'Villetaneuse'}], 'states': [], 'countries': [{'geonames_id': 3017382, 'name': 'France', 'code': 'FR'}]}}], 'input': {'affiliation': 'Institut Necker-Enfants Malades (INEM), INSERM U1151-CNRS UMR8253, Université de Paris, Faculté de Médecine, 156 rue de Vaugirard, 75730 Paris Cedex 15, France   '}}]}

NOTE: the above commands also support bulk querying e.g. to save up API queries - check out the docs for more info.

2. Searching GRID organizations

This can be done using full text search and/or fielded search.

Returning facets

[31]:
%%dsldf
search organizations
  for "new york"
return country_name
Returned Country_name: 10
[31]:
id count
0 United States 220
1 Albania 1
2 Canada 1
3 China 1
4 Czechia 1
5 France 1
6 Italy 1
7 South Korea 1
8 United Arab Emirates 1
9 United Kingdom 1
[32]:
%%dsldf
search organizations
  for "new york"
  where country_name = "United States"
return types
Returned Types: 8
[32]:
id count
0 Education 63
1 Nonprofit 45
2 Government 38
3 Other 26
4 Healthcare 19
5 Archive 9
6 Facility 5
7 Company 3

3. A closer look at the organizations data statistics

The Dimensions Search Language exposes programmatically metadata, such as supported sources and entities, along with their fields, facets, fieldsets, metrics and search fields.

[33]:
res = dsl.query("describe schema")

df = pd.DataFrame()

docs_for = ['organizations']
header = "sources"

d = {"sources": [], 'field': [], 'type': [], 'description':[], 'is_filter':[], 'is_entity': [],  'is_facet':[],}
for S in docs_for:
    for x in sorted(res.json[header][S]['fields']):
        d[header] += [S]
        d['field'] += [x]
        d['type'] += [res.json[header][S]['fields'][x]['type']]
        d['description'] += [res.json[header][S]['fields'][x]['description']]
        d['is_filter'] += [res.json[header][S]['fields'][x]['is_filter']]
        d['is_facet'] += [res.json[header][S]['fields'][x].get('is_facet', False)]
        d['is_entity'] += [res.json[header][S]['fields'][x].get('is_entity', False)]

fields = df.from_dict(d)
fields
[33]:
sources field type description is_filter is_entity is_facet
0 organizations acronym string GRID acronym of the organization. E.g., "UT" f... True False False
1 organizations city_name string GRID name of the organization country. E.g., "... True False True
2 organizations cnrs_ids string CNRS IDs for this organization True False False
3 organizations country_name string GRID name of the organization country. E.g., "... True False True
4 organizations established integer Year when the organization was estabilished True False False
5 organizations external_ids_fundref string Fundref IDs for this organization True False False
6 organizations hesa_ids string HESA IDs for this organization True False False
7 organizations id string GRID ID of the organization. E.g., "grid.26999... True False False
8 organizations isni_ids string ISNI IDs for this organization True False False
9 organizations latitude float None False False False
10 organizations linkout string None False False False
11 organizations longitude float None False False False
12 organizations name string GRID name of the organization. E.g., "Universi... True False False
13 organizations organization_child_ids string Child organization IDs True False False
14 organizations organization_parent_ids string Parent organization IDs True False False
15 organizations organization_related_ids string Related organization IDs True False False
16 organizations orgref_ids string OrgRef IDs for this organization True False False
17 organizations state_name string GRID name of the organization country. E.g., "... True False True
18 organizations types string Type of an organization. Available types inclu... True False True
19 organizations ucas_ids string UCAS IDs for this organization True False False
20 organizations ukprn_ids string UKPRN IDs for this organization True False False
21 organizations wikidata_ids string WikiData IDs for this organization True False False
22 organizations wikipedia_url string Wikipedia URL False False False

We can use the fields information above to draw up some quick statistics re. the organizations source.

In order to do this, we use the operator is not empty to generate automatically queries like this search organizations where field_name is not empty return organizations limit 1 and then use the total_count field in the JSON we get back for our statistics.

[34]:
# one query with `is not empty` for field-filters
q_template = """search organizations where {} is not empty return organizations[id] limit 1"""

# seed results with total number of orgs
totorgs = dsl.query("""search organizations return organizations[id] limit 1""", verbose=False).count_total
stats = [
    {'filter_by': 'All Organizations (no filter)', 'results' : totorgs}
]

for index, row in pbar(list(fields.iterrows())):
    # print("\n===", row['field'])
    q = q_template.format(row['field'], row['field'])
    res = dsl.query(q, verbose=False)
    time.sleep(0.5)
    stats.append({'filter_by': row['field'], 'results' : res.count_total})


# save to a dataframe
df = pd.DataFrame().from_dict(stats)
df.sort_values("results", inplace=True, ascending=False)
df

[34]:
filter_by results
0 All Organizations (no filter) 98739
8 id 98739
13 name 98739
4 country_name 98737
2 city_name 98666
19 types 96367
10 latitude 89252
12 longitude 89252
11 linkout 88736
5 established 78269
9 isni_ids 43762
1 acronym 37793
18 state_name 36449
23 wikipedia_url 32590
22 wikidata_ids 32515
17 orgref_ids 14950
15 organization_parent_ids 11663
6 external_ids_fundref 10109
16 organization_related_ids 4070
14 organization_child_ids 2996
3 cnrs_ids 833
21 ukprn_ids 173
7 hesa_ids 172
20 ucas_ids 153

Let’s visualize the data with plotly

[ ]:
!pip install plotly
[36]:
import plotly.express as px
[37]:
px.bar(df, x="filter_by", y="results",
       title="Fields distribution for GRID data")