The Dimcli Python library: Working with Pandas Dataframes¶
Dimcli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas dataframe objects.
Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.
pandas is a popular software library written for the Python programming language for data manipulation and analysis.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Aug 10, 2023
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli --quiet
import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
#
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.7
Method: dsl.ini file
1. General method to transform JSON query results into a dataframe¶
The DslDataset.as_dataframe
method allows to quickly turn any query results into a dataframe.
[3]:
# we'll reuse this query later on
query = """search publications for "graphene"
where year in [2013:2019]
return publications[id+type+volume+year+issue+title+journal+authors] sort by times_cited limit 1000"""
res = dsl.query(query)
Returned Publications: 1000 (total = 504501)
Time: 2.52s
WARNINGS [1]
Field current_organization_id of the authors field is deprecated and will be removed in the next major release.
[4]:
df = res.as_dataframe()
df.head(10)
[4]:
id | title | authors | issue | type | volume | year | journal.id | journal.title | |
---|---|---|---|---|---|---|---|---|---|
0 | pub.1024857999 | Van der Waals heterostructures | [{'affiliations': [{'city': 'Manchester', 'cit... | 7459 | article | 499 | 2013 | jour.1018957 | Nature |
1 | pub.1050119463 | The chemistry of two-dimensional layered trans... | [{'affiliations': [{'city': 'New Brunswick', '... | 4 | article | 5 | 2013 | jour.1041224 | Nature Chemistry |
2 | pub.1007330696 | Combining theory and experiment in electrocata... | [{'affiliations': [{'city': 'Stanford', 'city_... | 6321 | article | 355 | 2017 | jour.1346339 | Science |
3 | pub.1019126274 | The Li-Ion Rechargeable Battery: A Perspective | [{'affiliations': [{'city': 'Austin', 'city_id... | 4 | article | 135 | 2013 | jour.1081898 | Journal of the American Chemical Society |
4 | pub.1032956475 | Black phosphorus field-effect transistors | [{'affiliations': [{'city': 'Shanghai', 'city_... | 5 | article | 9 | 2014 | jour.1037429 | Nature Nanotechnology |
5 | pub.1036410643 | Review of Particle Physics | [{'affiliations': [], 'corresponding': '', 'cu... | 9 | article | 38 | 2014 | jour.1327822 | Chinese Physics C |
6 | pub.1055085164 | Aggregation-Induced Emission: Together We Shin... | [{'affiliations': [{'city': 'Shenzhen', 'city_... | 21 | article | 115 | 2015 | jour.1077147 | Chemical Reviews |
7 | pub.1004394295 | Interface engineering of highly efficient pero... | [{'affiliations': [{'city': 'Los Angeles', 'ci... | 6196 | article | 345 | 2014 | jour.1346339 | Science |
8 | pub.1045181228 | The emergence of perovskite solar cells | [{'affiliations': [{'city': 'Sydney', 'city_id... | 7 | article | 8 | 2014 | jour.1037430 | Nature Photonics |
9 | pub.1015305822 | Raman spectroscopy as a versatile tool for stu... | [{'affiliations': [{'city': 'Cambridge', 'city... | 4 | article | 8 | 2013 | jour.1037429 | Nature Nanotechnology |
Pandas dataframes offer a myriad of utilities for inspecting data. Check out the official docs or google a pandas tutorial to lean more about it.
[5]:
# the table shape
df.shape
[5]:
(1000, 9)
[6]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()
[6]:
2014 222
2013 183
2015 173
2016 155
2017 125
2018 83
2019 59
Name: year, dtype: int64
[7]:
# eg distribution of publication [type]
df['type'].value_counts()
[7]:
article 995
monograph 4
book 1
Name: type, dtype: int64
2. Dataframe Methods for ‘Publications’ queries¶
What follows are specialized versions of the as_dataframe
method for results sets composed of publication records.
Extracting authors: as_dataframe_authors
¶
Publications authors are usually returned by the Dimensions API inside a nested JSON object in the authors_affiliations
sub-key.
Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication
This methods allows to quickly extract that data and return a dataframe with one row per author.
[8]:
authors = res.as_dataframe_authors()
authors.head()
[8]:
affiliations | corresponding | current_organization_id | first_name | last_name | orcid | raw_affiliation | researcher_id | pub_id | |
---|---|---|---|---|---|---|---|---|---|
0 | [{'city': 'Manchester', 'city_id': 2643123, 'c... | grid.5379.8 | A. K. | Geim | [0000-0003-2861-8331] | [School of Physics and Astronomy, University o... | ur.0721730631.45 | pub.1024857999 | |
1 | [{'city': 'Manchester', 'city_id': 2643123, 'c... | True | grid.5379.8 | I. V. | Grigorieva | [0000-0001-5991-7778] | [School of Physics and Astronomy, University o... | ur.0767105504.29 | pub.1024857999 |
2 | [{'city': 'New Brunswick', 'city_id': 5101717,... | True | grid.5335.0 | Manish | Chhowalla | None | [Materials Science and Engineering, Rutgers Un... | ur.0633062306.03 | pub.1050119463 |
3 | [{'city': 'Ulsan', 'city_id': 1833747, 'countr... | grid.42687.3f | Hyeon Suk | Shin | None | [Interdisciplinary School of Green Energy and ... | ur.07617630407.83 | pub.1050119463 | |
4 | [{'city': 'Singapore', 'city_id': 1880252, 'co... | grid.4280.e | Goki | Eda | None | [Department of Physics, National University of... | ur.01150450507.27 | pub.1050119463 |
Using the authors dataframe, we can easily get the top ten values for current_organization_id
.
[9]:
authors['current_organization_id'].value_counts()[:10]
[9]:
grid.168010.e 148
133
grid.59025.3b 83
grid.12527.33 81
grid.59053.3a 78
grid.19006.3e 78
grid.166341.7 65
grid.41156.37 64
grid.89336.37 63
grid.116068.8 60
Name: current_organization_id, dtype: int64
Explanation: the most frequent organization turns out to be grid.168010.e. The first result is empty, meaning that for those authors Dimensions has no info about
current_organization_id
.
Extracting Affiliations: as_dataframe_authors_affiliations
¶
As you can see from the results of the previous section, the affiliations
of each author is yet another nested JSON object.
Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication
The as_dataframe_authors_affiliations
method allows to quickly extract that affiliations data and return a dataframe with one row per affiliation.
This can be useful e.g. if one wants to count research organizations at the time of writing (as opposed to current_organization_id
, which is the most recent organization of a researcher).
[10]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()
[10]:
aff_city | aff_city_id | aff_country | aff_country_code | aff_id | aff_name | aff_raw_affiliation | aff_state | aff_state_code | pub_id | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Manchester | 2643123.0 | United Kingdom | GB | grid.5379.8 | University of Manchester | School of Physics and Astronomy, University of... | pub.1024857999 | ur.0721730631.45 | A. K. | Geim | ||
1 | Manchester | 2643123.0 | United Kingdom | GB | grid.5379.8 | University of Manchester | Centre for Mesoscience and Nanotechnology, Uni... | pub.1024857999 | ur.0721730631.45 | A. K. | Geim | ||
2 | Manchester | 2643123.0 | United Kingdom | GB | grid.5379.8 | University of Manchester | School of Physics and Astronomy, University of... | pub.1024857999 | ur.0767105504.29 | I. V. | Grigorieva | ||
3 | New Brunswick | 5101717.0 | United States | US | grid.430387.b | Rutgers, The State University of New Jersey | Materials Science and Engineering, Rutgers Uni... | New Jersey | US-NJ | pub.1050119463 | ur.0633062306.03 | Manish | Chhowalla |
4 | Ulsan | 1833747.0 | South Korea | KR | grid.42687.3f | Ulsan National Institute of Science and Techno... | Interdisciplinary School of Green Energy and L... | pub.1050119463 | ur.07617630407.83 | Hyeon Suk | Shin |
[11]:
affiliations.describe(include="all")
[11]:
aff_city | aff_city_id | aff_country | aff_country_code | aff_id | aff_name | aff_raw_affiliation | aff_state | aff_state_code | pub_id | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8353 | 8353 | 8353 | 8353 | 8353 | 8353 | 8353 | 8353 | 8353 | 8353 | 8353 | 8353 | 8353 |
unique | 524 | 527 | 52 | 52 | 852 | 1132 | 3155 | 106 | 53 | 990 | 4619 | 3645 | 2406 |
top | United States | US | Stanford University | Department of Materials Science and Engineerin... | pub.1019661721 | Yi | Wang | ||||||
freq | 555 | 555 | 2758 | 2758 | 830 | 300 | 48 | 5129 | 5333 | 108 | 161 | 70 | 308 |
Let’s get the top ten values for aff_id
.
[12]:
affiliations['aff_id'].value_counts()[:10]
[12]:
830
grid.168010.e 300
grid.59025.3b 185
grid.166341.7 146
grid.8217.c 130
grid.19006.3e 119
grid.12527.33 105
grid.21729.3f 99
grid.5379.8 94
grid.116068.8 93
Name: aff_id, dtype: int64
Another example: we can now easily analyze the data by country too.
[13]:
affiliations['aff_country'].value_counts()[:10]
[13]:
United States 2758
China 2072
United Kingdom 377
375
Germany 311
Singapore 298
South Korea 231
Japan 225
Australia 187
Switzerland 140
Name: aff_country, dtype: int64
Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA.
3. Dataframe Methods for ‘Grants’ queries¶
Extracting Funders: as_dataframe_funders
¶
Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the funders
sub-key.
This methods allows to quickly extract that data and return a dataframe with one row per funder.
[26]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 14738)
Time: 1.13s
[25]:
res.as_dataframe_funders().head(10)
[25]:
grant_id | grant_title | grant_start_date | grant_end_date |
---|
Extracting investigators: as_dataframe_investigators
¶
Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the investigators
sub-key.
This methods allows to quickly extract that data and return a dataframe with one row per investigator.
NOTE:
investigators
are not returned by default in a grants query hence one must specify this in the query results
[16]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigators] limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 14738)
Time: 1.74s
WARNINGS [1]
Field 'grant_number' is deprecated in favor of project_numbers. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
[17]:
res.as_dataframe_investigators().head(10)
[17]:
affiliations | first_name | id | last_name | middle_name | role | grant_id | grant_title | grant_start_date | grant_end_date | |
---|---|---|---|---|---|---|---|---|---|---|
0 | [{'city': 'Melbourne', 'city_id': '2158177', '... | Coralie | None | Boulet | PI | grant.13535694 | Regulation of Plasmodium falciparum egress fro... | 2024-02-01 | 2026-01-31 | |
1 | [{'city': 'São Paulo', 'city_id': '3448439', '... | Célia | ur.01311140517.22 | da Silva Garcia | Regina | PI | grant.9698278 | Calcium signaling in Plasmodium falciparum gam... | 2023-11-01 | 2023-12-15 |
2 | [{'city': 'Allschwil 3', 'city_id': '2661604',... | Penelope | ur.01356421006.17 | Vounatsou | PI | grant.13219531 | Spatio-temporal modelling of climate variabili... | 2023-10-01 | 2026-09-30 | |
3 | [{'city': 'Heidelberg', 'city_id': '2907911', ... | Ina | ur.01006224125.53 | Danquah | PI | grant.13219531 | Spatio-temporal modelling of climate variabili... | 2023-10-01 | 2026-09-30 | |
4 | [{'city': None, 'city_id': '184745', 'country'... | Stephen | ur.01363557726.68 | Munga | Co-PI | grant.13219531 | Spatio-temporal modelling of climate variabili... | 2023-10-01 | 2026-09-30 | |
5 | [{'city': None, 'city_id': '2357163', 'country... | Ali | ur.01057017117.60 | Sié | Co-PI | grant.13219531 | Spatio-temporal modelling of climate variabili... | 2023-10-01 | 2026-09-30 | |
6 | [{'city': 'SAN ANTONIO', 'city_id': '4726206',... | Zhuolin | None | Qu | None | PI | grant.13535333 | LEAPS-MPS: Development of Novel Multistage Mod... | 2023-09-01 | 2025-08-31 |
7 | [{'city': None, 'city_id': '2648579', 'country... | Matthias | ur.01315014330.55 | Marti | PI | grant.13306682 | Defining molecular determinants of Plasmodium ... | 2023-08-31 | 2025-08-30 | |
8 | [{'city': None, 'city_id': '2648579', 'country... | Barbara | ur.013052533674.51 | Stokes | Hoyt | PI | grant.13306682 | Defining molecular determinants of Plasmodium ... | 2023-08-31 | 2025-08-30 |
9 | NaN | Victor | None | Yman | PI | grant.13528667 | Integrated serological surveillance of malaria... | 2023-07-01 | 2026-06-30 |
4. Dataframe Methods for ‘Concepts’ queries¶
These methods can be used with all content types that support the extraction of concepts, i.e., publications
or grants
. See the official documentation for more details.
Extracting Concepts: as_dataframe_concepts
¶
The as_dataframe_concepts
method allows to quickly extract all concepts attached to a record, one row per concept, so to make it easier to do operations like counting or plotting the results.
NOTE: concepts are normalized noun phrases describing the main topics of a document, which are automatically derived from the full text using machine learning. In the JSON data, concepts are available as an ordered list (=first items are the most relevant), including a relevance score. E.g. for the publications with ID ‘pub.1122072646’:
{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
{'concept': 'conversion', 'relevance': 0.055053872555463006},
{'concept': 'formic acid', 'relevance': 0.048144671935356},
{'concept': 'CO2', 'relevance': 0.032150964737607}
[........]
],
}
The as_dataframe_concepts
extracts all concepts data from JSON to a dataframe (ps this is functionally similarly to pandas’s explode method). Moreover, it automatically creates a number of metrics that can be used to carry out further analyses, like frequency
and score_avg
.
[18]:
q = """search publications for "graphene"
where year=2019
return publications[id+title+year+concepts_scores] limit 100"""
concepts = dsl.query(q).as_dataframe_concepts()
print("Concepts found: ", len(concepts))
print("Unique: ", len(concepts.drop_duplicates("concept")))
concepts.head()
Returned Publications: 100 (total = 119693)
Time: 5.27s
Concepts found: 5053
Unique: 3213
[18]:
id | title | year | concepts_count | concept | score | frequency | score_avg | |
---|---|---|---|---|---|---|---|---|
0 | pub.1146069136 | Physical Characteristics of Cement Mortar Prep... | 2019 | 7 | characteristics | 0.070 | 10 | 0.30910 |
1 | pub.1146069136 | Physical Characteristics of Cement Mortar Prep... | 2019 | 7 | physical characteristics | 0.068 | 1 | 0.06800 |
2 | pub.1146069136 | Physical Characteristics of Cement Mortar Prep... | 2019 | 7 | oxide | 0.058 | 14 | 0.42843 |
3 | pub.1146069136 | Physical Characteristics of Cement Mortar Prep... | 2019 | 7 | Prepared | 0.046 | 1 | 0.04600 |
4 | pub.1146069136 | Physical Characteristics of Cement Mortar Prep... | 2019 | 7 | glass | 0.026 | 2 | 0.22250 |
[19]:
concepts.drop_duplicates("concept").sort_values("score_avg", ascending=False).head(10)[['concept', 'score_avg']]
[19]:
concept | score_avg | |
---|---|---|
1693 | mesoporous silica nanoparticles | 0.918 |
558 | two-dimensional transition metal dichalcogenides | 0.916 |
3216 | quantum state transfer | 0.906 |
559 | spin-orbit coupling | 0.901 |
4377 | MoP/SiO2 catalyst | 0.886 |
3699 | hydrogen evolution reaction | 0.884 |
3700 | oxygen reduction reaction | 0.879 |
2862 | silica nano particles | 0.871 |
560 | quantum spin Hall effect | 0.870 |
561 | Ising spin-orbit coupling | 0.868 |
Please note that (as of version 1.25 of the DSL API) it is possible to return either concepts_scores
or concepts
with Publications queries, but only concepts
with Grants queries.
If a relevancy score is not available, the as_dataframe_concepts
function will simulate one based on the concept rank (=its position in the document-concepts list).
For more information on how to make use of concepts
please see the Topic Modeling Analysis Tutorial and the Official documentation.
Conclusions¶
Moving Dimensions API results to pandas dataframes makes it easier to analyze the data and answer research questions.
Note: the examples above only scratch the surface of what can be done with pandas!
Tip: see also the Dimcli: Magic Commands notebook to find out what shortcuts are available for these dataframe methods.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.