../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

The Dimcli Python library: Working with Pandas Dataframes

Dimcli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas dataframe objects.

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

pandas is a popular software library written for the Python programming language for data manipulation and analysis.

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Aug 10, 2023
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli --quiet

import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.7
Method: dsl.ini file

1. General method to transform JSON query results into a dataframe

The DslDataset.as_dataframe method allows to quickly turn any query results into a dataframe.

[3]:
# we'll reuse this query later on
query = """search publications for "graphene"
            where year in [2013:2019]
            return publications[id+type+volume+year+issue+title+journal+authors] sort by times_cited limit 1000"""
res = dsl.query(query)
Returned Publications: 1000 (total = 504501)
Time: 2.52s
WARNINGS [1]
Field current_organization_id of the authors field is deprecated and will be removed in the next major release.
[4]:
df = res.as_dataframe()
df.head(10)
[4]:
id title authors issue type volume year journal.id journal.title
0 pub.1024857999 Van der Waals heterostructures [{'affiliations': [{'city': 'Manchester', 'cit... 7459 article 499 2013 jour.1018957 Nature
1 pub.1050119463 The chemistry of two-dimensional layered trans... [{'affiliations': [{'city': 'New Brunswick', '... 4 article 5 2013 jour.1041224 Nature Chemistry
2 pub.1007330696 Combining theory and experiment in electrocata... [{'affiliations': [{'city': 'Stanford', 'city_... 6321 article 355 2017 jour.1346339 Science
3 pub.1019126274 The Li-Ion Rechargeable Battery: A Perspective [{'affiliations': [{'city': 'Austin', 'city_id... 4 article 135 2013 jour.1081898 Journal of the American Chemical Society
4 pub.1032956475 Black phosphorus field-effect transistors [{'affiliations': [{'city': 'Shanghai', 'city_... 5 article 9 2014 jour.1037429 Nature Nanotechnology
5 pub.1036410643 Review of Particle Physics [{'affiliations': [], 'corresponding': '', 'cu... 9 article 38 2014 jour.1327822 Chinese Physics C
6 pub.1055085164 Aggregation-Induced Emission: Together We Shin... [{'affiliations': [{'city': 'Shenzhen', 'city_... 21 article 115 2015 jour.1077147 Chemical Reviews
7 pub.1004394295 Interface engineering of highly efficient pero... [{'affiliations': [{'city': 'Los Angeles', 'ci... 6196 article 345 2014 jour.1346339 Science
8 pub.1045181228 The emergence of perovskite solar cells [{'affiliations': [{'city': 'Sydney', 'city_id... 7 article 8 2014 jour.1037430 Nature Photonics
9 pub.1015305822 Raman spectroscopy as a versatile tool for stu... [{'affiliations': [{'city': 'Cambridge', 'city... 4 article 8 2013 jour.1037429 Nature Nanotechnology

Pandas dataframes offer a myriad of utilities for inspecting data. Check out the official docs or google a pandas tutorial to lean more about it.

[5]:
# the table shape
df.shape
[5]:
(1000, 9)
[6]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()
[6]:
2014    222
2013    183
2015    173
2016    155
2017    125
2018     83
2019     59
Name: year, dtype: int64
[7]:
# eg distribution of publication [type]
df['type'].value_counts()
[7]:
article      995
monograph      4
book           1
Name: type, dtype: int64

2. Dataframe Methods for ‘Publications’ queries

What follows are specialized versions of the as_dataframe method for results sets composed of publication records.

Extracting authors: as_dataframe_authors

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the authors_affiliations sub-key.

Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with one row per author.

[8]:
authors = res.as_dataframe_authors()
authors.head()
[8]:
affiliations corresponding current_organization_id first_name last_name orcid raw_affiliation researcher_id pub_id
0 [{'city': 'Manchester', 'city_id': 2643123, 'c... grid.5379.8 A. K. Geim [0000-0003-2861-8331] [School of Physics and Astronomy, University o... ur.0721730631.45 pub.1024857999
1 [{'city': 'Manchester', 'city_id': 2643123, 'c... True grid.5379.8 I. V. Grigorieva [0000-0001-5991-7778] [School of Physics and Astronomy, University o... ur.0767105504.29 pub.1024857999
2 [{'city': 'New Brunswick', 'city_id': 5101717,... True grid.5335.0 Manish Chhowalla None [Materials Science and Engineering, Rutgers Un... ur.0633062306.03 pub.1050119463
3 [{'city': 'Ulsan', 'city_id': 1833747, 'countr... grid.42687.3f Hyeon Suk Shin None [Interdisciplinary School of Green Energy and ... ur.07617630407.83 pub.1050119463
4 [{'city': 'Singapore', 'city_id': 1880252, 'co... grid.4280.e Goki Eda None [Department of Physics, National University of... ur.01150450507.27 pub.1050119463

Using the authors dataframe, we can easily get the top ten values for current_organization_id.

[9]:
authors['current_organization_id'].value_counts()[:10]
[9]:
grid.168010.e    148
                 133
grid.59025.3b     83
grid.12527.33     81
grid.59053.3a     78
grid.19006.3e     78
grid.166341.7     65
grid.41156.37     64
grid.89336.37     63
grid.116068.8     60
Name: current_organization_id, dtype: int64

Explanation: the most frequent organization turns out to be grid.168010.e. The first result is empty, meaning that for those authors Dimensions has no info about current_organization_id.

Extracting Affiliations: as_dataframe_authors_affiliations

As you can see from the results of the previous section, the affiliations of each author is yet another nested JSON object.

Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The as_dataframe_authors_affiliations method allows to quickly extract that affiliations data and return a dataframe with one row per affiliation.

This can be useful e.g. if one wants to count research organizations at the time of writing (as opposed to current_organization_id, which is the most recent organization of a researcher).

[10]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()
[10]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
0 Manchester 2643123.0 United Kingdom GB grid.5379.8 University of Manchester School of Physics and Astronomy, University of... pub.1024857999 ur.0721730631.45 A. K. Geim
1 Manchester 2643123.0 United Kingdom GB grid.5379.8 University of Manchester Centre for Mesoscience and Nanotechnology, Uni... pub.1024857999 ur.0721730631.45 A. K. Geim
2 Manchester 2643123.0 United Kingdom GB grid.5379.8 University of Manchester School of Physics and Astronomy, University of... pub.1024857999 ur.0767105504.29 I. V. Grigorieva
3 New Brunswick 5101717.0 United States US grid.430387.b Rutgers, The State University of New Jersey Materials Science and Engineering, Rutgers Uni... New Jersey US-NJ pub.1050119463 ur.0633062306.03 Manish Chhowalla
4 Ulsan 1833747.0 South Korea KR grid.42687.3f Ulsan National Institute of Science and Techno... Interdisciplinary School of Green Energy and L... pub.1050119463 ur.07617630407.83 Hyeon Suk Shin
[11]:
affiliations.describe(include="all")
[11]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
count 8353 8353 8353 8353 8353 8353 8353 8353 8353 8353 8353 8353 8353
unique 524 527 52 52 852 1132 3155 106 53 990 4619 3645 2406
top United States US Stanford University Department of Materials Science and Engineerin... pub.1019661721 Yi Wang
freq 555 555 2758 2758 830 300 48 5129 5333 108 161 70 308

Let’s get the top ten values for aff_id.

[12]:
affiliations['aff_id'].value_counts()[:10]
[12]:
                 830
grid.168010.e    300
grid.59025.3b    185
grid.166341.7    146
grid.8217.c      130
grid.19006.3e    119
grid.12527.33    105
grid.21729.3f     99
grid.5379.8       94
grid.116068.8     93
Name: aff_id, dtype: int64

Another example: we can now easily analyze the data by country too.

[13]:
affiliations['aff_country'].value_counts()[:10]
[13]:
United States     2758
China             2072
United Kingdom     377
                   375
Germany            311
Singapore          298
South Korea        231
Japan              225
Australia          187
Switzerland        140
Name: aff_country, dtype: int64

Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA.

3. Dataframe Methods for ‘Grants’ queries

Extracting Funders: as_dataframe_funders

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the funders sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per funder.

[26]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 14738)
Time: 1.13s
[25]:
res.as_dataframe_funders().head(10)
[25]:
grant_id grant_title grant_start_date grant_end_date

Extracting investigators: as_dataframe_investigators

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the investigators sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per investigator.

NOTE: investigators are not returned by default in a grants query hence one must specify this in the query results

[16]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigators] limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 14738)
Time: 1.74s
WARNINGS [1]
Field 'grant_number' is deprecated in favor of project_numbers. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
[17]:
res.as_dataframe_investigators().head(10)
[17]:
affiliations first_name id last_name middle_name role grant_id grant_title grant_start_date grant_end_date
0 [{'city': 'Melbourne', 'city_id': '2158177', '... Coralie None Boulet PI grant.13535694 Regulation of Plasmodium falciparum egress fro... 2024-02-01 2026-01-31
1 [{'city': 'São Paulo', 'city_id': '3448439', '... Célia ur.01311140517.22 da Silva Garcia Regina PI grant.9698278 Calcium signaling in Plasmodium falciparum gam... 2023-11-01 2023-12-15
2 [{'city': 'Allschwil 3', 'city_id': '2661604',... Penelope ur.01356421006.17 Vounatsou PI grant.13219531 Spatio-temporal modelling of climate variabili... 2023-10-01 2026-09-30
3 [{'city': 'Heidelberg', 'city_id': '2907911', ... Ina ur.01006224125.53 Danquah PI grant.13219531 Spatio-temporal modelling of climate variabili... 2023-10-01 2026-09-30
4 [{'city': None, 'city_id': '184745', 'country'... Stephen ur.01363557726.68 Munga Co-PI grant.13219531 Spatio-temporal modelling of climate variabili... 2023-10-01 2026-09-30
5 [{'city': None, 'city_id': '2357163', 'country... Ali ur.01057017117.60 Sié Co-PI grant.13219531 Spatio-temporal modelling of climate variabili... 2023-10-01 2026-09-30
6 [{'city': 'SAN ANTONIO', 'city_id': '4726206',... Zhuolin None Qu None PI grant.13535333 LEAPS-MPS: Development of Novel Multistage Mod... 2023-09-01 2025-08-31
7 [{'city': None, 'city_id': '2648579', 'country... Matthias ur.01315014330.55 Marti PI grant.13306682 Defining molecular determinants of Plasmodium ... 2023-08-31 2025-08-30
8 [{'city': None, 'city_id': '2648579', 'country... Barbara ur.013052533674.51 Stokes Hoyt PI grant.13306682 Defining molecular determinants of Plasmodium ... 2023-08-31 2025-08-30
9 NaN Victor None Yman PI grant.13528667 Integrated serological surveillance of malaria... 2023-07-01 2026-06-30

4. Dataframe Methods for ‘Concepts’ queries

These methods can be used with all content types that support the extraction of concepts, i.e., publications or grants. See the official documentation for more details.

Extracting Concepts: as_dataframe_concepts

The as_dataframe_concepts method allows to quickly extract all concepts attached to a record, one row per concept, so to make it easier to do operations like counting or plotting the results.

NOTE: concepts are normalized noun phrases describing the main topics of a document, which are automatically derived from the full text using machine learning. In the JSON data, concepts are available as an ordered list (=first items are the most relevant), including a relevance score. E.g. for the publications with ID ‘pub.1122072646’:

{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
                    {'concept': 'conversion', 'relevance': 0.055053872555463006},
                    {'concept': 'formic acid', 'relevance': 0.048144671935356},
                    {'concept': 'CO2', 'relevance': 0.032150964737607}
                    [........]
                    ],
 }

The as_dataframe_concepts extracts all concepts data from JSON to a dataframe (ps this is functionally similarly to pandas’s explode method). Moreover, it automatically creates a number of metrics that can be used to carry out further analyses, like frequency and score_avg.

[18]:
q = """search publications for "graphene"
            where year=2019
       return publications[id+title+year+concepts_scores] limit 100"""

concepts = dsl.query(q).as_dataframe_concepts()

print("Concepts found: ", len(concepts))
print("Unique: ", len(concepts.drop_duplicates("concept")))

concepts.head()
Returned Publications: 100 (total = 119693)
Time: 5.27s
Concepts found:  5053
Unique:  3213
[18]:
id title year concepts_count concept score frequency score_avg
0 pub.1146069136 Physical Characteristics of Cement Mortar Prep... 2019 7 characteristics 0.070 10 0.30910
1 pub.1146069136 Physical Characteristics of Cement Mortar Prep... 2019 7 physical characteristics 0.068 1 0.06800
2 pub.1146069136 Physical Characteristics of Cement Mortar Prep... 2019 7 oxide 0.058 14 0.42843
3 pub.1146069136 Physical Characteristics of Cement Mortar Prep... 2019 7 Prepared 0.046 1 0.04600
4 pub.1146069136 Physical Characteristics of Cement Mortar Prep... 2019 7 glass 0.026 2 0.22250
[19]:
concepts.drop_duplicates("concept").sort_values("score_avg", ascending=False).head(10)[['concept', 'score_avg']]
[19]:
concept score_avg
1693 mesoporous silica nanoparticles 0.918
558 two-dimensional transition metal dichalcogenides 0.916
3216 quantum state transfer 0.906
559 spin-orbit coupling 0.901
4377 MoP/SiO2 catalyst 0.886
3699 hydrogen evolution reaction 0.884
3700 oxygen reduction reaction 0.879
2862 silica nano particles 0.871
560 quantum spin Hall effect 0.870
561 Ising spin-orbit coupling 0.868

Please note that (as of version 1.25 of the DSL API) it is possible to return either concepts_scores or concepts with Publications queries, but only concepts with Grants queries.

If a relevancy score is not available, the as_dataframe_concepts function will simulate one based on the concept rank (=its position in the document-concepts list).

For more information on how to make use of concepts please see the Topic Modeling Analysis Tutorial and the Official documentation.

Conclusions

Moving Dimensions API results to pandas dataframes makes it easier to analyze the data and answer research questions.

Note: the examples above only scratch the surface of what can be done with pandas!

Tip: see also the Dimcli: Magic Commands notebook to find out what shortcuts are available for these dataframe methods.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg