The Dimcli Python library: Working with Pandas Dataframes¶

Dimcli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas dataframe objects.

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

pandas is a popular software library written for the Python programming language for data manipulation and analysis.

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Aug 10, 2023
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli --quiet

import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v1.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.7
Method: dsl.ini file

1. General method to transform JSON query results into a dataframe¶

The DslDataset.as_dataframe method allows to quickly turn any query results into a dataframe.

[3]:

# we'll reuse this query later on
query = """search publications for "graphene"
            where year in [2013:2019]
            return publications[id+type+volume+year+issue+title+journal+authors] sort by times_cited limit 1000"""
res = dsl.query(query)

Returned Publications: 1000 (total = 504501)
Time: 2.52s
WARNINGS [1]
Field current_organization_id of the authors field is deprecated and will be removed in the next major release.

[4]:

df = res.as_dataframe()
df.head(10)

[4]:

	id	title	authors	issue	type	volume	year	journal.id	journal.title
0	pub.1024857999	Van der Waals heterostructures	[{'affiliations': [{'city': 'Manchester', 'cit...	7459	article	499	2013	jour.1018957	Nature
1	pub.1050119463	The chemistry of two-dimensional layered trans...	[{'affiliations': [{'city': 'New Brunswick', '...	4	article	5	2013	jour.1041224	Nature Chemistry
2	pub.1007330696	Combining theory and experiment in electrocata...	[{'affiliations': [{'city': 'Stanford', 'city_...	6321	article	355	2017	jour.1346339	Science
3	pub.1019126274	The Li-Ion Rechargeable Battery: A Perspective	[{'affiliations': [{'city': 'Austin', 'city_id...	4	article	135	2013	jour.1081898	Journal of the American Chemical Society
4	pub.1032956475	Black phosphorus field-effect transistors	[{'affiliations': [{'city': 'Shanghai', 'city_...	5	article	9	2014	jour.1037429	Nature Nanotechnology
5	pub.1036410643	Review of Particle Physics	[{'affiliations': [], 'corresponding': '', 'cu...	9	article	38	2014	jour.1327822	Chinese Physics C
6	pub.1055085164	Aggregation-Induced Emission: Together We Shin...	[{'affiliations': [{'city': 'Shenzhen', 'city_...	21	article	115	2015	jour.1077147	Chemical Reviews
7	pub.1004394295	Interface engineering of highly efficient pero...	[{'affiliations': [{'city': 'Los Angeles', 'ci...	6196	article	345	2014	jour.1346339	Science
8	pub.1045181228	The emergence of perovskite solar cells	[{'affiliations': [{'city': 'Sydney', 'city_id...	7	article	8	2014	jour.1037430	Nature Photonics
9	pub.1015305822	Raman spectroscopy as a versatile tool for stu...	[{'affiliations': [{'city': 'Cambridge', 'city...	4	article	8	2013	jour.1037429	Nature Nanotechnology

Pandas dataframes offer a myriad of utilities for inspecting data. Check out the official docs or google a pandas tutorial to lean more about it.

[5]:

# the table shape
df.shape

[5]:

(1000, 9)

[6]:

# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()

[6]:

2014    222
2013    183
2015    173
2016    155
2017    125
2018     83
2019     59
Name: year, dtype: int64

[7]:

# eg distribution of publication [type]
df['type'].value_counts()

[7]:

article      995
monograph      4
book           1
Name: type, dtype: int64

2. Dataframe Methods for ‘Publications’ queries¶

What follows are specialized versions of the as_dataframe method for results sets composed of publication records.

Extracting authors: `as_dataframe_authors`¶

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the authors_affiliations sub-key.

Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with one row per author.

[8]:

authors = res.as_dataframe_authors()
authors.head()

[8]:

	affiliations	corresponding	current_organization_id	first_name	last_name	orcid	raw_affiliation	researcher_id	pub_id
0	[{'city': 'Manchester', 'city_id': 2643123, 'c...		grid.5379.8	A. K.	Geim	[0000-0003-2861-8331]	[School of Physics and Astronomy, University o...	ur.0721730631.45	pub.1024857999
1	[{'city': 'Manchester', 'city_id': 2643123, 'c...	True	grid.5379.8	I. V.	Grigorieva	[0000-0001-5991-7778]	[School of Physics and Astronomy, University o...	ur.0767105504.29	pub.1024857999
2	[{'city': 'New Brunswick', 'city_id': 5101717,...	True	grid.5335.0	Manish	Chhowalla	None	[Materials Science and Engineering, Rutgers Un...	ur.0633062306.03	pub.1050119463
3	[{'city': 'Ulsan', 'city_id': 1833747, 'countr...		grid.42687.3f	Hyeon Suk	Shin	None	[Interdisciplinary School of Green Energy and ...	ur.07617630407.83	pub.1050119463
4	[{'city': 'Singapore', 'city_id': 1880252, 'co...		grid.4280.e	Goki	Eda	None	[Department of Physics, National University of...	ur.01150450507.27	pub.1050119463

Using the authors dataframe, we can easily get the top ten values for current_organization_id.

[9]:

authors['current_organization_id'].value_counts()[:10]

[9]:

grid.168010.e    148
                 133
grid.59025.3b     83
grid.12527.33     81
grid.59053.3a     78
grid.19006.3e     78
grid.166341.7     65
grid.41156.37     64
grid.89336.37     63
grid.116068.8     60
Name: current_organization_id, dtype: int64

Explanation: the most frequent organization turns out to be grid.168010.e. The first result is empty, meaning that for those authors Dimensions has no info about current_organization_id.

Extracting Affiliations: `as_dataframe_authors_affiliations`¶

As you can see from the results of the previous section, the affiliations of each author is yet another nested JSON object.

Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The as_dataframe_authors_affiliations method allows to quickly extract that affiliations data and return a dataframe with one row per affiliation.

This can be useful e.g. if one wants to count research organizations at the time of writing (as opposed to current_organization_id, which is the most recent organization of a researcher).

[10]:

affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()

[10]:

	aff_city	aff_city_id	aff_country	aff_country_code	aff_id	aff_name	aff_raw_affiliation	aff_state	aff_state_code	pub_id	researcher_id	first_name	last_name
0	Manchester	2643123.0	United Kingdom	GB	grid.5379.8	University of Manchester	School of Physics and Astronomy, University of...			pub.1024857999	ur.0721730631.45	A. K.	Geim
1	Manchester	2643123.0	United Kingdom	GB	grid.5379.8	University of Manchester	Centre for Mesoscience and Nanotechnology, Uni...			pub.1024857999	ur.0721730631.45	A. K.	Geim
2	Manchester	2643123.0	United Kingdom	GB	grid.5379.8	University of Manchester	School of Physics and Astronomy, University of...			pub.1024857999	ur.0767105504.29	I. V.	Grigorieva
3	New Brunswick	5101717.0	United States	US	grid.430387.b	Rutgers, The State University of New Jersey	Materials Science and Engineering, Rutgers Uni...	New Jersey	US-NJ	pub.1050119463	ur.0633062306.03	Manish	Chhowalla
4	Ulsan	1833747.0	South Korea	KR	grid.42687.3f	Ulsan National Institute of Science and Techno...	Interdisciplinary School of Green Energy and L...			pub.1050119463	ur.07617630407.83	Hyeon Suk	Shin

[11]:

affiliations.describe(include="all")

[11]:

	aff_city	aff_city_id	aff_country	aff_country_code	aff_id	aff_name	aff_raw_affiliation	aff_state	aff_state_code	pub_id	researcher_id	first_name	last_name
count	8353	8353	8353	8353	8353	8353	8353	8353	8353	8353	8353	8353	8353
unique	524	527	52	52	852	1132	3155	106	53	990	4619	3645	2406
top			United States	US		Stanford University	Department of Materials Science and Engineerin...			pub.1019661721		Yi	Wang
freq	555	555	2758	2758	830	300	48	5129	5333	108	161	70	308

Let’s get the top ten values for aff_id.

[12]:

affiliations['aff_id'].value_counts()[:10]

[12]:

                 830
grid.168010.e    300
grid.59025.3b    185
grid.166341.7    146
grid.8217.c      130
grid.19006.3e    119
grid.12527.33    105
grid.21729.3f     99
grid.5379.8       94
grid.116068.8     93
Name: aff_id, dtype: int64

Another example: we can now easily analyze the data by country too.

[13]:

affiliations['aff_country'].value_counts()[:10]

[13]:

United States     2758
China             2072
United Kingdom     377
                   375
Germany            311
Singapore          298
South Korea        231
Japan              225
Australia          187
Switzerland        140
Name: aff_country, dtype: int64

Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA.

3. Dataframe Methods for ‘Grants’ queries¶

Extracting Funders: `as_dataframe_funders`¶

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the funders sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per funder.

[26]:

# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)

Returned Grants: 1000 (total = 14738)
Time: 1.13s

[25]:

res.as_dataframe_funders().head(10)

[25]:

	grant_id	grant_title	grant_start_date	grant_end_date

Extracting investigators: `as_dataframe_investigators`¶

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the investigators sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per investigator.

NOTE: investigators are not returned by default in a grants query hence one must specify this in the query results

[16]:

# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigators] limit 1000"""
res = dsl.query(query)

Returned Grants: 1000 (total = 14738)
Time: 1.74s
WARNINGS [1]
Field 'grant_number' is deprecated in favor of project_numbers. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details

[17]:

res.as_dataframe_investigators().head(10)

[17]:

	affiliations	first_name	id	last_name	middle_name	role	grant_id	grant_title	grant_start_date	grant_end_date
0	[{'city': 'Melbourne', 'city_id': '2158177', '...	Coralie	None	Boulet		PI	grant.13535694	Regulation of Plasmodium falciparum egress fro...	2024-02-01	2026-01-31
1	[{'city': 'São Paulo', 'city_id': '3448439', '...	Célia	ur.01311140517.22	da Silva Garcia	Regina	PI	grant.9698278	Calcium signaling in Plasmodium falciparum gam...	2023-11-01	2023-12-15
2	[{'city': 'Allschwil 3', 'city_id': '2661604',...	Penelope	ur.01356421006.17	Vounatsou		PI	grant.13219531	Spatio-temporal modelling of climate variabili...	2023-10-01	2026-09-30
3	[{'city': 'Heidelberg', 'city_id': '2907911', ...	Ina	ur.01006224125.53	Danquah		PI	grant.13219531	Spatio-temporal modelling of climate variabili...	2023-10-01	2026-09-30
4	[{'city': None, 'city_id': '184745', 'country'...	Stephen	ur.01363557726.68	Munga		Co-PI	grant.13219531	Spatio-temporal modelling of climate variabili...	2023-10-01	2026-09-30
5	[{'city': None, 'city_id': '2357163', 'country...	Ali	ur.01057017117.60	Sié		Co-PI	grant.13219531	Spatio-temporal modelling of climate variabili...	2023-10-01	2026-09-30
6	[{'city': 'SAN ANTONIO', 'city_id': '4726206',...	Zhuolin	None	Qu	None	PI	grant.13535333	LEAPS-MPS: Development of Novel Multistage Mod...	2023-09-01	2025-08-31
7	[{'city': None, 'city_id': '2648579', 'country...	Matthias	ur.01315014330.55	Marti		PI	grant.13306682	Defining molecular determinants of Plasmodium ...	2023-08-31	2025-08-30
8	[{'city': None, 'city_id': '2648579', 'country...	Barbara	ur.013052533674.51	Stokes	Hoyt	PI	grant.13306682	Defining molecular determinants of Plasmodium ...	2023-08-31	2025-08-30
9	NaN	Victor	None	Yman		PI	grant.13528667	Integrated serological surveillance of malaria...	2023-07-01	2026-06-30

4. Dataframe Methods for ‘Concepts’ queries¶

These methods can be used with all content types that support the extraction of concepts, i.e., publications or grants. See the official documentation for more details.

Extracting Concepts: `as_dataframe_concepts`¶

The as_dataframe_concepts method allows to quickly extract all concepts attached to a record, one row per concept, so to make it easier to do operations like counting or plotting the results.

NOTE: concepts are normalized noun phrases describing the main topics of a document, which are automatically derived from the full text using machine learning. In the JSON data, concepts are available as an ordered list (=first items are the most relevant), including a relevance score. E.g. for the publications with ID ‘pub.1122072646’:

{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
                    {'concept': 'conversion', 'relevance': 0.055053872555463006},
                    {'concept': 'formic acid', 'relevance': 0.048144671935356},
                    {'concept': 'CO2', 'relevance': 0.032150964737607}
                    [........]
                    ],
 }

The as_dataframe_concepts extracts all concepts data from JSON to a dataframe (ps this is functionally similarly to pandas’s explode method). Moreover, it automatically creates a number of metrics that can be used to carry out further analyses, like frequency and score_avg.

[18]:

q = """search publications for "graphene"
            where year=2019
       return publications[id+title+year+concepts_scores] limit 100"""

concepts = dsl.query(q).as_dataframe_concepts()

print("Concepts found: ", len(concepts))
print("Unique: ", len(concepts.drop_duplicates("concept")))

concepts.head()

Returned Publications: 100 (total = 119693)
Time: 5.27s
Concepts found:  5053
Unique:  3213

[18]:

	id	title	year	concepts_count	concept	score	frequency	score_avg
0	pub.1146069136	Physical Characteristics of Cement Mortar Prep...	2019	7	characteristics	0.070	10	0.30910
1	pub.1146069136	Physical Characteristics of Cement Mortar Prep...	2019	7	physical characteristics	0.068	1	0.06800
2	pub.1146069136	Physical Characteristics of Cement Mortar Prep...	2019	7	oxide	0.058	14	0.42843
3	pub.1146069136	Physical Characteristics of Cement Mortar Prep...	2019	7	Prepared	0.046	1	0.04600
4	pub.1146069136	Physical Characteristics of Cement Mortar Prep...	2019	7	glass	0.026	2	0.22250

[19]:

concepts.drop_duplicates("concept").sort_values("score_avg", ascending=False).head(10)[['concept', 'score_avg']]

[19]:

	concept	score_avg
1693	mesoporous silica nanoparticles	0.918
558	two-dimensional transition metal dichalcogenides	0.916
3216	quantum state transfer	0.906
559	spin-orbit coupling	0.901
4377	MoP/SiO2 catalyst	0.886
3699	hydrogen evolution reaction	0.884
3700	oxygen reduction reaction	0.879
2862	silica nano particles	0.871
560	quantum spin Hall effect	0.870
561	Ising spin-orbit coupling	0.868

Please note that (as of version 1.25 of the DSL API) it is possible to return either concepts_scores or concepts with Publications queries, but only concepts with Grants queries.

If a relevancy score is not available, the as_dataframe_concepts function will simulate one based on the concept rank (=its position in the document-concepts list).

For more information on how to make use of concepts please see the Topic Modeling Analysis Tutorial and the Official documentation.

Conclusions¶

Moving Dimensions API results to pandas dataframes makes it easier to analyze the data and answer research questions.

Note: the examples above only scratch the surface of what can be done with pandas!

Tip: see also the Dimcli: Magic Commands notebook to find out what shortcuts are available for these dataframe methods.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

The Dimcli Python library: Working with Pandas Dataframes¶

Prerequisites¶

1. General method to transform JSON query results into a dataframe¶

2. Dataframe Methods for ‘Publications’ queries¶

Extracting authors: as_dataframe_authors¶

Extracting Affiliations: as_dataframe_authors_affiliations¶

3. Dataframe Methods for ‘Grants’ queries¶

Extracting Funders: as_dataframe_funders¶

Extracting investigators: as_dataframe_investigators¶