../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

The Dimcli Python library: Working with Pandas Dataframes

Dimcli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas dataframe objects.

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

pandas is a popular software library written for the Python programming language for data manipulation and analysis.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[1]:
!pip install dimcli --quiet

import dimcli
from dimcli.shortcuts import *
import json
import sys
import pandas as pd
#

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.7)
Connected to: https://app.dimensions.ai - DSL v1.25
Method: dsl.ini file

1. General method to transform JSON query results into a dataframe

The DslDataset.as_dataframe method allows to quickly turn any query results into a dataframe.

[2]:
# we'll reuse this query later on
query = """search publications for "graphene"
            where year in [2013:2019]
            return publications sort by times_cited limit 1000"""
res = dsl.query(query)
Returned Publications: 1000 (total = 425229)
[3]:
df = res.as_dataframe()
df.head(10)
[3]:
title author_affiliations volume issue pages type year id journal.id journal.title
0 The chemistry of two-dimensional layered trans... [[{'first_name': 'Manish', 'last_name': 'Chhow... 5 4 263-275 article 2013 pub.1050119463 jour.1041224 Nature Chemistry
1 Van der Waals heterostructures [[{'first_name': 'A. K.', 'last_name': 'Geim',... 499 7459 419-425 article 2013 pub.1024857999 jour.1018957 Nature
2 Interface engineering of highly efficient pero... [[{'first_name': 'Huanping', 'last_name': 'Zho... 345 6196 542-546 article 2014 pub.1004394295 jour.1346339 Science
3 Black phosphorus field-effect transistors [[{'first_name': 'Likai', 'last_name': 'Li', '... 9 5 372-377 article 2014 pub.1032956475 jour.1037429 Nature Nanotechnology
4 The Li-ion rechargeable battery: a perspective. [[{'first_name': 'John B.', 'last_name': 'Good... 135 4 1167-76 article 2013 pub.1019126274 jour.1081898 Journal of the American Chemical Society
5 Nanoenergy, Nanotechnology Applied for Energy ... NaN NaN NaN NaN book 2013 pub.1031762191 NaN NaN
6 Phosphorene: an unexplored 2D semiconductor wi... [[{'first_name': 'Han', 'last_name': 'Liu', 'i... 8 4 4033-41 article 2014 pub.1009826879 jour.1038917 ACS Nano
7 Raman spectroscopy as a versatile tool for stu... [[{'first_name': 'Andrea C.', 'last_name': 'Fe... 8 4 235-246 article 2013 pub.1015305822 jour.1037429 Nature Nanotechnology
8 Carbon Nanotubes: Present and Future Commercia... [[{'first_name': 'Michael F. L.', 'last_name':... 339 6119 535-539 article 2013 pub.1007405937 jour.1346339 Science
9 The emergence of perovskite solar cells [[{'first_name': 'Martin A.', 'last_name': 'Gr... 8 7 506-514 article 2014 pub.1045181228 jour.1037430 Nature Photonics

Pandas dataframes offer a myriad of utilities for inspecting data. Check out the official docs or google a pandas tutorial to lean more about it.

[4]:
# the table shape
df.shape
[4]:
(1000, 10)
[5]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()
[5]:
2013    324
2014    299
2015    210
2016    105
2017     50
2018     11
2019      1
Name: year, dtype: int64
[6]:
# eg distribution of publication [type]
df['type'].value_counts()
[6]:
article    997
book         2
chapter      1
Name: type, dtype: int64

2. Dataframe Methods for ‘Publications’ queries

What follows are specialized versions of the as_dataframe method for results sets composed of publication records.

Extracting authors: as_dataframe_authors

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the authors_affiliations sub-key.

Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with one row per author.

[7]:
authors = res.as_dataframe_authors()
authors.head()
[7]:
first_name last_name initials corresponding orcid current_organization_id researcher_id affiliations pub_id
0 Manish Chhowalla True grid.5335.0 ur.0633062306.03 [{'id': 'grid.430387.b', 'name': 'Rutgers, The... pub.1050119463
1 Hyeon Suk Shin grid.42687.3f ur.07617630407.83 [{'id': 'grid.42687.3f', 'name': 'Ulsan Nation... pub.1050119463
2 Goki Eda grid.4280.e ur.01150450507.27 [{'id': 'grid.4280.e', 'name': 'National Unive... pub.1050119463
3 Lain-Jong Li ['0000-0002-4059-7783'] grid.45672.32 ur.01313340113.13 [{'id': 'grid.28665.3f', 'name': 'Academia Sin... pub.1050119463
4 Kian Ping Loh ['0000-0002-1491-743X'] grid.4280.e ur.0752174033.73 [{'id': 'grid.4280.e', 'name': 'National Unive... pub.1050119463

Using the authors dataframe, we can easily get the top ten values for current_organization_id.

[8]:
authors['current_organization_id'].value_counts()[:10]

Extracting Affiliations: as_dataframe_authors_affiliations

As you can see from the results of the previous section, the affiliations of each author is yet another nested JSON object.

Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The as_dataframe_authors_affiliations method allows to quickly extract that affiliations data and return a dataframe with one row per affiliation.

This can be useful e.g. if one wants to count research organizations at the time of writing (as opposed to current_organization_id, which is the most recent organization of a researcher).

[9]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()
[9]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code pub_id researcher_id first_name last_name
0 grid.430387.b Rutgers, The State University of New Jersey New Brunswick 5.10172e+06 United States US New Jersey US-NJ pub.1050119463 ur.0633062306.03 Manish Chhowalla
1 grid.42687.3f Ulsan National Institute of Science and Techno... Ulsan 1.83375e+06 South Korea KR pub.1050119463 ur.07617630407.83 Hyeon Suk Shin
2 grid.4280.e National University of Singapore Singapore 1.88025e+06 Singapore SG pub.1050119463 ur.01150450507.27 Goki Eda
3 grid.4280.e National University of Singapore Singapore 1.88025e+06 Singapore SG pub.1050119463 ur.01150450507.27 Goki Eda
4 grid.4280.e National University of Singapore Singapore 1.88025e+06 Singapore SG pub.1050119463 ur.01150450507.27 Goki Eda
[10]:
affiliations.describe(include="all")
[10]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code pub_id researcher_id first_name last_name
count 7810 7810 7810 7810 7810 7810 7810 7810 7810 7810 7810 7810
unique 727 1078 446 448 51 51 53 53 985 4363 3500 2097
top Nanyang Technological University United States US pub.1019661721 Yi Wang
freq 989 242 989 995 2162 2162 5350 5350 108 185 72 325

Let’s get the top ten values for aff_id.

[11]:
affiliations['aff_id'].value_counts()[:10]

Another example: we can now easily analyze the data by country too.

[12]:
affiliations['aff_country'].value_counts()[:10]

3. Dataframe Methods for ‘Grants’ queries

Extracting Funders: as_dataframe_funders

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the funders sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per funder.

[13]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 11076)
[14]:
res.as_dataframe_funders().head(10)
[14]:
id linkout name country_name types longitude city_name latitude acronym state_name grant_id grant_title grant_start_date grant_end_date
0 grid.270680.b [http://ec.europa.eu/index_en.htm] European Commission Belgium [Government] 4.363670 Brussels 50.851650 EC NaN grant.8964341 mAlaRIa Sex dEtermination 2020-10-01 2022-09-30
1 grid.419681.3 [http://www.niaid.nih.gov/Pages/default.aspx] National Institute of Allergy and Infectious D... United States [Facility] -77.111830 Bethesda 39.066647 NIAID Maryland grant.9018944 Establishment of the New York University Vacci... 2020-04-10 2026-11-30
2 grid.419681.3 [http://www.niaid.nih.gov/Pages/default.aspx] National Institute of Allergy and Infectious D... United States [Facility] -77.111830 Bethesda 39.066647 NIAID Maryland grant.9020234 Antigen discovery for transmission-blocking va... 2020-04-06 2025-03-31
3 grid.419681.3 [http://www.niaid.nih.gov/Pages/default.aspx] National Institute of Allergy and Infectious D... United States [Facility] -77.111830 Bethesda 39.066647 NIAID Maryland grant.9020279 Repurposing kinase inhibitor chemotypes as ant... 2020-04-03 2025-03-31
4 grid.270680.b [http://ec.europa.eu/index_en.htm] European Commission Belgium [Government] 4.363670 Brussels 50.851650 EC NaN grant.8585457 Estimating the Prevalence of AntiMicrobial Res... 2020-04-01 2022-03-31
5 grid.420089.7 [http://www.nichd.nih.gov/Pages/index.aspx] National Institute of Child Health and Human D... United States [Facility] -77.100420 Bethesda 39.001095 NICHD Maryland grant.9018783 Physiological Functions of Female Reproductive... 2020-04-01 2025-03-31
6 grid.419681.3 [http://www.niaid.nih.gov/Pages/default.aspx] National Institute of Allergy and Infectious D... United States [Facility] -77.111830 Bethesda 39.066647 NIAID Maryland grant.9019363 Determining the mechanism of antibody-mediated... 2020-04-01 2022-03-31
7 grid.419681.3 [http://www.niaid.nih.gov/Pages/default.aspx] National Institute of Allergy and Infectious D... United States [Facility] -77.111830 Bethesda 39.066647 NIAID Maryland grant.9019371 Computational models of naturally acquired imm... 2020-04-01 2025-03-31
8 grid.280785.0 [http://www.nigms.nih.gov/Pages/default.aspx] National Institute of General Medical Sciences United States [Facility] -77.099380 Bethesda 38.997833 NIGMS Maryland grant.9018600 A Biomedical Mass Spectrometry Resource: Ongoi... 2020-04-01 2023-03-31
9 grid.425888.b [http://www.snf.ch/en] Swiss National Science Foundation Switzerland [Government] 7.432395 Bern 46.949230 SNF NaN grant.8968483 Adressing concerns over gene drive based malar... 2020-04-01 2020-06-30

Extracting investigators: as_dataframe_investigators

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the investigator_details sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per investigator.

NOTE: investigator_details are not returned by default in a grants query hence one must specify this in the query results

[15]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigator_details] limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 11076)
WARNINGS [2]
Field 'title_language' is deprecated in favor of language_title. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
Field 'project_num' is deprecated in favor of grant_number. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
[16]:
res.as_dataframe_investigators().head(10)
[16]:
affiliations first_name last_name role id middle_name grant_id grant_title grant_start_date grant_end_date
0 [{'state': 'NY', 'state_code': 'US-NY', 'city'... MARK JOSEPH MULLIGAN PI None None grant.9018944 Establishment of the New York University Vacci... 2020-04-10 2026-11-30
1 [{'state': '', 'state_code': None, 'city': 'SH... YAMING CAO PI None None grant.9020234 Antigen discovery for transmission-blocking va... 2020-04-06 2025-03-31
2 [{'state': '', 'state_code': None, 'city': 'RO... KELLY CHIBALE PI None None grant.9020279 Repurposing kinase inhibitor chemotypes as ant... 2020-04-03 2025-03-31
3 NaN WILLIAM ZUERCHER Co-PI None None grant.9020279 Repurposing kinase inhibitor chemotypes as ant... 2020-04-03 2025-03-31
4 NaN LORI FERRINS Co-PI None None grant.9020279 Repurposing kinase inhibitor chemotypes as ant... 2020-04-03 2025-03-31
5 [{'state': 'CT', 'state_code': 'US-CT', 'city'... JIANJUN SUN PI None None grant.9018783 Physiological Functions of Female Reproductive... 2020-04-01 2025-03-31
6 [{'state': 'MN', 'state_code': 'US-MN', 'city'... GEOFFREY T HART PI None None grant.9019363 Determining the mechanism of antibody-mediated... 2020-04-01 2022-03-31
7 [{'state': 'CA', 'state_code': 'US-CA', 'city'... BRYAN R GREENHOUSE PI None None grant.9019371 Computational models of naturally acquired imm... 2020-04-01 2025-03-31
8 NaN ATUL J. BUTTE Co-PI None None grant.9019371 Computational models of naturally acquired imm... 2020-04-01 2025-03-31
9 NaN PRASANNA JAGANNATHAN Co-PI None None grant.9019371 Computational models of naturally acquired imm... 2020-04-01 2025-03-31

4. Dataframe Methods for ‘Concepts’ queries

These methods can be used with all content types that support the extraction of concepts, i.e., publications or grants. See the official documentation for more details.

Extracting Concepts: as_dataframe_concepts

The as_dataframe_concepts method allows to quickly extract all concepts attached to a record, one row per concept, so to make it easier to do operations like counting or plotting the results.

NOTE: concepts are normalized noun phrases describing the main topics of a document, which are automatically derived from the full text using machine learning. In the JSON data, concepts are available as an ordered list (=first items are the most relevant), including a relevance score. E.g. for the publications with ID ‘pub.1122072646’:

{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
                    {'concept': 'conversion', 'relevance': 0.055053872555463006},
                    {'concept': 'formic acid', 'relevance': 0.048144671935356},
                    {'concept': 'CO2', 'relevance': 0.032150964737607}
                    [........]
                    ],
 }

The as_dataframe_concepts extracts all concepts data from JSON to a dataframe (ps this is functionally similarly to pandas’s explode method). Moreover, it automatically creates a number of metrics that can be used to carry out further analyses, like frequency and score_avg.

[9]:
q = """search publications for "graphene"
            where year=2019
       return publications[id+title+year+concepts_scores] limit 100"""

concepts = dsl.query(q).as_dataframe_concepts()

print("Concepts found: ", len(concepts))
print("Unique: ", len(concepts.drop_duplicates("concept")))

concepts.head()
Returned Publications: 100 (total = 100865)
Concepts found:  4925
Unique:  3071
[9]:
title id year concepts_count concept score frequency score_avg
0 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 non-woven fiber mats 0.78424 1 0.78424
1 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 polymer matrix 0.72761 3 0.64380
2 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 atom transfer radical polymerization 0.72668 1 0.72668
3 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 transfer radical polymerization 0.70781 1 0.70781
4 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 ray photoelectron spectroscopy 0.69869 3 0.64896
[14]:
concepts.drop_duplicates("concept").sort_values("score_avg", ascending=False).head(10)[['concept', 'score_avg']]
[14]:
concept score_avg
3027 graphene-based antenna 0.90147
2404 oxygen evolution reaction 0.87529
2405 oxygen reduction reaction 0.87487
2406 hydrogen evolution reaction 0.87120
2724 fluorinated carbon materials 0.86460
1327 MoP/SiO2 catalyst 0.86052
4711 POM water oxidation catalysts 0.85027
1392 high internal phase emulsions 0.84142
3028 dipole antenna 0.83857
4124 phenolic foam composites 0.83428

Please note that (as of version 1.25 of the DSL API) it is possible to return either concepts_scores or concepts with Publications queries, but only concepts with Grants queries.

If a relevancy score is not available, the as_dataframe_concepts function will simulate one based on the concept rank (=its position in the document-concepts list).

For more information on how to make use of concepts please see the Topic Modeling Analysis Tutorial and the Official documentation.

Conclusions

Moving Dimensions API results to pandas dataframes makes it easier to analyze the data and answer research questions.

Note: the examples above only scratch the surface of what can be done with pandas!

Tip: see also the Dimcli: Magic Commands notebook to find out what shortcuts are available for these dataframe methods.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg