../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

The Dimcli Python library: Working with Pandas Dataframes

Dimcli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas dataframe objects.

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

pandas is a popular software library written for the Python programming language for data manipulation and analysis.

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:
!pip install dimcli --quiet

import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. General method to transform JSON query results into a dataframe

The DslDataset.as_dataframe method allows to quickly turn any query results into a dataframe.

[2]:
# we'll reuse this query later on
query = """search publications for "graphene"
            where year in [2013:2019]
            return publications[id+type+volume+year+issue+title+journal+authors] sort by times_cited limit 1000"""
res = dsl.query(query)
Returned Publications: 1000 (total = 484143)
Time: 4.79s
[3]:
df = res.as_dataframe()
df.head(10)
[3]:
authors id issue title type volume year journal.id journal.title
0 [{'affiliations': [{'city': 'New Brunswick', '... pub.1050119463 4 The chemistry of two-dimensional layered trans... article 5 2013 jour.1041224 Nature Chemistry
1 [{'affiliations': [{'city': 'Manchester', 'cit... pub.1024857999 7459 Van der Waals heterostructures article 499 2013 jour.1018957 Nature
2 [{'affiliations': [{'city': 'Shanghai', 'city_... pub.1032956475 5 Black phosphorus field-effect transistors article 9 2014 jour.1037429 Nature Nanotechnology
3 [{'affiliations': [{'city': 'Austin', 'city_id... pub.1019126274 4 The Li-Ion Rechargeable Battery: A Perspective article 135 2013 jour.1081898 Journal of the American Chemical Society
4 [{'affiliations': [{'city': 'Los Angeles', 'ci... pub.1004394295 6196 Interface engineering of highly efficient pero... article 345 2014 jour.1346339 Science
5 [{'affiliations': [{'city': 'Stanford', 'city_... pub.1007330696 6321 Combining theory and experiment in electrocata... article 355 2017 jour.1346339 Science
6 [{'affiliations': [{'city': 'Nanshan', 'city_i... pub.1055085164 21 Aggregation-Induced Emission: Together We Shin... article 115 2015 jour.1077147 Chemical Reviews
7 [{'affiliations': [], 'corresponding': '', 'cu... pub.1009826879 4 Phosphorene: An Unexplored 2D Semiconductor wi... article 8 2014 jour.1038917 ACS Nano
8 [{'affiliations': [{'city': 'Sydney', 'city_id... pub.1045181228 7 The emergence of perovskite solar cells article 8 2014 jour.1037430 Nature Photonics
9 [{'affiliations': [{'city': 'Cambridge', 'city... pub.1015305822 4 Raman spectroscopy as a versatile tool for stu... article 8 2013 jour.1037429 Nature Nanotechnology

Pandas dataframes offer a myriad of utilities for inspecting data. Check out the official docs or google a pandas tutorial to lean more about it.

[4]:
# the table shape
df.shape
[4]:
(1000, 9)
[5]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()
[5]:
2014    256
2013    236
2015    192
2016    151
2017     99
2018     50
2019     16
Name: year, dtype: int64
[6]:
# eg distribution of publication [type]
df['type'].value_counts()
[6]:
article      996
book           2
monograph      2
Name: type, dtype: int64

2. Dataframe Methods for ‘Publications’ queries

What follows are specialized versions of the as_dataframe method for results sets composed of publication records.

Extracting authors: as_dataframe_authors

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the authors_affiliations sub-key.

Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with one row per author.

[7]:
authors = res.as_dataframe_authors()
authors.head()
[7]:
affiliations corresponding current_organization_id first_name last_name orcid raw_affiliation researcher_id pub_id
0 [{'city': 'New Brunswick', 'city_id': 5101717,... True grid.5335.0 Manish Chhowalla None [Materials Science and Engineering, Rutgers Un... ur.0633062306.03 pub.1050119463
1 [{'city': 'Ulsan', 'city_id': 1833747, 'countr... grid.42687.3f Hyeon Suk Shin None [Interdisciplinary School of Green Energy and ... ur.07617630407.83 pub.1050119463
2 [{'city': 'Singapore', 'city_id': 1880252, 'co... grid.4280.e Goki Eda None [Department of Physics, National University of... ur.01150450507.27 pub.1050119463
3 [{'city': 'Taipei', 'city_id': 1668341, 'count... grid.45672.32 Lain-Jong Li [0000-0002-4059-7783] [Institute of Atomic and Molecular Sciences, A... ur.01313340113.13 pub.1050119463
4 [{'city': 'Singapore', 'city_id': 1880252, 'co... grid.4280.e Kian Ping Loh [0000-0002-1491-743X] [Department of Chemistry, National University ... ur.0752174033.73 pub.1050119463

Using the authors dataframe, we can easily get the top ten values for current_organization_id.

[8]:
authors['current_organization_id'].value_counts()[:10]
[8]:
grid.168010.e    157
                 148
grid.59053.3a     96
grid.19006.3e     95
grid.59025.3b     92
grid.12527.33     82
grid.41156.37     70
grid.116068.8     69
grid.47840.3f     59
grid.166341.7     58
Name: current_organization_id, dtype: int64

Explanation: the most frequent organization turns out to be grid.168010.e. The first result is empty, meaning that for those authors Dimensions has no info about current_organization_id.

Extracting Affiliations: as_dataframe_authors_affiliations

As you can see from the results of the previous section, the affiliations of each author is yet another nested JSON object.

Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The as_dataframe_authors_affiliations method allows to quickly extract that affiliations data and return a dataframe with one row per affiliation.

This can be useful e.g. if one wants to count research organizations at the time of writing (as opposed to current_organization_id, which is the most recent organization of a researcher).

[9]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()
[9]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
0 New Brunswick 5101717.0 United States US grid.430387.b Rutgers, The State University of New Jersey Materials Science and Engineering, Rutgers Uni... New Jersey US-NJ pub.1050119463 ur.0633062306.03 Manish Chhowalla
1 Ulsan 1833747.0 South Korea KR grid.42687.3f Ulsan National Institute of Science and Techno... Interdisciplinary School of Green Energy and L... pub.1050119463 ur.07617630407.83 Hyeon Suk Shin
2 Singapore 1880252.0 Singapore SG grid.4280.e National University of Singapore Department of Physics, National University of ... pub.1050119463 ur.01150450507.27 Goki Eda
3 Singapore 1880252.0 Singapore SG grid.4280.e National University of Singapore Department of Chemistry, National University o... pub.1050119463 ur.01150450507.27 Goki Eda
4 Singapore 1880252.0 Singapore SG grid.4280.e National University of Singapore Graphene Research Centre, National University ... pub.1050119463 ur.01150450507.27 Goki Eda
[10]:
affiliations.describe(include="all")
[10]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
count 8240 8240.0 8240 8240 8240 8240 8240 8240 8240 8240 8240 8240 8240
unique 519 522.0 56 56 816 1161 3008 108 55 990 4506 3568 2287
top Beijing 1816670.0 United States US Stanford University pub.1019661721 Yi Wang
freq 604 604.0 2664 2664 944 280 263 5084 5278 108 172 70 331

Let’s get the top ten values for aff_id.

[11]:
affiliations['aff_id'].value_counts()[:10]
[11]:
                 944
grid.168010.e    280
grid.59025.3b    195
grid.166341.7    129
grid.19006.3e    121
grid.8217.c      112
grid.12527.33    106
grid.21729.3f     99
grid.5379.8       91
grid.116068.8     91
Name: aff_id, dtype: int64

Another example: we can now easily analyze the data by country too.

[12]:
affiliations['aff_country'].value_counts()[:10]
[12]:
United States     2664
China             2021
                   451
United Kingdom     395
Singapore          310
Germany            286
South Korea        282
Japan              212
Australia          170
Canada             167
Name: aff_country, dtype: int64

Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA.

3. Dataframe Methods for ‘Grants’ queries

Extracting Funders: as_dataframe_funders

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the funders sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per funder.

[13]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 12941)
Time: 1.52s
[14]:
res.as_dataframe_funders().head(10)
[14]:
acronym city_name country_name id latitude linkout longitude name types state_name grant_id grant_title grant_start_date grant_end_date
0 EC Brussels Belgium grid.270680.b 50.851650 [http://ec.europa.eu/index_en.htm] 4.363670 European Commission [Government] NaN grant.9661764 In vitro brain microvascular model to tackle f... 2022-09-01 2024-08-31
1 EDCTP The Hague Netherlands grid.453375.3 52.080574 [http://www.edctp.org/] 4.344426 European & Developing Countries Clinical Trial... [Nonprofit] NaN grant.9764814 Role of host immune responses in Plasmodium fa... 2022-04-01 2024-01-29
2 FRQS Montreal Canada grid.453037.6 45.506490 [http://www.frqs.gouv.qc.ca/en/] -73.572200 Fonds de Recherche du Québec - Santé [Government] Quebec grant.9721403 Elucidation of the mechanisms controlling prot... 2022-04-01 2025-03-31
3 BBSRC Swindon United Kingdom grid.418100.c 51.566742 [http://www.bbsrc.ac.uk/] -1.784771 Biotechnology and Biological Sciences Research... [Government] Wiltshire grant.9779816 BBSRC-NSF/BIO - Host immune suppression as a k... 2022-03-01 2025-02-28
4 ANR Paris France grid.22058.3d 48.845260 [http://www.agence-nationale-recherche.fr/en/p... 2.377769 National Agency for Research [Government] NaN grant.9947992 Ivermectin Fluorescent drug to decipher molecu... 2022-01-20 2025-07-19
5 ANR Paris France grid.22058.3d 48.845260 [http://www.agence-nationale-recherche.fr/en/p... 2.377769 National Agency for Research [Government] NaN grant.9947714 Setting boundaries: Cohesin and mutually exclu... 2022-01-20 2025-01-19
6 ANR Paris France grid.22058.3d 48.845260 [http://www.agence-nationale-recherche.fr/en/p... 2.377769 National Agency for Research [Government] NaN grant.9947606 The epigenetic role of MORC in apicomplexan li... 2022-01-20 2026-01-19
7 FCT Lisbon Portugal grid.22919.31 38.709667 [http://www.fct.pt/] -9.153090 Foundation for Science and Technology [Nonprofit] NaN grant.9951207 Plasmodium exploitation of host immune complem... 2022-01-03 2023-07-02
8 SNF Bern Switzerland grid.425888.b 46.949230 [http://www.snf.ch/en] 7.432395 Swiss National Science Foundation [Government] NaN grant.9959780 Structural basis of erythrocyte invasion by Pl... 2022-01-01 2023-12-31
9 NHMRC Canberra Australia grid.431143.0 -35.281240 [https://www.nhmrc.gov.au/] 149.123730 National Health and Medical Research Council [Government] Australian Capital Territory grant.9707893 Intermittent preventive treatment in pregnancy... 2022-01-01 2025-12-31

Extracting investigators: as_dataframe_investigators

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the investigators sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per investigator.

NOTE: investigators are not returned by default in a grants query hence one must specify this in the query results

[17]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigators] limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 12941)
Time: 2.29s
[18]:
res.as_dataframe_investigators().head(10)
[18]:
grant_id grant_title grant_start_date grant_end_date

4. Dataframe Methods for ‘Concepts’ queries

These methods can be used with all content types that support the extraction of concepts, i.e., publications or grants. See the official documentation for more details.

Extracting Concepts: as_dataframe_concepts

The as_dataframe_concepts method allows to quickly extract all concepts attached to a record, one row per concept, so to make it easier to do operations like counting or plotting the results.

NOTE: concepts are normalized noun phrases describing the main topics of a document, which are automatically derived from the full text using machine learning. In the JSON data, concepts are available as an ordered list (=first items are the most relevant), including a relevance score. E.g. for the publications with ID ‘pub.1122072646’:

{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
                    {'concept': 'conversion', 'relevance': 0.055053872555463006},
                    {'concept': 'formic acid', 'relevance': 0.048144671935356},
                    {'concept': 'CO2', 'relevance': 0.032150964737607}
                    [........]
                    ],
 }

The as_dataframe_concepts extracts all concepts data from JSON to a dataframe (ps this is functionally similarly to pandas’s explode method). Moreover, it automatically creates a number of metrics that can be used to carry out further analyses, like frequency and score_avg.

[19]:
q = """search publications for "graphene"
            where year=2019
       return publications[id+title+year+concepts_scores] limit 100"""

concepts = dsl.query(q).as_dataframe_concepts()

print("Concepts found: ", len(concepts))
print("Unique: ", len(concepts.drop_duplicates("concept")))

concepts.head()
Returned Publications: 100 (total = 116623)
Time: 1.15s
Concepts found:  5346
Unique:  3339
[19]:
id title year concepts_count concept score frequency score_avg
0 pub.1129328198 Study of mechanochemistry of carbon nanotube u... 2019 6 study 0.088 31 0.21619
1 pub.1129328198 Study of mechanochemistry of carbon nanotube u... 2019 6 principles 0.047 2 0.13200
2 pub.1129328198 Study of mechanochemistry of carbon nanotube u... 2019 6 mechanochemistry 0.015 1 0.01500
3 pub.1129328198 Study of mechanochemistry of carbon nanotube u... 2019 6 first principles 0.012 1 0.01200
4 pub.1129328198 Study of mechanochemistry of carbon nanotube u... 2019 6 carbon nanotubes 0.010 6 0.38033
[20]:
concepts.drop_duplicates("concept").sort_values("score_avg", ascending=False).head(10)[['concept', 'score_avg']]
[20]:
concept score_avg
1660 mesoporous silica nanoparticles 0.919
550 two-dimensional transition metal dichalcogenides 0.918
551 spin-orbit coupling 0.903
4333 fiber Bragg grating 0.885
2793 oxygen reduction reaction 0.882
3465 MoP/SiO2 catalyst 0.878
2335 silica nano particles 0.874
552 quantum spin Hall effect 0.872
2794 oxygen evolution reaction 0.870
553 Ising spin-orbit coupling 0.869

Please note that (as of version 1.25 of the DSL API) it is possible to return either concepts_scores or concepts with Publications queries, but only concepts with Grants queries.

If a relevancy score is not available, the as_dataframe_concepts function will simulate one based on the concept rank (=its position in the document-concepts list).

For more information on how to make use of concepts please see the Topic Modeling Analysis Tutorial and the Official documentation.

Conclusions

Moving Dimensions API results to pandas dataframes makes it easier to analyze the data and answer research questions.

Note: the examples above only scratch the surface of what can be done with pandas!

Tip: see also the Dimcli: Magic Commands notebook to find out what shortcuts are available for these dataframe methods.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg