../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

The Dimcli Python library: Working with Pandas Dataframes

Dimcli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas dataframe objects.

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

pandas is a popular software library written for the Python programming language for data manipulation and analysis.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[1]:
!pip install dimcli --quiet

import dimcli
from dimcli.shortcuts import *
import json
import sys
import pandas as pd
#

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.7.4.2)
Connected to: https://app.dimensions.ai - DSL v1.27
Method: dsl.ini file

1. General method to transform JSON query results into a dataframe

The DslDataset.as_dataframe method allows to quickly turn any query results into a dataframe.

[9]:
# we'll reuse this query later on
query = """search publications for "graphene"
            where year in [2013:2019]
            return publications[id+type+volume+year+issue+title+journal+authors] sort by times_cited limit 1000"""
res = dsl.query(query)
Returned Publications: 1000 (total = 426278)
Time: 2.46s
[10]:
df = res.as_dataframe()
df.head(10)
[10]:
id year authors volume issue type title journal.id journal.title
0 pub.1050119463 2013 [{'first_name': 'Manish', 'last_name': 'Chhowa... 5 4 article The chemistry of two-dimensional layered trans... jour.1041224 Nature Chemistry
1 pub.1024857999 2013 [{'first_name': 'A. K.', 'last_name': 'Geim', ... 499 7459 article Van der Waals heterostructures jour.1018957 Nature
2 pub.1004394295 2014 [{'first_name': 'Huanping', 'last_name': 'Zhou... 345 6196 article Interface engineering of highly efficient pero... jour.1346339 Science
3 pub.1032956475 2014 [{'first_name': 'Likai', 'last_name': 'Li', 'c... 9 5 article Black phosphorus field-effect transistors jour.1037429 Nature Nanotechnology
4 pub.1019126274 2013 [{'first_name': 'John B.', 'last_name': 'Goode... 135 4 article The Li-ion rechargeable battery: a perspective. jour.1081898 Journal of the American Chemical Society
5 pub.1031762191 2013 NaN NaN NaN book Nanoenergy, Nanotechnology Applied for Energy ... NaN NaN
6 pub.1009826879 2014 [{'first_name': 'Han', 'last_name': 'Liu', 'co... 8 4 article Phosphorene: an unexplored 2D semiconductor wi... jour.1038917 ACS Nano
7 pub.1059158429 2016 [{'first_name': 'C.', 'last_name': 'Patrignani... 40 10 article Review of Particle Physics jour.1327822 Chinese Physics C
8 pub.1015305822 2013 [{'first_name': 'Andrea C.', 'last_name': 'Fer... 8 4 article Raman spectroscopy as a versatile tool for stu... jour.1037429 Nature Nanotechnology
9 pub.1045181228 2014 [{'first_name': 'Martin A.', 'last_name': 'Gre... 8 7 article The emergence of perovskite solar cells jour.1037430 Nature Photonics

Pandas dataframes offer a myriad of utilities for inspecting data. Check out the official docs or google a pandas tutorial to lean more about it.

[11]:
# the table shape
df.shape
[11]:
(1000, 9)
[12]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()
[12]:
2013    310
2014    294
2015    210
2016    111
2017     58
2018     15
2019      2
Name: year, dtype: int64
[13]:
# eg distribution of publication [type]
df['type'].value_counts()
[13]:
article    997
book         2
chapter      1
Name: type, dtype: int64

2. Dataframe Methods for ‘Publications’ queries

What follows are specialized versions of the as_dataframe method for results sets composed of publication records.

Extracting authors: as_dataframe_authors

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the authors_affiliations sub-key.

Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with one row per author.

[14]:
authors = res.as_dataframe_authors()
authors.head()
[14]:
first_name last_name corresponding orcid current_organization_id researcher_id affiliations pub_id
0 Manish Chhowalla True grid.5335.0 ur.0633062306.03 [{'id': 'grid.430387.b', 'name': 'Rutgers, The... pub.1050119463
1 Hyeon Suk Shin grid.42687.3f ur.07617630407.83 [{'id': 'grid.42687.3f', 'name': 'Ulsan Nation... pub.1050119463
2 Goki Eda grid.4280.e ur.01150450507.27 [{'id': 'grid.4280.e', 'name': 'National Unive... pub.1050119463
3 Lain-Jong Li ['0000-0002-4059-7783'] grid.45672.32 ur.01313340113.13 [{'id': 'grid.28665.3f', 'name': 'Academia Sin... pub.1050119463
4 Kian Ping Loh ['0000-0002-1491-743X'] grid.4280.e ur.0752174033.73 [{'id': 'grid.4280.e', 'name': 'National Unive... pub.1050119463

Using the authors dataframe, we can easily get the top ten values for current_organization_id.

[15]:
authors['current_organization_id'].value_counts()[:10]
[15]:
                 202
grid.168010.e    160
grid.59025.3b    131
grid.5379.8       92
grid.116068.8     86
grid.59053.3a     81
grid.12527.33     76
grid.19006.3e     76
grid.5333.6       72
grid.13402.34     71
Name: current_organization_id, dtype: int64

Explanation: the most frequent organization turns out to be grid.168010.e. The first result is empty, meaning that for those authors Dimensions has no info about current_organization_id.

Extracting Affiliations: as_dataframe_authors_affiliations

As you can see from the results of the previous section, the affiliations of each author is yet another nested JSON object.

Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The as_dataframe_authors_affiliations method allows to quickly extract that affiliations data and return a dataframe with one row per affiliation.

This can be useful e.g. if one wants to count research organizations at the time of writing (as opposed to current_organization_id, which is the most recent organization of a researcher).

[16]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()
[16]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code pub_id researcher_id first_name last_name
0 grid.430387.b Rutgers, The State University of New Jersey New Brunswick 5.10172e+06 United States US New Jersey US-NJ pub.1050119463 ur.0633062306.03 Manish Chhowalla
1 grid.42687.3f Ulsan National Institute of Science and Techno... Ulsan 1.83375e+06 South Korea KR pub.1050119463 ur.07617630407.83 Hyeon Suk Shin
2 grid.4280.e National University of Singapore Singapore 1.88025e+06 Singapore SG pub.1050119463 ur.01150450507.27 Goki Eda
3 grid.4280.e National University of Singapore Singapore 1.88025e+06 Singapore SG pub.1050119463 ur.01150450507.27 Goki Eda
4 grid.4280.e National University of Singapore Singapore 1.88025e+06 Singapore SG pub.1050119463 ur.01150450507.27 Goki Eda
[17]:
affiliations.describe(include="all")
[17]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code pub_id researcher_id first_name last_name
count 7937 7937 7937 7937 7937 7937 7937 7937 7937 7937 7937 7937
unique 730 1119 449 451 52 52 53 53 985 4395 3512 2150
top Nanyang Technological University United States US pub.1019661721 Yi Wang
freq 1062 237 1062 1068 2196 2196 5433 5433 108 192 72 318

Let’s get the top ten values for aff_id.

[18]:
affiliations['aff_id'].value_counts()[:10]
[18]:
                 1062
grid.59025.3b     237
grid.168010.e     226
grid.19006.3e     135
grid.4280.e       116
grid.21729.3f     116
grid.8217.c       113
grid.5379.8       112
grid.116068.8     104
grid.12527.33      90
Name: aff_id, dtype: int64

Another example: we can now easily analyze the data by country too.

[19]:
affiliations['aff_country'].value_counts()[:10]
[19]:
United States     2196
China             1811
                  1062
Singapore          377
United Kingdom     344
South Korea        330
Germany            231
Australia          180
Japan              162
Switzerland        128
Name: aff_country, dtype: int64

Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA.

3. Dataframe Methods for ‘Grants’ queries

Extracting Funders: as_dataframe_funders

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the funders sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per funder.

[20]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 11605)
Time: 1.78s
[21]:
res.as_dataframe_funders().head(10)
[21]:
id name acronym latitude types linkout country_name longitude city_name state_name grant_id grant_title grant_start_date grant_end_date
0 grid.270680.b European Commission EC 50.851650 [Government] [http://ec.europa.eu/index_en.htm] Belgium 4.363670 Brussels NaN grant.9064941 The molecular mechanisms of sex determination ... 2021-09-01 2023-08-31
1 grid.270680.b European Commission EC 50.851650 [Government] [http://ec.europa.eu/index_en.htm] Belgium 4.363670 Brussels NaN grant.9065769 Surveilling Malaria through machine learning a... 2021-04-01 2023-03-31
2 grid.425888.b Swiss National Science Foundation SNF 46.949230 [Government] [http://www.snf.ch/en] Switzerland 7.432395 Bern NaN grant.8968483 Adressing concerns over gene drive based malar... 2020-11-01 2021-01-31
3 grid.25111.36 FWF Austrian Science Fund FWF 48.219303 [Nonprofit] [https://www.fwf.ac.at/en/] Austria 16.352383 Vienna NaN grant.8966928 Further insights into the pathogenesis of avia... 2020-10-01 2023-09-30
4 grid.270680.b European Commission EC 50.851650 [Government] [http://ec.europa.eu/index_en.htm] Belgium 4.363670 Brussels NaN grant.8964341 mAlaRIa Sex dEtermination 2020-10-01 2022-09-30
5 grid.457875.c Directorate for Mathematical & Physical Sciences NSF MPS 38.880566 [Government] [http://www.nsf.gov/dir/index.jsp?org=MPS] United States -77.110990 Arlington Virginia grant.9314067 Accurate and Interpretable Machine Learning fo... 2020-09-01 2023-08-31
6 grid.418100.c Biotechnology and Biological Sciences Research... BBSRC 51.566742 [Government] [http://www.bbsrc.ac.uk/] United Kingdom -1.784771 Swindon Wiltshire grant.9288877 Characterisation of a new family of zinc finge... 2020-08-31 2023-08-30
7 grid.248883.d Canadian Institutes of Health Research CIHR 45.381893 [Government] [http://www.cihr-irsc.gc.ca/e/193.html] Canada -75.745224 Ottawa Ontario grant.9213792 Mechanism of protein translocation across the ... 2020-08-01 2022-07-31
8 grid.419681.3 National Institute of Allergy and Infectious D... NIAID 39.066647 [Facility] [http://www.niaid.nih.gov/Pages/default.aspx] United States -77.111830 Bethesda Maryland grant.9292319 Benzoxaborole-based antimalarial drug discovery 2020-07-08 2022-06-30
9 grid.452896.4 European Research Council ERC 50.856167 [Government] [http://erc.europa.eu/] Belgium 4.359973 Brussels NaN grant.9244737 Quantifying the spread of P. falciparum malaria 2020-07-01 2025-06-30

Extracting investigators: as_dataframe_investigators

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the investigator_details sub-key.

This methods allows to quickly extract that data and return a dataframe with one row per investigator.

NOTE: investigator_details are not returned by default in a grants query hence one must specify this in the query results

[22]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigator_details] limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 11605)
Time: 2.04s
WARNINGS [2]
Field 'title_language' is deprecated in favor of language_title. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
Field 'project_num' is deprecated in favor of grant_number. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
[23]:
res.as_dataframe_investigators().head(10)
[23]:
id first_name middle_name last_name role affiliations grant_id grant_title grant_start_date grant_end_date
0 ur.016225703001.43 Marie-Valentine None Florin PI [{'id': 'grid.5333.6', 'name': 'EPF Lausanne',... grant.8968483 Adressing concerns over gene drive based malar... 2020-11-01 2021-01-31
1 ur.01207752573.80 Herbert WEISSENBÖCK PI [{'id': 'grid.6583.8', 'name': 'Veterinärmediz... grant.8966928 Further insights into the pathogenesis of avia... 2020-10-01 2023-09-30
2 None David None Benkeser PI [{'id': 'grid.189967.8', 'name': 'Emory Univer... grant.9314067 Accurate and Interpretable Machine Learning fo... 2020-09-01 2023-08-31
3 None Katarzyna Kinga Modrzynska PI [{'id': 'grid.8756.c', 'name': 'University of ... grant.9288877 Characterisation of a new family of zinc finge... 2020-08-31 2023-08-30
4 ur.0667564616.06 Ruijie Darius Teo PI NaN grant.9213792 Mechanism of protein translocation across the ... 2020-08-01 2022-07-31
5 ur.07631437760.18 Peter Dirk Tieleman Co-PI NaN grant.9213792 Mechanism of protein translocation across the ... 2020-08-01 2022-07-31
6 None MARISSA None AUBREY PI [{'id': None, 'name': 'BORAGEN INC', 'city': '... grant.9292319 Benzoxaborole-based antimalarial drug discovery 2020-07-08 2022-06-30
7 None Jan Teun None BOUSEMA PI [{'id': 'grid.5590.9', 'name': 'STICHTING KATH... grant.9244737 Quantifying the spread of P. falciparum malaria 2020-07-01 2025-06-30
8 ur.010462641253.51 José Pedro Gil PI [{'id': 'grid.4714.6', 'name': 'Karolinska ins... grant.9242920 Beyond pvmdr1 and pvcrt: Search for relevant d... 2020-07-01 2021-06-30
9 ur.07665260653.48 Jose Pedro Gil PI [{'id': 'grid.4714.6', 'name': 'Karolinska ins... grant.9242920 Beyond pvmdr1 and pvcrt: Search for relevant d... 2020-07-01 2021-06-30

4. Dataframe Methods for ‘Concepts’ queries

These methods can be used with all content types that support the extraction of concepts, i.e., publications or grants. See the official documentation for more details.

Extracting Concepts: as_dataframe_concepts

The as_dataframe_concepts method allows to quickly extract all concepts attached to a record, one row per concept, so to make it easier to do operations like counting or plotting the results.

NOTE: concepts are normalized noun phrases describing the main topics of a document, which are automatically derived from the full text using machine learning. In the JSON data, concepts are available as an ordered list (=first items are the most relevant), including a relevance score. E.g. for the publications with ID ‘pub.1122072646’:

{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
                    {'concept': 'conversion', 'relevance': 0.055053872555463006},
                    {'concept': 'formic acid', 'relevance': 0.048144671935356},
                    {'concept': 'CO2', 'relevance': 0.032150964737607}
                    [........]
                    ],
 }

The as_dataframe_concepts extracts all concepts data from JSON to a dataframe (ps this is functionally similarly to pandas’s explode method). Moreover, it automatically creates a number of metrics that can be used to carry out further analyses, like frequency and score_avg.

[24]:
q = """search publications for "graphene"
            where year=2019
       return publications[id+title+year+concepts_scores] limit 100"""

concepts = dsl.query(q).as_dataframe_concepts()

print("Concepts found: ", len(concepts))
print("Unique: ", len(concepts.drop_duplicates("concept")))

concepts.head()
Returned Publications: 100 (total = 101443)
Time: 1.23s
Concepts found:  4880
Unique:  3050
[24]:
title year id concepts_count concept score frequency score_avg
0 Smart Non-Woven Fiber Mats with Light-Induced ... 2019 pub.1123764889 63 non-woven fiber mats 0.78424 1 0.78424
1 Smart Non-Woven Fiber Mats with Light-Induced ... 2019 pub.1123764889 63 polymer matrix 0.72761 3 0.64380
2 Smart Non-Woven Fiber Mats with Light-Induced ... 2019 pub.1123764889 63 atom transfer radical polymerization 0.72668 1 0.72668
3 Smart Non-Woven Fiber Mats with Light-Induced ... 2019 pub.1123764889 63 transfer radical polymerization 0.70781 1 0.70781
4 Smart Non-Woven Fiber Mats with Light-Induced ... 2019 pub.1123764889 63 ray photoelectron spectroscopy 0.69869 4 0.65301
[25]:
concepts.drop_duplicates("concept").sort_values("score_avg", ascending=False).head(10)[['concept', 'score_avg']]
[25]:
concept score_avg
3111 graphene-based antenna 0.90147
2554 oxygen evolution reaction 0.87529
2555 oxygen reduction reaction 0.87487
2556 hydrogen evolution reaction 0.87120
2867 fluorinated carbon materials 0.86460
1477 MoP/SiO2 catalyst 0.86052
2384 high internal phase emulsions 0.84142
3112 dipole antenna 0.83857
4385 phenolic foam composites 0.83428
3866 hole transport layer 0.83396

Please note that (as of version 1.25 of the DSL API) it is possible to return either concepts_scores or concepts with Publications queries, but only concepts with Grants queries.

If a relevancy score is not available, the as_dataframe_concepts function will simulate one based on the concept rank (=its position in the document-concepts list).

For more information on how to make use of concepts please see the Topic Modeling Analysis Tutorial and the Official documentation.

Conclusions

Moving Dimensions API results to pandas dataframes makes it easier to analyze the data and answer research questions.

Note: the examples above only scratch the surface of what can be done with pandas!

Tip: see also the Dimcli: Magic Commands notebook to find out what shortcuts are available for these dataframe methods.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg