The Dimcli Python library: Working with Pandas Dataframes¶
Dimcli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas dataframe objects.
Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.
pandas is a popular software library written for the Python programming language for data manipulation and analysis.
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.
[1]:
!pip install dimcli --quiet
import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
#
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.8.2)
Connected to: https://app.dimensions.ai - DSL v1.28
Method: dsl.ini file
1. General method to transform JSON query results into a dataframe¶
The DslDataset.as_dataframe
method allows to quickly turn any query results into a dataframe.
[9]:
# we'll reuse this query later on
query = """search publications for "graphene"
where year in [2013:2019]
return publications[id+type+volume+year+issue+title+journal+authors] sort by times_cited limit 1000"""
res = dsl.query(query)
Returned Publications: 1000 (total = 426278)
Time: 2.46s
[10]:
df = res.as_dataframe()
df.head(10)
[10]:
id | year | authors | volume | issue | type | title | journal.id | journal.title | |
---|---|---|---|---|---|---|---|---|---|
0 | pub.1050119463 | 2013 | [{'first_name': 'Manish', 'last_name': 'Chhowa... | 5 | 4 | article | The chemistry of two-dimensional layered trans... | jour.1041224 | Nature Chemistry |
1 | pub.1024857999 | 2013 | [{'first_name': 'A. K.', 'last_name': 'Geim', ... | 499 | 7459 | article | Van der Waals heterostructures | jour.1018957 | Nature |
2 | pub.1004394295 | 2014 | [{'first_name': 'Huanping', 'last_name': 'Zhou... | 345 | 6196 | article | Interface engineering of highly efficient pero... | jour.1346339 | Science |
3 | pub.1032956475 | 2014 | [{'first_name': 'Likai', 'last_name': 'Li', 'c... | 9 | 5 | article | Black phosphorus field-effect transistors | jour.1037429 | Nature Nanotechnology |
4 | pub.1019126274 | 2013 | [{'first_name': 'John B.', 'last_name': 'Goode... | 135 | 4 | article | The Li-ion rechargeable battery: a perspective. | jour.1081898 | Journal of the American Chemical Society |
5 | pub.1031762191 | 2013 | NaN | NaN | NaN | book | Nanoenergy, Nanotechnology Applied for Energy ... | NaN | NaN |
6 | pub.1009826879 | 2014 | [{'first_name': 'Han', 'last_name': 'Liu', 'co... | 8 | 4 | article | Phosphorene: an unexplored 2D semiconductor wi... | jour.1038917 | ACS Nano |
7 | pub.1059158429 | 2016 | [{'first_name': 'C.', 'last_name': 'Patrignani... | 40 | 10 | article | Review of Particle Physics | jour.1327822 | Chinese Physics C |
8 | pub.1015305822 | 2013 | [{'first_name': 'Andrea C.', 'last_name': 'Fer... | 8 | 4 | article | Raman spectroscopy as a versatile tool for stu... | jour.1037429 | Nature Nanotechnology |
9 | pub.1045181228 | 2014 | [{'first_name': 'Martin A.', 'last_name': 'Gre... | 8 | 7 | article | The emergence of perovskite solar cells | jour.1037430 | Nature Photonics |
Pandas dataframes offer a myriad of utilities for inspecting data. Check out the official docs or google a pandas tutorial to lean more about it.
[11]:
# the table shape
df.shape
[11]:
(1000, 9)
[12]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()
[12]:
2013 310
2014 294
2015 210
2016 111
2017 58
2018 15
2019 2
Name: year, dtype: int64
[13]:
# eg distribution of publication [type]
df['type'].value_counts()
[13]:
article 997
book 2
chapter 1
Name: type, dtype: int64
2. Dataframe Methods for ‘Publications’ queries¶
What follows are specialized versions of the as_dataframe
method for results sets composed of publication records.
Extracting authors: as_dataframe_authors
¶
Publications authors are usually returned by the Dimensions API inside a nested JSON object in the authors_affiliations
sub-key.
Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication
This methods allows to quickly extract that data and return a dataframe with one row per author.
[14]:
authors = res.as_dataframe_authors()
authors.head()
[14]:
first_name | last_name | corresponding | orcid | current_organization_id | researcher_id | affiliations | pub_id | |
---|---|---|---|---|---|---|---|---|
0 | Manish | Chhowalla | True | grid.5335.0 | ur.0633062306.03 | [{'id': 'grid.430387.b', 'name': 'Rutgers, The... | pub.1050119463 | |
1 | Hyeon Suk | Shin | grid.42687.3f | ur.07617630407.83 | [{'id': 'grid.42687.3f', 'name': 'Ulsan Nation... | pub.1050119463 | ||
2 | Goki | Eda | grid.4280.e | ur.01150450507.27 | [{'id': 'grid.4280.e', 'name': 'National Unive... | pub.1050119463 | ||
3 | Lain-Jong | Li | ['0000-0002-4059-7783'] | grid.45672.32 | ur.01313340113.13 | [{'id': 'grid.28665.3f', 'name': 'Academia Sin... | pub.1050119463 | |
4 | Kian Ping | Loh | ['0000-0002-1491-743X'] | grid.4280.e | ur.0752174033.73 | [{'id': 'grid.4280.e', 'name': 'National Unive... | pub.1050119463 |
Using the authors dataframe, we can easily get the top ten values for current_organization_id
.
[15]:
authors['current_organization_id'].value_counts()[:10]
[15]:
202
grid.168010.e 160
grid.59025.3b 131
grid.5379.8 92
grid.116068.8 86
grid.59053.3a 81
grid.12527.33 76
grid.19006.3e 76
grid.5333.6 72
grid.13402.34 71
Name: current_organization_id, dtype: int64
Explanation: the most frequent organization turns out to be grid.168010.e. The first result is empty, meaning that for those authors Dimensions has no info about
current_organization_id
.
Extracting Affiliations: as_dataframe_authors_affiliations
¶
As you can see from the results of the previous section, the affiliations
of each author is yet another nested JSON object.
Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication
The as_dataframe_authors_affiliations
method allows to quickly extract that affiliations data and return a dataframe with one row per affiliation.
This can be useful e.g. if one wants to count research organizations at the time of writing (as opposed to current_organization_id
, which is the most recent organization of a researcher).
[16]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()
[16]:
aff_id | aff_name | aff_city | aff_city_id | aff_country | aff_country_code | aff_state | aff_state_code | pub_id | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | grid.430387.b | Rutgers, The State University of New Jersey | New Brunswick | 5.10172e+06 | United States | US | New Jersey | US-NJ | pub.1050119463 | ur.0633062306.03 | Manish | Chhowalla |
1 | grid.42687.3f | Ulsan National Institute of Science and Techno... | Ulsan | 1.83375e+06 | South Korea | KR | pub.1050119463 | ur.07617630407.83 | Hyeon Suk | Shin | ||
2 | grid.4280.e | National University of Singapore | Singapore | 1.88025e+06 | Singapore | SG | pub.1050119463 | ur.01150450507.27 | Goki | Eda | ||
3 | grid.4280.e | National University of Singapore | Singapore | 1.88025e+06 | Singapore | SG | pub.1050119463 | ur.01150450507.27 | Goki | Eda | ||
4 | grid.4280.e | National University of Singapore | Singapore | 1.88025e+06 | Singapore | SG | pub.1050119463 | ur.01150450507.27 | Goki | Eda |
[17]:
affiliations.describe(include="all")
[17]:
aff_id | aff_name | aff_city | aff_city_id | aff_country | aff_country_code | aff_state | aff_state_code | pub_id | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7937 | 7937 | 7937 | 7937 | 7937 | 7937 | 7937 | 7937 | 7937 | 7937 | 7937 | 7937 |
unique | 730 | 1119 | 449 | 451 | 52 | 52 | 53 | 53 | 985 | 4395 | 3512 | 2150 |
top | Nanyang Technological University | United States | US | pub.1019661721 | Yi | Wang | ||||||
freq | 1062 | 237 | 1062 | 1068 | 2196 | 2196 | 5433 | 5433 | 108 | 192 | 72 | 318 |
Let’s get the top ten values for aff_id
.
[18]:
affiliations['aff_id'].value_counts()[:10]
[18]:
1062
grid.59025.3b 237
grid.168010.e 226
grid.19006.3e 135
grid.4280.e 116
grid.21729.3f 116
grid.8217.c 113
grid.5379.8 112
grid.116068.8 104
grid.12527.33 90
Name: aff_id, dtype: int64
Another example: we can now easily analyze the data by country too.
[19]:
affiliations['aff_country'].value_counts()[:10]
[19]:
United States 2196
China 1811
1062
Singapore 377
United Kingdom 344
South Korea 330
Germany 231
Australia 180
Japan 162
Switzerland 128
Name: aff_country, dtype: int64
Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA.
3. Dataframe Methods for ‘Grants’ queries¶
Extracting Funders: as_dataframe_funders
¶
Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the funders
sub-key.
This methods allows to quickly extract that data and return a dataframe with one row per funder.
[20]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 11605)
Time: 1.78s
[21]:
res.as_dataframe_funders().head(10)
[21]:
id | name | acronym | latitude | types | linkout | country_name | longitude | city_name | state_name | grant_id | grant_title | grant_start_date | grant_end_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | grid.270680.b | European Commission | EC | 50.851650 | [Government] | [http://ec.europa.eu/index_en.htm] | Belgium | 4.363670 | Brussels | NaN | grant.9064941 | The molecular mechanisms of sex determination ... | 2021-09-01 | 2023-08-31 |
1 | grid.270680.b | European Commission | EC | 50.851650 | [Government] | [http://ec.europa.eu/index_en.htm] | Belgium | 4.363670 | Brussels | NaN | grant.9065769 | Surveilling Malaria through machine learning a... | 2021-04-01 | 2023-03-31 |
2 | grid.425888.b | Swiss National Science Foundation | SNF | 46.949230 | [Government] | [http://www.snf.ch/en] | Switzerland | 7.432395 | Bern | NaN | grant.8968483 | Adressing concerns over gene drive based malar... | 2020-11-01 | 2021-01-31 |
3 | grid.25111.36 | FWF Austrian Science Fund | FWF | 48.219303 | [Nonprofit] | [https://www.fwf.ac.at/en/] | Austria | 16.352383 | Vienna | NaN | grant.8966928 | Further insights into the pathogenesis of avia... | 2020-10-01 | 2023-09-30 |
4 | grid.270680.b | European Commission | EC | 50.851650 | [Government] | [http://ec.europa.eu/index_en.htm] | Belgium | 4.363670 | Brussels | NaN | grant.8964341 | mAlaRIa Sex dEtermination | 2020-10-01 | 2022-09-30 |
5 | grid.457875.c | Directorate for Mathematical & Physical Sciences | NSF MPS | 38.880566 | [Government] | [http://www.nsf.gov/dir/index.jsp?org=MPS] | United States | -77.110990 | Arlington | Virginia | grant.9314067 | Accurate and Interpretable Machine Learning fo... | 2020-09-01 | 2023-08-31 |
6 | grid.418100.c | Biotechnology and Biological Sciences Research... | BBSRC | 51.566742 | [Government] | [http://www.bbsrc.ac.uk/] | United Kingdom | -1.784771 | Swindon | Wiltshire | grant.9288877 | Characterisation of a new family of zinc finge... | 2020-08-31 | 2023-08-30 |
7 | grid.248883.d | Canadian Institutes of Health Research | CIHR | 45.381893 | [Government] | [http://www.cihr-irsc.gc.ca/e/193.html] | Canada | -75.745224 | Ottawa | Ontario | grant.9213792 | Mechanism of protein translocation across the ... | 2020-08-01 | 2022-07-31 |
8 | grid.419681.3 | National Institute of Allergy and Infectious D... | NIAID | 39.066647 | [Facility] | [http://www.niaid.nih.gov/Pages/default.aspx] | United States | -77.111830 | Bethesda | Maryland | grant.9292319 | Benzoxaborole-based antimalarial drug discovery | 2020-07-08 | 2022-06-30 |
9 | grid.452896.4 | European Research Council | ERC | 50.856167 | [Government] | [http://erc.europa.eu/] | Belgium | 4.359973 | Brussels | NaN | grant.9244737 | Quantifying the spread of P. falciparum malaria | 2020-07-01 | 2025-06-30 |
Extracting investigators: as_dataframe_investigators
¶
Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the investigator_details
sub-key.
This methods allows to quickly extract that data and return a dataframe with one row per investigator.
NOTE:
investigator_details
are not returned by default in a grants query hence one must specify this in the query results
[22]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigator_details] limit 1000"""
res = dsl.query(query)
Returned Grants: 1000 (total = 11605)
Time: 2.04s
WARNINGS [2]
Field 'title_language' is deprecated in favor of language_title. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
Field 'project_num' is deprecated in favor of grant_number. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
[23]:
res.as_dataframe_investigators().head(10)
[23]:
id | first_name | middle_name | last_name | role | affiliations | grant_id | grant_title | grant_start_date | grant_end_date | |
---|---|---|---|---|---|---|---|---|---|---|
0 | ur.016225703001.43 | Marie-Valentine | None | Florin | PI | [{'id': 'grid.5333.6', 'name': 'EPF Lausanne',... | grant.8968483 | Adressing concerns over gene drive based malar... | 2020-11-01 | 2021-01-31 |
1 | ur.01207752573.80 | Herbert | WEISSENBÖCK | PI | [{'id': 'grid.6583.8', 'name': 'Veterinärmediz... | grant.8966928 | Further insights into the pathogenesis of avia... | 2020-10-01 | 2023-09-30 | |
2 | None | David | None | Benkeser | PI | [{'id': 'grid.189967.8', 'name': 'Emory Univer... | grant.9314067 | Accurate and Interpretable Machine Learning fo... | 2020-09-01 | 2023-08-31 |
3 | None | Katarzyna | Kinga | Modrzynska | PI | [{'id': 'grid.8756.c', 'name': 'University of ... | grant.9288877 | Characterisation of a new family of zinc finge... | 2020-08-31 | 2023-08-30 |
4 | ur.0667564616.06 | Ruijie | Darius | Teo | PI | NaN | grant.9213792 | Mechanism of protein translocation across the ... | 2020-08-01 | 2022-07-31 |
5 | ur.07631437760.18 | Peter | Dirk | Tieleman | Co-PI | NaN | grant.9213792 | Mechanism of protein translocation across the ... | 2020-08-01 | 2022-07-31 |
6 | None | MARISSA | None | AUBREY | PI | [{'id': None, 'name': 'BORAGEN INC', 'city': '... | grant.9292319 | Benzoxaborole-based antimalarial drug discovery | 2020-07-08 | 2022-06-30 |
7 | None | Jan Teun | None | BOUSEMA | PI | [{'id': 'grid.5590.9', 'name': 'STICHTING KATH... | grant.9244737 | Quantifying the spread of P. falciparum malaria | 2020-07-01 | 2025-06-30 |
8 | ur.010462641253.51 | José | Pedro | Gil | PI | [{'id': 'grid.4714.6', 'name': 'Karolinska ins... | grant.9242920 | Beyond pvmdr1 and pvcrt: Search for relevant d... | 2020-07-01 | 2021-06-30 |
9 | ur.07665260653.48 | Jose | Pedro | Gil | PI | [{'id': 'grid.4714.6', 'name': 'Karolinska ins... | grant.9242920 | Beyond pvmdr1 and pvcrt: Search for relevant d... | 2020-07-01 | 2021-06-30 |
4. Dataframe Methods for ‘Concepts’ queries¶
These methods can be used with all content types that support the extraction of concepts, i.e., publications
or grants
. See the official documentation for more details.
Extracting Concepts: as_dataframe_concepts
¶
The as_dataframe_concepts
method allows to quickly extract all concepts attached to a record, one row per concept, so to make it easier to do operations like counting or plotting the results.
NOTE: concepts are normalized noun phrases describing the main topics of a document, which are automatically derived from the full text using machine learning. In the JSON data, concepts are available as an ordered list (=first items are the most relevant), including a relevance score. E.g. for the publications with ID ‘pub.1122072646’:
{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
{'concept': 'conversion', 'relevance': 0.055053872555463006},
{'concept': 'formic acid', 'relevance': 0.048144671935356},
{'concept': 'CO2', 'relevance': 0.032150964737607}
[........]
],
}
The as_dataframe_concepts
extracts all concepts data from JSON to a dataframe (ps this is functionally similarly to pandas’s explode method). Moreover, it automatically creates a number of metrics that can be used to carry out further analyses, like frequency
and score_avg
.
[24]:
q = """search publications for "graphene"
where year=2019
return publications[id+title+year+concepts_scores] limit 100"""
concepts = dsl.query(q).as_dataframe_concepts()
print("Concepts found: ", len(concepts))
print("Unique: ", len(concepts.drop_duplicates("concept")))
concepts.head()
Returned Publications: 100 (total = 101443)
Time: 1.23s
Concepts found: 4880
Unique: 3050
[24]:
title | year | id | concepts_count | concept | score | frequency | score_avg | |
---|---|---|---|---|---|---|---|---|
0 | Smart Non-Woven Fiber Mats with Light-Induced ... | 2019 | pub.1123764889 | 63 | non-woven fiber mats | 0.78424 | 1 | 0.78424 |
1 | Smart Non-Woven Fiber Mats with Light-Induced ... | 2019 | pub.1123764889 | 63 | polymer matrix | 0.72761 | 3 | 0.64380 |
2 | Smart Non-Woven Fiber Mats with Light-Induced ... | 2019 | pub.1123764889 | 63 | atom transfer radical polymerization | 0.72668 | 1 | 0.72668 |
3 | Smart Non-Woven Fiber Mats with Light-Induced ... | 2019 | pub.1123764889 | 63 | transfer radical polymerization | 0.70781 | 1 | 0.70781 |
4 | Smart Non-Woven Fiber Mats with Light-Induced ... | 2019 | pub.1123764889 | 63 | ray photoelectron spectroscopy | 0.69869 | 4 | 0.65301 |
[25]:
concepts.drop_duplicates("concept").sort_values("score_avg", ascending=False).head(10)[['concept', 'score_avg']]
[25]:
concept | score_avg | |
---|---|---|
3111 | graphene-based antenna | 0.90147 |
2554 | oxygen evolution reaction | 0.87529 |
2555 | oxygen reduction reaction | 0.87487 |
2556 | hydrogen evolution reaction | 0.87120 |
2867 | fluorinated carbon materials | 0.86460 |
1477 | MoP/SiO2 catalyst | 0.86052 |
2384 | high internal phase emulsions | 0.84142 |
3112 | dipole antenna | 0.83857 |
4385 | phenolic foam composites | 0.83428 |
3866 | hole transport layer | 0.83396 |
Please note that (as of version 1.25 of the DSL API) it is possible to return either concepts_scores
or concepts
with Publications queries, but only concepts
with Grants queries.
If a relevancy score is not available, the as_dataframe_concepts
function will simulate one based on the concept rank (=its position in the document-concepts list).
For more information on how to make use of concepts
please see the Topic Modeling Analysis Tutorial and the Official documentation.
Conclusions¶
Moving Dimensions API results to pandas dataframes makes it easier to analyze the data and answer research questions.
Note: the examples above only scratch the surface of what can be done with pandas!
Tip: see also the Dimcli: Magic Commands notebook to find out what shortcuts are available for these dataframe methods.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.