../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

The Datasets API: Features Overview

This tutorial provides an overview of the Datasets data source available via the Dimensions Analytics API.

The topics covered in this notebook are:

Prerequisites

Please install the latest versions of these libraries to run this notebook.

[1]:
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}

!pip install dimcli plotly tqdm -U --quiet

# load common libraries
import pandas as pd
from pandas.io.json import json_normalize

import time
import json
from tqdm.notebook import tqdm as progress

import plotly.express as px
from plotly.offline import plot

import dimcli
from dimcli.shortcuts import *

dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
DimCli v0.6.4.2 - Succesfully connected to <https://app.dimensions.ai> (method: manual login)

1. Sample Dataset Queries

For the following queries, we will restrict our search using the keyword ‘graphene’. You can of course change that, so to explore other topics too.

[8]:
TOPIC = "graphene" #@param {type: "string"}

Searching datasets by keyword

We can easily discover datasets mentioning the keyword graphene and sorting them by most recent first.

[9]:
df = dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
return datasets[basics+license]
    sort by date_created limit 100
""").as_dataframe()
Returned Datasets: 100 (total = 691)
[10]:
df.head(3)
[10]:
title id authors keywords year license.url license.name license.value journal.id journal.title
0 Data_Sheet_1_Three-Dimensional PrGO-Based Sand... 11911530 [{'name': 'Yangqiang Zhao', 'orcid': ''}, {'na... [molybdenum sulfide, graphene oxide, anode, co... 2020 https://creativecommons.org/licenses/by/4.0/ CC BY 4.0 1 NaN NaN
1 Data_Sheet_1_Tubular Graphene Nano-Scroll Coat... 11906064 [{'name': 'Minyuan Shi', 'orcid': ''}, {'name'... [silicon anodes, lithium-ion battery, three-di... 2020 https://creativecommons.org/licenses/by/4.0/ CC BY 4.0 1 NaN NaN
2 Molecular Insights into the Loading and Dynami... 11894361 [{'name': 'Mina Mahdavi', 'orcid': ''}, {'name... [PEGGO drug delivery systems, PEG chain length... 2020 https://creativecommons.org/licenses/by-nc/4.0/ CC BY-NC 4.0 10 NaN NaN
[11]:
px.pie(df, names="license.name",
       title=f"Share of different license of the 100 most recent datasets about '{TOPIC}'")

Returning associated grants and publication data

Whenever the information about the publication associated to a dataset is available, it can be retrieved via the associated_publication_id field. Similarly, links between datasets and grants are exposed via the associated_grant_ids field.

[12]:
dfPubsAndGrants = dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
    where associated_grant_ids is not empty
    and  associated_publication_id is not empty
return datasets[basics+associated_publication_id+associated_grant_ids+category_for]
    sort by date_created desc limit 50
""").as_dataframe()
Returned Datasets: 50 (total = 194)
[13]:
dfPubsAndGrants.head(3)
[13]:
associated_publication_id associated_grant_ids id authors keywords year category_for title journal.id journal.title
0 pub.1122695248 [grant.8188484] 10331408 [{'name': 'Yanyan Liu', 'orcid': ''}, {'name':... [nitrogen-doped carbon, hierarchical pores, bi... 2019 [{'id': '2203', 'name': '03 Chemical Sciences'... Table_1_Catalytically Active Carbon From Catta... jour.1049812 Frontiers in Chemistry
1 pub.1122330425 [grant.7443065] 10260083 [{'name': 'Gabriel R. Schleder', 'orcid': ''},... [stability evaluation, novel compounds, use ma... 2019 [{'id': '2203', 'name': '03 Chemical Sciences'... Exploring Two-Dimensional Materials Thermodyna... jour.1041450 ACS Applied Materials & Interfaces
2 pub.1121529253 [grant.5234741, grant.6805090, grant.6447261] 9996017 [{'name': 'Prince Ravat', 'orcid': ''}, {'name... [compound, sp 2 carbon atoms, SOMO, theory cal... 2019 [{'id': '2203', 'name': '03 Chemical Sciences'... Benzo[cd]triangulene: A Spin 1/2\nGraphene Fra... jour.1077200 The Journal of Organic Chemistry
[14]:
fig = px.histogram(dfPubsAndGrants.sort_values('journal.title'),
                   x="journal.title", barmode="group",
                   title=f"Overiew of journals linked to the most recent 100 datasets about '{TOPIC}'")
fig.show()

Aggregating results using facets

Datasets results can be grouped using facets. E.g. we can see what are the top funders, research organizations or researchers related to our datasets (note: the column ‘count’ represents the number of dataset records in each of the groups).

Top funders

[21]:
df = dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
return funders limit 100
""").as_dataframe()
Returned Funders: 100
[22]:
df.head(3)
[22]:
id count name linkout types acronym country_name city_name longitude latitude state_name
0 grid.419696.5 77 National Natural Science Foundation of China [http://www.nsfc.gov.cn/publish/portal1/] [Government] NSFC China Beijing 116.339830 40.005177 NaN
1 grid.270680.b 27 European Commission [http://ec.europa.eu/index_en.htm] [Government] EC Belgium Brussels 4.363670 50.851650 NaN
2 grid.452896.4 24 European Research Council [http://erc.europa.eu/] [Government] ERC Belgium Brussels 4.359973 50.856167 NaN
[23]:
px.scatter(df, x="count", y="name",
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top funder referenced in datasets about '{TOPIC}', all time")

Top research organizations

Note: research organizations are linked to datasets via the datasets’ associated publication.

[24]:
df = dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
return research_orgs limit 10
""").as_dataframe()
Returned Research_orgs: 10
[25]:
df.head()
[25]:
id count longitude latitude country_name acronym name city_name linkout types state_name
0 grid.418028.7 16 13.283713 52.448765 Germany FHI Fritz Haber Institute of the Max Planck Society Berlin [http://www.fhi-berlin.mpg.de/] [Facility] NaN
1 grid.34555.32 15 30.511314 50.441902 Ukraine KNU Taras Shevchenko National University of Kyiv Kyiv [http://www.univ.kiev.ua/en/] [Education] NaN
2 grid.10347.31 14 101.656390 3.120833 Malaysia UM University of Malaya Kuala Lumpur [https://www.um.edu.my/] [Education] NaN
3 grid.419547.a 11 8.229722 49.989445 Germany NaN Max Planck Institute for Polymer Research Mainz [http://www.mpip-mainz.mpg.de/home/en] [Facility] NaN
4 grid.454832.c 11 103.858444 36.051895 China LICP Lanzhou Institute of Chemical Physics Lanzhou [http://english.licp.cas.cn/] [Facility] NaN
[26]:
px.pie(df, names="country_name",
       title=f"Global distribution of research organisations linked to datasets about '{TOPIC}'")

Top contributors

Note: researchers are linked to datasets via the datasets’ associated publication.

[27]:
dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
return researchers limit 10
""").as_dataframe()
Returned Researchers: 10
[27]:
id count research_orgs last_name first_name orcid_id
0 ur.01046566370.94 15 [grid.6734.6, grid.418028.7] Frank Benjamin NaN
1 ur.01333027111.96 15 [grid.5018.c, grid.4372.2, grid.6582.9, grid.1... Schlögl Robert NaN
2 ur.01347070166.03 15 [grid.415616.1, grid.440789.6, grid.418028.7, ... Khavryuchenko Oleksiy V NaN
3 ur.013720757463.40 15 [grid.481551.c, grid.424048.e, grid.4372.2, gr... Hermann Klaus E [0000-0002-3861-3916]
4 ur.0714064131.80 15 [grid.420264.6, grid.440957.b, grid.4372.2, gr... Trunschke Annette NaN
5 ur.01305567747.71 11 [grid.7450.6, grid.6582.9, grid.48166.3d, grid... Müllen Kläus NaN
6 ur.0750742345.75 11 [grid.418929.f, grid.266097.c, grid.4372.2, gr... Feng Xin Liang [0000-0003-3885-2703]
7 ur.010452216040.82 9 [grid.59025.3b, grid.33763.32, grid.410651.7, ... Liu Бинь [0000-0002-5836-2333]
8 ur.01046574666.06 9 [grid.410726.6, grid.32566.34, grid.7177.6, gr... Zhang Hong NaN
9 ur.01205145514.44 9 [grid.454798.3, grid.412638.a, grid.202665.5, ... Li Ning [0000-0003-1684-4454]

2. A closer look at Datasets statistics

The Dimensions Search Language exposes programmatically metadata, such as supported sources and entities, along with their fields, facets, fieldsets, metrics and search fields.

[2]:
%dsldocs datasets
[2]:
sources field type description is_filter is_entity is_facet
0 datasets associated_grant_ids string Dimensions IDs of the grants associated to the... True False False
1 datasets associated_publication_id string The Dimensions ID of the publication linked to... True False False
2 datasets authors json Ordered list of the dataset authors. ORCIDs ar... True False False
3 datasets category_bra categories `Broad Research Areas <https: ... True True True
4 datasets category_for categories `ANZSRC Fields of Research classification <htt... True True True
5 datasets category_hra categories `Health Research Areas <https: ... True True True
6 datasets category_hrcs_hc categories `HRCS - Health Categories <https: ... True True True
7 datasets category_hrcs_rac categories `HRCS – Research Activity Codes <https: ... True True True
8 datasets category_icrp_cso categories `ICRP Common Scientific Outline <https: ... True True True
9 datasets category_icrp_ct categories `ICRP Cancer Types <https: ... True True True
10 datasets category_rcdc categories `Research, Condition, and Disease Categorizati... True True True
11 datasets date date The publication date of the dataset, eg "2018-... True False False
12 datasets date_created date The creation date of the dataset. True False False
13 datasets date_embargo date The embargo date of the dataset. True False False
14 datasets date_inserted date Date when the record was inserted into Dimensi... True False False
15 datasets date_modified date The last modification date of the dataset. True False False
16 datasets description string Description of the dataset. False False False
17 datasets doi string Dataset DOI. True False False
18 datasets figshare_url string Figshare URL for the dataset. False False False
19 datasets funder_countries countries The country linked to the organisation funding... True True True
20 datasets funders organizations The GRID organisations funding the dataset. True True True
21 datasets id string Dimensions dataset ID. True False False
22 datasets journal journals The journal a data set belongs to. True True True
23 datasets keywords string Keywords used to describe the dataset (from au... True False True
24 datasets language_desc string Dataset title language, as ISO 639-1 language ... True False True
25 datasets language_title string Dataset title language, as ISO 639-1 language ... True False True
26 datasets license json The dataset licence, as a structured JSON cont... True False False
27 datasets publication_ids string The Dimensions IDs of the publications the dat... True False False
28 datasets repository_id string The ID of the repository of the dataset. True False True
29 datasets research_org_cities cities City of the organisations the publication auth... True True True
30 datasets research_org_countries countries Country of the organisations the publication a... True True True
31 datasets research_org_states states State of the organisations the publication aut... True True True
32 datasets research_orgs organizations GRID organisations linked to the publication a... True True True
33 datasets researchers researchers Dimensions researchers IDs associated to the d... True True True
34 datasets title string Title of the dataset. False False False
35 datasets year integer Year of publication of the dataset. True False True

The fields list shown above can be extracted via the following DSL query:

[3]:
data = dsl.query("""describe source datasets""")
fields = sorted([x for x in data.fields.keys()])

Counting records per each field

By using the fields list obtained above, it is possible to draw up some general statistics re. the Datasets content type in Dimensions.

In order to do this, we use the operator is not empty to generate automatically queries like this search datasets where {field_name} is not empty return datasets limit 1 and then use the total_count field in the JSON we get back for our statistics.

[4]:
q_template = """search datasets where {} is not empty return datasets[id] limit 1"""

# seed results with total number of orgs
total = dsl.query("""search datasets return datasets[id] limit 1""", verbose=False).count_total
stats = [
    {'filter_by': 'no filter (=all records)', 'results' : total}
]

for f in progress(fields):
    q = q_template.format(f)
    res = dsl.query(q, verbose=False)
    time.sleep(0.5)
    stats.append({'filter_by': f, 'results' : res.count_total})


df = pd.DataFrame().from_dict(stats)
df.sort_values("results", inplace=True, ascending=False)
df

[4]:
filter_by results
0 no filter (=all records) 1475172
12 date 1475172
35 title 1475172
27 license 1475172
22 id 1475172
19 figshare_url 1475172
15 date_inserted 1475172
13 date_created 1475172
36 year 1475172
3 authors 1475172
29 repository_id 1475169
17 description 1475160
26 language_title 1475148
25 language_desc 1475148
16 date_modified 1475115
18 doi 1474451
24 keywords 1473445
2 associated_publication_id 954043
23 journal 953892
34 researchers 936690
33 research_orgs 929191
31 research_org_countries 907209
30 research_org_cities 906714
5 category_for 698596
21 funders 574878
20 funder_countries 574712
32 research_org_states 364002
1 associated_grant_ids 356088
11 category_rcdc 335519
7 category_hrcs_hc 181856
6 category_hra 164675
9 category_icrp_cso 161734
4 category_bra 157629
8 category_hrcs_rac 120520
10 category_icrp_ct 65806
28 publication_ids 43655
14 date_embargo 7088

Creating a bar chart

NOTE: a standalone version of this chart is also available online

[5]:
from datetime import date
today = date.today().strftime("%d/%m/%Y")

fig = px.bar(df, x=df['filter_by'], y=df['results'],
             title=f"No of Dataset records per API field (as of {today})")
plot(fig, filename = 'dataset-fields-overview.html', auto_open=False)
fig.show()