../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

The Datasets API: Features Overview

This tutorial provides an overview of the Datasets data source available via the Dimensions Analytics API.

The topics covered in this notebook are:

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *

import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Sample Dataset Queries

For the following queries, we will restrict our search using the keyword ‘graphene’. You can of course change that, so to explore other topics too.

[3]:
TOPIC = "graphene" #@param {type: "string"}

Searching datasets by keyword

We can easily discover datasets mentioning the keyword graphene and sorting them by most recent first.

[6]:
df = dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
return datasets[basics+license_name+license_url]
    sort by date_created limit 100
""").as_dataframe()
Returned Datasets: 100 (total = 1845)
Time: 0.74s
[7]:
df.head(3)
[7]:
authors id keywords license_name license_url title year journal.id journal.title
0 [{'name': 'Huaxu Zhou'}, {'name': 'Yao Ding'},... dataset.57983323 [electrochemical detection, silica nanochannel... CC BY 4.0 https://creativecommons.org/licenses/by/4.0/ DataSheet1_Silica Nanochannel Array Film Suppo... 2022 NaN NaN
1 [{'name': 'Francesca Zummo'}, {'name': 'Pietro... dataset.57980490 [hippocampal neurons, graphene, chemical funct... CC BY 4.0 https://creativecommons.org/licenses/by/4.0/ Table_1_Bidirectional Modulation of Neuronal C... 2022 NaN NaN
2 [{'name': 'Vipin Singh'}, {'name': 'Shanta Raj... dataset.57980625 [chromeno spirooxindoles, azomethine ylides, h... CC BY 4.0 https://creativecommons.org/licenses/by/4.0/ DataSheet1_Graphene Oxide Catalyzed Synthesis ... 2022 NaN NaN
[9]:
px.pie(df,
       names="license_name",
       title=f"Share of different license of the 100 most recent datasets about '{TOPIC}'")

Returning associated grants and publication data

Whenever the information about the publication associated to a dataset is available, it can be retrieved via the associated_publication_id field. Similarly, links between datasets and grants are exposed via the associated_grant_ids field.

[10]:
dfPubsAndGrants = dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
    where associated_grant_ids is not empty
    and  associated_publication_id is not empty
return datasets[basics+associated_publication_id+associated_grant_ids+category_for]
    sort by date_created desc limit 50
""").as_dataframe()
Returned Datasets: 50 (total = 426)
Time: 0.61s
[11]:
dfPubsAndGrants.head(3)
[11]:
associated_grant_ids associated_publication_id authors category_for id keywords title year journal.id journal.title
0 [grant.8128998, grant.9211019, grant.8122546, ... pub.1144550732 [{'name': 'Kai Sun'}, {'name': 'Chen Wang'}, {... [{'id': '2209', 'name': '09 Engineering'}, {'i... dataset.57932510 [synthetic structural control, sulfur redox ki... Ion-Selective\nCovalent Organic Framework Memb... 2022 jour.1041450 ACS Applied Materials & Interfaces
1 [grant.8158929, grant.8160387] pub.1143784077 [{'name': 'Yilan Li'}, {'name': 'Kaiguang Yang... [{'id': '2210', 'name': '10 Technology'}, {'id... dataset.57827484 [contain specific biomarkers, comparative prot... Surface Nanosieving Polyether Sulfone Particle... 2021 jour.1345331 Analytical Chemistry
2 [grant.8158929, grant.8160387] pub.1143784077 [{'name': 'Yilan Li'}, {'name': 'Kaiguang Yang... [{'id': '2210', 'name': '10 Technology'}, {'id... dataset.57827485 [contain specific biomarkers, comparative prot... Surface Nanosieving Polyether Sulfone Particle... 2021 jour.1345331 Analytical Chemistry
[12]:
fig = px.histogram(dfPubsAndGrants.sort_values('journal.title'),
                   x="journal.title",
                   barmode="group",
                   title=f"Overiew of journals linked to the most recent 100 datasets about '{TOPIC}'")
fig.show()

Aggregating results using facets

Datasets results can be grouped using facets. E.g. we can see what are the top funders, research organizations or researchers related to our datasets (note: the column ‘count’ represents the number of dataset records in each of the groups).

Top funders

[19]:
df = dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
return funders limit 100
""").as_dataframe()
Returned Funders: 100
Time: 0.61s
[20]:
df.head(3)
[20]:
acronym city_name count country_name id latitude linkout longitude name types state_name
0 NSFC Beijing 178 China grid.419696.5 40.005177 [http://www.nsfc.gov.cn/publish/portal1/] 116.339830 National Natural Science Foundation of China [Government] NaN
1 EC Brussels 73 Belgium grid.270680.b 50.851650 [http://ec.europa.eu/index_en.htm] 4.363670 European Commission [Government] NaN
2 EPSRC Swindon 66 United Kingdom grid.421091.f 51.567093 [https://www.epsrc.ac.uk/] -1.784602 Engineering and Physical Sciences Research Cou... [Government] England
[21]:
px.scatter(df,
           x="count", y="name",
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top funder referenced in datasets about '{TOPIC}', all time")

Top research organizations

Note: research organizations are linked to datasets via the datasets’ associated publication.

[22]:
df = dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
return research_orgs limit 10
""").as_dataframe()
Returned Research_orgs: 10
Time: 0.52s
[23]:
df.head()
[23]:
acronym city_name count country_name id latitude linkout longitude name types state_name
0 UM Kuala Lumpur 26 Malaysia grid.10347.31 3.120833 [https://www.um.edu.my/] 101.656390 University of Malaya [Education] NaN
1 UPM Seri Kembangan 16 Malaysia grid.11142.37 2.992025 [http://www.upm.edu.my/] 101.716240 Universiti Putra Malaysia [Education] NaN
2 FHI Berlin 16 Germany grid.418028.7 52.448765 [http://www.fhi-berlin.mpg.de/] 13.283713 Fritz Haber Institute of the Max Planck Society [Facility] NaN
3 KNU Kyiv 15 Ukraine grid.34555.32 50.441902 [http://www.univ.kiev.ua/en/] 30.511314 Taras Shevchenko National University of Kyiv [Education] NaN
4 NaN Cambridge 15 United Kingdom grid.5335.0 52.204453 [http://www.cam.ac.uk/] 0.114908 University of Cambridge [Education] NaN
[24]:
px.pie(df,
       names="country_name",
       title=f"Global distribution of research organisations linked to datasets about '{TOPIC}'")

Top contributors

Note: researchers are linked to datasets via the datasets’ associated publication.

[25]:
dsl.query(f"""
search datasets
    in full_data for "{TOPIC}"
return researchers limit 10
""").as_dataframe()
Returned Researchers: 10
Time: 0.60s
[25]:
count first_name id last_name research_orgs orcid_id
0 15 Benjamin ur.01046566370.94 Frank [grid.6734.6, grid.418028.7, grid.419564.b] NaN
1 15 Robert ur.01333027111.96 Schlögl [grid.5379.8, grid.423905.9, grid.410726.6, gr... NaN
2 15 Oleksiy V ur.01347070166.03 Khavryuchenko [grid.34555.32, grid.415616.1, grid.418751.e, ... NaN
3 15 Klaus E ur.013720757463.40 Hermann [grid.4372.2, grid.5164.6, grid.14095.39, grid... [0000-0002-3861-3916]
4 15 Annette ur.0714064131.80 Trunschke [grid.418028.7, grid.440957.b, grid.4372.2, gr... NaN
5 13 Xinliang ur.0750742345.75 Feng [grid.5335.0, grid.419547.a, grid.4488.0, grid... [0000-0003-3885-2703]
6 12 Shahabaldin ur.010553550461.38 Rezania [grid.31501.36, grid.263333.4, grid.410877.d] [0000-0001-8943-3045]
7 12 Ali Akbar ur.011210700466.21 Mohammadi [grid.502998.f] NaN
8 12 Seyedeh Solmaz ur.01230367167.34 Talebi [grid.508797.4, grid.444858.1, grid.411705.6] NaN
9 12 Nahid Tavakkoli ur.012742050435.43 Nezhad [grid.411583.a] NaN

2. A closer look at Datasets statistics

The Dimensions Search Language exposes programmatically metadata, such as supported sources and entities, along with their fields, facets, fieldsets, metrics and search fields.

[26]:
%dsldocs datasets
[26]:
sources field type description is_filter is_entity is_facet
0 datasets associated_grant_ids string The Dimensions IDs of the grants linked to the... True False False
1 datasets associated_publication publication_links Publication linked to the dataset (single value). True True True
2 datasets associated_publication_id string The Dimensions ID of the publication linked to... True False False
3 datasets authors json Ordered list of the dataset authors. ORCIDs ar... True False False
4 datasets category_bra categories `Broad Research Areas <https://dimensions.fres... True True True
5 datasets category_for categories `ANZSRC Fields of Research classification <htt... True True True
6 datasets category_hra categories `Health Research Areas <https://dimensions.fre... True True True
7 datasets category_hrcs_hc categories `HRCS - Health Categories <https://dimensions.... True True True
8 datasets category_hrcs_rac categories `HRCS – Research Activity Codes <https://dimen... True True True
9 datasets category_icrp_cso categories `ICRP Common Scientific Outline <https://dimen... True True True
10 datasets category_icrp_ct categories `ICRP Cancer Types <https://dimensions.freshde... True True True
11 datasets category_rcdc categories `Research, Condition, and Disease Categorizati... True True True
12 datasets category_sdg categories SDG - Sustainable Development Goals True True True
13 datasets date date The publication date of the dataset, eg "2018-... True False False
14 datasets date_created date The creation date of the dataset. True False False
15 datasets date_embargo date The embargo date of the dataset. True False False
16 datasets date_inserted date Date when the record was inserted into Dimensi... True False False
17 datasets date_modified date The last modification date of the dataset. True False False
18 datasets description string Description of the dataset. False False False
19 datasets dimensions_url string Link pointing to the Dimensions web application False False False
20 datasets doi string Dataset DOI. True False False
21 datasets figshare_url string Figshare URL for the dataset. False False False
22 datasets funder_countries countries The country linked to the organisation funding... True True True
23 datasets funders organizations The GRID organisations funding the dataset. True True True
24 datasets id string Dimensions dataset ID. True False False
25 datasets journal journals The journal a data set belongs to. True True True
26 datasets keywords string Keywords used to describe the dataset (from au... True False True
27 datasets language_desc string Dataset title language, as ISO 639-1 language ... True False True
28 datasets language_title string Dataset title language, as ISO 639-1 language ... True False True
29 datasets license_name string The dataset licence name, e.g. 'CC BY 4.0'. True False True
30 datasets license_url string The dataset licence URL, e.g. 'https://creativ... False False False
31 datasets repository_id string The ID of the repository of the dataset. True False True
32 datasets research_org_cities cities City of the organisations the publication auth... True True True
33 datasets research_org_countries countries Country of the organisations the publication a... True True True
34 datasets research_org_states states State of the organisations the publication aut... True True True
35 datasets research_orgs organizations GRID organisations linked to the publication a... True True True
36 datasets researchers researchers Dimensions researchers IDs associated to the d... True True True
37 datasets title string Title of the dataset. False False False
38 datasets year integer Year of publication of the dataset. True False True

The fields list shown above can be extracted via the following DSL query:

[27]:
data = dsl.query("""describe source datasets""")
fields = sorted([x for x in data.fields.keys()])

Counting records per each field

By using the fields list obtained above, it is possible to draw up some general statistics re. the Datasets content type in Dimensions.

In order to do this, we use the operator is not empty to generate automatically queries like this search datasets where {field_name} is not empty return datasets limit 1 and then use the total_count field in the JSON we get back for our statistics.

[28]:
q_template = """search datasets where {} is not empty return datasets[id] limit 1"""

# seed results with total number of orgs
total = dsl.query("""search datasets return datasets[id] limit 1""", verbose=False).count_total
stats = [
    {'filter_by': 'no filter (=all records)', 'results' : total}
]

for f in progress(fields):
    q = q_template.format(f)
    res = dsl.query(q, verbose=False)
    time.sleep(0.5)
    stats.append({'filter_by': f, 'results' : res.count_total})


df = pd.DataFrame().from_dict(stats)
df.sort_values("results", inplace=True, ascending=False)
df
[28]:
filter_by results
0 no filter (=all records) 11231163
17 date_inserted 11231163
32 repository_id 11231163
25 id 11231163
22 figshare_url 11231163
21 doi 11231163
20 dimensions_url 11231163
38 title 11230951
29 language_title 11222040
28 language_desc 11222040
14 date 11217876
39 year 11217876
4 authors 10565064
19 description 8407088
6 category_for 7603144
31 license_url 6046461
18 date_modified 4579912
8 category_hrcs_hc 3155791
30 license_name 2458161
2 associated_publication 2217950
3 associated_publication_id 2217950
26 journal 2213954
37 researchers 2178513
36 research_orgs 2164742
35 research_org_states 2122878
34 research_org_countries 2122878
33 research_org_cities 2122807
24 funders 1488258
23 funder_countries 1488254
15 date_created 1398342
27 keywords 1396620
9 category_hrcs_rac 1055373
11 category_icrp_ct 962895
12 category_rcdc 917847
10 category_icrp_cso 873010
1 associated_grant_ids 854844
5 category_bra 487587
7 category_hra 301492
13 category_sdg 177423
16 date_embargo 8732

Creating a bar chart

NOTE: a standalone version of this chart is also available online

[29]:
from datetime import date
today = date.today().strftime("%d/%m/%Y")

fig = px.bar(df,
             x=df['filter_by'],
             y=df['results'],
             title=f"No of Dataset records per API field (as of {today})")

plot(fig, filename = 'dataset-fields-overview.html', auto_open=False)
fig.show()

Counting the yearly distribution of field/records data

[30]:
#
# get how many dataset records have values for each field, for each year
#

q_template = """search datasets where {} is not empty return year limit 150"""

# seed with all records data (no filter)
seed = dsl.query("""search datasets return year limit 150""", verbose=False).as_dataframe()
seed['segment'] = "all records"

for f in progress(fields):
    q = q_template.format(f)
    res = dsl.query(q, verbose=False).as_dataframe()
    res['segment'] = f
    seed = seed.append(res, ignore_index=True )
    time.sleep(0.5)

seed = seed.rename(columns={'id' : 'year'})
seed = seed.astype({'year': 'int32'})

#
# fill in (normalize) missing years in order to build a line chart
#

yrange = [seed['year'].min(), seed['year'].max()]
# TIP yrange[1]+1 to make sure max value is included
all_years = [x for x in range(yrange[0], yrange[1]+1)]

def add_missing_years(field_name):
    global seed
    known_years = list(seed[seed["segment"] == field_name]['year'])
    l = []
    for x in all_years:
        if x not in known_years:
            l.append({'segment' : field_name , 'year' : x, 'count': 0 })
    seed = seed.append(l, ignore_index=True )

all_field_names = seed['segment'].value_counts().index.tolist()
for field in all_field_names:
    add_missing_years(field)

Creating a line chart

NOTE: a standalone version of this chart is also available online

A few things to remember:

  • There are a lot of overlapping lines, as many fields appear frequently; hence it’s useful to click on the right panel to hide/reveal specific segments.

  • We set a start year to avoid having a long tail of (very few) datasets published a long time ago.

[31]:
start_year = 1980

# need to sort otherwise the chart is messed up!
temp = seed.query(f"year >= {start_year}").sort_values(["segment", "year"])
#
fig = px.line(temp,
              x="year",
              y="count",
              color="segment",
               title=f"Dataset fields available, segmented by year {today})")

plot(fig, filename = 'dataset-fields-by-year-count.html', auto_open=False)
fig.show()

Where to find out more

Please have a look at the official documentation for more information on Datasets.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg