../../_images/badge-colab.svg ../../_images/badge-github-custom.svg ../../_images/badge-dimensions-api.svg

Searching Datasets using the Dimensions API: an Introduction

The purpose of this notebook is to demostrate the basics of querying datasets with the Dimensions Analytics API.

Please have a look also at the official documentation for more information on Datasets. Note: a standalone version of the dataset charts generated in section 2 of this tutorial is also available online: dataset fields overview | distribution of dataset fields per years.

Prerequisites

Install the latest versions of these libraries to run this notebook.

[1]:
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}

!pip install dimcli plotly tqdm -U --quiet

# load common libraries
import pandas as pd
from pandas.io.json import json_normalize

import time
import json
from tqdm.notebook import tqdm as progress

import plotly.express as px
from plotly.offline import plot

import dimcli
from dimcli.shortcuts import *

dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
DimCli v0.6.4.2 - Succesfully connected to <https://app.dimensions.ai> (method: manual login)

2. A general look at Datasets and their statistics

The Dimensions Search Language exposes programmatically metadata, such as supported sources and entities, along with their fields, facets, fieldsets, metrics and search fields.

[2]:
%dsldocs datasets
[2]:
sources field type description is_filter is_entity is_facet
0 datasets associated_grant_ids string Dimensions IDs of the grants associated to the... True False False
1 datasets associated_publication_id string The Dimensions ID of the publication linked to... True False False
2 datasets authors json Ordered list of the dataset authors. ORCIDs ar... True False False
3 datasets category_bra categories `Broad Research Areas <https: ... True True True
4 datasets category_for categories `ANZSRC Fields of Research classification <htt... True True True
5 datasets category_hra categories `Health Research Areas <https: ... True True True
6 datasets category_hrcs_hc categories `HRCS - Health Categories <https: ... True True True
7 datasets category_hrcs_rac categories `HRCS – Research Activity Codes <https: ... True True True
8 datasets category_icrp_cso categories `ICRP Common Scientific Outline <https: ... True True True
9 datasets category_icrp_ct categories `ICRP Cancer Types <https: ... True True True
10 datasets category_rcdc categories `Research, Condition, and Disease Categorizati... True True True
11 datasets date date The publication date of the dataset, eg "2018-... True False False
12 datasets date_created date The creation date of the dataset. True False False
13 datasets date_embargo date The embargo date of the dataset. True False False
14 datasets date_inserted date Date when the record was inserted into Dimensi... True False False
15 datasets date_modified date The last modification date of the dataset. True False False
16 datasets description string Description of the dataset. False False False
17 datasets doi string Dataset DOI. True False False
18 datasets figshare_url string Figshare URL for the dataset. False False False
19 datasets funder_countries countries The country linked to the organisation funding... True True True
20 datasets funders organizations The GRID organisations funding the dataset. True True True
21 datasets id string Dimensions dataset ID. True False False
22 datasets journal journals The journal a data set belongs to. True True True
23 datasets keywords string Keywords used to describe the dataset (from au... True False True
24 datasets language_desc string Dataset title language, as ISO 639-1 language ... True False True
25 datasets language_title string Dataset title language, as ISO 639-1 language ... True False True
26 datasets license json The dataset licence, as a structured JSON cont... True False False
27 datasets publication_ids string The Dimensions IDs of the publications the dat... True False False
28 datasets repository_id string The ID of the repository of the dataset. True False True
29 datasets research_org_cities cities City of the organisations the publication auth... True True True
30 datasets research_org_countries countries Country of the organisations the publication a... True True True
31 datasets research_org_states states State of the organisations the publication aut... True True True
32 datasets research_orgs organizations GRID organisations linked to the publication a... True True True
33 datasets researchers researchers Dimensions researchers IDs associated to the d... True True True
34 datasets title string Title of the dataset. False False False
35 datasets year integer Year of publication of the dataset. True False True

The fields list shown above can be extracted via the following DSL query:

[3]:
data = dsl.query("""describe source datasets""")
fields = sorted([x for x in data.fields.keys()])

Counting records per each field

By using the fields list obtained above, it is possible to draw up some general statistics re. the Datasets content type in Dimensions.

In order to do this, we use the operator is not empty to generate automatically queries like this search datasets where {field_name} is not empty return datasets limit 1 and then use the total_count field in the JSON we get back for our statistics.

[4]:
q_template = """search datasets where {} is not empty return datasets[id] limit 1"""

# seed results with total number of orgs
total = dsl.query("""search datasets return datasets[id] limit 1""", verbose=False).count_total
stats = [
    {'filter_by': 'no filter (=all records)', 'results' : total}
]

for f in progress(fields):
    q = q_template.format(f)
    res = dsl.query(q, verbose=False)
    time.sleep(0.5)
    stats.append({'filter_by': f, 'results' : res.count_total})


df = pd.DataFrame().from_dict(stats)
df.sort_values("results", inplace=True, ascending=False)
df

[4]:
filter_by results
0 no filter (=all records) 1475172
12 date 1475172
35 title 1475172
27 license 1475172
22 id 1475172
19 figshare_url 1475172
15 date_inserted 1475172
13 date_created 1475172
36 year 1475172
3 authors 1475172
29 repository_id 1475169
17 description 1475160
26 language_title 1475148
25 language_desc 1475148
16 date_modified 1475115
18 doi 1474451
24 keywords 1473445
2 associated_publication_id 954043
23 journal 953892
34 researchers 936690
33 research_orgs 929191
31 research_org_countries 907209
30 research_org_cities 906714
5 category_for 698596
21 funders 574878
20 funder_countries 574712
32 research_org_states 364002
1 associated_grant_ids 356088
11 category_rcdc 335519
7 category_hrcs_hc 181856
6 category_hra 164675
9 category_icrp_cso 161734
4 category_bra 157629
8 category_hrcs_rac 120520
10 category_icrp_ct 65806
28 publication_ids 43655
14 date_embargo 7088

Creating a bar chart

NOTE: a standalone version of this chart is also available online

[5]:
from datetime import date
today = date.today().strftime("%d/%m/%Y")

fig = px.bar(df, x=df['filter_by'], y=df['results'],
             title=f"No of Dataset records per API field (as of {today})")
plot(fig, filename = 'dataset-fields-overview.html', auto_open=False)
fig.show()