../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Working with lists in the Dimensions API

In this notebook we are going to show:

  • How to use lists in order to write more efficient DSL queries

  • How lists can be used to concatenate the results of one query with another query

  • How these methods can be used for real-word applications e.g., getting publications/patents/grants that cite my publications

Prerequisites: Installing the Dimensions Library and Logging in

[1]:

# @markdown # Get the API library and login
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}


!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

import json
import pandas as pd
import numpy as np
Dimcli - Dimensions API Client (v0.6.9)
Connected to endpoint: https://app.dimensions.ai - DSL version: 1.24
Method: dsl.ini file

1. How do we use lists in the Dimensions API?

We use lists in the API because they are easier to read, and easier to work with.

Here is a query without lists.

How many publications were produced from either Monash or Melbourne University ( grid.1002.3, grid.1008.9 ) in either (2019 OR 2020). Be really careful with your brakets!

[2]:
%%dsldf

search publications
where
      (
          research_orgs.id = "grid.1008.9"
       or research_orgs.id = "grid.1002.3"
       )
  and (
          year = 2019
       or year = 2020
       )
return publications
limit 1

Returned Publications: 1 (total = 32721)
[2]:
id title volume author_affiliations type year issue pages journal.id journal.title
0 pub.1125408894 Posttraumatic anger: a confirmatory factor ana... 11 [[{'first_name': 'Grazia', 'last_name': 'Cesch... article 2020 1 1731127 jour.1045059 European Journal of Psychotraumatology

The query above could get really messy. What if I wanted 20 institutions. What if I wanted the last ten years: (or,or,or,or,or….) and (or,or,or,or,or)

By using lists we can quickly add a large number of conditions by means of an easy to read square-brakets notation:

[3]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 32721)
[3]:
id
0 pub.1125408894
1 pub.1125679504
2 pub.1121881108
3 pub.1125399511
4 pub.1125025149
... ...
95 pub.1124285831
96 pub.1124593123
97 pub.1123956872
98 pub.1124795761
99 pub.1124554498

100 rows × 1 columns

2. What are all the things that we can make lists of in the Dimensions API?

What are the internal Entities that we might put in a list?

[4]:
%dsldocs
dsl_last_results[dsl_last_results['is_entity']==True]
[4]:
sources field type description is_filter is_entity is_facet
6 publications category_bra categories `Broad Research Areas <https://app.dimensions.... True True True
7 publications category_for categories `ANZSRC Fields of Research classification <htt... True True True
8 publications category_hra categories `Health Research Areas <https://app.dimensions... True True True
9 publications category_hrcs_hc categories `HRCS - Health Categories <https://app.dimensi... True True True
10 publications category_hrcs_rac categories `HRCS – Research Activity Codes <https://app.d... True True True
... ... ... ... ... ... ... ...
260 datasets research_org_cities cities City of the organisations the publication auth... True True True
261 datasets research_org_countries countries Country of the organisations the publication a... True True True
262 datasets research_org_states states State of the organisations the publication aut... True True True
263 datasets research_orgs organizations GRID organisations linked to the publication a... True True True
264 datasets researchers researchers Dimensions researchers IDs associated to the d... True True True

93 rows × 7 columns

What about lists of ids?

[5]:
%dsldocs
dsl_last_results[dsl_last_results['field'].str.contains('id')==True]
[5]:
sources field type description is_filter is_entity is_facet
1 publications altmetric_id integer AltMetric Publication ID True False False
23 publications id string Dimensions publication ID. True False False
32 publications pmcid string PubMed Central ID. True False False
33 publications pmid string PubMed ID. True False False
37 publications reference_ids string Dimensions publication ID for publications in ... True False False
48 publications supporting_grant_ids string Grants supporting a publication, returned as a... True False False
84 grants id string Dimensions grant ID. True False False
105 patents associated_grant_ids string Dimensions IDs of the grants associated to the... True False False
114 patents cited_by_ids string Dimensions IDs of the patents that cite this p... True False False
125 patents id string Dimensions patent ID True False False
135 patents publication_ids string Dimensions IDs of the publications related to ... True False False
137 patents reference_ids string Dimensions IDs of the patents which are cited ... True False False
145 clinical_trials associated_grant_ids string Dimensions IDs of the grants associated to the... True False False
161 clinical_trials id string Dimensions clinical trial ID True False False
166 clinical_trials publication_ids string Dimensions IDs of the publications related to ... True False False
180 policy_documents id string Dimensions policy document ID True False False
182 policy_documents publication_ids string Dimensions IDs of the publications related to ... True False False
193 researchers id string Dimensions researcher ID. True False False
197 researchers nih_ppid string The PI Profile ID (i.e., ppid) is a Researcher... True False False
199 researchers orcid_id string `ORCID <https://orcid.org/>`_ ID. True False False
206 organizations cnrs_ids string CNRS IDs for this organization True False False
209 organizations external_ids_fundref string Fundref IDs for this organization True False False
210 organizations hesa_ids string HESA IDs for this organization True False False
211 organizations id string GRID ID of the organization. E.g., "grid.26999... True False False
212 organizations isni_ids string ISNI IDs for this organization True False False
221 organizations organization_child_ids string Child organization IDs True False False
222 organizations organization_parent_ids string Parent organization IDs True False False
223 organizations organization_related_ids string Related organization IDs True False False
224 organizations orgref_ids string OrgRef IDs for this organization True False False
227 organizations ucas_ids string UCAS IDs for this organization True False False
228 organizations ukprn_ids string UKPRN IDs for this organization True False False
229 organizations wikidata_ids string WikiData IDs for this organization True False False
231 datasets associated_grant_ids string Dimensions IDs of the grants associated to the... True False False
232 datasets associated_publication_id string The Dimensions ID of the publication linked to... True False False
252 datasets id string Dimensions dataset ID. True False False
258 datasets publication_ids string The Dimensions IDs of the publications the dat... True False False
259 datasets repository_id string The ID of the repository of the dataset. True False True

What are the external entities that we can put in a list?

  • a list of ISSN’s

  • a list of External Grant IDs

  • a list of DOIs

  • a list of categories

3. Making a list from the results of a query

The list syntax for the Dimensions API is the same as the list syntax for json, so we can use python’s json-to-string functions to make a list of ids for us from the previous query.

Let’s run our example query again.

[6]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 32721)
[6]:
id
0 pub.1125408894
1 pub.1125679504
2 pub.1121881108
3 pub.1125399511
4 pub.1125025149
... ...
95 pub.1124285831
96 pub.1124593123
97 pub.1123956872
98 pub.1124795761
99 pub.1124554498

100 rows × 1 columns

[7]:
json.dumps(list(dsl_last_results.id))


[7]:
'["pub.1125408894", "pub.1125679504", "pub.1121881108", "pub.1125399511", "pub.1125025149", "pub.1116652110", "pub.1125617654", "pub.1125508088", "pub.1125663277", "pub.1126702062", "pub.1126705781", "pub.1125756054", "pub.1125902856", "pub.1124284911", "pub.1125617777", "pub.1125753634", "pub.1125545402", "pub.1125609619", "pub.1125664038", "pub.1124885124", "pub.1126013995", "pub.1124342595", "pub.1124670593", "pub.1124633600", "pub.1125152336", "pub.1125488818", "pub.1125502024", "pub.1125488824", "pub.1124372639", "pub.1124060229", "pub.1127171381", "pub.1126502051", "pub.1124550366", "pub.1126697386", "pub.1125488819", "pub.1127351832", "pub.1126704737", "pub.1127397677", "pub.1127419474", "pub.1127403602", "pub.1127453116", "pub.1127449188", "pub.1126125606", "pub.1127579881", "pub.1124132392", "pub.1124830567", "pub.1125402193", "pub.1125770299", "pub.1127514900", "pub.1123822663", "pub.1124157789", "pub.1124072948", "pub.1124132388", "pub.1124224145", "pub.1124845775", "pub.1124681882", "pub.1124587103", "pub.1125035645", "pub.1125334218", "pub.1124838496", "pub.1125548282", "pub.1125950685", "pub.1125765179", "pub.1125707207", "pub.1125707849", "pub.1125675430", "pub.1124438141", "pub.1126819202", "pub.1125491401", "pub.1125799833", "pub.1126004931", "pub.1124370224", "pub.1124906983", "pub.1124633536", "pub.1123836575", "pub.1126648156", "pub.1125929238", "pub.1126173952", "pub.1125494084", "pub.1125118148", "pub.1123936541", "pub.1127421180", "pub.1125041346", "pub.1127361771", "pub.1125115778", "pub.1127453886", "pub.1127594182", "pub.1127258408", "pub.1126698864", "pub.1125664006", "pub.1127361827", "pub.1127551208", "pub.1127618456", "pub.1127579821", "pub.1124456088", "pub.1124285831", "pub.1124593123", "pub.1123956872", "pub.1124795761", "pub.1124554498"]'

Let’s try to use this list of IDs.

Unfortunately, you can’t just put your results directly into the query

[8]:
%%dsldf
  search publications
  where id in [json.dumps(list(dsl_last_results.id))]

  return publications

Returned Errors: 1
1 QuerySyntaxError found
1 ParserError found
  * [Line 2:15] ('json') no viable alternative at input '[json'

..so let’s get our results back again

[9]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 32721)
[9]:
id
0 pub.1125408894
1 pub.1125679504
2 pub.1121881108
3 pub.1125399511
4 pub.1125025149
... ...
95 pub.1124285831
96 pub.1124593123
97 pub.1123956872
98 pub.1124795761
99 pub.1124554498

100 rows × 1 columns

… and use the python way of calling the Dimensions API instead

[10]:
dsl.query(f"""

 search publications
  where id in {json.dumps(list(dsl_last_results.id))}

  return publications


""").as_dataframe()

f"""

 search publications
  where id in {json.dumps(list(dsl_last_results.id))}

  return publications


"""
Returned Publications: 20 (total = 100)
[10]:
'\n\n search publications\n  where id in ["pub.1125408894", "pub.1125679504", "pub.1121881108", "pub.1125399511", "pub.1125025149", "pub.1116652110", "pub.1125617654", "pub.1125508088", "pub.1125663277", "pub.1126702062", "pub.1126705781", "pub.1125756054", "pub.1125902856", "pub.1124284911", "pub.1125617777", "pub.1125753634", "pub.1125545402", "pub.1125609619", "pub.1125664038", "pub.1124885124", "pub.1126013995", "pub.1124342595", "pub.1124670593", "pub.1124633600", "pub.1125152336", "pub.1125488818", "pub.1125502024", "pub.1125488824", "pub.1124372639", "pub.1124060229", "pub.1127171381", "pub.1126502051", "pub.1124550366", "pub.1126697386", "pub.1125488819", "pub.1127351832", "pub.1126704737", "pub.1127397677", "pub.1127419474", "pub.1127403602", "pub.1127453116", "pub.1127449188", "pub.1126125606", "pub.1127579881", "pub.1124132392", "pub.1124830567", "pub.1125402193", "pub.1125770299", "pub.1127514900", "pub.1123822663", "pub.1124157789", "pub.1124072948", "pub.1124132388", "pub.1124224145", "pub.1124845775", "pub.1124681882", "pub.1124587103", "pub.1125035645", "pub.1125334218", "pub.1124838496", "pub.1125548282", "pub.1125950685", "pub.1125765179", "pub.1125707207", "pub.1125707849", "pub.1125675430", "pub.1124438141", "pub.1126819202", "pub.1125491401", "pub.1125799833", "pub.1126004931", "pub.1124370224", "pub.1124906983", "pub.1124633536", "pub.1123836575", "pub.1126648156", "pub.1125929238", "pub.1126173952", "pub.1125494084", "pub.1125118148", "pub.1123936541", "pub.1127421180", "pub.1125041346", "pub.1127361771", "pub.1125115778", "pub.1127453886", "pub.1127594182", "pub.1127258408", "pub.1126698864", "pub.1125664006", "pub.1127361827", "pub.1127551208", "pub.1127618456", "pub.1127579821", "pub.1124456088", "pub.1124285831", "pub.1124593123", "pub.1123956872", "pub.1124795761", "pub.1124554498"]\n\n  return publications\n\n\n'

Putting both parts of this example together

[11]:
# Step 1. Get the list of publications..

pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 100
                """).as_dataframe()

# Step 2. Put the list into the next query...

dsl.query_iterative(f"""
                 search publications
                    where id in {json.dumps(list(pubs.id))}
                    return publications
""").as_dataframe().head(5)
Returned Publications: 100 (total = 32721)
1000 / ...
100 / 100
===
Records extracted: 100
[11]:
title author_affiliations volume issue pages type year id journal.id journal.title
0 Posttraumatic anger: a confirmatory factor ana... [[{'first_name': 'Grazia', 'last_name': 'Cesch... 11 1 1731127 article 2020 pub.1125408894 jour.1045059 European Journal of Psychotraumatology
1 Direct assessment of mental health and metabol... [[{'first_name': 'Peter S', 'last_name': 'Azzo... 13 1 1732665 article 2020 pub.1125679504 jour.1041075 Global Health Action
2 The large-scale implementation and evaluation ... [[{'first_name': 'Bengianni', 'last_name': 'Pi... 25 1 1-11 article 2020 pub.1121881108 jour.1097842 International Journal of Adolescence and Youth
3 Structural brain changes with lifetime trauma ... [[{'first_name': 'Marie-Laure', 'last_name': '... 11 1 1733247 article 2020 pub.1125399511 jour.1045059 European Journal of Psychotraumatology
4 Exploring cultural differences in the use of e... [[{'first_name': 'Amanda', 'last_name': 'Nagul... 11 1 1729033 article 2020 pub.1125025149 jour.1045059 European Journal of Psychotraumatology

Doing something useful: Get all the publications that cite my publications

[12]:
pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 100
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query_iterative(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe().head()
Returned Publications: 100 (total = 32721)
1000 / ...
25 / 25
===
Records extracted: 25
[12]:
title author_affiliations volume issue pages type year id journal.id journal.title
0 Non-photochemical quenching, a non-invasive pr... [[{'first_name': 'Pranali', 'last_name': 'Deor... 1 1 32-43 article 2020 pub.1125663277 NaN NaN
1 Distribution and pyrethroid resistance status ... [[{'first_name': 'Hitoshi', 'last_name': 'Kawa... 13 1 213 article 2020 pub.1126895840 jour.1039458 Parasites & Vectors
2 Reducing ignorance about who dies of what: res... [[{'first_name': 'Alan D.', 'last_name': 'Lope... 18 1 58 article 2020 pub.1125488824 jour.1032885 BMC Medicine
3 Implementing and scaling verbal autopsies: int... [[{'first_name': 'Ross M.', 'last_name': 'Boyc... 18 1 53 article 2020 pub.1125488825 jour.1032885 BMC Medicine
4 Addressing critical knowledge and capacity gap... [[{'first_name': 'Tim', 'last_name': 'Adair', ... 18 1 46 article 2020 pub.1125488821 jour.1032885 BMC Medicine

5. How Long can lists get?

It is a bit dependent on string length, plus a fixed length of 512 items

This won’t work

[13]:
pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe()
Returned Publications: 1000 (total = 32721)
Returned Errors: 1
Semantic Error
Semantic errors found:
        Filter operator 'in' requires 0 < items < 512. '1000 is out of this range'.

This will

[14]:
pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 250
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe().head(2)
Returned Publications: 250 (total = 32721)
Returned Publications: 20 (total = 78)
[14]:
author_affiliations issue year type title id pages volume journal.id journal.title
0 [[{'first_name': 'Pranali', 'last_name': 'Deor... 1 2020 article Non-photochemical quenching, a non-invasive pr... pub.1125663277 32-43 1 NaN NaN
1 [[{'first_name': 'Hitoshi', 'last_name': 'Kawa... 1 2020 article Distribution and pyrethroid resistance status ... pub.1126895840 213 13 jour.1039458 Parasites & Vectors

What if I need a very long list?

The Dimcli library can break up your query into chunks.

We then loop through each chunk - get the result, and stick them back together again at the end.

[15]:
# Step 1 - same as before - except now we want the query in chunks

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results

query_results = []

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return publications
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

pd.concat(query_results).\
   drop_duplicates(subset='id').\
   head(2)

Returned Publications: 1000 (total = 32721)
1000 / ...
79 / 79
===
Records extracted: 79
1000 / ...
53 / 53
===
Records extracted: 53
1000 / ...
70 / 70
===
Records extracted: 70
1000 / ...
127 / 127
===
Records extracted: 127
[15]:
id title volume author_affiliations type year issue pages journal.id journal.title
0 pub.1125663277 Non-photochemical quenching, a non-invasive pr... 1 [[{'first_name': 'Pranali', 'last_name': 'Deor... article 2020 1 32-43 NaN NaN
1 pub.1126895840 Distribution and pyrethroid resistance status ... 13 [[{'first_name': 'Hitoshi', 'last_name': 'Kawa... article 2020 1 213 jour.1039458 Parasites & Vectors

6. What if I want to get the researchers associated with the publications the cite my institution?

[16]:
# Step 1 - same as before

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

query_results = []

# Step 2 same as before, but now I returning researchers instead of publications

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return researchers limit 1000
                        """).as_dataframe()
      # Warning 1, If there are more than 1000 researchers involved in this query, you will miss some
      )

# Step 3 join the queries back together, this time using a groupby statement to join the counts back together again

my_researchers = pd.concat(query_results).\
                 groupby(['id','first_name','last_name']).\
                  agg({'count':'sum'}).\
                  sort_values(by='count', ascending=False).\
                  head(10)

Returned Publications: 1000 (total = 32721)
Returned Researchers: 434
Returned Researchers: 485
Returned Researchers: 213
Returned Researchers: 417

7. What if I want to get all the researchers associated with the publications that cite my institution?

[17]:
# Step 1 - same as before

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

query_results = []

# Step 2 - almost the same as before -
#          except now we are asking for the as_dataframe_authors data frame

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return publications
                        """).as_dataframe_authors() # I have changed this line from as_dataframe to as_datframe_authors
      )

# Step 3 - join the publications back together

researcher_pubs = pd.concat(query_results).\
                drop_duplicates(subset=['researcher_id','pub_id'])


# Step 4 - count up the publications using a groupby statement

my_researchers = researcher_pubs[researcher_pubs['researcher_id'] != ''].\
    groupby(['researcher_id']).\
    agg({'first_name':'max','last_name':'max','pub_id':'count'}).\
    sort_values(by='pub_id', ascending=False).\
    reset_index()

my_researchers.\
    head(10)

Returned Publications: 1000 (total = 32721)
1000 / ...
79 / 79
===
Records extracted: 79
1000 / ...
53 / 53
===
Records extracted: 53
1000 / ...
70 / 70
===
Records extracted: 70
1000 / ...
127 / 127
===
Records extracted: 127
[17]:
researcher_id first_name last_name pub_id
0 ur.014233544433.66 Shirui Pan 6
1 ur.01125623170.44 Alan D. Lopez 3
2 ur.01224315435.68 Rachelle Buchbinder 3
3 ur.010771505735.21 Xingquan Zhu 3
4 ur.01167177047.84 Antonis C. Antoniou 3
5 ur.0725130324.01 Clare L Ardern 2
6 ur.01356312225.35 Joaquín Bernal-Bayard 2
7 ur.01357125251.51 Quaid D. Morris 2
8 ur.0616256460.84 John L. Hopper 2
9 ur.0724772466.27 Christopher I. Pakes 2

8. ..and if we want details about our researchers, we can put our list of researchers into the researcher API

See the researcher source docs for more details.

[18]:
## First, we need to chunk up our researcher list

query_results = []

for g, rschr in my_researchers.groupby(np.arange(len(my_researchers)) // 250):
          # This does *almost* the same thing as the chunks command used above

     myreslist = json.dumps(list(rschr.researcher_id))

     query_results.append(

                  dsl.query_iterative(f"""
                        search researchers
                            where id in {myreslist}
                            return researchers
                        """).as_dataframe() #
      )


pd.concat(query_results).head()
1000 / ...
248 / 248
===
Records extracted: 248
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
249 / 249
===
Records extracted: 249
1000 / ...
30 / 30
===
Records extracted: 30
[18]:
id last_name first_name research_orgs orcid_id
0 ur.015157705046.28 Szczęsny Robert [{'id': 'grid.8756.c', 'linkout': ['http://www... NaN
1 ur.014764355205.36 Speed Cathy [{'id': 'grid.47170.35', 'linkout': ['http://w... NaN
2 ur.01264721100.06 Barrowdale Daniel [{'id': 'grid.5335.0', 'linkout': ['http://www... [0000-0003-1661-3939]
3 ur.015577114163.64 Tang Tiffany Obog [{'id': 'grid.1002.3', 'linkout': ['http://www... NaN
4 ur.014557047701.50 English Ruth [{'id': 'grid.450634.0', 'linkout': ['http://w... NaN

9. Patents example (patents -> publications)

Using the same method, we can retrieve all patents citing publications from my institution.

[19]:
%dsldocs patents
dsl_last_results[dsl_last_results['field']=='publication_ids']
[19]:
sources field type description is_filter is_entity is_facet
37 patents publication_ids string Dimensions IDs of the publications related to ... True False False
[20]:
# Step 1 - same as before - except now we want the query in chunks

pubs_chunks = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2015
                    return publications[id]
                """).chunks(250)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below.  publications was replaced with patents, and refernce_ids was replaced by publication_ids

query_results = []

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search patents
                            where publication_ids in {mypubslist}
                            return patents
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

cited_patents = pd.concat(query_results).\
   drop_duplicates(subset='id')

cited_patents.head(2)
1000 / ...
1000 / 8409
2000 / 8409
3000 / 8409
4000 / 8409
5000 / 8409
6000 / 8409
7000 / 8409
8000 / 8409
8409 / 8409
===
Records extracted: 8409
1000 / ...
4 / 4
===
Records extracted: 4
1000 / ...
===
Records extracted: 0
1000 / ...
===
Records extracted: 0
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
4 / 4
===
Records extracted: 4
1000 / ...
4 / 4
===
Records extracted: 4
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
9 / 9
===
Records extracted: 9
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
1 / 1
===
Records extracted: 1
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
1 / 1
===
Records extracted: 1
1000 / ...
4 / 4
===
Records extracted: 4
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
5 / 5
===
Records extracted: 5
1000 / ...
8 / 8
===
Records extracted: 8
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
5 / 5
===
Records extracted: 5
1000 / ...
6 / 6
===
Records extracted: 6
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
11 / 11
===
Records extracted: 11
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
8 / 8
===
Records extracted: 8
1000 / ...
===
Records extracted: 0
1000 / ...
5 / 5
===
Records extracted: 5
1000 / ...
4 / 4
===
Records extracted: 4
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
===
Records extracted: 0
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
===
Records extracted: 0
[20]:
id title filing_status inventor_names assignee_names year assignees publication_date times_cited granted_year
0 WO-2017129644-A1 COMPOSITIONS WITH SPECIFIC OLIGOSACCHARIDES TO... Application [BLANCHARD, CARINE, NEMBRINI, Chiara] [NESTEC SA] 2017 [{'id': 'grid.419905.0', 'name': 'Nestlé (Swit... 2017-08-03 0.0 NaN
1 WO-2017129642-A1 COMPOSITION FOR USE IN THE PREVENTION AND/OR T... Application [BLANCHARD, CARINE, NEMBRINI, Chiara] [NESTEC SA] 2017 [{'id': 'grid.419905.0', 'name': 'Nestlé (Swit... 2017-08-03 0.0 NaN
[21]:
%dsldocs patents
dsl_last_results[dsl_last_results['type']=='organizations']
[21]:
sources field type description is_filter is_entity is_facet
6 patents assignees organizations Disambiguated GRID organisations who own or ha... True True True
19 patents current_assignees organizations Disambiguated GRID organisations currenlty own... True True True
24 patents funders organizations GRID organisations funding the patent. True True True
33 patents original_assignees organizations Disambiguated GRID organisations that first ow... True True True
[22]:
import json
cited_patents_assignees = cited_patents.explode('assignees')

cited_patents_assignees['assignee_grid_id'] = cited_patents_assignees['assignees'].\
    apply(lambda g: g['id'] if type(g) == dict else 0 )

cited_patents_assignees['assignee_name'] = cited_patents_assignees['assignees'].\
    apply(lambda g: g['name'] if type(g) == dict else 0 )

cited_patents_assignees.\
    groupby(['assignee_grid_id','assignee_name']).\
    agg({'id':'count'}).\
    sort_values(by='id', ascending=False).\
    head(20)
[22]:
id
assignee_grid_id assignee_name
0 0 29
grid.428999.7 Pasteur Institute 5
grid.1058.c Murdoch Children's Research Institute 4
grid.453773.1 Wisconsin Alumni Research Foundation 3
grid.419905.0 Nestlé (Switzerland) 3
grid.420918.6 Imperial Innovations (United Kingdom) 3
grid.1055.1 Peter MacCallum Cancer Centre 3
grid.25879.31 University of Pennsylvania 2
grid.1042.7 Walter and Eliza Hall Institute of Medical Research 2
grid.4444.0 French National Centre for Scientific Research 2
grid.452266.1 Campus Science Support Facilities 2
grid.1003.2 University of Queensland 2
grid.420377.5 NEC (Japan) 2
grid.420214.1 Sanofi (Germany) 2
grid.419859.8 NEC Corporation of America 2
grid.419318.6 Intel (United States) 2
grid.417521.4 Institute of Molecular Biotechnology 2
grid.29857.31 Pennsylvania State University 2
grid.431532.2 Mesoblast (United States) 2
grid.7429.8 French Institute of Health and Medical Research 2

10. Clinical Trials (clinical trials -> publications)

Using the same method, we can retrieve all clinical trials citing publications from my institution.

[23]:
%dsldocs clinical_trials
dsl_last_results[dsl_last_results['field']=='research_orgs']
[23]:
sources field type description is_filter is_entity is_facet
26 clinical_trials research_orgs organizations GRID organizations involved, e.g. as sponsors ... True True True
[24]:
# Step 1 - same as before - except now we want the query in chunks

clinical_trials_chunks = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2015
                    return publications[id]
                """).chunks(400)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below.  publications was replaced with clinical_trials, and reference_ids was replaced by publication_ids

query_results = []

for c in clinical_trials_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search clinical_trials
                            where publication_ids in {mypubslist}
                            return clinical_trials[all]
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

cited_clinical_trials = pd.concat(query_results).\
   drop_duplicates(subset='id')

cited_clinical_trials.head(2)
1000 / ...
1000 / 8409
2000 / 8409
3000 / 8409
4000 / 8409
5000 / 8409
6000 / 8409
7000 / 8409
8000 / 8409
8409 / 8409
===
Records extracted: 8409
1000 / ...
10 / 10
===
Records extracted: 10
1000 / ...
17 / 17
===
Records extracted: 17
1000 / ...
26 / 26
===
Records extracted: 26
1000 / ...
9 / 9
===
Records extracted: 9
1000 / ...
7 / 7
===
Records extracted: 7
1000 / ...
15 / 15
===
Records extracted: 15
1000 / ...
5 / 5
===
Records extracted: 5
1000 / ...
7 / 7
===
Records extracted: 7
1000 / ...
13 / 13
===
Records extracted: 13
1000 / ...
12 / 12
===
Records extracted: 12
1000 / ...
13 / 13
===
Records extracted: 13
1000 / ...
13 / 13
===
Records extracted: 13
1000 / ...
6 / 6
===
Records extracted: 6
1000 / ...
7 / 7
===
Records extracted: 7
1000 / ...
14 / 14
===
Records extracted: 14
1000 / ...
11 / 11
===
Records extracted: 11
1000 / ...
11 / 11
===
Records extracted: 11
1000 / ...
11 / 11
===
Records extracted: 11
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
===
Records extracted: 0
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
===
Records extracted: 0
[24]:
FOR_first id interventions category_rcdc RCDC organizations research_orgs category_bra title HRCS_HC ... funders gender brief_title researchers category_hrcs_rac HRCS_RAC category_icrp_cso category_icrp_ct associated_grant_ids funder_groups
0 [{'id': '2211', 'name': '11 Medical and Health... NCT00987454 [{'type': 'Drug', 'name': 'Sodium Chloride 0.9... [{'id': '387', 'name': 'Neurosciences'}, {'id'... [{'id': '387', 'name': 'Neurosciences'}, {'id'... [{'id': 'grid.5802.f', 'name': 'Johannes Guten... [{'id': 'grid.5802.f', 'name': 'Johannes Guten... [{'id': '4001', 'name': 'Clinical Medicine and... A Randomised, Placebo-controlled Trial of Eryt... [{'id': '904', 'name': 'Injuries and Accidents... ... [{'id': 'grid.431143.0', 'name': 'National Hea... All Erythropoietin in Traumatic Brain Injury (EPO-... NaN NaN NaN NaN NaN NaN NaN
1 [{'id': '2217', 'name': '17 Psychology and Cog... NCT03112850 [{'type': 'Behavioral', 'name': 'Auditory Trai... [{'id': '508', 'name': 'Clinical Trials and Su... [{'id': '508', 'name': 'Clinical Trials and Su... [{'id': 'grid.1027.4', 'name': 'Swinburne Univ... [{'id': 'grid.1027.4', 'name': 'Swinburne Univ... [{'id': '4003', 'name': 'Public Health'}] Investigating the Impact of Hearing Aid Use an... [{'id': '905', 'name': 'Mental Health'}, {'id'... ... NaN All Cognition Effects of Hearing Aids and Auditory... [{'id': 'ur.01243702010.56', 'last_name': 'Mey... NaN NaN NaN NaN NaN NaN

2 rows × 37 columns

[25]:
%dsldocs clinical_trials
dsl_last_results[dsl_last_results['type']=='organizations']
[25]:
sources field type description is_filter is_entity is_facet
17 clinical_trials funders organizations GRID funding organisations that are involved w... True True True
26 clinical_trials research_orgs organizations GRID organizations involved, e.g. as sponsors ... True True True
[26]:
cited_clinical_trials_orgs = cited_clinical_trials.explode('research_orgs')

cited_clinical_trials_orgs['research_orgs_grid_id'] = cited_clinical_trials_orgs['research_orgs'].\
    apply(lambda g: g['id'] if type(g) == dict else 0 )

cited_clinical_trials_orgs['research_orgs_name'] = cited_clinical_trials_orgs['research_orgs'].\
    apply(lambda g: g['name'] if type(g) == dict else 0 )

cited_clinical_trials_orgs.\
    groupby(['research_orgs_grid_id','research_orgs_name']).\
    agg({'id':'count'}).\
    sort_values(by='id', ascending=False).\
    head(20)
[26]:
id
research_orgs_grid_id research_orgs_name
grid.1008.9 University of Melbourne 11
grid.431143.0 National Health and Medical Research Council 11
grid.416153.4 Royal Melbourne Hospital 6
grid.21107.35 Johns Hopkins University 6
grid.1002.3 Monash University 5
grid.1058.c Murdoch Children's Research Institute 5
grid.1055.1 Peter MacCallum Cancer Centre 5
grid.416100.2 Royal Brisbane and Women's Hospital 4
grid.419681.3 National Institute of Allergy and Infectious Diseases 4
grid.277151.7 Centre Hospitalier Universitaire de Nantes 4
grid.1623.6 The Alfred Hospital 4
grid.413249.9 Royal Prince Alfred Hospital 4
grid.5650.6 Academic Medical Center 4
grid.411109.c Virgen del Rocío University Hospital 4
grid.416259.d Royal Women's Hospital 4
grid.84393.35 Hospital Universitari i Politècnic La Fe 4
grid.1003.2 University of Queensland 4
grid.17091.3e University of British Columbia 3
grid.413574.0 Alberta Health Services 3
grid.27860.3b University of California, Davis 3

11. Grants (publications -> grants)

Using the same method, we can retrieve all grants funding publications from my institution.

[27]:
%dsldocs publications
dsl_last_results[dsl_last_results['field'].str.contains('ids')]
[27]:
sources field type description is_filter is_entity is_facet
37 publications reference_ids string Dimensions publication ID for publications in ... True False False
48 publications supporting_grant_ids string Grants supporting a publication, returned as a... True False False
[28]:
# Step 1 - same as before - except now we want the query in chunks

publications = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2020
                    return publications[id+supporting_grant_ids]
                """).as_dataframe()

# Step 2 - we can get the grants IDs directly from publications this time.
# So as a second step, we want to pull grants metadata using these identifiers.

pubs_grants = publications.explode('supporting_grant_ids')

grants_from_pubs = pd.DataFrame(pubs_grants.supporting_grant_ids.unique()).\
                   dropna().\
                   rename(columns={0:'id'})

query_results = []

for g, gnts in grants_from_pubs.groupby(np.arange(len(grants_from_pubs)) // 250):
          # This does *almost* the same thing as the chunks command used above

      myglist = json.dumps(list(gnts.id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search grants
                            where id in {myglist}
                          return grants[all]
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

grant_details = pd.concat(query_results).\
   drop_duplicates(subset='id')

grant_details.head(5)
1000 / ...
1000 / 6285
2000 / 6285
3000 / 6285
4000 / 6285
5000 / 6285
6000 / 6285
6285 / 6285
===
Records extracted: 6285
1000 / ...
248 / 248
===
Records extracted: 248
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
248 / 248
===
Records extracted: 248
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
249 / 249
===
Records extracted: 249
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
60 / 60
===
Records extracted: 60
[28]:
funding_cad end_date title_language language_title research_org_name HRCS_HC category_hrcs_hc funding_nzd date_inserted original_title ... FOR active_year funding_chf language id category_icrp_cso research_orgs research_org_state_codes funding_org_acronym funding_org_city
0 12544062.0 2024-04-30 en en HENNEPIN HEALTHCARE RESEARCH INSTITUTE [{'id': '911', 'name': 'Cancer'}] [{'id': '911', 'name': 'Cancer'}] 14354176.0 2019-09-05 ASPirin in Reducing Events in the Elderly - eX... ... [{'id': '3053', 'name': '1103 Clinical Science... [2019, 2020, 2021, 2022, 2023, 2024] 9394464.0 en grant.8556107 NaN NaN NaN NaN NaN
1 630010.0 2024-04-30 en en University of Massachusetts Medical School [{'id': '911', 'name': 'Cancer'}] [{'id': '911', 'name': 'Cancer'}] 720921.0 2019-06-28 NanoOptogenetic immunotherapy for B cell lymphoma ... [{'id': '3142', 'name': '1112 Oncology and Car... [2019, 2020, 2021, 2022, 2023, 2024] 471826.0 en grant.8474795 [{'id': '3767', 'name': '5.3 Systemic Therapie... [{'id': 'grid.168645.8', 'country_name': 'Unit... [{'id': 'US-MA', 'name': 'Massachusetts'}] NaN NaN
2 589505.0 2021-03-31 en en University of California, San Diego [{'id': '897', 'name': 'Neurological'}] [{'id': '897', 'name': 'Neurological'}] 674575.0 2019-06-28 Prediction of seizure lateralization and posto... ... [{'id': '3120', 'name': '1109 Neurosciences'}] [2019, 2020, 2021] 441491.0 en grant.8473885 NaN [{'id': 'grid.266100.3', 'country_name': 'Unit... [{'id': 'US-CA', 'name': 'California'}] NaN NaN
3 1169289.0 2023-03-31 en en Texas A&M Health Science Center [{'id': '894', 'name': 'Cardiovascular'}] [{'id': '894', 'name': 'Cardiovascular'}] 1343265.0 2019-04-23 Role of TET dioxygenase associated immune mech... ... [{'id': '2620', 'name': '0604 Genetics'}, {'id... [2019, 2020, 2021, 2022, 2023] 874596.0 en grant.8388899 NaN [{'id': 'grid.412408.b', 'country_name': 'Unit... [{'id': 'US-TX', 'name': 'Texas'}] NaN NaN
4 1005090.0 2022-02-28 en en Case Western Reserve University [{'id': '911', 'name': 'Cancer'}] [{'id': '911', 'name': 'Cancer'}] 1154319.0 2019-04-23 HistoTools: A suite of digital pathology tool... ... [{'id': '3142', 'name': '1112 Oncology and Car... [2019, 2020, 2021, 2022] 751846.0 en grant.8388167 [{'id': '3761', 'name': '4.1 Technology Develo... [{'id': 'grid.67105.35', 'country_name': 'Unit... [{'id': 'US-OH', 'name': 'Ohio'}] NaN NaN

5 rows × 56 columns

[29]:
pubs_grants.groupby('supporting_grant_ids').\
    agg({'id':'count'}).\
    reset_index().\
    rename(columns={'id':'pubs','supporting_grant_ids':'id'}).\
    merge(grant_details[['id','original_title','funding_usd']],
          on='id').\
    sort_values(by='pubs', ascending=False)

[29]:
id pubs original_title funding_usd
1451 grant.6711717 26 ARC Centre of Excellence in Exciton Science 22669994.0
767 grant.3931418 17 ARC Centre of Excellence in Convergent Bio-Nan... 19674412.0
2294 grant.7874297 16 Advancing Nanomedicine through Particle Techno... 617151.0
2741 grant.7877829 13 Novel therapies for psychiatric disorders 617440.0
2778 grant.7878111 13 Novel therapies, risk pathways and prevention ... 643236.0
... ... ... ... ...
1256 grant.5300885 1 OP: Ultrafast and Optomechanical Properties of... 370000.0
1257 grant.5346997 1 Testing genotype-guided therapies in pancreati... 132643.0
1259 grant.5475480 1 UCLA Clinical Translational Science Institute 52803736.0
1260 grant.5475517 1 Unraveling the mammalian secretory pathway thr... 1550000.0
3304 grant.8681118 1 Development of molecular markers for applicati... NaN

3305 rows × 4 columns

Why didn’t I use resulting_publication_ids ?

[30]:
%%dsldf

search grants
where resulting_publication_ids in ["pub.1005269097"]
Returned Grants: 3 (total = 3)
WARNINGS [1]
Field 'resulting_publication_ids' is deprecated. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
[30]:
id title end_date title_language project_num original_title funding_org_name language funders start_date active_year start_year
0 grant.4320525 Nanoscale X-Ray Imaging and Dynamics of Electr... 2021-12-14 en DE-SC0001805 Nanoscale X-Ray Imaging and Dynamics of Electr... Office of Basic Energy Sciences en [{'id': 'grid.452988.a', 'linkout': ['https://... 2018-12-15 [2018, 2019, 2020, 2021] 2018
1 grant.3660654 Strain-induced modification of nanoscale mater... 2018-07-31 en 1411335 Strain-induced modification of nanoscale mater... Directorate for Mathematical & Physical Sciences en [{'id': 'grid.457875.c', 'linkout': ['http://w... 2014-08-15 [2014, 2015, 2016, 2017, 2018] 2014
2 grant.3100327 Magnetic Transition Metal Nanowires 2013-09-30 en 0906957 Magnetic Transition Metal Nanowires Directorate for Mathematical & Physical Sciences en [{'id': 'grid.457875.c', 'linkout': ['http://w... 2009-08-15 [2009, 2010, 2011, 2012, 2013] 2009

Conclusions

Lists are a simple data structure that can have a great number of applications.

When used in conjuction with the DSL language, they make it easy to concatenate the results of one query with another query e.g. in order to navigate through links available in Dimensions (from publications to grants, patents etc…).

See also this patents tutorial or this clinical trials tutorial for more in-depth applications of the queries discussed above.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg