../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Working with lists in the Dimensions API

In this notebook we are going to show:

  • How to use lists in order to write more efficient DSL queries

  • How lists can be used to concatenate the results of one query with another query

  • How these methods can be used for real-word applications e.g., getting publications/patents/grants that cite my publications

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:
!pip install dimcli -U --quiet

import dimcli
from dimcli.utils import *
import sys
import json
import pandas as pd
import numpy as np

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. How do we use lists in the Dimensions API?

We use lists in the API because they are easier to read, and easier to work with.

Here is a query without lists.

How many publications were produced from either Monash or Melbourne University ( grid.1002.3, grid.1008.9 ) in either (2019 OR 2020). Be really careful with your brakets!

[2]:
%%dsldf

search publications
where
      (
          research_orgs.id = "grid.1008.9"
       or research_orgs.id = "grid.1002.3"
       )
  and (
          year = 2019
       or year = 2020
       )
return publications
limit 1

Returned Publications: 1 (total = 45670)
Time: 0.71s
[2]:
authors id issue pages title type volume year journal.id journal.title
0 [{'affiliations': [{'city': 'Lisbon', 'city_id... pub.1141637539 3 188-197 How silence facilitates verbal participation article 3 2020 jour.1374325 English Language Teaching Educational Journal

The query above could get really messy. What if I wanted 20 institutions. What if I wanted the last ten years: (or,or,or,or,or….) and (or,or,or,or,or)

By using lists we can quickly add a large number of conditions by means of an easy to read square-brakets notation:

[3]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 45670)
Time: 2.34s
[3]:
id
0 pub.1141637539
1 pub.1141305289
2 pub.1138102366
3 pub.1135883909
4 pub.1135432149
... ...
95 pub.1134203058
96 pub.1134203025
97 pub.1134202605
98 pub.1134190313
99 pub.1134190104

100 rows × 1 columns

2. What are all the things that we can make lists of in the Dimensions API?

What are the internal Entities that we might put in a list?

[4]:
%dsldocs
dsl_last_results[dsl_last_results['is_entity']==True]
[4]:
sources field type description is_filter is_entity is_facet
6 clinical_trials category_bra categories `Broad Research Areas <https://app.dimensions.... True True True
7 clinical_trials category_for categories `ANZSRC Fields of Research classification <htt... True True True
8 clinical_trials category_hra categories `Health Research Areas <https://dimensions.fre... True True True
9 clinical_trials category_hrcs_hc categories `HRCS - Health Categories <https://dimensions.... True True True
10 clinical_trials category_hrcs_rac categories `HRCS – Research Activity Codes <https://dimen... True True True
... ... ... ... ... ... ... ...
321 reports responsible_orgs_cities cities Cities of the organisations responsible for th... True True True
322 reports responsible_orgs_countries countries Countries of the organisations responsible for... True True True
324 reports responsible_orgs_states states States of the organisations responsible for th... True True True
327 researchers current_research_org organizations The most recent research organization linked t... True True True
340 researchers research_orgs organizations All research organizations linked to the resea... True True True

123 rows × 7 columns

What about lists of ids?

[5]:
%dsldocs
dsl_last_results[dsl_last_results['field'].str.contains('id')==True]
[5]:
sources field type description is_filter is_entity is_facet
4 clinical_trials associated_grant_ids string Dimensions IDs of the grants associated to the... True False False
21 clinical_trials id string Dimensions clinical trial ID True False False
27 clinical_trials publication_ids string Publications mentioned in clinical trials (exc... True False False
33 datasets associated_grant_ids string The Dimensions IDs of the grants linked to the... True False False
35 datasets associated_publication_id string The Dimensions ID of the publication linked to... True False False
57 datasets id string Dimensions dataset ID. True False False
64 datasets repository_id string The ID of the repository of the dataset. True False True
105 grants id string Dimensions grant ID. True False False
122 organizations cnrs_ids string CNRS IDs for this organization True False False
126 organizations external_ids_fundref string Fundref IDs for this organization True False False
127 organizations hesa_ids string HESA IDs for this organization True False False
128 organizations id string GRID ID of the organization. E.g., "grid.26999... True False False
129 organizations isni_ids string ISNI IDs for this organization True False False
140 organizations organization_child_ids string Child organization IDs True False False
141 organizations organization_parent_ids string Parent organization IDs True False False
142 organizations organization_related_ids string Related organization IDs True False False
143 organizations orgref_ids string OrgRef IDs for this organization True False False
145 organizations ror_ids string ROR IDs for this organization True False False
149 organizations ucas_ids string UCAS IDs for this organization True False False
150 organizations ukprn_ids string UKPRN IDs for this organization True False False
151 organizations wikidata_ids string WikiData IDs for this organization True False False
160 patents associated_grant_ids string Dimensions IDs of the grants associated to the... True False False
169 patents cited_by_ids string Dimensions IDs of the patents that cite this p... True False False
177 patents family_id integer Identifier of the corresponding `EPO patent fa... True False True
183 patents id string Dimensions patent ID True False False
194 patents publication_ids string Dimensions IDs of the publications related to ... True False False
197 patents reference_ids string Dimensions IDs of the patents which are cited ... True False False
212 policy_documents id string Dimensions policy document ID True False False
214 policy_documents publication_ids string Dimensions IDs of the publications related to ... True False False
225 publications altmetric_id integer Altmetric Publication ID True False False
226 publications arxiv_id string The publications arXiv identifier (e.g. ‘arXiv... True False False
242 publications clinical_trial_ids string Clinical Trial IDs mentioned in publications f... True False False
254 publications id string Dimensions publication ID. True False False
264 publications pmcid string PubMed Central ID. True False False
265 publications pmid string PubMed ID. True False False
269 publications reference_ids string Dimensions publication ID for publications in ... True False False
283 publications supporting_grant_ids string Grants supporting a publication, returned as a... True False False
306 reports external_ids json External identifiers available from the report... True False False
310 reports id string Dimensions report ID. True False False
332 researchers id string Dimensions researcher ID. True False False
336 researchers nih_ppid string The PI Profile ID (i.e., ppid) is a Researcher... True False False
338 researchers orcid_id string `ORCID <https://orcid.org/>`_ ID. True False False
343 source_titles id string The Dimensions ID of the source. True False False

What are the external entities that we can put in a list?

  • a list of ISSN’s

  • a list of External Grant IDs

  • a list of DOIs

  • a list of categories

3. Making a list from the results of a query

The list syntax for the Dimensions API is the same as the list syntax for json, so we can use python’s json-to-string functions to make a list of ids for us from the previous query.

Let’s run our example query again.

[6]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 45670)
Time: 1.43s
[6]:
id
0 pub.1141637539
1 pub.1141305289
2 pub.1138102366
3 pub.1135883909
4 pub.1135432149
... ...
95 pub.1134203058
96 pub.1134203025
97 pub.1134202605
98 pub.1134190313
99 pub.1134190104

100 rows × 1 columns

[7]:
json.dumps(list(dsl_last_results.id))


[7]:
'["pub.1141637539", "pub.1141305289", "pub.1138102366", "pub.1135883909", "pub.1135432149", "pub.1134822967", "pub.1134491333", "pub.1134281073", "pub.1134266008", "pub.1134265994", "pub.1134265901", "pub.1134265655", "pub.1134265653", "pub.1134265625", "pub.1134258780", "pub.1134258376", "pub.1134257556", "pub.1134256126", "pub.1134256124", "pub.1134256123", "pub.1134254481", "pub.1134252657", "pub.1134251430", "pub.1134250811", "pub.1134250027", "pub.1134248622", "pub.1134199431", "pub.1134198997", "pub.1134198446", "pub.1134198388", "pub.1134196709", "pub.1134196597", "pub.1134196290", "pub.1134196127", "pub.1134195884", "pub.1134195597", "pub.1134195228", "pub.1134194637", "pub.1134193894", "pub.1134193864", "pub.1134193611", "pub.1134192883", "pub.1134192144", "pub.1134192022", "pub.1133732704", "pub.1133137340", "pub.1133137326", "pub.1133137275", "pub.1132992249", "pub.1132838273", "pub.1128236326", "pub.1134262302", "pub.1134257598", "pub.1134257594", "pub.1134257519", "pub.1134257515", "pub.1134246135", "pub.1134233491", "pub.1134221342", "pub.1134220735", "pub.1134201275", "pub.1134199760", "pub.1134199173", "pub.1134199132", "pub.1134196260", "pub.1134196172", "pub.1134195488", "pub.1134194209", "pub.1134193250", "pub.1134193081", "pub.1134191741", "pub.1134191365", "pub.1134190135", "pub.1134190078", "pub.1134186262", "pub.1134185924", "pub.1134185750", "pub.1134185659", "pub.1134184944", "pub.1134184639", "pub.1134183226", "pub.1134182538", "pub.1134181948", "pub.1133342464", "pub.1133065204", "pub.1132594886", "pub.1132412597", "pub.1131466688", "pub.1134233730", "pub.1134233503", "pub.1134233412", "pub.1134215867", "pub.1134215652", "pub.1134215451", "pub.1134208463", "pub.1134203058", "pub.1134203025", "pub.1134202605", "pub.1134190313", "pub.1134190104"]'

Let’s try to use this list of IDs.

Unfortunately, you can’t just put your results directly into the query

[8]:
%%dsldf
  search publications
  where id in [json.dumps(list(dsl_last_results.id))]

  return publications

Returned Errors: 1
Time: 1.25s
1 QueryError found
4 QuerySyntaxErrors found

4 ParserErrors found
  * [Line 2:15] ('json') no viable alternative at input '[json'
  * [Line 2:26] ('list') extraneous input 'list' expecting {'for', 'in', '('}
  * [Line 2:31] ('dsl_last_results') mismatched input 'dsl_last_results' expecting {'for', 'in', '('}
  * [Line 2:52] (']') mismatched input ']' expecting <EOF>

..so let’s get our results back again

[9]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 45670)
Time: 1.14s
[9]:
id
0 pub.1141637539
1 pub.1141305289
2 pub.1138102366
3 pub.1135883909
4 pub.1135432149
... ...
95 pub.1134203058
96 pub.1134203025
97 pub.1134202605
98 pub.1134190313
99 pub.1134190104

100 rows × 1 columns

… and use the python way of calling the Dimensions API instead

[10]:
dsl.query(f"""

 search publications
  where id in {json.dumps(list(dsl_last_results.id))}

  return publications


""").as_dataframe()

f"""

 search publications
  where id in {json.dumps(list(dsl_last_results.id))}

  return publications


"""
Returned Publications: 20 (total = 100)
Time: 3.08s
[10]:
'\n\n search publications\n  where id in ["pub.1141637539", "pub.1141305289", "pub.1138102366", "pub.1135883909", "pub.1135432149", "pub.1134822967", "pub.1134491333", "pub.1134281073", "pub.1134266008", "pub.1134265994", "pub.1134265901", "pub.1134265655", "pub.1134265653", "pub.1134265625", "pub.1134258780", "pub.1134258376", "pub.1134257556", "pub.1134256126", "pub.1134256124", "pub.1134256123", "pub.1134254481", "pub.1134252657", "pub.1134251430", "pub.1134250811", "pub.1134250027", "pub.1134248622", "pub.1134199431", "pub.1134198997", "pub.1134198446", "pub.1134198388", "pub.1134196709", "pub.1134196597", "pub.1134196290", "pub.1134196127", "pub.1134195884", "pub.1134195597", "pub.1134195228", "pub.1134194637", "pub.1134193894", "pub.1134193864", "pub.1134193611", "pub.1134192883", "pub.1134192144", "pub.1134192022", "pub.1133732704", "pub.1133137340", "pub.1133137326", "pub.1133137275", "pub.1132992249", "pub.1132838273", "pub.1128236326", "pub.1134262302", "pub.1134257598", "pub.1134257594", "pub.1134257519", "pub.1134257515", "pub.1134246135", "pub.1134233491", "pub.1134221342", "pub.1134220735", "pub.1134201275", "pub.1134199760", "pub.1134199173", "pub.1134199132", "pub.1134196260", "pub.1134196172", "pub.1134195488", "pub.1134194209", "pub.1134193250", "pub.1134193081", "pub.1134191741", "pub.1134191365", "pub.1134190135", "pub.1134190078", "pub.1134186262", "pub.1134185924", "pub.1134185750", "pub.1134185659", "pub.1134184944", "pub.1134184639", "pub.1134183226", "pub.1134182538", "pub.1134181948", "pub.1133342464", "pub.1133065204", "pub.1132594886", "pub.1132412597", "pub.1131466688", "pub.1134233730", "pub.1134233503", "pub.1134233412", "pub.1134215867", "pub.1134215652", "pub.1134215451", "pub.1134208463", "pub.1134203058", "pub.1134203025", "pub.1134202605", "pub.1134190313", "pub.1134190104"]\n\n  return publications\n\n\n'

Putting both parts of this example together

[11]:
# Step 1. Get the list of publications..

pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 100
                """).as_dataframe()

# Step 2. Put the list into the next query...

dsl.query_iterative(f"""
                 search publications
                    where id in {json.dumps(list(pubs.id))}
                    return publications
""").as_dataframe().head(5)
Returned Publications: 100 (total = 45670)
Time: 1.80s
Starting iteration with limit=1000 skip=0 ...
0-100 / 100 (0.80s)
===
Records extracted: 100
[11]:
authors id issue pages title type volume year journal.id journal.title
0 [{'affiliations': [{'city': 'Lisbon', 'city_id... pub.1141637539 3 188-197 How silence facilitates verbal participation article 3 2020 jour.1374325 English Language Teaching Educational Journal
1 [{'affiliations': [{'city': 'Sydney', 'city_id... pub.1141305289 1 83-138 The expression of emotions in Kunbarlang and i... article 27 2020 jour.1129762 Pragmatics & Cognition
2 [{'affiliations': [{'city': None, 'city_id': N... pub.1138102366 110 42-44 Ophthalmic nurses: vital team members in the p... article 33 2020 jour.1099554 Community Eye Health Journal
3 [{'affiliations': [{'city': 'Melbourne', 'city... pub.1135883909 4 1474-1482 Charting a way forward? Post-juristocracy, dem... article 18 2020 jour.1139636 International Journal of Constitutional Law
4 [{'affiliations': [{'city': 'Melbourne', 'city... pub.1135432149 2 100 An Interactive Approach to Contraception Teach... article 14 2020 jour.1051336 South-East Asian Journal of Medical Education

Doing something useful: Get all the publications that cite my publications

[12]:
pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 100
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query_iterative(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe().head()
Returned Publications: 100 (total = 45670)
Time: 1.36s
Starting iteration with limit=1000 skip=0 ...
0-391 / 391 (1.44s)
===
Records extracted: 391
[12]:
authors id pages title type volume year journal.id journal.title issue
0 [{'affiliations': [{'city': 'Monastir', 'city_... pub.1144584350 132398 New spiropyrrolothiazole derivatives bearing a... article 1254 2022 jour.1009258 Journal of Molecular Structure NaN
1 [{'affiliations': [{'city': 'Gwangju', 'city_i... pub.1144716570 112636 A new paradigm in sarcopenia: Cognitive impair... article 147 2022 jour.1371653 Biomedicine & Pharmacotherapy NaN
2 [{'affiliations': [{'city': 'Changsha', 'city_... pub.1142243251 133215 Intercalation and covalent bonding strategies ... article 431 2022 jour.1127675 Chemical Engineering Journal NaN
3 [{'affiliations': [{'city': 'Wuhan', 'city_id'... pub.1144581821 104948 Muscle aging amelioration by yeast protein sup... article 89 2022 jour.1041965 Journal of Functional Foods NaN
4 [{'affiliations': [{'city': 'Tehran', 'city_id... pub.1144368116 118310 Peer-to-peer decentralized energy trading fram... article 308 2022 jour.1029519 Applied Energy NaN

5. How Long can lists get?

It is a bit dependent on string length, plus a fixed length of 512 items

This won’t work

[13]:
pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe()
Returned Publications: 1000 (total = 45670)
Time: 2.89s
Returned Errors: 1
Time: 0.80s
1 QueryError found
Semantic errors found:
        Filter operator 'in' requires 0 < items < 512. '1000 is out of this range'.

This will

[14]:
pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 250
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe().head(2)
Returned Publications: 250 (total = 45670)
Time: 1.95s
Returned Publications: 20 (total = 1009)
Time: 1.69s
[14]:
authors id pages title type volume year journal.id journal.title issue
0 [{'affiliations': [{'city': 'Monastir', 'city_... pub.1144584350 132398 New spiropyrrolothiazole derivatives bearing a... article 1254 2022 jour.1009258 Journal of Molecular Structure NaN
1 [{'affiliations': [{'city': 'Guangzhou', 'city... pub.1144235680 100917 Designed Biomass Materials for “Green” Electro... article 125 2022 jour.1045007 Progress in Materials Science NaN

What if I need a very long list?

The Dimcli library can break up your query into chunks.

We then loop through each chunk - get the result, and stick them back together again at the end.

[15]:
# Step 1 - same as before - except now we want the query in chunks

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results

query_results = []

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return publications
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

pd.concat(query_results).\
   drop_duplicates(subset='id').\
   head(2)

Returned Publications: 1000 (total = 45670)
Time: 1.78s
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1009 (3.11s)
1000-1009 / 1009 (1.96s)
===
Records extracted: 1009
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1016 (3.28s)
1000-1016 / 1016 (1.46s)
===
Records extracted: 1016
Starting iteration with limit=1000 skip=0 ...
0-871 / 871 (3.43s)
===
Records extracted: 871
Starting iteration with limit=1000 skip=0 ...
0-910 / 910 (2.59s)
===
Records extracted: 910
[15]:
authors id pages title type volume year journal.id journal.title issue
0 [{'affiliations': [{'city': 'Monastir', 'city_... pub.1144584350 132398 New spiropyrrolothiazole derivatives bearing a... article 1254 2022.0 jour.1009258 Journal of Molecular Structure NaN
1 [{'affiliations': [{'city': 'Guangzhou', 'city... pub.1144235680 100917 Designed Biomass Materials for “Green” Electro... article 125 2022.0 jour.1045007 Progress in Materials Science NaN

6. What if I want to get the researchers associated with the publications the cite my institution?

[16]:
# Step 1 - same as before

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

query_results = []

# Step 2 same as before, but now I returning researchers instead of publications

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return researchers limit 1000
                        """).as_dataframe()
      # Warning 1, If there are more than 1000 researchers involved in this query, you will miss some
      )

# Step 3 join the queries back together, this time using a groupby statement to join the counts back together again

my_researchers = pd.concat(query_results).\
                 groupby(['id','first_name','last_name']).\
                  agg({'count':'sum'}).\
                  sort_values(by='count', ascending=False).\
                  head(10)

Returned Publications: 1000 (total = 45670)
Time: 4.17s
Returned Researchers: 1000
Time: 2.60s
Returned Researchers: 1000
Time: 2.38s
Returned Researchers: 1000
Time: 1.57s
Returned Researchers: 1000
Time: 1.65s

7. What if I want to get all the researchers associated with the publications that cite my institution?

[17]:
# Step 1 - same as before

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

query_results = []

# Step 2 - almost the same as before -
#          except now we are asking for the as_dataframe_authors data frame

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return publications[id+title+authors]
                        """).as_dataframe_authors() # I have changed this line from as_dataframe to as_datframe_authors
      )

# Step 3 - join the publications back together

researcher_pubs = pd.concat(query_results).\
                drop_duplicates(subset=['researcher_id','pub_id'])


# Step 4 - count up the publications using a groupby statement

my_researchers = researcher_pubs[researcher_pubs['researcher_id'] != ''].\
    groupby(['researcher_id']).\
    agg({'first_name':'max','last_name':'max','pub_id':'count'}).\
    sort_values(by='pub_id', ascending=False).\
    reset_index()

my_researchers.\
    head(10)

Returned Publications: 1000 (total = 45670)
Time: 1.70s
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1009 (2.13s)
1000-1009 / 1009 (2.78s)
===
Records extracted: 1009
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1016 (5.23s)
1000-1016 / 1016 (3.22s)
===
Records extracted: 1016
Starting iteration with limit=1000 skip=0 ...
0-871 / 871 (4.01s)
===
Records extracted: 871
Starting iteration with limit=1000 skip=0 ...
0-910 / 910 (2.81s)
===
Records extracted: 910
[17]:
researcher_id first_name last_name pub_id
0 ur.0631757532.92 Gin S. Malhi 22
1 ur.015337322331.59 Erica Bell 18
2 ur.012754752112.35 Richard J. Porter 17
3 ur.01115612112.63 Rena Yadlapati 13
4 ur.01234653773.64 Farhad Ravandi 12
5 ur.0735337324.35 Roger T Mulder 11
6 ur.01044576346.22 Frank R. Dunshea 11
7 ur.01224616510.22 John E. Pandolfino 11
8 ur.011606313504.49 Peter J. Kahrilas 10
9 ur.0772267026.25 C. Prakash Gyawali 10

8. ..and if we want details about our researchers, we can put our list of researchers into the researcher API

See the researcher source docs for more details.

[18]:
## First, we need to chunk up our researcher list

query_results = []

for g, rschr in my_researchers.groupby(np.arange(len(my_researchers)) // 250):
          # This does *almost* the same thing as the chunks command used above

     myreslist = json.dumps(list(rschr.researcher_id))

     query_results.append(

                  dsl.query_iterative(f"""
                        search researchers
                            where id in {myreslist}
                            return researchers
                        """).as_dataframe() #
      )


pd.concat(query_results).head()
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.44s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.62s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.66s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.76s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.28s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.22s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.23s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (17.52s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.49s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.96s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.91s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.45s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.24s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.74s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.90s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.44s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.93s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.14s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.32s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.78s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.01s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.28s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.47s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.26s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.07s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.12s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.24s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.23s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.38s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.16s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.39s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.06s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.22s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.37s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.49s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.15s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.03s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.16s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-246 / 246 (1.15s)
===
Records extracted: 246
[18]:
first_name id last_name orcid_id research_orgs
0 Jennifer A ur.0777662157.75 Summers [0000-0002-3156-0479] [{'city_name': 'Dunedin', 'country_name': 'New...
1 Francois Henri ur.07764267264.89 Nosten [0000-0002-7951-0745] [{'acronym': 'LSTM', 'city_name': 'Liverpool',...
2 Tony A ur.0775731540.59 Blakely [0000-0002-6995-4369] [{'acronym': 'QUT', 'city_name': 'Brisbane', '...
3 Chandra Prakash ur.0772267026.25 Gyawali NaN [{'acronym': 'QMUL', 'city_name': 'London', 'c...
4 Adam Kenneth ur.0771337646.65 Wheatley [0000-0002-5593-9387] [{'city_name': 'Melbourne', 'country_name': 'A...

9. Patents example (patents -> publications)

Using the same method, we can retrieve all patents citing publications from my institution.

[19]:
%dsldocs patents
dsl_last_results[dsl_last_results['field']=='publication_ids']
[19]:
sources field type description is_filter is_entity is_facet
41 patents publication_ids string Dimensions IDs of the publications related to ... True False False
[20]:
# Step 1 - same as before - except now we want the query in chunks

pubs_chunks = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2015
                    return publications[id]
                """).chunks(250)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below.  publications was replaced with patents, and refernce_ids was replaced by publication_ids

query_results = []

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search patents
                            where publication_ids in {mypubslist}
                            return patents
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

cited_patents = pd.concat(query_results).\
   drop_duplicates(subset='id')

cited_patents.head(2)
Starting iteration with limit=1000 skip=0 ...
0-1000 / 8884 (2.54s)
1000-2000 / 8884 (1.41s)
2000-3000 / 8884 (2.06s)
3000-4000 / 8884 (1.82s)
4000-5000 / 8884 (1.63s)
5000-6000 / 8884 (3.83s)
6000-7000 / 8884 (1.95s)
7000-8000 / 8884 (1.93s)
8000-8884 / 8884 (1.70s)
===
Records extracted: 8884
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (0.73s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-24 / 24 (0.83s)
===
Records extracted: 24
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.88s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (1.15s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (1.55s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.68s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-15 / 15 (0.86s)
===
Records extracted: 15
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.72s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-20 / 20 (0.94s)
===
Records extracted: 20
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (0.87s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.98s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (0.75s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (0.74s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-2 / 2 (1.20s)
===
Records extracted: 2
Starting iteration with limit=1000 skip=0 ...
0-26 / 26 (0.86s)
===
Records extracted: 26
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (1.06s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.72s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-17 / 17 (2.38s)
===
Records extracted: 17
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (0.67s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-20 / 20 (0.70s)
===
Records extracted: 20
Starting iteration with limit=1000 skip=0 ...
0-26 / 26 (0.69s)
===
Records extracted: 26
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (0.72s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-36 / 36 (0.69s)
===
Records extracted: 36
Starting iteration with limit=1000 skip=0 ...
0-15 / 15 (0.73s)
===
Records extracted: 15
Starting iteration with limit=1000 skip=0 ...
0-21 / 21 (0.66s)
===
Records extracted: 21
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.70s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-16 / 16 (0.71s)
===
Records extracted: 16
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.75s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-28 / 28 (0.70s)
===
Records extracted: 28
Starting iteration with limit=1000 skip=0 ...
0-22 / 22 (2.21s)
===
Records extracted: 22
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.69s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-21 / 21 (0.74s)
===
Records extracted: 21
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (0.74s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-1 / 1 (0.92s)
===
Records extracted: 1
Starting iteration with limit=1000 skip=0 ...
0-1 / 1 (0.68s)
===
Records extracted: 1
Starting iteration with limit=1000 skip=0 ...
===
Records extracted: 0
[20]:
assignee_names assignees filing_status id inventor_names publication_date times_cited title year granted_year
0 [PASTEUR INSTITUT, INST NAT SANTE RECH MED] [{'city_name': 'Paris', 'country_name': 'Franc... Application WO-2018015815-A2 [MECHERI SALAH, MENARD ROBERT, DEMARTA-GATSI C... 2018-01-25 0 PLASMODIUM WITH HISTAMINE RELEASING FACTOR (HR... 2017 NaN
1 [KDDI Corp] [{'city_name': 'Tokyo', 'country_name': 'Japan... Grant US-10547380-B2 [ISHIMURA SHOTA] 2020-01-28 0 Polarization tracking device, optical receivin... 2019 2020.0
[21]:
%dsldocs patents
dsl_last_results[dsl_last_results['type']=='organizations']
[21]:
sources field type description is_filter is_entity is_facet
6 patents assignees organizations Disambiguated GRID organisations who own or ha... True True True
19 patents current_assignees organizations Disambiguated GRID organisations currenlty own... True True True
27 patents funders organizations GRID organisations funding the patent. True True True
37 patents original_assignees organizations Disambiguated GRID organisations that first ow... True True True
[22]:
import json
cited_patents_assignees = cited_patents.explode('assignees')

cited_patents_assignees['assignee_grid_id'] = cited_patents_assignees['assignees'].\
    apply(lambda g: g['id'] if type(g) == dict else 0 )

cited_patents_assignees['assignee_name'] = cited_patents_assignees['assignees'].\
    apply(lambda g: g['name'] if type(g) == dict else 0 )

cited_patents_assignees.\
    groupby(['assignee_grid_id','assignee_name']).\
    agg({'id':'count'}).\
    sort_values(by='id', ascending=False).\
    head(20)
[22]:
id
assignee_grid_id assignee_name
0 0 126
grid.4280.e National University of Singapore 15
grid.7429.8 French Institute of Health and Medical Research 14
grid.453420.4 SingHealth 13
grid.431072.3 AbbVie (United States) 10
grid.116068.8 Massachusetts Institute of Technology 10
grid.4444.0 French National Centre for Scientific Research 8
grid.453773.1 Wisconsin Alumni Research Foundation 7
grid.1008.9 University of Melbourne 7
grid.417921.8 Incyte (United States) 6
grid.421350.1 Northrop Grumman (United States) 6
grid.6363.0 Charité - University Medicine Berlin 6
grid.1042.7 Walter and Eliza Hall Institute of Medical Research 5
grid.16753.36 Northwestern University 5
grid.467402.3 Helsinn (Switzerland) 5
grid.51462.34 Memorial Sloan Kettering Cancer Center 5
grid.1058.c Murdoch Children's Research Institute 5
grid.422530.2 Waters (United Kingdom) 4
grid.420451.6 Google (United States) 4
grid.168010.e Stanford University 4

10. Clinical Trials (clinical trials -> publications)

Using the same method, we can retrieve all clinical trials citing publications from my institution.

[23]:
%dsldocs clinical_trials
dsl_last_results[dsl_last_results['field']=='research_orgs']
[23]:
sources field type description is_filter is_entity is_facet
30 clinical_trials research_orgs organizations GRID organizations involved, e.g. as sponsors ... True True True
[25]:
# Step 1 - same as before - except now we want the query in chunks

clinical_trials_chunks = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2015
                    return publications[id]
                """).chunks(400)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below.  publications was replaced with clinical_trials, and reference_ids was replaced by publication_ids

query_results = []

for c in clinical_trials_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search clinical_trials
                            where publication_ids in {mypubslist}
                            return clinical_trials[basics+research_orgs]
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

cited_clinical_trials = pd.concat(query_results).\
   drop_duplicates(subset='id')

cited_clinical_trials.head(2)
Starting iteration with limit=1000 skip=0 ...
0-1000 / 8884 (3.10s)
1000-2000 / 8884 (2.89s)
2000-3000 / 8884 (3.20s)
3000-4000 / 8884 (3.63s)
4000-5000 / 8884 (3.87s)
5000-6000 / 8884 (3.81s)
6000-7000 / 8884 (1.63s)
7000-8000 / 8884 (1.48s)
8000-8884 / 8884 (1.90s)
===
Records extracted: 8884
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (1.13s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (1.21s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (1.01s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-16 / 16 (0.85s)
===
Records extracted: 16
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (2.91s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (3.13s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.20s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.15s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (1.26s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.89s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (1.31s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (0.88s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (0.85s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (1.25s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (2.15s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.84s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-11 / 11 (0.87s)
===
Records extracted: 11
Starting iteration with limit=1000 skip=0 ...
0-4 / 4 (1.57s)
===
Records extracted: 4
Starting iteration with limit=1000 skip=0 ...
0-6 / 6 (0.92s)
===
Records extracted: 6
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.04s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-2 / 2 (0.92s)
===
Records extracted: 2
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (1.21s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
===
Records extracted: 0
[25]:
active_years id investigators title research_orgs
0 [2021, 2022, 2023, 2024] NCT05170594 [[Haiyan Liu, , Principal Investigator, The Se... A Prospective Study of Bevacizumab Combined Wi... NaN
1 [2021, 2022, 2023] NCT05147818 NaN Risk Factors for Acute Kidney Injury in Diabet... [{'city_name': 'Assiut', 'country_name': 'Egyp...
[28]:
%dsldocs clinical_trials
dsl_last_results[dsl_last_results['type']=='organizations']
[28]:
sources field type description is_filter is_entity is_facet
19 clinical_trials funders organizations GRID funding organisations that are involved w... True True True
30 clinical_trials research_orgs organizations GRID organizations involved, e.g. as sponsors ... True True True
[29]:
cited_clinical_trials_orgs = cited_clinical_trials.explode('research_orgs')

cited_clinical_trials_orgs['research_orgs_grid_id'] = cited_clinical_trials_orgs['research_orgs'].\
    apply(lambda g: g['id'] if type(g) == dict else 0 )

cited_clinical_trials_orgs['research_orgs_name'] = cited_clinical_trials_orgs['research_orgs'].\
    apply(lambda g: g['name'] if type(g) == dict else 0 )

cited_clinical_trials_orgs.\
    groupby(['research_orgs_grid_id','research_orgs_name']).\
    agg({'id':'count'}).\
    sort_values(by='id', ascending=False).\
    head(20)
[29]:
id
research_orgs_grid_id research_orgs_name
grid.1002.3 Monash University 6
0 0 4
grid.4367.6 Washington University in St. Louis 4
grid.506076.2 Istanbul University Cerrahpaşa 4
grid.223827.e University of Utah 4
grid.84393.35 Hospital Universitari i Politècnic La Fe 4
grid.411438.b Hospital Universitari Germans Trias i Pujol 3
grid.194645.b University of Hong Kong 3
grid.21925.3d University of Pittsburgh 3
grid.248762.d BC Cancer Agency 3
grid.2515.3 Boston Children's Hospital 3
grid.252487.e Assiut University 3
grid.412860.9 Wake Forest Baptist Medical Center 3
grid.428313.f Corporació Sanitària Parc Taulí 3
grid.266102.1 University of California, San Francisco 3
grid.81821.32 Hospital Universitario La Paz 3
grid.411052.3 Central University Hospital of Asturias 3
grid.411349.a Reina Sofia Hospital 3
grid.417574.4 Abbott (United States) 2
grid.440627.3 University of the Andes, Chile 2

11. Grants (publications -> grants)

Using the same method, we can retrieve all grants funding publications from my institution.

[30]:
%dsldocs publications
dsl_last_results[dsl_last_results['field'].str.contains('ids')]
[30]:
sources field type description is_filter is_entity is_facet
20 publications clinical_trial_ids string Clinical Trial IDs mentioned in publications f... True False False
47 publications reference_ids string Dimensions publication ID for publications in ... True False False
61 publications supporting_grant_ids string Grants supporting a publication, returned as a... True False False
[38]:
# Step 1 - same as before - except now we want the query in chunks

publications = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2020
                    return publications[id+supporting_grant_ids]
                """).as_dataframe()

# Step 2 - we can get the grants IDs directly from publications this time.
# So as a second step, we want to pull grants metadata using these identifiers.

pubs_grants = publications.explode('supporting_grant_ids')

grants_from_pubs = pd.DataFrame(pubs_grants.supporting_grant_ids.unique()).\
                   dropna().\
                   rename(columns={0:'id'})

query_results = []

for g, gnts in grants_from_pubs.groupby(np.arange(len(grants_from_pubs)) // 250):
          # This does *almost* the same thing as the chunks command used above

      myglist = json.dumps(list(gnts.id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search grants
                            where id in {myglist}
                          return grants[basics+funding_usd]
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

grant_details = pd.concat(query_results).\
   drop_duplicates(subset='id')

grant_details.head(5)
Starting iteration with limit=1000 skip=0 ...
0-1000 / 14092 (1.94s)
1000-2000 / 14092 (1.86s)
2000-3000 / 14092 (1.22s)
3000-4000 / 14092 (1.21s)
4000-5000 / 14092 (4.85s)
5000-6000 / 14092 (1.20s)
6000-7000 / 14092 (1.26s)
7000-8000 / 14092 (1.19s)
8000-9000 / 14092 (1.18s)
9000-10000 / 14092 (1.26s)
10000-11000 / 14092 (1.38s)
11000-12000 / 14092 (1.25s)
12000-13000 / 14092 (1.23s)
13000-14000 / 14092 (1.18s)
14000-14092 / 14092 (1.13s)
===
Records extracted: 14092
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.01s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.29s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.12s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.07s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.03s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.70s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-248 / 248 (0.98s)
===
Records extracted: 248
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.72s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.73s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.96s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.97s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.99s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.71s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.80s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.66s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.78s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.06s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.97s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.93s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.85s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-247 / 247 (0.97s)
===
Records extracted: 247
[38]:
active_year end_date funders funding_org_name funding_usd grant_number id language original_title start_date start_year title
0 [2021, 2022, 2023, 2024] 2024-03-31 [{'acronym': 'DOE SC', 'city_name': 'Washingto... Office of Science 14773638.0 DE-SC0007881 grant.4320658 en Particle Physics and Cosmology Research 2021-07-01 2021.0 Particle Physics and Cosmology Research
1 [2020, 2021, 2022] 2022-03-31 [{'acronym': 'JSPS', 'city_name': 'Tokyo', 'co... Japan Society for the Promotion of Science 62350.0 20K21008 grant.9399908 ja Analysis of bounding flight in birds by dynami... 2020-07-30 2020.0 Analysis of bounding flight in birds by dynami...
2 [2020, 2021, 2022, 2023, 2024, 2025] 2025-06-30 [{'acronym': 'NIAID', 'city_name': 'Bethesda',... National Institute of Allergy and Infectious D... 1587803.0 R01AI153400 grant.9293892 en Polymerase Inhibitors of Respiratory Syncytial... 2020-07-01 2020.0 Polymerase Inhibitors of Respiratory Syncytial...
3 [2020, 2021, 2022, 2023, 2024, 2025] 2025-05-31 [{'acronym': 'NIAID', 'city_name': 'Bethesda',... National Institute of Allergy and Infectious D... 1082990.0 R01AI148407 grant.9293759 en Investigating the basis of MAIT cell antigen p... 2020-06-15 2020.0 Investigating the basis of MAIT cell antigen p...
4 [2020, 2021, 2022, 2023] 2023-06-01 [{'acronym': 'ARC', 'city_name': 'Canberra', '... Australian Research Council 333250.0 DE200100264 grant.8676066 en Giving Mums a fair go: culturally-responsive c... 2020-06-01 2020.0 Giving Mums a fair go: culturally-responsive c...
[37]:
pubs_grants.groupby('supporting_grant_ids').\
    agg({'id':'count'}).\
    reset_index().\
    rename(columns={'id':'pubs','supporting_grant_ids':'id'}).\
    merge(grant_details[['id','original_title','funding_usd']],
          on='id').\
    sort_values(by='pubs', ascending=False)

[37]:
id pubs original_title funding_usd
6803 grant.8470130 63 Consolidated Grant 2019 2724578.0
6811 grant.8470469 61 Particle Physics Consolidated Grant from the U... 3097905.0
6273 grant.7924120 61 GRIDPP PROJECT COORDINATION NaN
2155 grant.4320658 57 Particle Physics and Cosmology Research 14773638.0
206 grant.2438793 55 Cancer Center Support Grant 266136064.0
... ... ... ... ...
3151 grant.6447242 1 Participatory Urban Living for Sustainable Env... 5696332.0
3150 grant.6447136 1 Towards Novel Inert (Photo-)toxic Ru(II) Polyp... 2317685.0
3149 grant.6446143 1 Understanding how an MLH1 promoter polymorphis... 952550.0
3148 grant.6446046 1 Macronutrients and Metabolic Health - Understa... 2822859.0
3997 grant.6803312 1 Macrophage Dysfunction in Obesity, Diabetes an... 12071621.0

7995 rows × 4 columns

Conclusions

Lists are a simple data structure that can have a great number of applications.

When used in conjuction with the DSL language, they make it easy to concatenate the results of one query with another query e.g. in order to navigate through links available in Dimensions (from publications to grants, patents etc…).

See also this patents tutorial or this clinical trials tutorial for more in-depth applications of the queries discussed above.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg