Working with lists in the Dimensions API¶

In this notebook we are going to show:

How to use lists in order to write more efficient DSL queries
How lists can be used to concatenate the results of one query with another query
How these methods can be used for real-word applications e.g., getting publications/patents/grants that cite my publications

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:

!pip install dimcli -U --quiet

import dimcli
from dimcli.utils import *
import sys
import json
import pandas as pd
import numpy as np

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. How do we use lists in the Dimensions API?¶

We use lists in the API because they are easier to read, and easier to work with.

Here is a query without lists.

How many publications were produced from either Monash or Melbourne University ( grid.1002.3, grid.1008.9 ) in either (2019 OR 2020). Be really careful with your brakets!

[2]:

%%dsldf

search publications
where
      (
          research_orgs.id = "grid.1008.9"
       or research_orgs.id = "grid.1002.3"
       )
  and (
          year = 2019
       or year = 2020
       )
return publications
limit 1

Returned Publications: 1 (total = 45670)
Time: 0.71s

[2]:

	authors	id	issue	pages	title	type	volume	year	journal.id	journal.title
0	[{'affiliations': [{'city': 'Lisbon', 'city_id...	pub.1141637539	3	188-197	How silence facilitates verbal participation	article	3	2020	jour.1374325	English Language Teaching Educational Journal

The query above could get really messy. What if I wanted 20 institutions. What if I wanted the last ten years: (or,or,or,or,or….) and (or,or,or,or,or)

By using lists we can quickly add a large number of conditions by means of an easy to read square-brakets notation:

[3]:

%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100

Returned Publications: 100 (total = 45670)
Time: 2.34s

[3]:

	id
0	pub.1141637539
1	pub.1141305289
2	pub.1138102366
3	pub.1135883909
4	pub.1135432149
...	...
95	pub.1134203058
96	pub.1134203025
97	pub.1134202605
98	pub.1134190313
99	pub.1134190104

100 rows × 1 columns

2. What are all the things that we can make lists of in the Dimensions API?¶

What are the internal Entities that we might put in a list?¶

[4]:

%dsldocs
dsl_last_results[dsl_last_results['is_entity']==True]

[4]:

	sources	field	type	description	is_filter	is_entity	is_facet
6	clinical_trials	category_bra	categories	`Broad Research Areas <https://app.dimensions....	True	True	True
7	clinical_trials	category_for	categories	`ANZSRC Fields of Research classification <htt...	True	True	True
8	clinical_trials	category_hra	categories	`Health Research Areas <https://dimensions.fre...	True	True	True
9	clinical_trials	category_hrcs_hc	categories	`HRCS - Health Categories <https://dimensions....	True	True	True
10	clinical_trials	category_hrcs_rac	categories	`HRCS – Research Activity Codes <https://dimen...	True	True	True
...	...	...	...	...	...	...	...
321	reports	responsible_orgs_cities	cities	Cities of the organisations responsible for th...	True	True	True
322	reports	responsible_orgs_countries	countries	Countries of the organisations responsible for...	True	True	True
324	reports	responsible_orgs_states	states	States of the organisations responsible for th...	True	True	True
327	researchers	current_research_org	organizations	The most recent research organization linked t...	True	True	True
340	researchers	research_orgs	organizations	All research organizations linked to the resea...	True	True	True

123 rows × 7 columns

What about lists of ids?¶

[5]:

%dsldocs
dsl_last_results[dsl_last_results['field'].str.contains('id')==True]

[5]:

	sources	field	type	description	is_filter	is_entity	is_facet
4	clinical_trials	associated_grant_ids	string	Dimensions IDs of the grants associated to the...	True	False	False
21	clinical_trials	id	string	Dimensions clinical trial ID	True	False	False
27	clinical_trials	publication_ids	string	Publications mentioned in clinical trials (exc...	True	False	False
33	datasets	associated_grant_ids	string	The Dimensions IDs of the grants linked to the...	True	False	False
35	datasets	associated_publication_id	string	The Dimensions ID of the publication linked to...	True	False	False
57	datasets	id	string	Dimensions dataset ID.	True	False	False
64	datasets	repository_id	string	The ID of the repository of the dataset.	True	False	True
105	grants	id	string	Dimensions grant ID.	True	False	False
122	organizations	cnrs_ids	string	CNRS IDs for this organization	True	False	False
126	organizations	external_ids_fundref	string	Fundref IDs for this organization	True	False	False
127	organizations	hesa_ids	string	HESA IDs for this organization	True	False	False
128	organizations	id	string	GRID ID of the organization. E.g., "grid.26999...	True	False	False
129	organizations	isni_ids	string	ISNI IDs for this organization	True	False	False
140	organizations	organization_child_ids	string	Child organization IDs	True	False	False
141	organizations	organization_parent_ids	string	Parent organization IDs	True	False	False
142	organizations	organization_related_ids	string	Related organization IDs	True	False	False
143	organizations	orgref_ids	string	OrgRef IDs for this organization	True	False	False
145	organizations	ror_ids	string	ROR IDs for this organization	True	False	False
149	organizations	ucas_ids	string	UCAS IDs for this organization	True	False	False
150	organizations	ukprn_ids	string	UKPRN IDs for this organization	True	False	False
151	organizations	wikidata_ids	string	WikiData IDs for this organization	True	False	False
160	patents	associated_grant_ids	string	Dimensions IDs of the grants associated to the...	True	False	False
169	patents	cited_by_ids	string	Dimensions IDs of the patents that cite this p...	True	False	False
177	patents	family_id	integer	Identifier of the corresponding `EPO patent fa...	True	False	True
183	patents	id	string	Dimensions patent ID	True	False	False
194	patents	publication_ids	string	Dimensions IDs of the publications related to ...	True	False	False
197	patents	reference_ids	string	Dimensions IDs of the patents which are cited ...	True	False	False
212	policy_documents	id	string	Dimensions policy document ID	True	False	False
214	policy_documents	publication_ids	string	Dimensions IDs of the publications related to ...	True	False	False
225	publications	altmetric_id	integer	Altmetric Publication ID	True	False	False
226	publications	arxiv_id	string	The publications arXiv identifier (e.g. ‘arXiv...	True	False	False
242	publications	clinical_trial_ids	string	Clinical Trial IDs mentioned in publications f...	True	False	False
254	publications	id	string	Dimensions publication ID.	True	False	False
264	publications	pmcid	string	PubMed Central ID.	True	False	False
265	publications	pmid	string	PubMed ID.	True	False	False
269	publications	reference_ids	string	Dimensions publication ID for publications in ...	True	False	False
283	publications	supporting_grant_ids	string	Grants supporting a publication, returned as a...	True	False	False
306	reports	external_ids	json	External identifiers available from the report...	True	False	False
310	reports	id	string	Dimensions report ID.	True	False	False
332	researchers	id	string	Dimensions researcher ID.	True	False	False
336	researchers	nih_ppid	string	The PI Profile ID (i.e., ppid) is a Researcher...	True	False	False
338	researchers	orcid_id	string	`ORCID <https://orcid.org/>`_ ID.	True	False	False
343	source_titles	id	string	The Dimensions ID of the source.	True	False	False

What are the external entities that we can put in a list?¶

a list of ISSN’s
a list of External Grant IDs
a list of DOIs
a list of categories

3. Making a list from the results of a query¶

The list syntax for the Dimensions API is the same as the list syntax for json, so we can use python’s json-to-string functions to make a list of ids for us from the previous query.

Let’s run our example query again.

[6]:

%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100

Returned Publications: 100 (total = 45670)
Time: 1.43s

[6]:

	id
0	pub.1141637539
1	pub.1141305289
2	pub.1138102366
3	pub.1135883909
4	pub.1135432149
...	...
95	pub.1134203058
96	pub.1134203025
97	pub.1134202605
98	pub.1134190313
99	pub.1134190104

100 rows × 1 columns

[7]:

json.dumps(list(dsl_last_results.id))

[7]:

'["pub.1141637539", "pub.1141305289", "pub.1138102366", "pub.1135883909", "pub.1135432149", "pub.1134822967", "pub.1134491333", "pub.1134281073", "pub.1134266008", "pub.1134265994", "pub.1134265901", "pub.1134265655", "pub.1134265653", "pub.1134265625", "pub.1134258780", "pub.1134258376", "pub.1134257556", "pub.1134256126", "pub.1134256124", "pub.1134256123", "pub.1134254481", "pub.1134252657", "pub.1134251430", "pub.1134250811", "pub.1134250027", "pub.1134248622", "pub.1134199431", "pub.1134198997", "pub.1134198446", "pub.1134198388", "pub.1134196709", "pub.1134196597", "pub.1134196290", "pub.1134196127", "pub.1134195884", "pub.1134195597", "pub.1134195228", "pub.1134194637", "pub.1134193894", "pub.1134193864", "pub.1134193611", "pub.1134192883", "pub.1134192144", "pub.1134192022", "pub.1133732704", "pub.1133137340", "pub.1133137326", "pub.1133137275", "pub.1132992249", "pub.1132838273", "pub.1128236326", "pub.1134262302", "pub.1134257598", "pub.1134257594", "pub.1134257519", "pub.1134257515", "pub.1134246135", "pub.1134233491", "pub.1134221342", "pub.1134220735", "pub.1134201275", "pub.1134199760", "pub.1134199173", "pub.1134199132", "pub.1134196260", "pub.1134196172", "pub.1134195488", "pub.1134194209", "pub.1134193250", "pub.1134193081", "pub.1134191741", "pub.1134191365", "pub.1134190135", "pub.1134190078", "pub.1134186262", "pub.1134185924", "pub.1134185750", "pub.1134185659", "pub.1134184944", "pub.1134184639", "pub.1134183226", "pub.1134182538", "pub.1134181948", "pub.1133342464", "pub.1133065204", "pub.1132594886", "pub.1132412597", "pub.1131466688", "pub.1134233730", "pub.1134233503", "pub.1134233412", "pub.1134215867", "pub.1134215652", "pub.1134215451", "pub.1134208463", "pub.1134203058", "pub.1134203025", "pub.1134202605", "pub.1134190313", "pub.1134190104"]'

Let’s try to use this list of IDs.

Unfortunately, you can’t just put your results directly into the query

[8]:

%%dsldf
  search publications
  where id in [json.dumps(list(dsl_last_results.id))]

  return publications

Returned Errors: 1
Time: 1.25s
1 QueryError found
4 QuerySyntaxErrors found

4 ParserErrors found
  * [Line 2:15] ('json') no viable alternative at input '[json'
  * [Line 2:26] ('list') extraneous input 'list' expecting {'for', 'in', '('}
  * [Line 2:31] ('dsl_last_results') mismatched input 'dsl_last_results' expecting {'for', 'in', '('}
  * [Line 2:52] (']') mismatched input ']' expecting <EOF>

..so let’s get our results back again

[9]:

%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id]
limit 100

Returned Publications: 100 (total = 45670)
Time: 1.14s

[9]:

	id
0	pub.1141637539
1	pub.1141305289
2	pub.1138102366
3	pub.1135883909
4	pub.1135432149
...	...
95	pub.1134203058
96	pub.1134203025
97	pub.1134202605
98	pub.1134190313
99	pub.1134190104

100 rows × 1 columns

… and use the python way of calling the Dimensions API instead

[10]:

dsl.query(f"""

 search publications
  where id in {json.dumps(list(dsl_last_results.id))}

  return publications

""").as_dataframe()

f"""

 search publications
  where id in {json.dumps(list(dsl_last_results.id))}

  return publications

"""

Returned Publications: 20 (total = 100)
Time: 3.08s

[10]:

'\n\n search publications\n  where id in ["pub.1141637539", "pub.1141305289", "pub.1138102366", "pub.1135883909", "pub.1135432149", "pub.1134822967", "pub.1134491333", "pub.1134281073", "pub.1134266008", "pub.1134265994", "pub.1134265901", "pub.1134265655", "pub.1134265653", "pub.1134265625", "pub.1134258780", "pub.1134258376", "pub.1134257556", "pub.1134256126", "pub.1134256124", "pub.1134256123", "pub.1134254481", "pub.1134252657", "pub.1134251430", "pub.1134250811", "pub.1134250027", "pub.1134248622", "pub.1134199431", "pub.1134198997", "pub.1134198446", "pub.1134198388", "pub.1134196709", "pub.1134196597", "pub.1134196290", "pub.1134196127", "pub.1134195884", "pub.1134195597", "pub.1134195228", "pub.1134194637", "pub.1134193894", "pub.1134193864", "pub.1134193611", "pub.1134192883", "pub.1134192144", "pub.1134192022", "pub.1133732704", "pub.1133137340", "pub.1133137326", "pub.1133137275", "pub.1132992249", "pub.1132838273", "pub.1128236326", "pub.1134262302", "pub.1134257598", "pub.1134257594", "pub.1134257519", "pub.1134257515", "pub.1134246135", "pub.1134233491", "pub.1134221342", "pub.1134220735", "pub.1134201275", "pub.1134199760", "pub.1134199173", "pub.1134199132", "pub.1134196260", "pub.1134196172", "pub.1134195488", "pub.1134194209", "pub.1134193250", "pub.1134193081", "pub.1134191741", "pub.1134191365", "pub.1134190135", "pub.1134190078", "pub.1134186262", "pub.1134185924", "pub.1134185750", "pub.1134185659", "pub.1134184944", "pub.1134184639", "pub.1134183226", "pub.1134182538", "pub.1134181948", "pub.1133342464", "pub.1133065204", "pub.1132594886", "pub.1132412597", "pub.1131466688", "pub.1134233730", "pub.1134233503", "pub.1134233412", "pub.1134215867", "pub.1134215652", "pub.1134215451", "pub.1134208463", "pub.1134203058", "pub.1134203025", "pub.1134202605", "pub.1134190313", "pub.1134190104"]\n\n  return publications\n\n\n'

Putting both parts of this example together¶

[11]:

# Step 1. Get the list of publications..

pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 100
                """).as_dataframe()

# Step 2. Put the list into the next query...

dsl.query_iterative(f"""
                 search publications
                    where id in {json.dumps(list(pubs.id))}
                    return publications
""").as_dataframe().head(5)

Returned Publications: 100 (total = 45670)
Time: 1.80s

Starting iteration with limit=1000 skip=0 ...
0-100 / 100 (0.80s)
===
Records extracted: 100

[11]:

	authors	id	issue	pages	title	type	volume	year	journal.id	journal.title
0	[{'affiliations': [{'city': 'Lisbon', 'city_id...	pub.1141637539	3	188-197	How silence facilitates verbal participation	article	3	2020	jour.1374325	English Language Teaching Educational Journal
1	[{'affiliations': [{'city': 'Sydney', 'city_id...	pub.1141305289	1	83-138	The expression of emotions in Kunbarlang and i...	article	27	2020	jour.1129762	Pragmatics & Cognition
2	[{'affiliations': [{'city': None, 'city_id': N...	pub.1138102366	110	42-44	Ophthalmic nurses: vital team members in the p...	article	33	2020	jour.1099554	Community Eye Health Journal
3	[{'affiliations': [{'city': 'Melbourne', 'city...	pub.1135883909	4	1474-1482	Charting a way forward? Post-juristocracy, dem...	article	18	2020	jour.1139636	International Journal of Constitutional Law
4	[{'affiliations': [{'city': 'Melbourne', 'city...	pub.1135432149	2	100	An Interactive Approach to Contraception Teach...	article	14	2020	jour.1051336	South-East Asian Journal of Medical Education

Doing something useful: Get all the publications that cite my publications¶

[12]:

pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 100
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query_iterative(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe().head()

Returned Publications: 100 (total = 45670)
Time: 1.36s

Starting iteration with limit=1000 skip=0 ...
0-391 / 391 (1.44s)
===
Records extracted: 391

[12]:

	authors	id	pages	title	type	volume	year	journal.id	journal.title	issue
0	[{'affiliations': [{'city': 'Monastir', 'city_...	pub.1144584350	132398	New spiropyrrolothiazole derivatives bearing a...	article	1254	2022	jour.1009258	Journal of Molecular Structure	NaN
1	[{'affiliations': [{'city': 'Gwangju', 'city_i...	pub.1144716570	112636	A new paradigm in sarcopenia: Cognitive impair...	article	147	2022	jour.1371653	Biomedicine & Pharmacotherapy	NaN
2	[{'affiliations': [{'city': 'Changsha', 'city_...	pub.1142243251	133215	Intercalation and covalent bonding strategies ...	article	431	2022	jour.1127675	Chemical Engineering Journal	NaN
3	[{'affiliations': [{'city': 'Wuhan', 'city_id'...	pub.1144581821	104948	Muscle aging amelioration by yeast protein sup...	article	89	2022	jour.1041965	Journal of Functional Foods	NaN
4	[{'affiliations': [{'city': 'Tehran', 'city_id...	pub.1144368116	118310	Peer-to-peer decentralized energy trading fram...	article	308	2022	jour.1029519	Applied Energy	NaN

5. How Long can lists get?¶

It is a bit dependent on string length, plus a fixed length of 512 items

This won’t work¶

[13]:

pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe()

Returned Publications: 1000 (total = 45670)
Time: 2.89s
Returned Errors: 1
Time: 0.80s
1 QueryError found
Semantic errors found:
        Filter operator 'in' requires 0 < items < 512. '1000 is out of this range'.

This will¶

[14]:

pubs = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 250
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe().head(2)

Returned Publications: 250 (total = 45670)
Time: 1.95s
Returned Publications: 20 (total = 1009)
Time: 1.69s

[14]:

	authors	id	pages	title	type	volume	year	journal.id	journal.title	issue
0	[{'affiliations': [{'city': 'Monastir', 'city_...	pub.1144584350	132398	New spiropyrrolothiazole derivatives bearing a...	article	1254	2022	jour.1009258	Journal of Molecular Structure	NaN
1	[{'affiliations': [{'city': 'Guangzhou', 'city...	pub.1144235680	100917	Designed Biomass Materials for “Green” Electro...	article	125	2022	jour.1045007	Progress in Materials Science	NaN

What if I need a very long list?¶

The Dimcli library can break up your query into chunks.

We then loop through each chunk - get the result, and stick them back together again at the end.

[15]:

# Step 1 - same as before - except now we want the query in chunks

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results

query_results = []

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return publications
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

pd.concat(query_results).\
   drop_duplicates(subset='id').\
   head(2)

Returned Publications: 1000 (total = 45670)
Time: 1.78s

Starting iteration with limit=1000 skip=0 ...
0-1000 / 1009 (3.11s)
1000-1009 / 1009 (1.96s)
===
Records extracted: 1009
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1016 (3.28s)
1000-1016 / 1016 (1.46s)
===
Records extracted: 1016
Starting iteration with limit=1000 skip=0 ...
0-871 / 871 (3.43s)
===
Records extracted: 871
Starting iteration with limit=1000 skip=0 ...
0-910 / 910 (2.59s)
===
Records extracted: 910

[15]:

	authors	id	pages	title	type	volume	year	journal.id	journal.title	issue
0	[{'affiliations': [{'city': 'Monastir', 'city_...	pub.1144584350	132398	New spiropyrrolothiazole derivatives bearing a...	article	1254	2022.0	jour.1009258	Journal of Molecular Structure	NaN
1	[{'affiliations': [{'city': 'Guangzhou', 'city...	pub.1144235680	100917	Designed Biomass Materials for “Green” Electro...	article	125	2022.0	jour.1045007	Progress in Materials Science	NaN

6. What if I want to get the researchers associated with the publications the cite my institution?¶

[16]:

# Step 1 - same as before

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

query_results = []

# Step 2 same as before, but now I returning researchers instead of publications

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return researchers limit 1000
                        """).as_dataframe()
      # Warning 1, If there are more than 1000 researchers involved in this query, you will miss some
      )

# Step 3 join the queries back together, this time using a groupby statement to join the counts back together again

my_researchers = pd.concat(query_results).\
                 groupby(['id','first_name','last_name']).\
                  agg({'count':'sum'}).\
                  sort_values(by='count', ascending=False).\
                  head(10)

Returned Publications: 1000 (total = 45670)
Time: 4.17s
Returned Researchers: 1000
Time: 2.60s
Returned Researchers: 1000
Time: 2.38s
Returned Researchers: 1000
Time: 1.57s
Returned Researchers: 1000
Time: 1.65s

7. What if I want to get all the researchers associated with the publications that cite my institution?¶

[17]:

# Step 1 - same as before

pubs_chunks = dsl.query("""
                  search publications
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id]
                    limit 1000
                """).chunks(250)

query_results = []

# Step 2 - almost the same as before -
#          except now we are asking for the as_dataframe_authors data frame

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return publications[id+title+authors]
                        """).as_dataframe_authors() # I have changed this line from as_dataframe to as_datframe_authors
      )

# Step 3 - join the publications back together

researcher_pubs = pd.concat(query_results).\
                drop_duplicates(subset=['researcher_id','pub_id'])


# Step 4 - count up the publications using a groupby statement

my_researchers = researcher_pubs[researcher_pubs['researcher_id'] != ''].\
    groupby(['researcher_id']).\
    agg({'first_name':'max','last_name':'max','pub_id':'count'}).\
    sort_values(by='pub_id', ascending=False).\
    reset_index()

my_researchers.\
    head(10)

Returned Publications: 1000 (total = 45670)
Time: 1.70s

Starting iteration with limit=1000 skip=0 ...
0-1000 / 1009 (2.13s)
1000-1009 / 1009 (2.78s)
===
Records extracted: 1009
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1016 (5.23s)
1000-1016 / 1016 (3.22s)
===
Records extracted: 1016
Starting iteration with limit=1000 skip=0 ...
0-871 / 871 (4.01s)
===
Records extracted: 871
Starting iteration with limit=1000 skip=0 ...
0-910 / 910 (2.81s)
===
Records extracted: 910

[17]:

	researcher_id	first_name	last_name	pub_id
0	ur.0631757532.92	Gin S.	Malhi	22
1	ur.015337322331.59	Erica	Bell	18
2	ur.012754752112.35	Richard J.	Porter	17
3	ur.01115612112.63	Rena	Yadlapati	13
4	ur.01234653773.64	Farhad	Ravandi	12
5	ur.0735337324.35	Roger T	Mulder	11
6	ur.01044576346.22	Frank R.	Dunshea	11
7	ur.01224616510.22	John E.	Pandolfino	11
8	ur.011606313504.49	Peter J.	Kahrilas	10
9	ur.0772267026.25	C. Prakash	Gyawali	10

8. ..and if we want details about our researchers, we can put our list of researchers into the researcher API¶

See the researcher source docs for more details.

[18]:

## First, we need to chunk up our researcher list

query_results = []

for g, rschr in my_researchers.groupby(np.arange(len(my_researchers)) // 250):
          # This does *almost* the same thing as the chunks command used above

     myreslist = json.dumps(list(rschr.researcher_id))

     query_results.append(

                  dsl.query_iterative(f"""
                        search researchers
                            where id in {myreslist}
                            return researchers
                        """).as_dataframe() #
      )


pd.concat(query_results).head()

Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.44s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.62s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.66s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.76s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.28s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.22s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.23s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (17.52s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.49s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.96s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.91s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.45s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.24s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.74s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.90s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.44s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.93s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.14s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.32s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.78s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.01s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.28s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.47s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.26s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.07s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.12s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.24s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.23s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.38s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.16s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.39s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.06s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.22s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.37s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.49s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.15s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.03s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.16s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-246 / 246 (1.15s)
===
Records extracted: 246

[18]:

	first_name	id	last_name	orcid_id	research_orgs
0	Jennifer A	ur.0777662157.75	Summers	[0000-0002-3156-0479]	[{'city_name': 'Dunedin', 'country_name': 'New...
1	Francois Henri	ur.07764267264.89	Nosten	[0000-0002-7951-0745]	[{'acronym': 'LSTM', 'city_name': 'Liverpool',...
2	Tony A	ur.0775731540.59	Blakely	[0000-0002-6995-4369]	[{'acronym': 'QUT', 'city_name': 'Brisbane', '...
3	Chandra Prakash	ur.0772267026.25	Gyawali	NaN	[{'acronym': 'QMUL', 'city_name': 'London', 'c...
4	Adam Kenneth	ur.0771337646.65	Wheatley	[0000-0002-5593-9387]	[{'city_name': 'Melbourne', 'country_name': 'A...

9. Patents example (patents -> publications)¶

Using the same method, we can retrieve all patents citing publications from my institution.

[19]:

%dsldocs patents
dsl_last_results[dsl_last_results['field']=='publication_ids']

[19]:

	sources	field	type	description	is_filter	is_entity	is_facet
41	patents	publication_ids	string	Dimensions IDs of the publications related to ...	True	False	False

[20]:

# Step 1 - same as before - except now we want the query in chunks

pubs_chunks = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2015
                    return publications[id]
                """).chunks(250)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below.  publications was replaced with patents, and refernce_ids was replaced by publication_ids

query_results = []

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search patents
                            where publication_ids in {mypubslist}
                            return patents
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

cited_patents = pd.concat(query_results).\
   drop_duplicates(subset='id')

cited_patents.head(2)

Starting iteration with limit=1000 skip=0 ...
0-1000 / 8884 (2.54s)
1000-2000 / 8884 (1.41s)
2000-3000 / 8884 (2.06s)
3000-4000 / 8884 (1.82s)
4000-5000 / 8884 (1.63s)
5000-6000 / 8884 (3.83s)
6000-7000 / 8884 (1.95s)
7000-8000 / 8884 (1.93s)
8000-8884 / 8884 (1.70s)
===
Records extracted: 8884
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (0.73s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-24 / 24 (0.83s)
===
Records extracted: 24
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.88s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (1.15s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (1.55s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.68s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-15 / 15 (0.86s)
===
Records extracted: 15
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.72s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-20 / 20 (0.94s)
===
Records extracted: 20
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (0.87s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.98s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (0.75s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (0.74s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-2 / 2 (1.20s)
===
Records extracted: 2
Starting iteration with limit=1000 skip=0 ...
0-26 / 26 (0.86s)
===
Records extracted: 26
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (1.06s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.72s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-17 / 17 (2.38s)
===
Records extracted: 17
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (0.67s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-20 / 20 (0.70s)
===
Records extracted: 20
Starting iteration with limit=1000 skip=0 ...
0-26 / 26 (0.69s)
===
Records extracted: 26
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (0.72s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-36 / 36 (0.69s)
===
Records extracted: 36
Starting iteration with limit=1000 skip=0 ...
0-15 / 15 (0.73s)
===
Records extracted: 15
Starting iteration with limit=1000 skip=0 ...
0-21 / 21 (0.66s)
===
Records extracted: 21
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.70s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-16 / 16 (0.71s)
===
Records extracted: 16
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.75s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-28 / 28 (0.70s)
===
Records extracted: 28
Starting iteration with limit=1000 skip=0 ...
0-22 / 22 (2.21s)
===
Records extracted: 22
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.69s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-21 / 21 (0.74s)
===
Records extracted: 21
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (0.74s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-1 / 1 (0.92s)
===
Records extracted: 1
Starting iteration with limit=1000 skip=0 ...
0-1 / 1 (0.68s)
===
Records extracted: 1
Starting iteration with limit=1000 skip=0 ...
===
Records extracted: 0

[20]:

	assignee_names	assignees	filing_status	id	inventor_names	publication_date	times_cited	title	year	granted_year
0	[PASTEUR INSTITUT, INST NAT SANTE RECH MED]	[{'city_name': 'Paris', 'country_name': 'Franc...	Application	WO-2018015815-A2	[MECHERI SALAH, MENARD ROBERT, DEMARTA-GATSI C...	2018-01-25	0	PLASMODIUM WITH HISTAMINE RELEASING FACTOR (HR...	2017	NaN
1	[KDDI Corp]	[{'city_name': 'Tokyo', 'country_name': 'Japan...	Grant	US-10547380-B2	[ISHIMURA SHOTA]	2020-01-28	0	Polarization tracking device, optical receivin...	2019	2020.0

[21]:

%dsldocs patents
dsl_last_results[dsl_last_results['type']=='organizations']

[21]:

	sources	field	type	description	is_filter	is_entity	is_facet
6	patents	assignees	organizations	Disambiguated GRID organisations who own or ha...	True	True	True
19	patents	current_assignees	organizations	Disambiguated GRID organisations currenlty own...	True	True	True
27	patents	funders	organizations	GRID organisations funding the patent.	True	True	True
37	patents	original_assignees	organizations	Disambiguated GRID organisations that first ow...	True	True	True

[22]:

import json
cited_patents_assignees = cited_patents.explode('assignees')

cited_patents_assignees['assignee_grid_id'] = cited_patents_assignees['assignees'].\
    apply(lambda g: g['id'] if type(g) == dict else 0 )

cited_patents_assignees['assignee_name'] = cited_patents_assignees['assignees'].\
    apply(lambda g: g['name'] if type(g) == dict else 0 )

cited_patents_assignees.\
    groupby(['assignee_grid_id','assignee_name']).\
    agg({'id':'count'}).\
    sort_values(by='id', ascending=False).\
    head(20)

[22]:

		id
assignee_grid_id	assignee_name
0	0	126
grid.4280.e	National University of Singapore	15
grid.7429.8	French Institute of Health and Medical Research	14
grid.453420.4	SingHealth	13
grid.431072.3	AbbVie (United States)	10
grid.116068.8	Massachusetts Institute of Technology	10
grid.4444.0	French National Centre for Scientific Research	8
grid.453773.1	Wisconsin Alumni Research Foundation	7
grid.1008.9	University of Melbourne	7
grid.417921.8	Incyte (United States)	6
grid.421350.1	Northrop Grumman (United States)	6
grid.6363.0	Charité - University Medicine Berlin	6
grid.1042.7	Walter and Eliza Hall Institute of Medical Research	5
grid.16753.36	Northwestern University	5
grid.467402.3	Helsinn (Switzerland)	5
grid.51462.34	Memorial Sloan Kettering Cancer Center	5
grid.1058.c	Murdoch Children's Research Institute	5
grid.422530.2	Waters (United Kingdom)	4
grid.420451.6	Google (United States)	4
grid.168010.e	Stanford University	4

10. Clinical Trials (clinical trials -> publications)¶

Using the same method, we can retrieve all clinical trials citing publications from my institution.

[23]:

%dsldocs clinical_trials
dsl_last_results[dsl_last_results['field']=='research_orgs']

[23]:

	sources	field	type	description	is_filter	is_entity	is_facet
30	clinical_trials	research_orgs	organizations	GRID organizations involved, e.g. as sponsors ...	True	True	True

[25]:

# Step 1 - same as before - except now we want the query in chunks

clinical_trials_chunks = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2015
                    return publications[id]
                """).chunks(400)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below.  publications was replaced with clinical_trials, and reference_ids was replaced by publication_ids

query_results = []

for c in clinical_trials_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search clinical_trials
                            where publication_ids in {mypubslist}
                            return clinical_trials[basics+research_orgs]
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

cited_clinical_trials = pd.concat(query_results).\
   drop_duplicates(subset='id')

cited_clinical_trials.head(2)

Starting iteration with limit=1000 skip=0 ...
0-1000 / 8884 (3.10s)
1000-2000 / 8884 (2.89s)
2000-3000 / 8884 (3.20s)
3000-4000 / 8884 (3.63s)
4000-5000 / 8884 (3.87s)
5000-6000 / 8884 (3.81s)
6000-7000 / 8884 (1.63s)
7000-8000 / 8884 (1.48s)
8000-8884 / 8884 (1.90s)
===
Records extracted: 8884
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (1.13s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (1.21s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (1.01s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-16 / 16 (0.85s)
===
Records extracted: 16
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (2.91s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (3.13s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.20s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.15s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (1.26s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.89s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (1.31s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (0.88s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (0.85s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (1.25s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (2.15s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.84s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-11 / 11 (0.87s)
===
Records extracted: 11
Starting iteration with limit=1000 skip=0 ...
0-4 / 4 (1.57s)
===
Records extracted: 4
Starting iteration with limit=1000 skip=0 ...
0-6 / 6 (0.92s)
===
Records extracted: 6
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.04s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-2 / 2 (0.92s)
===
Records extracted: 2
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (1.21s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
===
Records extracted: 0

[25]:

	active_years	id	investigators	title	research_orgs
0	[2021, 2022, 2023, 2024]	NCT05170594	[[Haiyan Liu, , Principal Investigator, The Se...	A Prospective Study of Bevacizumab Combined Wi...	NaN
1	[2021, 2022, 2023]	NCT05147818	NaN	Risk Factors for Acute Kidney Injury in Diabet...	[{'city_name': 'Assiut', 'country_name': 'Egyp...

[28]:

%dsldocs clinical_trials
dsl_last_results[dsl_last_results['type']=='organizations']

[28]:

	sources	field	type	description	is_filter	is_entity	is_facet
19	clinical_trials	funders	organizations	GRID funding organisations that are involved w...	True	True	True
30	clinical_trials	research_orgs	organizations	GRID organizations involved, e.g. as sponsors ...	True	True	True

[29]:

cited_clinical_trials_orgs = cited_clinical_trials.explode('research_orgs')

cited_clinical_trials_orgs['research_orgs_grid_id'] = cited_clinical_trials_orgs['research_orgs'].\
    apply(lambda g: g['id'] if type(g) == dict else 0 )

cited_clinical_trials_orgs['research_orgs_name'] = cited_clinical_trials_orgs['research_orgs'].\
    apply(lambda g: g['name'] if type(g) == dict else 0 )

cited_clinical_trials_orgs.\
    groupby(['research_orgs_grid_id','research_orgs_name']).\
    agg({'id':'count'}).\
    sort_values(by='id', ascending=False).\
    head(20)

[29]:

		id
research_orgs_grid_id	research_orgs_name
grid.1002.3	Monash University	6
0	0	4
grid.4367.6	Washington University in St. Louis	4
grid.506076.2	Istanbul University Cerrahpaşa	4
grid.223827.e	University of Utah	4
grid.84393.35	Hospital Universitari i Politècnic La Fe	4
grid.411438.b	Hospital Universitari Germans Trias i Pujol	3
grid.194645.b	University of Hong Kong	3
grid.21925.3d	University of Pittsburgh	3
grid.248762.d	BC Cancer Agency	3
grid.2515.3	Boston Children's Hospital	3
grid.252487.e	Assiut University	3
grid.412860.9	Wake Forest Baptist Medical Center	3
grid.428313.f	Corporació Sanitària Parc Taulí	3
grid.266102.1	University of California, San Francisco	3
grid.81821.32	Hospital Universitario La Paz	3
grid.411052.3	Central University Hospital of Asturias	3
grid.411349.a	Reina Sofia Hospital	3
grid.417574.4	Abbott (United States)	2
grid.440627.3	University of the Andes, Chile	2

11. Grants (publications -> grants)¶

Using the same method, we can retrieve all grants funding publications from my institution.

[30]:

%dsldocs publications
dsl_last_results[dsl_last_results['field'].str.contains('ids')]

[30]:

	sources	field	type	description	is_filter	is_entity	is_facet
20	publications	clinical_trial_ids	string	Clinical Trial IDs mentioned in publications f...	True	False	False
47	publications	reference_ids	string	Dimensions publication ID for publications in ...	True	False	False
61	publications	supporting_grant_ids	string	Grants supporting a publication, returned as a...	True	False	False

[38]:

# Step 1 - same as before - except now we want the query in chunks

publications = dsl.query_iterative("""
                  search publications
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2020
                    return publications[id+supporting_grant_ids]
                """).as_dataframe()

# Step 2 - we can get the grants IDs directly from publications this time.
# So as a second step, we want to pull grants metadata using these identifiers.

pubs_grants = publications.explode('supporting_grant_ids')

grants_from_pubs = pd.DataFrame(pubs_grants.supporting_grant_ids.unique()).\
                   dropna().\
                   rename(columns={0:'id'})

query_results = []

for g, gnts in grants_from_pubs.groupby(np.arange(len(grants_from_pubs)) // 250):
          # This does *almost* the same thing as the chunks command used above

      myglist = json.dumps(list(gnts.id))

      query_results.append(

                  dsl.query_iterative(f"""
                        search grants
                            where id in {myglist}
                          return grants[basics+funding_usd]
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

grant_details = pd.concat(query_results).\
   drop_duplicates(subset='id')

grant_details.head(5)

Starting iteration with limit=1000 skip=0 ...
0-1000 / 14092 (1.94s)
1000-2000 / 14092 (1.86s)
2000-3000 / 14092 (1.22s)
3000-4000 / 14092 (1.21s)
4000-5000 / 14092 (4.85s)
5000-6000 / 14092 (1.20s)
6000-7000 / 14092 (1.26s)
7000-8000 / 14092 (1.19s)
8000-9000 / 14092 (1.18s)
9000-10000 / 14092 (1.26s)
10000-11000 / 14092 (1.38s)
11000-12000 / 14092 (1.25s)
12000-13000 / 14092 (1.23s)
13000-14000 / 14092 (1.18s)
14000-14092 / 14092 (1.13s)
===
Records extracted: 14092
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.01s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.29s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.12s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.07s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.03s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.70s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-248 / 248 (0.98s)
===
Records extracted: 248
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.72s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.73s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.96s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.97s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.99s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.71s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.80s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.66s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.78s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.06s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.97s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.93s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.85s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-247 / 247 (0.97s)
===
Records extracted: 247

[38]:

	active_year	end_date	funders	funding_org_name	funding_usd	grant_number	id	language	original_title	start_date	start_year	title
0	[2021, 2022, 2023, 2024]	2024-03-31	[{'acronym': 'DOE SC', 'city_name': 'Washingto...	Office of Science	14773638.0	DE-SC0007881	grant.4320658	en	Particle Physics and Cosmology Research	2021-07-01	2021.0	Particle Physics and Cosmology Research
1	[2020, 2021, 2022]	2022-03-31	[{'acronym': 'JSPS', 'city_name': 'Tokyo', 'co...	Japan Society for the Promotion of Science	62350.0	20K21008	grant.9399908	ja	Analysis of bounding flight in birds by dynami...	2020-07-30	2020.0	Analysis of bounding flight in birds by dynami...
2	[2020, 2021, 2022, 2023, 2024, 2025]	2025-06-30	[{'acronym': 'NIAID', 'city_name': 'Bethesda',...	National Institute of Allergy and Infectious D...	1587803.0	R01AI153400	grant.9293892	en	Polymerase Inhibitors of Respiratory Syncytial...	2020-07-01	2020.0	Polymerase Inhibitors of Respiratory Syncytial...
3	[2020, 2021, 2022, 2023, 2024, 2025]	2025-05-31	[{'acronym': 'NIAID', 'city_name': 'Bethesda',...	National Institute of Allergy and Infectious D...	1082990.0	R01AI148407	grant.9293759	en	Investigating the basis of MAIT cell antigen p...	2020-06-15	2020.0	Investigating the basis of MAIT cell antigen p...
4	[2020, 2021, 2022, 2023]	2023-06-01	[{'acronym': 'ARC', 'city_name': 'Canberra', '...	Australian Research Council	333250.0	DE200100264	grant.8676066	en	Giving Mums a fair go: culturally-responsive c...	2020-06-01	2020.0	Giving Mums a fair go: culturally-responsive c...

[37]:

pubs_grants.groupby('supporting_grant_ids').\
    agg({'id':'count'}).\
    reset_index().\
    rename(columns={'id':'pubs','supporting_grant_ids':'id'}).\
    merge(grant_details[['id','original_title','funding_usd']],
          on='id').\
    sort_values(by='pubs', ascending=False)

[37]:

	id	pubs	original_title	funding_usd
6803	grant.8470130	63	Consolidated Grant 2019	2724578.0
6811	grant.8470469	61	Particle Physics Consolidated Grant from the U...	3097905.0
6273	grant.7924120	61	GRIDPP PROJECT COORDINATION	NaN
2155	grant.4320658	57	Particle Physics and Cosmology Research	14773638.0
206	grant.2438793	55	Cancer Center Support Grant	266136064.0
...	...	...	...	...
3151	grant.6447242	1	Participatory Urban Living for Sustainable Env...	5696332.0
3150	grant.6447136	1	Towards Novel Inert (Photo-)toxic Ru(II) Polyp...	2317685.0
3149	grant.6446143	1	Understanding how an MLH1 promoter polymorphis...	952550.0
3148	grant.6446046	1	Macronutrients and Metabolic Health - Understa...	2822859.0
3997	grant.6803312	1	Macrophage Dysfunction in Obesity, Diabetes an...	12071621.0

7995 rows × 4 columns

Conclusions¶

Lists are a simple data structure that can have a great number of applications.

When used in conjuction with the DSL language, they make it easy to concatenate the results of one query with another query e.g. in order to navigate through links available in Dimensions (from publications to grants, patents etc…).

See also this patents tutorial or this clinical trials tutorial for more in-depth applications of the queries discussed above.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.