Working with lists in the Dimensions API¶
In this notebook we are going to show:
How to use lists in order to write more efficient DSL queries
How lists can be used to concatenate the results of one query with another query
How these methods can be used for real-word applications e.g., getting publications/patents/grants that cite my publications
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[1]:
!pip install dimcli -U --quiet
import dimcli
from dimcli.utils import *
import sys
import json
import pandas as pd
import numpy as np
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
1. How do we use lists in the Dimensions API?¶
We use lists in the API because they are easier to read, and easier to work with.
Here is a query without lists.
How many publications were produced from either Monash or Melbourne University ( grid.1002.3, grid.1008.9 ) in either (2019 OR 2020). Be really careful with your brakets!
[2]:
%%dsldf
search publications
where
(
research_orgs.id = "grid.1008.9"
or research_orgs.id = "grid.1002.3"
)
and (
year = 2019
or year = 2020
)
return publications
limit 1
Returned Publications: 1 (total = 45670)
Time: 0.71s
[2]:
authors | id | issue | pages | title | type | volume | year | journal.id | journal.title | |
---|---|---|---|---|---|---|---|---|---|---|
0 | [{'affiliations': [{'city': 'Lisbon', 'city_id... | pub.1141637539 | 3 | 188-197 | How silence facilitates verbal participation | article | 3 | 2020 | jour.1374325 | English Language Teaching Educational Journal |
The query above could get really messy. What if I wanted 20 institutions. What if I wanted the last ten years: (or,or,or,or,or….) and (or,or,or,or,or)
By using lists we can quickly add a large number of conditions by means of an easy to read square-brakets notation:
[3]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 45670)
Time: 2.34s
[3]:
id | |
---|---|
0 | pub.1141637539 |
1 | pub.1141305289 |
2 | pub.1138102366 |
3 | pub.1135883909 |
4 | pub.1135432149 |
... | ... |
95 | pub.1134203058 |
96 | pub.1134203025 |
97 | pub.1134202605 |
98 | pub.1134190313 |
99 | pub.1134190104 |
100 rows × 1 columns
2. What are all the things that we can make lists of in the Dimensions API?¶
What are the internal Entities that we might put in a list?¶
[4]:
%dsldocs
dsl_last_results[dsl_last_results['is_entity']==True]
[4]:
sources | field | type | description | is_filter | is_entity | is_facet | |
---|---|---|---|---|---|---|---|
6 | clinical_trials | category_bra | categories | `Broad Research Areas <https://app.dimensions.... | True | True | True |
7 | clinical_trials | category_for | categories | `ANZSRC Fields of Research classification <htt... | True | True | True |
8 | clinical_trials | category_hra | categories | `Health Research Areas <https://dimensions.fre... | True | True | True |
9 | clinical_trials | category_hrcs_hc | categories | `HRCS - Health Categories <https://dimensions.... | True | True | True |
10 | clinical_trials | category_hrcs_rac | categories | `HRCS – Research Activity Codes <https://dimen... | True | True | True |
... | ... | ... | ... | ... | ... | ... | ... |
321 | reports | responsible_orgs_cities | cities | Cities of the organisations responsible for th... | True | True | True |
322 | reports | responsible_orgs_countries | countries | Countries of the organisations responsible for... | True | True | True |
324 | reports | responsible_orgs_states | states | States of the organisations responsible for th... | True | True | True |
327 | researchers | current_research_org | organizations | The most recent research organization linked t... | True | True | True |
340 | researchers | research_orgs | organizations | All research organizations linked to the resea... | True | True | True |
123 rows × 7 columns
What about lists of ids?¶
[5]:
%dsldocs
dsl_last_results[dsl_last_results['field'].str.contains('id')==True]
[5]:
sources | field | type | description | is_filter | is_entity | is_facet | |
---|---|---|---|---|---|---|---|
4 | clinical_trials | associated_grant_ids | string | Dimensions IDs of the grants associated to the... | True | False | False |
21 | clinical_trials | id | string | Dimensions clinical trial ID | True | False | False |
27 | clinical_trials | publication_ids | string | Publications mentioned in clinical trials (exc... | True | False | False |
33 | datasets | associated_grant_ids | string | The Dimensions IDs of the grants linked to the... | True | False | False |
35 | datasets | associated_publication_id | string | The Dimensions ID of the publication linked to... | True | False | False |
57 | datasets | id | string | Dimensions dataset ID. | True | False | False |
64 | datasets | repository_id | string | The ID of the repository of the dataset. | True | False | True |
105 | grants | id | string | Dimensions grant ID. | True | False | False |
122 | organizations | cnrs_ids | string | CNRS IDs for this organization | True | False | False |
126 | organizations | external_ids_fundref | string | Fundref IDs for this organization | True | False | False |
127 | organizations | hesa_ids | string | HESA IDs for this organization | True | False | False |
128 | organizations | id | string | GRID ID of the organization. E.g., "grid.26999... | True | False | False |
129 | organizations | isni_ids | string | ISNI IDs for this organization | True | False | False |
140 | organizations | organization_child_ids | string | Child organization IDs | True | False | False |
141 | organizations | organization_parent_ids | string | Parent organization IDs | True | False | False |
142 | organizations | organization_related_ids | string | Related organization IDs | True | False | False |
143 | organizations | orgref_ids | string | OrgRef IDs for this organization | True | False | False |
145 | organizations | ror_ids | string | ROR IDs for this organization | True | False | False |
149 | organizations | ucas_ids | string | UCAS IDs for this organization | True | False | False |
150 | organizations | ukprn_ids | string | UKPRN IDs for this organization | True | False | False |
151 | organizations | wikidata_ids | string | WikiData IDs for this organization | True | False | False |
160 | patents | associated_grant_ids | string | Dimensions IDs of the grants associated to the... | True | False | False |
169 | patents | cited_by_ids | string | Dimensions IDs of the patents that cite this p... | True | False | False |
177 | patents | family_id | integer | Identifier of the corresponding `EPO patent fa... | True | False | True |
183 | patents | id | string | Dimensions patent ID | True | False | False |
194 | patents | publication_ids | string | Dimensions IDs of the publications related to ... | True | False | False |
197 | patents | reference_ids | string | Dimensions IDs of the patents which are cited ... | True | False | False |
212 | policy_documents | id | string | Dimensions policy document ID | True | False | False |
214 | policy_documents | publication_ids | string | Dimensions IDs of the publications related to ... | True | False | False |
225 | publications | altmetric_id | integer | Altmetric Publication ID | True | False | False |
226 | publications | arxiv_id | string | The publications arXiv identifier (e.g. ‘arXiv... | True | False | False |
242 | publications | clinical_trial_ids | string | Clinical Trial IDs mentioned in publications f... | True | False | False |
254 | publications | id | string | Dimensions publication ID. | True | False | False |
264 | publications | pmcid | string | PubMed Central ID. | True | False | False |
265 | publications | pmid | string | PubMed ID. | True | False | False |
269 | publications | reference_ids | string | Dimensions publication ID for publications in ... | True | False | False |
283 | publications | supporting_grant_ids | string | Grants supporting a publication, returned as a... | True | False | False |
306 | reports | external_ids | json | External identifiers available from the report... | True | False | False |
310 | reports | id | string | Dimensions report ID. | True | False | False |
332 | researchers | id | string | Dimensions researcher ID. | True | False | False |
336 | researchers | nih_ppid | string | The PI Profile ID (i.e., ppid) is a Researcher... | True | False | False |
338 | researchers | orcid_id | string | `ORCID <https://orcid.org/>`_ ID. | True | False | False |
343 | source_titles | id | string | The Dimensions ID of the source. | True | False | False |
What are the external entities that we can put in a list?¶
a list of ISSN’s
a list of External Grant IDs
a list of DOIs
a list of categories
3. Making a list from the results of a query¶
The list syntax for the Dimensions API is the same as the list syntax for json, so we can use python’s json-to-string functions to make a list of ids for us from the previous query.
Let’s run our example query again.
[6]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 45670)
Time: 1.43s
[6]:
id | |
---|---|
0 | pub.1141637539 |
1 | pub.1141305289 |
2 | pub.1138102366 |
3 | pub.1135883909 |
4 | pub.1135432149 |
... | ... |
95 | pub.1134203058 |
96 | pub.1134203025 |
97 | pub.1134202605 |
98 | pub.1134190313 |
99 | pub.1134190104 |
100 rows × 1 columns
[7]:
json.dumps(list(dsl_last_results.id))
[7]:
'["pub.1141637539", "pub.1141305289", "pub.1138102366", "pub.1135883909", "pub.1135432149", "pub.1134822967", "pub.1134491333", "pub.1134281073", "pub.1134266008", "pub.1134265994", "pub.1134265901", "pub.1134265655", "pub.1134265653", "pub.1134265625", "pub.1134258780", "pub.1134258376", "pub.1134257556", "pub.1134256126", "pub.1134256124", "pub.1134256123", "pub.1134254481", "pub.1134252657", "pub.1134251430", "pub.1134250811", "pub.1134250027", "pub.1134248622", "pub.1134199431", "pub.1134198997", "pub.1134198446", "pub.1134198388", "pub.1134196709", "pub.1134196597", "pub.1134196290", "pub.1134196127", "pub.1134195884", "pub.1134195597", "pub.1134195228", "pub.1134194637", "pub.1134193894", "pub.1134193864", "pub.1134193611", "pub.1134192883", "pub.1134192144", "pub.1134192022", "pub.1133732704", "pub.1133137340", "pub.1133137326", "pub.1133137275", "pub.1132992249", "pub.1132838273", "pub.1128236326", "pub.1134262302", "pub.1134257598", "pub.1134257594", "pub.1134257519", "pub.1134257515", "pub.1134246135", "pub.1134233491", "pub.1134221342", "pub.1134220735", "pub.1134201275", "pub.1134199760", "pub.1134199173", "pub.1134199132", "pub.1134196260", "pub.1134196172", "pub.1134195488", "pub.1134194209", "pub.1134193250", "pub.1134193081", "pub.1134191741", "pub.1134191365", "pub.1134190135", "pub.1134190078", "pub.1134186262", "pub.1134185924", "pub.1134185750", "pub.1134185659", "pub.1134184944", "pub.1134184639", "pub.1134183226", "pub.1134182538", "pub.1134181948", "pub.1133342464", "pub.1133065204", "pub.1132594886", "pub.1132412597", "pub.1131466688", "pub.1134233730", "pub.1134233503", "pub.1134233412", "pub.1134215867", "pub.1134215652", "pub.1134215451", "pub.1134208463", "pub.1134203058", "pub.1134203025", "pub.1134202605", "pub.1134190313", "pub.1134190104"]'
Let’s try to use this list of IDs.
Unfortunately, you can’t just put your results directly into the query
[8]:
%%dsldf
search publications
where id in [json.dumps(list(dsl_last_results.id))]
return publications
Returned Errors: 1
Time: 1.25s
1 QueryError found
4 QuerySyntaxErrors found
4 ParserErrors found
* [Line 2:15] ('json') no viable alternative at input '[json'
* [Line 2:26] ('list') extraneous input 'list' expecting {'for', 'in', '('}
* [Line 2:31] ('dsl_last_results') mismatched input 'dsl_last_results' expecting {'for', 'in', '('}
* [Line 2:52] (']') mismatched input ']' expecting <EOF>
..so let’s get our results back again
[9]:
%%dsldf
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 100
Returned Publications: 100 (total = 45670)
Time: 1.14s
[9]:
id | |
---|---|
0 | pub.1141637539 |
1 | pub.1141305289 |
2 | pub.1138102366 |
3 | pub.1135883909 |
4 | pub.1135432149 |
... | ... |
95 | pub.1134203058 |
96 | pub.1134203025 |
97 | pub.1134202605 |
98 | pub.1134190313 |
99 | pub.1134190104 |
100 rows × 1 columns
… and use the python way of calling the Dimensions API instead
[10]:
dsl.query(f"""
search publications
where id in {json.dumps(list(dsl_last_results.id))}
return publications
""").as_dataframe()
f"""
search publications
where id in {json.dumps(list(dsl_last_results.id))}
return publications
"""
Returned Publications: 20 (total = 100)
Time: 3.08s
[10]:
'\n\n search publications\n where id in ["pub.1141637539", "pub.1141305289", "pub.1138102366", "pub.1135883909", "pub.1135432149", "pub.1134822967", "pub.1134491333", "pub.1134281073", "pub.1134266008", "pub.1134265994", "pub.1134265901", "pub.1134265655", "pub.1134265653", "pub.1134265625", "pub.1134258780", "pub.1134258376", "pub.1134257556", "pub.1134256126", "pub.1134256124", "pub.1134256123", "pub.1134254481", "pub.1134252657", "pub.1134251430", "pub.1134250811", "pub.1134250027", "pub.1134248622", "pub.1134199431", "pub.1134198997", "pub.1134198446", "pub.1134198388", "pub.1134196709", "pub.1134196597", "pub.1134196290", "pub.1134196127", "pub.1134195884", "pub.1134195597", "pub.1134195228", "pub.1134194637", "pub.1134193894", "pub.1134193864", "pub.1134193611", "pub.1134192883", "pub.1134192144", "pub.1134192022", "pub.1133732704", "pub.1133137340", "pub.1133137326", "pub.1133137275", "pub.1132992249", "pub.1132838273", "pub.1128236326", "pub.1134262302", "pub.1134257598", "pub.1134257594", "pub.1134257519", "pub.1134257515", "pub.1134246135", "pub.1134233491", "pub.1134221342", "pub.1134220735", "pub.1134201275", "pub.1134199760", "pub.1134199173", "pub.1134199132", "pub.1134196260", "pub.1134196172", "pub.1134195488", "pub.1134194209", "pub.1134193250", "pub.1134193081", "pub.1134191741", "pub.1134191365", "pub.1134190135", "pub.1134190078", "pub.1134186262", "pub.1134185924", "pub.1134185750", "pub.1134185659", "pub.1134184944", "pub.1134184639", "pub.1134183226", "pub.1134182538", "pub.1134181948", "pub.1133342464", "pub.1133065204", "pub.1132594886", "pub.1132412597", "pub.1131466688", "pub.1134233730", "pub.1134233503", "pub.1134233412", "pub.1134215867", "pub.1134215652", "pub.1134215451", "pub.1134208463", "pub.1134203058", "pub.1134203025", "pub.1134202605", "pub.1134190313", "pub.1134190104"]\n\n return publications\n\n\n'
Putting both parts of this example together¶
[11]:
# Step 1. Get the list of publications..
pubs = dsl.query("""
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 100
""").as_dataframe()
# Step 2. Put the list into the next query...
dsl.query_iterative(f"""
search publications
where id in {json.dumps(list(pubs.id))}
return publications
""").as_dataframe().head(5)
Returned Publications: 100 (total = 45670)
Time: 1.80s
Starting iteration with limit=1000 skip=0 ...
0-100 / 100 (0.80s)
===
Records extracted: 100
[11]:
authors | id | issue | pages | title | type | volume | year | journal.id | journal.title | |
---|---|---|---|---|---|---|---|---|---|---|
0 | [{'affiliations': [{'city': 'Lisbon', 'city_id... | pub.1141637539 | 3 | 188-197 | How silence facilitates verbal participation | article | 3 | 2020 | jour.1374325 | English Language Teaching Educational Journal |
1 | [{'affiliations': [{'city': 'Sydney', 'city_id... | pub.1141305289 | 1 | 83-138 | The expression of emotions in Kunbarlang and i... | article | 27 | 2020 | jour.1129762 | Pragmatics & Cognition |
2 | [{'affiliations': [{'city': None, 'city_id': N... | pub.1138102366 | 110 | 42-44 | Ophthalmic nurses: vital team members in the p... | article | 33 | 2020 | jour.1099554 | Community Eye Health Journal |
3 | [{'affiliations': [{'city': 'Melbourne', 'city... | pub.1135883909 | 4 | 1474-1482 | Charting a way forward? Post-juristocracy, dem... | article | 18 | 2020 | jour.1139636 | International Journal of Constitutional Law |
4 | [{'affiliations': [{'city': 'Melbourne', 'city... | pub.1135432149 | 2 | 100 | An Interactive Approach to Contraception Teach... | article | 14 | 2020 | jour.1051336 | South-East Asian Journal of Medical Education |
Doing something useful: Get all the publications that cite my publications¶
[12]:
pubs = dsl.query("""
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 100
""")
mypubslist = json.dumps(list(pubs.as_dataframe().id))
dsl.query_iterative(f"""
search publications
where reference_ids in {mypubslist}
return publications
""").as_dataframe().head()
Returned Publications: 100 (total = 45670)
Time: 1.36s
Starting iteration with limit=1000 skip=0 ...
0-391 / 391 (1.44s)
===
Records extracted: 391
[12]:
authors | id | pages | title | type | volume | year | journal.id | journal.title | issue | |
---|---|---|---|---|---|---|---|---|---|---|
0 | [{'affiliations': [{'city': 'Monastir', 'city_... | pub.1144584350 | 132398 | New spiropyrrolothiazole derivatives bearing a... | article | 1254 | 2022 | jour.1009258 | Journal of Molecular Structure | NaN |
1 | [{'affiliations': [{'city': 'Gwangju', 'city_i... | pub.1144716570 | 112636 | A new paradigm in sarcopenia: Cognitive impair... | article | 147 | 2022 | jour.1371653 | Biomedicine & Pharmacotherapy | NaN |
2 | [{'affiliations': [{'city': 'Changsha', 'city_... | pub.1142243251 | 133215 | Intercalation and covalent bonding strategies ... | article | 431 | 2022 | jour.1127675 | Chemical Engineering Journal | NaN |
3 | [{'affiliations': [{'city': 'Wuhan', 'city_id'... | pub.1144581821 | 104948 | Muscle aging amelioration by yeast protein sup... | article | 89 | 2022 | jour.1041965 | Journal of Functional Foods | NaN |
4 | [{'affiliations': [{'city': 'Tehran', 'city_id... | pub.1144368116 | 118310 | Peer-to-peer decentralized energy trading fram... | article | 308 | 2022 | jour.1029519 | Applied Energy | NaN |
5. How Long can lists get?¶
It is a bit dependent on string length, plus a fixed length of 512 items
This won’t work¶
[13]:
pubs = dsl.query("""
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 1000
""")
mypubslist = json.dumps(list(pubs.as_dataframe().id))
dsl.query(f"""
search publications
where reference_ids in {mypubslist}
return publications
""").as_dataframe()
Returned Publications: 1000 (total = 45670)
Time: 2.89s
Returned Errors: 1
Time: 0.80s
1 QueryError found
Semantic errors found:
Filter operator 'in' requires 0 < items < 512. '1000 is out of this range'.
This will¶
[14]:
pubs = dsl.query("""
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 250
""")
mypubslist = json.dumps(list(pubs.as_dataframe().id))
dsl.query(f"""
search publications
where reference_ids in {mypubslist}
return publications
""").as_dataframe().head(2)
Returned Publications: 250 (total = 45670)
Time: 1.95s
Returned Publications: 20 (total = 1009)
Time: 1.69s
[14]:
authors | id | pages | title | type | volume | year | journal.id | journal.title | issue | |
---|---|---|---|---|---|---|---|---|---|---|
0 | [{'affiliations': [{'city': 'Monastir', 'city_... | pub.1144584350 | 132398 | New spiropyrrolothiazole derivatives bearing a... | article | 1254 | 2022 | jour.1009258 | Journal of Molecular Structure | NaN |
1 | [{'affiliations': [{'city': 'Guangzhou', 'city... | pub.1144235680 | 100917 | Designed Biomass Materials for “Green” Electro... | article | 125 | 2022 | jour.1045007 | Progress in Materials Science | NaN |
What if I need a very long list?¶
The Dimcli library can break up your query into chunks.
We then loop through each chunk - get the result, and stick them back together again at the end.
[15]:
# Step 1 - same as before - except now we want the query in chunks
pubs_chunks = dsl.query("""
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 1000
""").chunks(250)
# Step 2 - almost the same as before - except now we use a for loop to loop through our results
query_results = []
for c in pubs_chunks:
mypubslist = json.dumps(list(pd.DataFrame(c).id))
query_results.append(
dsl.query_iterative(f"""
search publications
where reference_ids in {mypubslist}
return publications
""").as_dataframe()
)
# Step 3 - join our results back together again, and get rid of duplicates
pd.concat(query_results).\
drop_duplicates(subset='id').\
head(2)
Returned Publications: 1000 (total = 45670)
Time: 1.78s
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1009 (3.11s)
1000-1009 / 1009 (1.96s)
===
Records extracted: 1009
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1016 (3.28s)
1000-1016 / 1016 (1.46s)
===
Records extracted: 1016
Starting iteration with limit=1000 skip=0 ...
0-871 / 871 (3.43s)
===
Records extracted: 871
Starting iteration with limit=1000 skip=0 ...
0-910 / 910 (2.59s)
===
Records extracted: 910
[15]:
authors | id | pages | title | type | volume | year | journal.id | journal.title | issue | |
---|---|---|---|---|---|---|---|---|---|---|
0 | [{'affiliations': [{'city': 'Monastir', 'city_... | pub.1144584350 | 132398 | New spiropyrrolothiazole derivatives bearing a... | article | 1254 | 2022.0 | jour.1009258 | Journal of Molecular Structure | NaN |
1 | [{'affiliations': [{'city': 'Guangzhou', 'city... | pub.1144235680 | 100917 | Designed Biomass Materials for “Green” Electro... | article | 125 | 2022.0 | jour.1045007 | Progress in Materials Science | NaN |
6. What if I want to get the researchers associated with the publications the cite my institution?¶
[16]:
# Step 1 - same as before
pubs_chunks = dsl.query("""
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 1000
""").chunks(250)
query_results = []
# Step 2 same as before, but now I returning researchers instead of publications
for c in pubs_chunks:
mypubslist = json.dumps(list(pd.DataFrame(c).id))
query_results.append(
dsl.query(f"""
search publications
where reference_ids in {mypubslist}
return researchers limit 1000
""").as_dataframe()
# Warning 1, If there are more than 1000 researchers involved in this query, you will miss some
)
# Step 3 join the queries back together, this time using a groupby statement to join the counts back together again
my_researchers = pd.concat(query_results).\
groupby(['id','first_name','last_name']).\
agg({'count':'sum'}).\
sort_values(by='count', ascending=False).\
head(10)
Returned Publications: 1000 (total = 45670)
Time: 4.17s
Returned Researchers: 1000
Time: 2.60s
Returned Researchers: 1000
Time: 2.38s
Returned Researchers: 1000
Time: 1.57s
Returned Researchers: 1000
Time: 1.65s
7. What if I want to get all the researchers associated with the publications that cite my institution?¶
[17]:
# Step 1 - same as before
pubs_chunks = dsl.query("""
search publications
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
and year in [2019:2020]
return publications[id]
limit 1000
""").chunks(250)
query_results = []
# Step 2 - almost the same as before -
# except now we are asking for the as_dataframe_authors data frame
for c in pubs_chunks:
mypubslist = json.dumps(list(pd.DataFrame(c).id))
query_results.append(
dsl.query_iterative(f"""
search publications
where reference_ids in {mypubslist}
return publications[id+title+authors]
""").as_dataframe_authors() # I have changed this line from as_dataframe to as_datframe_authors
)
# Step 3 - join the publications back together
researcher_pubs = pd.concat(query_results).\
drop_duplicates(subset=['researcher_id','pub_id'])
# Step 4 - count up the publications using a groupby statement
my_researchers = researcher_pubs[researcher_pubs['researcher_id'] != ''].\
groupby(['researcher_id']).\
agg({'first_name':'max','last_name':'max','pub_id':'count'}).\
sort_values(by='pub_id', ascending=False).\
reset_index()
my_researchers.\
head(10)
Returned Publications: 1000 (total = 45670)
Time: 1.70s
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1009 (2.13s)
1000-1009 / 1009 (2.78s)
===
Records extracted: 1009
Starting iteration with limit=1000 skip=0 ...
0-1000 / 1016 (5.23s)
1000-1016 / 1016 (3.22s)
===
Records extracted: 1016
Starting iteration with limit=1000 skip=0 ...
0-871 / 871 (4.01s)
===
Records extracted: 871
Starting iteration with limit=1000 skip=0 ...
0-910 / 910 (2.81s)
===
Records extracted: 910
[17]:
researcher_id | first_name | last_name | pub_id | |
---|---|---|---|---|
0 | ur.0631757532.92 | Gin S. | Malhi | 22 |
1 | ur.015337322331.59 | Erica | Bell | 18 |
2 | ur.012754752112.35 | Richard J. | Porter | 17 |
3 | ur.01115612112.63 | Rena | Yadlapati | 13 |
4 | ur.01234653773.64 | Farhad | Ravandi | 12 |
5 | ur.0735337324.35 | Roger T | Mulder | 11 |
6 | ur.01044576346.22 | Frank R. | Dunshea | 11 |
7 | ur.01224616510.22 | John E. | Pandolfino | 11 |
8 | ur.011606313504.49 | Peter J. | Kahrilas | 10 |
9 | ur.0772267026.25 | C. Prakash | Gyawali | 10 |
8. ..and if we want details about our researchers, we can put our list of researchers into the researcher API¶
See the researcher source docs for more details.
[18]:
## First, we need to chunk up our researcher list
query_results = []
for g, rschr in my_researchers.groupby(np.arange(len(my_researchers)) // 250):
# This does *almost* the same thing as the chunks command used above
myreslist = json.dumps(list(rschr.researcher_id))
query_results.append(
dsl.query_iterative(f"""
search researchers
where id in {myreslist}
return researchers
""").as_dataframe() #
)
pd.concat(query_results).head()
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.44s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.62s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.66s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.76s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.28s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.22s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.23s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (17.52s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.49s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.96s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.91s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.45s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.24s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.74s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.90s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.44s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.93s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.14s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.32s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.78s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.01s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.28s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.47s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.84s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.26s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.07s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.12s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.24s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.23s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.38s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.16s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.39s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.06s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.22s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.37s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.09s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (2.49s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.15s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.03s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (3.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.16s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.08s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.04s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.05s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-246 / 246 (1.15s)
===
Records extracted: 246
[18]:
first_name | id | last_name | orcid_id | research_orgs | |
---|---|---|---|---|---|
0 | Jennifer A | ur.0777662157.75 | Summers | [0000-0002-3156-0479] | [{'city_name': 'Dunedin', 'country_name': 'New... |
1 | Francois Henri | ur.07764267264.89 | Nosten | [0000-0002-7951-0745] | [{'acronym': 'LSTM', 'city_name': 'Liverpool',... |
2 | Tony A | ur.0775731540.59 | Blakely | [0000-0002-6995-4369] | [{'acronym': 'QUT', 'city_name': 'Brisbane', '... |
3 | Chandra Prakash | ur.0772267026.25 | Gyawali | NaN | [{'acronym': 'QMUL', 'city_name': 'London', 'c... |
4 | Adam Kenneth | ur.0771337646.65 | Wheatley | [0000-0002-5593-9387] | [{'city_name': 'Melbourne', 'country_name': 'A... |
9. Patents example (patents -> publications)¶
Using the same method, we can retrieve all patents citing publications from my institution.
[19]:
%dsldocs patents
dsl_last_results[dsl_last_results['field']=='publication_ids']
[19]:
sources | field | type | description | is_filter | is_entity | is_facet | |
---|---|---|---|---|---|---|---|
41 | patents | publication_ids | string | Dimensions IDs of the publications related to ... | True | False | False |
[20]:
# Step 1 - same as before - except now we want the query in chunks
pubs_chunks = dsl.query_iterative("""
search publications
where research_orgs.id in ["grid.1008.9"]
and year = 2015
return publications[id]
""").chunks(250)
# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below. publications was replaced with patents, and refernce_ids was replaced by publication_ids
query_results = []
for c in pubs_chunks:
mypubslist = json.dumps(list(pd.DataFrame(c).id))
query_results.append(
dsl.query_iterative(f"""
search patents
where publication_ids in {mypubslist}
return patents
""").as_dataframe()
)
# Step 3 - join our results back together again, and get rid of duplicates
cited_patents = pd.concat(query_results).\
drop_duplicates(subset='id')
cited_patents.head(2)
Starting iteration with limit=1000 skip=0 ...
0-1000 / 8884 (2.54s)
1000-2000 / 8884 (1.41s)
2000-3000 / 8884 (2.06s)
3000-4000 / 8884 (1.82s)
4000-5000 / 8884 (1.63s)
5000-6000 / 8884 (3.83s)
6000-7000 / 8884 (1.95s)
7000-8000 / 8884 (1.93s)
8000-8884 / 8884 (1.70s)
===
Records extracted: 8884
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (0.73s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-24 / 24 (0.83s)
===
Records extracted: 24
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.88s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (1.15s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (1.55s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.68s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-15 / 15 (0.86s)
===
Records extracted: 15
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.72s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-20 / 20 (0.94s)
===
Records extracted: 20
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (0.87s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.98s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (0.75s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-10 / 10 (0.74s)
===
Records extracted: 10
Starting iteration with limit=1000 skip=0 ...
0-2 / 2 (1.20s)
===
Records extracted: 2
Starting iteration with limit=1000 skip=0 ...
0-26 / 26 (0.86s)
===
Records extracted: 26
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (1.06s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.72s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-17 / 17 (2.38s)
===
Records extracted: 17
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (0.67s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-20 / 20 (0.70s)
===
Records extracted: 20
Starting iteration with limit=1000 skip=0 ...
0-26 / 26 (0.69s)
===
Records extracted: 26
Starting iteration with limit=1000 skip=0 ...
0-5 / 5 (0.72s)
===
Records extracted: 5
Starting iteration with limit=1000 skip=0 ...
0-36 / 36 (0.69s)
===
Records extracted: 36
Starting iteration with limit=1000 skip=0 ...
0-15 / 15 (0.73s)
===
Records extracted: 15
Starting iteration with limit=1000 skip=0 ...
0-21 / 21 (0.66s)
===
Records extracted: 21
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.70s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-16 / 16 (0.71s)
===
Records extracted: 16
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.75s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-28 / 28 (0.70s)
===
Records extracted: 28
Starting iteration with limit=1000 skip=0 ...
0-22 / 22 (2.21s)
===
Records extracted: 22
Starting iteration with limit=1000 skip=0 ...
0-18 / 18 (0.69s)
===
Records extracted: 18
Starting iteration with limit=1000 skip=0 ...
0-21 / 21 (0.74s)
===
Records extracted: 21
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (0.74s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-1 / 1 (0.92s)
===
Records extracted: 1
Starting iteration with limit=1000 skip=0 ...
0-1 / 1 (0.68s)
===
Records extracted: 1
Starting iteration with limit=1000 skip=0 ...
===
Records extracted: 0
[20]:
assignee_names | assignees | filing_status | id | inventor_names | publication_date | times_cited | title | year | granted_year | |
---|---|---|---|---|---|---|---|---|---|---|
0 | [PASTEUR INSTITUT, INST NAT SANTE RECH MED] | [{'city_name': 'Paris', 'country_name': 'Franc... | Application | WO-2018015815-A2 | [MECHERI SALAH, MENARD ROBERT, DEMARTA-GATSI C... | 2018-01-25 | 0 | PLASMODIUM WITH HISTAMINE RELEASING FACTOR (HR... | 2017 | NaN |
1 | [KDDI Corp] | [{'city_name': 'Tokyo', 'country_name': 'Japan... | Grant | US-10547380-B2 | [ISHIMURA SHOTA] | 2020-01-28 | 0 | Polarization tracking device, optical receivin... | 2019 | 2020.0 |
[21]:
%dsldocs patents
dsl_last_results[dsl_last_results['type']=='organizations']
[21]:
sources | field | type | description | is_filter | is_entity | is_facet | |
---|---|---|---|---|---|---|---|
6 | patents | assignees | organizations | Disambiguated GRID organisations who own or ha... | True | True | True |
19 | patents | current_assignees | organizations | Disambiguated GRID organisations currenlty own... | True | True | True |
27 | patents | funders | organizations | GRID organisations funding the patent. | True | True | True |
37 | patents | original_assignees | organizations | Disambiguated GRID organisations that first ow... | True | True | True |
[22]:
import json
cited_patents_assignees = cited_patents.explode('assignees')
cited_patents_assignees['assignee_grid_id'] = cited_patents_assignees['assignees'].\
apply(lambda g: g['id'] if type(g) == dict else 0 )
cited_patents_assignees['assignee_name'] = cited_patents_assignees['assignees'].\
apply(lambda g: g['name'] if type(g) == dict else 0 )
cited_patents_assignees.\
groupby(['assignee_grid_id','assignee_name']).\
agg({'id':'count'}).\
sort_values(by='id', ascending=False).\
head(20)
[22]:
id | ||
---|---|---|
assignee_grid_id | assignee_name | |
0 | 0 | 126 |
grid.4280.e | National University of Singapore | 15 |
grid.7429.8 | French Institute of Health and Medical Research | 14 |
grid.453420.4 | SingHealth | 13 |
grid.431072.3 | AbbVie (United States) | 10 |
grid.116068.8 | Massachusetts Institute of Technology | 10 |
grid.4444.0 | French National Centre for Scientific Research | 8 |
grid.453773.1 | Wisconsin Alumni Research Foundation | 7 |
grid.1008.9 | University of Melbourne | 7 |
grid.417921.8 | Incyte (United States) | 6 |
grid.421350.1 | Northrop Grumman (United States) | 6 |
grid.6363.0 | Charité - University Medicine Berlin | 6 |
grid.1042.7 | Walter and Eliza Hall Institute of Medical Research | 5 |
grid.16753.36 | Northwestern University | 5 |
grid.467402.3 | Helsinn (Switzerland) | 5 |
grid.51462.34 | Memorial Sloan Kettering Cancer Center | 5 |
grid.1058.c | Murdoch Children's Research Institute | 5 |
grid.422530.2 | Waters (United Kingdom) | 4 |
grid.420451.6 | Google (United States) | 4 |
grid.168010.e | Stanford University | 4 |
10. Clinical Trials (clinical trials -> publications)¶
Using the same method, we can retrieve all clinical trials citing publications from my institution.
[23]:
%dsldocs clinical_trials
dsl_last_results[dsl_last_results['field']=='research_orgs']
[23]:
sources | field | type | description | is_filter | is_entity | is_facet | |
---|---|---|---|---|---|---|---|
30 | clinical_trials | research_orgs | organizations | GRID organizations involved, e.g. as sponsors ... | True | True | True |
[25]:
# Step 1 - same as before - except now we want the query in chunks
clinical_trials_chunks = dsl.query_iterative("""
search publications
where research_orgs.id in ["grid.1008.9"]
and year = 2015
return publications[id]
""").chunks(400)
# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below. publications was replaced with clinical_trials, and reference_ids was replaced by publication_ids
query_results = []
for c in clinical_trials_chunks:
mypubslist = json.dumps(list(pd.DataFrame(c).id))
query_results.append(
dsl.query_iterative(f"""
search clinical_trials
where publication_ids in {mypubslist}
return clinical_trials[basics+research_orgs]
""").as_dataframe()
)
# Step 3 - join our results back together again, and get rid of duplicates
cited_clinical_trials = pd.concat(query_results).\
drop_duplicates(subset='id')
cited_clinical_trials.head(2)
Starting iteration with limit=1000 skip=0 ...
0-1000 / 8884 (3.10s)
1000-2000 / 8884 (2.89s)
2000-3000 / 8884 (3.20s)
3000-4000 / 8884 (3.63s)
4000-5000 / 8884 (3.87s)
5000-6000 / 8884 (3.81s)
6000-7000 / 8884 (1.63s)
7000-8000 / 8884 (1.48s)
8000-8884 / 8884 (1.90s)
===
Records extracted: 8884
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (1.13s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (1.21s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (1.01s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-16 / 16 (0.85s)
===
Records extracted: 16
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (2.91s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (3.13s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.20s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.15s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (1.26s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (0.89s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
0-9 / 9 (1.31s)
===
Records extracted: 9
Starting iteration with limit=1000 skip=0 ...
0-13 / 13 (0.88s)
===
Records extracted: 13
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (0.85s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (1.25s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-14 / 14 (2.15s)
===
Records extracted: 14
Starting iteration with limit=1000 skip=0 ...
0-8 / 8 (0.84s)
===
Records extracted: 8
Starting iteration with limit=1000 skip=0 ...
0-11 / 11 (0.87s)
===
Records extracted: 11
Starting iteration with limit=1000 skip=0 ...
0-4 / 4 (1.57s)
===
Records extracted: 4
Starting iteration with limit=1000 skip=0 ...
0-6 / 6 (0.92s)
===
Records extracted: 6
Starting iteration with limit=1000 skip=0 ...
0-7 / 7 (1.04s)
===
Records extracted: 7
Starting iteration with limit=1000 skip=0 ...
0-2 / 2 (0.92s)
===
Records extracted: 2
Starting iteration with limit=1000 skip=0 ...
0-3 / 3 (1.21s)
===
Records extracted: 3
Starting iteration with limit=1000 skip=0 ...
===
Records extracted: 0
[25]:
active_years | id | investigators | title | research_orgs | |
---|---|---|---|---|---|
0 | [2021, 2022, 2023, 2024] | NCT05170594 | [[Haiyan Liu, , Principal Investigator, The Se... | A Prospective Study of Bevacizumab Combined Wi... | NaN |
1 | [2021, 2022, 2023] | NCT05147818 | NaN | Risk Factors for Acute Kidney Injury in Diabet... | [{'city_name': 'Assiut', 'country_name': 'Egyp... |
[28]:
%dsldocs clinical_trials
dsl_last_results[dsl_last_results['type']=='organizations']
[28]:
sources | field | type | description | is_filter | is_entity | is_facet | |
---|---|---|---|---|---|---|---|
19 | clinical_trials | funders | organizations | GRID funding organisations that are involved w... | True | True | True |
30 | clinical_trials | research_orgs | organizations | GRID organizations involved, e.g. as sponsors ... | True | True | True |
[29]:
cited_clinical_trials_orgs = cited_clinical_trials.explode('research_orgs')
cited_clinical_trials_orgs['research_orgs_grid_id'] = cited_clinical_trials_orgs['research_orgs'].\
apply(lambda g: g['id'] if type(g) == dict else 0 )
cited_clinical_trials_orgs['research_orgs_name'] = cited_clinical_trials_orgs['research_orgs'].\
apply(lambda g: g['name'] if type(g) == dict else 0 )
cited_clinical_trials_orgs.\
groupby(['research_orgs_grid_id','research_orgs_name']).\
agg({'id':'count'}).\
sort_values(by='id', ascending=False).\
head(20)
[29]:
id | ||
---|---|---|
research_orgs_grid_id | research_orgs_name | |
grid.1002.3 | Monash University | 6 |
0 | 0 | 4 |
grid.4367.6 | Washington University in St. Louis | 4 |
grid.506076.2 | Istanbul University Cerrahpaşa | 4 |
grid.223827.e | University of Utah | 4 |
grid.84393.35 | Hospital Universitari i Politècnic La Fe | 4 |
grid.411438.b | Hospital Universitari Germans Trias i Pujol | 3 |
grid.194645.b | University of Hong Kong | 3 |
grid.21925.3d | University of Pittsburgh | 3 |
grid.248762.d | BC Cancer Agency | 3 |
grid.2515.3 | Boston Children's Hospital | 3 |
grid.252487.e | Assiut University | 3 |
grid.412860.9 | Wake Forest Baptist Medical Center | 3 |
grid.428313.f | Corporació Sanitària Parc Taulí | 3 |
grid.266102.1 | University of California, San Francisco | 3 |
grid.81821.32 | Hospital Universitario La Paz | 3 |
grid.411052.3 | Central University Hospital of Asturias | 3 |
grid.411349.a | Reina Sofia Hospital | 3 |
grid.417574.4 | Abbott (United States) | 2 |
grid.440627.3 | University of the Andes, Chile | 2 |
11. Grants (publications -> grants)¶
Using the same method, we can retrieve all grants funding publications from my institution.
[30]:
%dsldocs publications
dsl_last_results[dsl_last_results['field'].str.contains('ids')]
[30]:
sources | field | type | description | is_filter | is_entity | is_facet | |
---|---|---|---|---|---|---|---|
20 | publications | clinical_trial_ids | string | Clinical Trial IDs mentioned in publications f... | True | False | False |
47 | publications | reference_ids | string | Dimensions publication ID for publications in ... | True | False | False |
61 | publications | supporting_grant_ids | string | Grants supporting a publication, returned as a... | True | False | False |
[38]:
# Step 1 - same as before - except now we want the query in chunks
publications = dsl.query_iterative("""
search publications
where research_orgs.id in ["grid.1008.9"]
and year = 2020
return publications[id+supporting_grant_ids]
""").as_dataframe()
# Step 2 - we can get the grants IDs directly from publications this time.
# So as a second step, we want to pull grants metadata using these identifiers.
pubs_grants = publications.explode('supporting_grant_ids')
grants_from_pubs = pd.DataFrame(pubs_grants.supporting_grant_ids.unique()).\
dropna().\
rename(columns={0:'id'})
query_results = []
for g, gnts in grants_from_pubs.groupby(np.arange(len(grants_from_pubs)) // 250):
# This does *almost* the same thing as the chunks command used above
myglist = json.dumps(list(gnts.id))
query_results.append(
dsl.query_iterative(f"""
search grants
where id in {myglist}
return grants[basics+funding_usd]
""").as_dataframe()
)
# Step 3 - join our results back together again, and get rid of duplicates
grant_details = pd.concat(query_results).\
drop_duplicates(subset='id')
grant_details.head(5)
Starting iteration with limit=1000 skip=0 ...
0-1000 / 14092 (1.94s)
1000-2000 / 14092 (1.86s)
2000-3000 / 14092 (1.22s)
3000-4000 / 14092 (1.21s)
4000-5000 / 14092 (4.85s)
5000-6000 / 14092 (1.20s)
6000-7000 / 14092 (1.26s)
7000-8000 / 14092 (1.19s)
8000-9000 / 14092 (1.18s)
9000-10000 / 14092 (1.26s)
10000-11000 / 14092 (1.38s)
11000-12000 / 14092 (1.25s)
12000-13000 / 14092 (1.23s)
13000-14000 / 14092 (1.18s)
14000-14092 / 14092 (1.13s)
===
Records extracted: 14092
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.01s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.29s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.00s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.12s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.07s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.03s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.70s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-248 / 248 (0.98s)
===
Records extracted: 248
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.72s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.73s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.96s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.97s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.99s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.02s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.71s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.80s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.66s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.78s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (1.06s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.95s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.97s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.93s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.98s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.94s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-250 / 250 (0.85s)
===
Records extracted: 250
Starting iteration with limit=1000 skip=0 ...
0-247 / 247 (0.97s)
===
Records extracted: 247
[38]:
active_year | end_date | funders | funding_org_name | funding_usd | grant_number | id | language | original_title | start_date | start_year | title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [2021, 2022, 2023, 2024] | 2024-03-31 | [{'acronym': 'DOE SC', 'city_name': 'Washingto... | Office of Science | 14773638.0 | DE-SC0007881 | grant.4320658 | en | Particle Physics and Cosmology Research | 2021-07-01 | 2021.0 | Particle Physics and Cosmology Research |
1 | [2020, 2021, 2022] | 2022-03-31 | [{'acronym': 'JSPS', 'city_name': 'Tokyo', 'co... | Japan Society for the Promotion of Science | 62350.0 | 20K21008 | grant.9399908 | ja | Analysis of bounding flight in birds by dynami... | 2020-07-30 | 2020.0 | Analysis of bounding flight in birds by dynami... |
2 | [2020, 2021, 2022, 2023, 2024, 2025] | 2025-06-30 | [{'acronym': 'NIAID', 'city_name': 'Bethesda',... | National Institute of Allergy and Infectious D... | 1587803.0 | R01AI153400 | grant.9293892 | en | Polymerase Inhibitors of Respiratory Syncytial... | 2020-07-01 | 2020.0 | Polymerase Inhibitors of Respiratory Syncytial... |
3 | [2020, 2021, 2022, 2023, 2024, 2025] | 2025-05-31 | [{'acronym': 'NIAID', 'city_name': 'Bethesda',... | National Institute of Allergy and Infectious D... | 1082990.0 | R01AI148407 | grant.9293759 | en | Investigating the basis of MAIT cell antigen p... | 2020-06-15 | 2020.0 | Investigating the basis of MAIT cell antigen p... |
4 | [2020, 2021, 2022, 2023] | 2023-06-01 | [{'acronym': 'ARC', 'city_name': 'Canberra', '... | Australian Research Council | 333250.0 | DE200100264 | grant.8676066 | en | Giving Mums a fair go: culturally-responsive c... | 2020-06-01 | 2020.0 | Giving Mums a fair go: culturally-responsive c... |
[37]:
pubs_grants.groupby('supporting_grant_ids').\
agg({'id':'count'}).\
reset_index().\
rename(columns={'id':'pubs','supporting_grant_ids':'id'}).\
merge(grant_details[['id','original_title','funding_usd']],
on='id').\
sort_values(by='pubs', ascending=False)
[37]:
id | pubs | original_title | funding_usd | |
---|---|---|---|---|
6803 | grant.8470130 | 63 | Consolidated Grant 2019 | 2724578.0 |
6811 | grant.8470469 | 61 | Particle Physics Consolidated Grant from the U... | 3097905.0 |
6273 | grant.7924120 | 61 | GRIDPP PROJECT COORDINATION | NaN |
2155 | grant.4320658 | 57 | Particle Physics and Cosmology Research | 14773638.0 |
206 | grant.2438793 | 55 | Cancer Center Support Grant | 266136064.0 |
... | ... | ... | ... | ... |
3151 | grant.6447242 | 1 | Participatory Urban Living for Sustainable Env... | 5696332.0 |
3150 | grant.6447136 | 1 | Towards Novel Inert (Photo-)toxic Ru(II) Polyp... | 2317685.0 |
3149 | grant.6446143 | 1 | Understanding how an MLH1 promoter polymorphis... | 952550.0 |
3148 | grant.6446046 | 1 | Macronutrients and Metabolic Health - Understa... | 2822859.0 |
3997 | grant.6803312 | 1 | Macrophage Dysfunction in Obesity, Diabetes an... | 12071621.0 |
7995 rows × 4 columns
Conclusions¶
Lists are a simple data structure that can have a great number of applications.
When used in conjuction with the DSL language, they make it easy to concatenate the results of one query with another query e.g. in order to navigate through links available in Dimensions (from publications to grants, patents etc…).
See also this patents tutorial or this clinical trials tutorial for more in-depth applications of the queries discussed above.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.