Extracting researchers based on affiliations and publications history¶
The purpose of this notebook is to demonstrate how to extract researchers data using the Dimensions API.
Specifically, we will look at a concrete use case. We want to find out about all researchers based on these two criteria:
they are or have been affiliated to a specific GRID organization
they have published within a chosen time frame
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.
[1]:
!pip install dimcli -U --quiet
import dimcli
from dimcli.utils import *
import os, sys, time, json
import pandas as pd
from IPython.display import Image
from IPython.core.display import HTML
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.8.2)
Connected to: https://app.dimensions.ai - DSL v1.28
Method: dsl.ini file
Select an organization ID¶
For the purpose of this exercise, we will are going to use grid.258806.1 and the time frame 2013-2018. Feel free though to change the parameters below as you want, eg by choosing another GRID organization.
[14]:
# sample org: grid.258806.1
GRIDID = "grid.258806.1" #@param {type:"string"}
START_YEAR = 2013 #@param {type:"slider", min:1900, max:2020, step: 1}
END_YEAR = 2018 #@param {type:"slider", min:1900, max:2020, step: 1}
if START_YEAR < END_YEAR: START_YEAR = END_YEAR
YEARS = f"[{START_YEAR}:{END_YEAR}]"
Background: understanding the data model¶
In order to process researchers affiliation data in the context of publications, we should first take the time to understand how this data is structured in Dimensions.
The JSON results of any query with shape search publications where .... return publications
are composed by a list of publications. If we open up one single publication record we will immediately see that authors are stored in a nested object authors
that contains a list of dictionaries. Each element in this dictionary represents one single publication author and includes other information e.g. name, surname, ID, the organizations he/she is affiliated with etc..
For example, in order to extract the second author of the tenth publication from our results we would do the following: results.publications[10]['authors'][1]
:
# author info
...
{'first_name': 'Noboru',
'last_name': 'Sebe',
'orcid': '',
'current_organization_id': 'grid.258806.1',
'researcher_id': 'ur.010647607673.28',
'affiliations': [{'id': 'grid.258806.1',
'name': 'Kyushu Institute of Technology',
'city': 'Kitakyushu',
'city_id': 1859307,
'country': 'Japan',
'country_code': 'JP',
'state': None,
'state_code': None}]}
...
Here’s a object model diagram summing up how data is structured.
[15]:
Image(url= "http://api-sample-data.dimensions.ai/diagrams/data_model_researchers_publications1.v2.jpg", width=800)
[15]:

There are a few important things to keep in mind:
Publication Authors VS Researchers. In Dimensions, publication authors don’t necessarily have a
researcher
ID (eg because they haven’t been disambiguated yet). So a publication may have N authors (stored in JSON within theauthors
key), but only a subset of these include aresearcher_id
link. PS see also the searching for researchers section of the API docs for more info on this topic.Time of the affiliation. Researchers can be affiliated to a specific GRID organization either at the time of speaking (now), or at the time of writing (i.e. when the article was published). The DSL uses different properties to express this fact:
current_research_org
or simplyresearch_orgs
. For the sake of this exercise, we will look at both.Denormalized fields. Both the Publication and Researcher sources include a
research_orgs
field - both are ‘denormalized’ shortcut versions of the data you’d find via theauthors
structure in publications. However they don’t have the same meaning: for publications, the field contains the sum of all authors’ research organizations, while for researchers, that’s the sum of all research organizations a single individual has been affiliated to throught his career (as far as Dimensions knows, of course!).
So, in the real world we often have scenarios like the following one:
[16]:
Image(url= "http://api-sample-data.dimensions.ai/diagrams/data_model_researchers_publications2.v2.jpg", width=1000)
[16]:

Methodology: two options available¶
It turns out that there are two possible ways to extract these data, depending on whether we start our queries from Publications or from Researchers.
Starting from the Publication source, we can first filter publications based on our constraints (eg year range [2013-2018] and research_orgs=“grid.258806.1” - but it could be any other query parameters); second, we would loop over all of these publications so to extract all relevant reasearchers using the
affiliation
data.Starting from the Researcher source, we would first filter researchers based on our constraints (eg with research_orgs=“grid.258806.1”); second, we would search for publications linked to these researchers, which have been published in the time frame [2013-2018]; lastly, we extract all relevant reasearchers using the
affiliation
data.
As we will find out later, both approaches are perfectly valid and return the same results.
The first apporach is generally quicker as it has only two steps (the second one has threed).
In real world situations though, deciding which approach is best depends on the specific query filters being used and on the impact these filters have on the overall performance/speed of the data extraction. There is no fixed rule and a bit of trial & error can go a long way in helping you optimize your data extraction algorithm!
Approach 1. From publications to researchers¶
Starting from the publications source, the steps are as follows:
Filtering publications for year range [2013-2018] and research_orgs=“grid.258806.1”
i.e.
search publications where year in [2013:2018] and research_orgs="grid.258806.1" return publications
Looping over publications’ authors and extracting relevant researchers
if
['current_organization_id'] == "grid.258806.1"
=> that gives us the affiliations at the time of speaking
or if
['affiliation']['id'] == "grid.258806.1"
=> that gives us the affiliations at the time of publishing
First off, we can to get all publications based on our search criteria by using a loop
query.
[17]:
pubs = dsl.query_iterative(f"""search publications
where year in {YEARS} and research_orgs="{GRIDID}"
return publications[id+title+doi+year+type+authors+journal+issue+volume]""")
Starting iteration with limit=1000 skip=0 ...
0-693 / 693 (1.23s)
===
Records extracted: 693
First we want to know how many researchers linked to these publications were affiliated to the GRID organization when the publication was created (note: they may be at at a different institution right now).
The affiliation data in publications represent exactly that: we can thus loop over them (for each publication/researcher) and keep only the ones matching our GRID ID.
TIP affiliations can be extracted easily thanks one of the ‘dataframe’ transformation methods in Dimcli: as_dataframe_authors_affiliations
[18]:
# extract affiliations from a publications list
affiliations = pubs.as_dataframe_authors_affiliations()
# select only affiliations for GRIDID
authors_historical = affiliations[affiliations['aff_id'] == GRIDID].copy()
# remove duplicates by eliminating publication-specific data
authors_historical.drop(columns=['pub_id'], inplace=True)
authors_historical.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at time of writing:", authors_historical.researcher_id.nunique(), "\n===")
# preview the data
authors_historical
===
Researchers with affiliation to grid.258806.1 at time of writing: 546
===
[18]:
aff_id | aff_name | aff_city | aff_city_id | aff_country | aff_country_code | aff_state | aff_state_code | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | Siewteng | Sim | |||
1 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.01116323260.31 | Yoshito | Andou | ||
6 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.010362311021.55 | Kubra | Eksiler | ||
7 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.011724474721.84 | Satoshi | Iikubo | ||
12 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.016307426665.31 | R. | Okamoto | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4025 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.011222056445.46 | Harald | KLEINE | ||
4033 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.010000541204.77 | Ryuji | TAKAHASHI | ||
4062 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.010305013750.53 | Satoshi | Hiai | ||
4072 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.014011430535.10 | Kei | Yamamura | ||
4073 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.014512005745.07 | Aoi | Honda |
546 rows × 11 columns
Note: * The first ‘publication-affiliations’ dataframe we get may contain duplicate records - eg if an author has more than one publication it’ll be listed twice. That’s why we have an extra step where we drop the pub_id
column and simply count unique researchers, based on their researcher ID.
This can be achieved simply by taking into consideration a different field, called current_organization_id
, available at the outer level of the JSON author structure (see the data model section above) - outside the affiliations
list.
Luckily Dimcli includes another handy method for umpacking authors into a dataframe: as_dataframe_authors
[19]:
authors_current = pubs.as_dataframe_authors()
authors_current = authors_current[authors_current['current_organization_id'] == GRIDID].copy()
authors_current.drop(columns=['pub_id'], inplace=True)
authors_current.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at the time of speaking:", authors_current.researcher_id.nunique(), "\n===")
authors_current
===
Researchers with affiliation to grid.258806.1 at the time of speaking: 518
===
[19]:
first_name | last_name | corresponding | orcid | current_organization_id | researcher_id | affiliations | |
---|---|---|---|---|---|---|---|
1 | Yoshito | Andou | grid.258806.1 | ur.01116323260.31 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
6 | Kubra | Eksiler | grid.258806.1 | ur.010362311021.55 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
7 | Satoshi | Iikubo | ['0000-0002-5186-4058'] | grid.258806.1 | ur.011724474721.84 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | |
12 | R. | Okamoto | grid.258806.1 | ur.016307426665.31 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
22 | Shuzi | Hayase | grid.258806.1 | ur.01055753603.27 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
... | ... | ... | ... | ... | ... | ... | ... |
3455 | Makoto | TAKENAKA | grid.258806.1 | ur.015326226575.37 | [{'name': 'Kagawa Prefectural Industrial Techn... | ||
3475 | Ryuji | TAKAHASHI | grid.258806.1 | ur.010000541204.77 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
3504 | Satoshi | Hiai | True | grid.258806.1 | ur.010305013750.53 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | |
3514 | Kei | Yamamura | grid.258806.1 | ur.014011430535.10 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
3515 | Aoi | Honda | ['0000-0002-4485-1523'] | grid.258806.1 | ur.014512005745.07 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... |
518 rows × 7 columns
Approach 2. From researchers to publications¶
Using this approach, we start our search from the ‘researchers’ database (instead of the ‘publications’ database).
There are 3 main steps:
Filtering researchers with research_orgs=GRID-ID (note: this gives us affiliated researches at any point in time)
search researchers where research_orgs="grid.258806.1" return researchers
Searching for publications linked to these researchers and linked to GRID-ID, which have been published in the time frame
[2013-2018]
search publications where researchers.id in {LIST OF IDS} and year in [2013:2018] and research_orgs="grid.258806.1" return publications
NOTE: this a variation of the Approach-1 query above: we have just added the researchers IDs filter (thus reducing the search space)
Extracting relevant researchers from publications, using the same exact steps as in approach 1 above.
if
['current_organization_id'] == "grid.258806.1"
=> that gives us the affiliations at the time of speaking
or if
['affiliation']['id'] == "grid.258806.1"
=> that gives us the affiliations at the time of publishing
[20]:
q = f"""search researchers where research_orgs="{GRIDID}"
return researchers[basics]"""
researchers_json = dsl.query_iterative(q)
researchers = researchers_json.as_dataframe()
researchers.head()
Starting iteration with limit=1000 skip=0 ...
0-1000 / 8976 (0.86s)
1000-2000 / 8993 (1.41s)
2000-3000 / 8993 (1.08s)
3000-4000 / 8976 (2.34s)
4000-5000 / 8993 (1.16s)
5000-6000 / 8976 (0.90s)
6000-7000 / 8976 (1.27s)
7000-8000 / 8976 (0.97s)
8000-8993 / 8993 (1.06s)
===
Records extracted: 8993
[20]:
id | first_name | last_name | research_orgs | orcid_id | |
---|---|---|---|---|---|
0 | ur.012161120453.51 | Seita | Seki | [{'id': 'grid.258806.1', 'types': ['Education'... | NaN |
1 | ur.014746662301.94 | Takeshi | Yoshinaga | [{'id': 'grid.258806.1', 'types': ['Education'... | NaN |
2 | ur.016064142555.51 | Charles Ronald | Harahap | [{'id': 'grid.258806.1', 'types': ['Education'... | NaN |
3 | ur.016437530541.00 | Yurie | Sugimoto | [{'id': 'grid.258806.1', 'types': ['Education'... | NaN |
4 | ur.011113737576.59 | Zhen | Wang | [{'id': 'grid.258806.1', 'types': ['Education'... | [0000-0002-9136-3100] |
Now we need to select only the researchers who have published in the time frame [2013:2018]. So for each researcher ID we must extract the full publication history in order to verify their relevance.
The most efficient way to do this is to use a query that extracts the publication history for several researchers at the same time (so to avoid overruning our API quota), then, as a second step, producing a clean list of relevant researchers from it.
[21]:
results = []
researchers_ids = list(researchers['id'])
# no of researchers IDs per query: so to ensure we never hit the 1000 records limit per query
CHUNKS_SIZE = 300
q = """search publications
where researchers.id in {}
and year in {}
and research_orgs="{}"
return publications[id+title+doi+year+type+authors+journal+issue+volume] limit 1000"""
from dimcli.shortcuts import chunks_of
for chunk in chunks_of(researchers_ids, size=CHUNKS_SIZE):
data = dsl.query(q.format(json.dumps(chunk), YEARS, GRIDID))
try:
results += data.publications
except:
pass
print("---\nFound", len(results), "publications for the given criteria (including duplicates)")
# simulate a DSL payload using Dimcli
pubs_v2 = dimcli.DslDataset.from_publications_list(results)
# transform to a dataframe to remove duplicates quickly
pubs_v2_df = pubs_v2.as_dataframe()
pubs_v2_df.drop_duplicates("id", inplace=True)
print("Final result:", len(pubs_v2_df), "unique publications")
Returned Publications: 32 (total = 32)
Time: 0.78s
Returned Publications: 54 (total = 54)
Time: 1.34s
Returned Publications: 78 (total = 78)
Time: 1.27s
Returned Publications: 43 (total = 43)
Time: 0.87s
Returned Publications: 25 (total = 25)
Time: 0.86s
Returned Publications: 88 (total = 88)
Time: 1.00s
Returned Publications: 70 (total = 70)
Time: 0.92s
Returned Publications: 62 (total = 62)
Time: 0.76s
Returned Publications: 24 (total = 24)
Time: 0.66s
Returned Publications: 46 (total = 46)
Time: 0.66s
Returned Publications: 75 (total = 75)
Time: 0.77s
Returned Publications: 47 (total = 47)
Time: 0.66s
Returned Publications: 29 (total = 29)
Time: 0.97s
Returned Publications: 80 (total = 80)
Time: 0.78s
Returned Publications: 89 (total = 89)
Time: 0.79s
Returned Publications: 21 (total = 21)
Time: 0.64s
Returned Publications: 41 (total = 41)
Time: 0.63s
Returned Publications: 66 (total = 66)
Time: 0.76s
Returned Publications: 17 (total = 17)
Time: 0.67s
Returned Publications: 36 (total = 36)
Time: 0.63s
Returned Publications: 67 (total = 67)
Time: 0.98s
Returned Publications: 46 (total = 46)
Time: 0.94s
Returned Publications: 76 (total = 76)
Time: 0.94s
Returned Publications: 54 (total = 54)
Time: 0.94s
Returned Publications: 59 (total = 59)
Time: 0.77s
Returned Publications: 50 (total = 50)
Time: 0.70s
Returned Publications: 36 (total = 36)
Time: 0.70s
Returned Publications: 46 (total = 46)
Time: 0.79s
Returned Publications: 89 (total = 89)
Time: 0.99s
Returned Publications: 61 (total = 61)
Time: 0.78s
---
Found 1607 publications for the given criteria (including duplicates)
Final result: 680 unique publications
This step is basically the same as in approach 1 above.
[22]:
# extract affiliations from a publications list
affiliations_v2 = pubs_v2.as_dataframe_authors_affiliations()
# select only affiliations for GRIDID
authors_historical_v2 = affiliations_v2[affiliations_v2['aff_id'] == GRIDID].copy()
# remove duplicates by eliminating publication-specific data
authors_historical_v2.drop(columns=['pub_id'], inplace=True)
authors_historical_v2.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at time of writing:", authors_historical_v2.researcher_id.nunique(), "\n===")
# preview the data
authors_historical_v2
===
Researchers with affiliation to grid.258806.1 at time of writing: 546
===
[22]:
aff_id | aff_name | aff_city | aff_city_id | aff_country | aff_country_code | aff_state | aff_state_code | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.011113737576.59 | Zhen | Wang | ||
1 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.011773542351.82 | Muhammad Akmal | Kamarudin | ||
2 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | Ng Chi | Huey | |||
3 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.010311155152.66 | Fu | Yang | ||
4 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.010416773662.27 | Manish | Pandey | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9445 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.015771611513.28 | Yoshiro | Fukui | ||
9885 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.011176635061.41 | Takao | Kodama | ||
9908 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.014036341366.26 | Kei | Ohnishi | ||
10098 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.0775361534.49 | Ken'ichi | Yokoyama | ||
10122 | grid.258806.1 | Kyushu Institute of Technology | Kitakyushu | 1.85931e+06 | Japan | JP | ur.01156670436.97 | Hiroki | Obata |
546 rows × 11 columns
Also here, the procedure is exactly the same as in approach 1.
[23]:
authors_current_v2 = pubs_v2.as_dataframe_authors()
authors_current_v2 = authors_current_v2[authors_current_v2['current_organization_id'] == GRIDID].copy()
authors_current_v2.drop(columns=['pub_id'], inplace=True)
authors_current_v2.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at the time of speaking:", authors_current_v2.researcher_id.nunique(), "\n===")
authors_current_v2
===
Researchers with affiliation to grid.258806.1 at the time of speaking: 518
===
[23]:
first_name | last_name | corresponding | orcid | current_organization_id | researcher_id | affiliations | |
---|---|---|---|---|---|---|---|
0 | Zhen | Wang | ['0000-0002-9136-3100'] | grid.258806.1 | ur.011113737576.59 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | |
1 | Muhammad Akmal | Kamarudin | ['0000-0002-2256-5948'] | grid.258806.1 | ur.011773542351.82 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | |
3 | Fu | Yang | ['0000-0001-7673-8026'] | grid.258806.1 | ur.010311155152.66 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | |
4 | Manish | Pandey | ['0000-0003-0963-8097'] | grid.258806.1 | ur.010416773662.27 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | |
5 | Gaurav | Kapil | grid.258806.1 | ur.07620457442.79 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
... | ... | ... | ... | ... | ... | ... | ... |
8268 | Yoshiro | Fukui | grid.258806.1 | ur.015771611513.28 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
8680 | Takao | Kodama | True | grid.258806.1 | ur.011176635061.41 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | |
8701 | Kei | Ohnishi | grid.258806.1 | ur.014036341366.26 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | ||
8868 | Ken'ichi | Yokoyama | True | grid.258806.1 | ur.0775361534.49 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... | |
8891 | Hiroki | Obata | grid.258806.1 | ur.01156670436.97 | [{'id': 'grid.258806.1', 'name': 'Kyushu Insti... |
518 rows × 7 columns
Conclusions¶
As anticipated above, both approaches are equally valid and in fact they return the same (or very similar) number of results. Let’s compare them:
[24]:
# create summary table
data = [['1', len(authors_current), len(authors_historical),
"""search publications where year in [2013:2018] and research_orgs="grid.258806.1" return publication""",
],
['2', len(authors_current_v2), len(authors_historical_v2),
"""search researchers where research_orgs="grid.258806.1" return researchers --- then --- search publications where researchers.id in {IDS} and year in [2013:2018] and research_orgs={GRIDID} return publications""",
]]
pd.DataFrame(data, columns = ['Method', 'Authors (current)', 'Authors (historical)', 'Query'])
[24]:
Method | Authors (current) | Authors (historical) | Query | |
---|---|---|---|---|
0 | 1 | 518 | 546 | search publications where year in [2013:2018] ... |
1 | 2 | 518 | 546 | search researchers where research_orgs="grid.2... |
In some cases you might encounter small differences in the total number of records returned by the two approaches (eg one method returns 1-2 extra records than the other one).
This is usually due to a synchronization delay between Dimensions databases (e.g. publications and researchers). The differences are negligible in most cases, but in general it’s enough to run same extraction again after a day or two for the problem to disappear.
It depends on which/how many filters are being used in order to identify a suitable results set for your research question.
The first approach is generally quicker as it has only two steps, as opposed to the second method that has three.
However if your initial
publications
query returns lots of results (eg for a large institution or big time frame), it may be quicker to try out method 2 instead.The second approach can be handy if one wants to pre-filter researchers using one of the ther available properties (e.g.
last_grant_year
)
So, in general, deciding which approach is best depends on the specific query filters being used and on the impact these filters have on the overall performance/speed of the data extraction.
There is no fixed rule and a bit of trial & error can go a long way in helping you optimize your data extraction algorithm!
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.