../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Extracting researchers based on affiliations and publications history

The purpose of this notebook is to demonstrate how to extract researchers data using the Dimensions API.

Specifically, we will look at a concrete use case. We want to find out about all researchers based on these two criteria:

  1. they are or have been affiliated to a specific GRID organization

  2. they have published within a chosen time frame

For the purpose of this exercise, we will are going to use grid.258806.1 and the time frame 2013-2018. Feel free though to change the parameters below as you want, eg by choosing another GRID organization.

[1]:
# sample org: grid.258806.1
GRIDID = "grid.258806.1" #@param {type:"string"}
START_YEAR = 2013 #@param {type:"slider", min:1900, max:2020, step: 1}
END_YEAR = 2018 #@param {type:"slider", min:1900, max:2020, step: 1}

if START_YEAR < END_YEAR: START_YEAR = END_YEAR
YEARS = f"[{START_YEAR}:{END_YEAR}]"

Before we start, let’s also load some useful libraries and login with the Dimensions API.

[2]:
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai"

!pip install dimcli -U --quiet

# import all libraries and login
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()


import json
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
from tqdm import tnrange, tqdm_notebook as bar
from time import sleep
from IPython.display import Image
from IPython.core.display import HTML
Dimcli - Dimensions API Client (v0.6.9)
Connected to endpoint: https://app.dimensions.ai - DSL version: 1.24
Method: dsl.ini file

Background: understanding the data model

In order to process researchers affiliation data in the context of publications, we should first take the time to understand how this data is structured in Dimensions.

The JSON results of any query with shape search publications where .... return publications are composed by a list of publications. If we open up one single publication record we will immediately see that authors are stored in a nested object authors that contains a list of dictionaries. Each element in this dictionary represents one single publication author and includes other information e.g. name, surname, ID, the organizations he/she is affiliated with etc..

For example, in order to extract the second author of the tenth publication from our results we would do the following: results.publications[10]['authors'][1]:

# author info
...
    {'first_name': 'Noboru',
     'last_name': 'Sebe',
     'orcid': '',
     'current_organization_id': 'grid.258806.1',
     'researcher_id': 'ur.010647607673.28',
     'affiliations': [{'id': 'grid.258806.1',
       'name': 'Kyushu Institute of Technology',
       'city': 'Kitakyushu',
       'city_id': 1859307,
       'country': 'Japan',
       'country_code': 'JP',
       'state': None,
       'state_code': None}]}
 ...

Here’s a object model diagram summing up how data is structured.

[3]:
Image(url= "http://api-sample-data.dimensions.ai/diagrams/data_model_researchers_publications1.v2.jpg", width=800)
[3]:

There are a few important things to keep in mind:

  • Publication Authors VS Researchers. In Dimensions, publication authors don’t necessarily have a researcher ID (eg because they haven’t been disambiguated yet). So a publication may have N authors (stored in JSON within the authors key), but only a subset of these include a researcher_id link. PS see also the searching for researchers section of the API docs for more info on this topic.

  • Time of the affiliation. Researchers can be affiliated to a specific GRID organization either at the time of speaking (now), or at the time of writing (i.e. when the article was published). The DSL uses different properties to express this fact: current_research_org or simply research_orgs. For the sake of this exercise, we will look at both.

  • Denormalized fields. Both the Publication and Researcher sources include a research_orgs field - both are ‘denormalized’ shortcut versions of the data you’d find via the authors structure in publications. However they don’t have the same meaning: for publications, the field contains the sum of all authors’ research organizations, while for researchers, that’s the sum of all research organizations a single individual has been affiliated to throught his career (as far as Dimensions knows, of course!).

So, in the real world we often have scenarios like the following one:

[4]:
Image(url= "http://api-sample-data.dimensions.ai/diagrams/data_model_researchers_publications2.v2.jpg", width=1000)
[4]:

Methodology: two options available

It turns out that there are two possible ways to extract these data, depending on whether we start our queries from Publications or from Researchers.

  1. Starting from the Publication source, we can first filter publications based on our constraints (eg year range [2013-2018] and research_orgs=“grid.258806.1” - but it could be any other query parameters); second, we would loop over all of these publications so to extract all relevant reasearchers using the affiliation data.

  2. Starting from the Researcher source, we would first filter researchers based on our constraints (eg with research_orgs=“grid.258806.1”); second, we would search for publications linked to these researchers, which have been published in the time frame [2013-2018]; lastly, we extract all relevant reasearchers using the affiliation data.

As we will find out later, both approaches are perfectly valid and return the same results.

The first apporach is generally quicker as it has only two steps (the second one has threed).

In real world situations though, deciding which approach is best depends on the specific query filters being used and on the impact these filters have on the overall performance/speed of the data extraction. There is no fixed rule and a bit of trial & error can go a long way in helping you optimize your data extraction algorithm!

Approach 1. From publications to researchers

Starting from the publications source, the steps are as follows:

  1. Filtering publications for year range [2013-2018] and research_orgs=“grid.258806.1”

    • i.e. search publications where year in [2013:2018] and research_orgs="grid.258806.1" return publications

  2. Looping over publications’ authors and extracting relevant researchers

    • if ['current_organization_id'] == "grid.258806.1"

      • => that gives us the affiliations at the time of speaking

    • or if ['affiliation']['id'] == "grid.258806.1"

      • => that gives us the affiliations at the time of publishing

First off, we can to get all publications based on our search criteria by using a loop query.

[5]:
pubs = dsl.query_iterative(f"""search publications
                                where year in {YEARS} and research_orgs="{GRIDID}"
                            return publications[id+title+doi+year+type+authors+journal+issue+volume]""")
1000 / ...
705 / 705
===
Records extracted: 705

Researchers affiliated to the GRID-ID at the time of writing

First we want to know how many researchers linked to these publications were affiliated to the GRID organization when the publication was created (note: they may be at at a different institution right now).

The affiliation data in publications represent exactly that: we can thus loop over them (for each publication/researcher) and keep only the ones matching our GRID ID.

TIP affiliations can be extracted easily thanks one of the ‘dataframe’ transformation methods in Dimcli: as_dataframe_authors_affiliations

[6]:
# extract affiliations from a publications list
affiliations = pubs.as_dataframe_authors_affiliations()
# select only affiliations for GRIDID
authors_historical = affiliations[affiliations['aff_id'] == GRIDID].copy()
# remove duplicates by eliminating publication-specific data
authors_historical.drop(columns=['pub_id'], inplace=True)
authors_historical.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at time of writing:", authors_historical.researcher_id.nunique(), "\n===")
# preview the data
authors_historical
===
Researchers with affiliation to grid.258806.1 at time of writing: 547
===
[6]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code researcher_id first_name last_name
0 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP Siewteng Sim
1 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.01116323260.31 Yoshito Andou
6 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.010362311021.55 Kubra Eksiler
7 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.011724474721.84 Satoshi Iikubo
12 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.016307426665.31 R. Okamoto
... ... ... ... ... ... ... ... ... ... ... ...
4229 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.010000541204.77 Ryuji TAKAHASHI
4262 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.012624021122.89 Ko Ichinose
4265 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.014011430535.10 Kei Yamamura
4266 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.014512005745.07 Aoi Honda
4267 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.010305013750.53 Satoshi Hiai

547 rows × 11 columns

Note: * The first ‘publication-affiliations’ dataframe we get may contain duplicate records - eg if an author has more than one publication it’ll be listed twice. That’s why we have an extra step where we drop the pub_id column and simply count unique researchers, based on their researcher ID.

Researchers affiliated to the GRIDID at the time of speaking ie now

This can be achieved simply by taking into consideration a different field, called current_organization_id, available at the outer level of the JSON author structure (see the data model section above) - outside the affiliations list.

Luckily Dimcli includes another handy method for umpacking authors into a dataframe: as_dataframe_authors

[7]:
authors_current = pubs.as_dataframe_authors()
authors_current = authors_current[authors_current['current_organization_id'] == GRIDID].copy()
authors_current.drop(columns=['pub_id'], inplace=True)
authors_current.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at the time of speaking:", authors_current.researcher_id.nunique(), "\n===")
authors_current
===
Researchers with affiliation to grid.258806.1 at the time of speaking: 514
===
[7]:
first_name last_name initials corresponding orcid current_organization_id researcher_id affiliations
1 Yoshito Andou grid.258806.1 ur.01116323260.31 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
6 Kubra Eksiler grid.258806.1 ur.010362311021.55 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
7 Satoshi Iikubo ['0000-0002-5186-4058'] grid.258806.1 ur.011724474721.84 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
12 R. Okamoto grid.258806.1 ur.016307426665.31 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
22 Shuzi Hayase grid.258806.1 ur.01055753603.27 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
... ... ... ... ... ... ... ... ...
3560 Ryuji TAKAHASHI grid.258806.1 ur.010000541204.77 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
3593 Ko Ichinose grid.258806.1 ur.012624021122.89 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
3596 Kei Yamamura grid.258806.1 ur.014011430535.10 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
3597 Aoi Honda ['0000-0002-4485-1523'] grid.258806.1 ur.014512005745.07 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
3598 Satoshi Hiai True grid.258806.1 ur.010305013750.53 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...

514 rows × 8 columns

Approach 2. From researchers to publications

Using this approach, we start our search from the ‘researchers’ database (instead of the ‘publications’ database).

There are 3 main steps:

  1. Filtering researchers with research_orgs=GRID-ID (note: this gives us affiliated researches at any point in time)

    • search researchers where research_orgs="grid.258806.1" return researchers

  2. Searching for publications linked to these researchers and linked to GRID-ID, which have been published in the time frame [2013-2018]

    • search publications where researchers.id in {LIST OF IDS} and year in [2013:2018] and research_orgs="grid.258806.1" return publications

    • NOTE: this a variation of the Approach-1 query above: we have just added the researchers IDs filter (thus reducing the search space)

  3. Extracting relevant researchers from publications, using the same exact steps as in approach 1 above.

    • if ['current_organization_id'] == "grid.258806.1"

      • => that gives us the affiliations at the time of speaking

    • or if ['affiliation']['id'] == "grid.258806.1"

      • => that gives us the affiliations at the time of publishing

[8]:
q = f"""search researchers where research_orgs="{GRIDID}"
        return researchers[basics]"""
researchers_json = dsl.query_iterative(q)
researchers = researchers_json.as_dataframe()
researchers.head()
1000 / ...
1000 / 7490
2000 / 7490
3000 / 7490
4000 / 7490
5000 / 7490
6000 / 7490
7000 / 7490
7490 / 7490
===
Records extracted: 7490
[8]:
id last_name first_name research_orgs orcid_id
0 ur.010617300440.41 Tomizaki Kin-Ya [{'id': 'grid.32197.3e', 'linkout': ['http://w... NaN
1 ur.014652747700.37 Tomizaki Kin-Ya [{'id': 'grid.32197.3e', 'linkout': ['http://w... NaN
2 ur.012161120453.51 Seki Seita [{'id': 'grid.258806.1', 'linkout': ['https://... NaN
3 ur.012476062713.53 Fujii Hitoshi [{'id': 'grid.39158.36', 'linkout': ['https://... NaN
4 ur.010472574715.67 Kaido Chikara [{'id': 'grid.471761.2', 'linkout': ['http://w... NaN

Now we need to select only the researchers who have published in the time frame [2013:2018]. So for each researcher ID we must extract the full publication history in order to verify their relevance.

The most efficient way to do this is to use a query that extracts the publication history for several researchers at the same time (so to avoid overruning our API quota), then, as a second step, producing a clean list of relevant researchers from it.

[9]:
results = []
researchers_ids = list(researchers['id'])
# no of researchers IDs per query: so to ensure we never hit the 1000 records limit per query
CHUNKS_SIZE = 300

q = """search publications
                where researchers.id in {}
                and year in {}
                and research_orgs="{}"
            return publications[id+title+doi+year+type+authors+journal+issue+volume] limit 1000"""


for chunk in chunks_of(researchers_ids, size=CHUNKS_SIZE):
    data = dsl.query(q.format(json.dumps(chunk), YEARS, GRIDID))
    try:
        results += data.publications
    except:
        pass

print("---\nFound", len(results), "publications for the given criteria (including duplicates)")

# simulate a DSL payload using Dimcli
pubs_v2 = dimcli.DslDataset.from_publications_list(results)

# transform to a dataframe to remove duplicates quickly
pubs_v2_df = pubs_v2.as_dataframe()
pubs_v2_df.drop_duplicates("id", inplace=True)
print("Final result:", len(pubs_v2_df), "unique publications")
Returned Publications: 36 (total = 36)
Returned Publications: 45 (total = 45)
Returned Publications: 92 (total = 92)
Returned Publications: 65 (total = 65)
Returned Publications: 21 (total = 21)
Returned Publications: 125 (total = 125)
Returned Publications: 71 (total = 71)
Returned Publications: 29 (total = 29)
Returned Publications: 83 (total = 83)
Returned Publications: 67 (total = 67)
Returned Publications: 42 (total = 42)
Returned Publications: 39 (total = 39)
Returned Publications: 126 (total = 126)
Returned Publications: 21 (total = 21)
Returned Publications: 51 (total = 51)
Returned Publications: 71 (total = 71)
Returned Publications: 42 (total = 42)
Returned Publications: 24 (total = 24)
Returned Publications: 139 (total = 139)
Returned Publications: 59 (total = 59)
Returned Publications: 43 (total = 43)
Returned Publications: 101 (total = 101)
Returned Publications: 32 (total = 32)
Returned Publications: 42 (total = 42)
Returned Publications: 129 (total = 129)
---
Found 1595 publications for the given criteria (including duplicates)
Final result: 683 unique publications

Researchers affiliated to the GRID-ID at the time of writing

This step is basically the same as in approach 1 above.

[10]:
# extract affiliations from a publications list
affiliations_v2 = pubs_v2.as_dataframe_authors_affiliations()
# select only affiliations for GRIDID
authors_historical_v2 = affiliations_v2[affiliations_v2['aff_id'] == GRIDID].copy()
# remove duplicates by eliminating publication-specific data
authors_historical_v2.drop(columns=['pub_id'], inplace=True)
authors_historical_v2.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at time of writing:", authors_historical_v2.researcher_id.nunique(), "\n===")
# preview the data
authors_historical_v2
===
Researchers with affiliation to grid.258806.1 at time of writing: 547
===
[10]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code researcher_id first_name last_name
0 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.011113737576.59 Zhen Wang
1 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.011773542351.82 Muhammad Akmal Kamarudin
2 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP Ng Chi Huey
3 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.010311155152.66 Fu Yang
4 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.010416773662.27 Manish Pandey
... ... ... ... ... ... ... ... ... ... ... ...
9149 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.014036341366.26 Kei Ohnishi
9368 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.0775361534.49 Ken'ichi Yokoyama
9451 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.014002425662.11 Taiki Torigoe
9523 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.011176635061.41 Takao Kodama
9990 grid.258806.1 Kyushu Institute of Technology Kitakyushu 1.85931e+06 Japan JP ur.01156670436.97 Hiroki Obata

547 rows × 11 columns

Researchers affiliated to the GRIDID at the time of speaking ie now

Also here, the procedure is exactly the same as in approach 1.

[11]:
authors_current_v2 = pubs_v2.as_dataframe_authors()
authors_current_v2 = authors_current_v2[authors_current_v2['current_organization_id'] == GRIDID].copy()
authors_current_v2.drop(columns=['pub_id'], inplace=True)
authors_current_v2.drop_duplicates('researcher_id', inplace=True)
print(f"===\nResearchers with affiliation to {GRIDID} at the time of speaking:", authors_current_v2.researcher_id.nunique(), "\n===")
authors_current_v2
===
Researchers with affiliation to grid.258806.1 at the time of speaking: 514
===
[11]:
first_name last_name initials corresponding orcid current_organization_id researcher_id affiliations
0 Zhen Wang ['0000-0002-9136-3100'] grid.258806.1 ur.011113737576.59 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
1 Muhammad Akmal Kamarudin ['0000-0002-2256-5948'] grid.258806.1 ur.011773542351.82 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
3 Fu Yang ['0000-0001-7673-8026'] grid.258806.1 ur.010311155152.66 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
4 Manish Pandey ['0000-0003-0963-8097'] grid.258806.1 ur.010416773662.27 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
5 Gaurav Kapil grid.258806.1 ur.07620457442.79 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
... ... ... ... ... ... ... ... ...
7907 Kei Ohnishi grid.258806.1 ur.014036341366.26 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
8119 Ken'ichi Yokoyama True grid.258806.1 ur.0775361534.49 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
8202 Taiki Torigoe grid.258806.1 ur.014002425662.11 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
8255 Takao Kodama True grid.258806.1 ur.011176635061.41 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...
8671 Hiroki Obata grid.258806.1 ur.01156670436.97 [{'id': 'grid.258806.1', 'name': 'Kyushu Insti...

514 rows × 8 columns

Conclusions

As anticipated above, both approaches are equally valid and in fact they return the same (or very similar) number of results. Let’s compare them:

[12]:
# create summary table
data = [['1', len(authors_current), len(authors_historical),
         """search publications where year in [2013:2018] and research_orgs="grid.258806.1" return publication""",
         ],
        ['2', len(authors_current_v2), len(authors_historical_v2),
         """search researchers where research_orgs="grid.258806.1" return researchers --- then --- search publications where researchers.id in {IDS} and year in [2013:2018] and research_orgs={GRIDID} return publications""",
        ]]

pd.DataFrame(data, columns = ['Method', 'Authors (current)', 'Authors (historical)',  'Query'])
[12]:
Method Authors (current) Authors (historical) Query
0 1 514 547 search publications where year in [2013:2018] ...
1 2 514 547 search researchers where research_orgs="grid.2...

Why are the total counts different?

In some cases you might encounter small differences in the total number of records returned by the two approaches (eg one method returns 1-2 extra records than the other one).

This is usually due to a synchronization delay between Dimensions databases (e.g. publications and researchers). The differences are negligible in most cases, but in general it’s enough to run same extraction again after a day or two for the problem to disappear.

So which method should I choose?

It depends on which/how many filters are being used in order to identify a suitable results set for your research question.

  • The first approach is generally quicker as it has only two steps, as opposed to the second method that has three.

  • However if your initial publications query returns lots of results (eg for a large institution or big time frame), it may be quicker to try out method 2 instead.

  • The second approach can be handy if one wants to pre-filter researchers using one of the ther available properties (e.g. last_grant_year)

So, in general, deciding which approach is best depends on the specific query filters being used and on the impact these filters have on the overall performance/speed of the data extraction.

There is no fixed rule and a bit of trial & error can go a long way in helping you optimize your data extraction algorithm!



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg