../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Journal Profiling Part 1: Getting the Data

This Python notebook shows how to use the Dimensions Analytics API to extract publications data for a specific journal, as well its authors and affiliations.

This tutorial is the first of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to:

  • extract all publications data for a given journal

  • have a quick look at the publications’ authors and affiliations

  • review how many authors have been disambiguated with a Dimensions Researcher ID

  • produce a dataset of non-disambiguated authors that can be used for manual disambiguation

Prerequisites: Installing the Dimensions Library and Logging in

[1]:

# @markdown # Get the API library and login
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}


!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

#
# load common libraries
import time
import sys
import json
import os
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progress

#
# charts libs
# import plotly_express as px
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)
DimCli v0.6.8.1 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)

Selecting a Journal and Extracting All Publications Metadata

[2]:
#@title Select a journal from the dropdown
#@markdown If the journal isn't there, you can try type in the exact name instead.

journal_title = "Nature Genetics" #@param ['Nature', 'Nature Communications', 'Nature Biotechnology', 'Nature Medicine', 'Nature Genetics', 'Nature Neuroscience', 'Nature Structural & Molecular Biology', 'Nature Methods', 'Nature Cell Biology', 'Nature Immunology', 'Nature Reviews Drug Discovery', 'Nature Materials', 'Nature Physics', 'Nature Reviews Neuroscience', 'Nature Nanotechnology', 'Nature Reviews Genetics', 'Nature Reviews Urology', 'Nature Reviews Molecular Cell Biology', 'Nature Precedings', 'Nature Reviews Cancer', 'Nature Photonics', 'Nature Reviews Immunology', 'Nature Reviews Cardiology', 'Nature Reviews Gastroenterology & Hepatology', 'Nature Reviews Clinical Oncology', 'Nature Reviews Endocrinology', 'Nature Reviews Neurology', 'Nature Chemical Biology', 'Nature Reviews Microbiology', 'Nature Geoscience', 'Nature Reviews Rheumatology', 'Nature Climate Change', 'Nature Reviews Nephrology', 'Nature Chemistry', 'Nature Digest', 'Nature Protocols', 'Nature Middle East', 'Nature India', 'Nature China', 'Nature Plants', 'Nature Microbiology', 'Nature Ecology & Evolution', 'Nature Astronomy', 'Nature Energy', 'Nature Human Behaviour', 'AfCS-Nature Molecule Pages', 'Human Nature', 'Nature Reviews Disease Primers', 'Nature Biomedical Engineering', 'Nature Reports Stem Cells', 'Nature Reviews Materials', 'Nature Sustainability', 'Nature Catalysis', 'Nature Electronics', 'Nature Reviews Chemistry', 'Nature Metabolism', 'Nature Reviews Physics', 'Nature Machine Intelligence', 'NCI Nature Pathway Interaction Database', 'Nature Reports: Climate Change'] {allow-input: true}
start_year = 2015  #@param {type: "number"}
#@markdown ---

# PS
# To get titles from the API one can do this:
# > %dsldf search publications where journal.title~"Nature" and publisher="Springer Nature" return journal limit 100
# > ", ".join([f"'{x}'" for x in list(dsl_last_results.title)])
#

q_template = """search publications where
    journal.title="{}" and
    year>={}
    return publications[basics+altmetric+times_cited]"""
q = q_template.format(journal_title, start_year)
print("DSL Query:\n----\n", q, "\n----")
pubs = dsl.query_iterative(q.format(journal_title, start_year), limit=500)

DSL Query:
----
 search publications where
    journal.title="Nature Genetics" and
    year>=2015
    return publications[basics+altmetric+times_cited]
----
500 / ...
500 / 1480
1000 / 1480
1480 / 1480
===
Records extracted: 1480

Save the data as a CSV file in case we want to reuse it later

[3]:
dfpubs = pubs.as_dataframe()
save(dfpubs,"1_publications.csv")
# preview the publications
dfpubs.head(10)
[3]:
altmetric title year pages type times_cited author_affiliations id journal.id journal.title volume issue
0 1.0 Publisher Correction: Comprehensive analysis o... 2020 1-1 article 0 [[{'first_name': 'Isidro', 'last_name': 'Corté... pub.1127548497 jour.1103138 Nature Genetics NaN NaN
1 14.0 Uncoupling histone H3K4 trimethylation from de... 2020 1-11 article 0 [[{'first_name': 'Delphine', 'last_name': 'Dou... pub.1127504415 jour.1103138 Nature Genetics NaN NaN
2 38.0 Evaluating two different models of peanut’s or... 2020 1-3 article 1 [[{'first_name': 'David J.', 'last_name': 'Ber... pub.1127499311 jour.1103138 Nature Genetics NaN NaN
3 3.0 Reply to: Evaluating two different models of p... 2020 1-4 article 1 [[{'first_name': 'Weijian', 'last_name': 'Zhua... pub.1127503074 jour.1103138 Nature Genetics NaN NaN
4 195.0 Genetic identification of cell types underlyin... 2020 482-493 article 1 [[{'first_name': 'Julien', 'last_name': 'Bryoi... pub.1127148454 jour.1103138 Nature Genetics 52 5
5 193.0 Genome-wide association meta-analyses combinin... 2020 494-504 article 1 [[{'first_name': 'Maria Teresa', 'last_name': ... pub.1127145284 jour.1103138 Nature Genetics 52 5
6 10.0 Elevated sorbitol underlies a heritable neurop... 2020 469-470 article 0 [[{'first_name': 'Eva', 'last_name': 'Morava',... pub.1127348606 jour.1103138 Nature Genetics 52 5
7 12.0 Spt5-mediated enhancer transcription directly ... 2020 505-515 article 1 [[{'first_name': 'Johanna', 'last_name': 'Fitz... pub.1126151817 jour.1103138 Nature Genetics 52 5
8 241.0 Identifying genetic variants underlying phenot... 2020 534-540 article 2 [[{'first_name': 'Yoav', 'last_name': 'Voichek... pub.1126635876 jour.1103138 Nature Genetics 52 5
9 3.0 Enhancer–promoter interactions and transcription 2020 470-471 article 0 [[{'first_name': 'Douglas R.', 'last_name': 'H... pub.1127392559 jour.1103138 Nature Genetics 52 5

Extract the authors data

[4]:
# preview the authors data
authors = pubs.as_dataframe_authors()
save(authors,"1_publications_authors.csv")
authors.head(10)
[4]:
first_name last_name initials corresponding orcid current_organization_id researcher_id affiliations is_bogus pub_id
0 Isidro Cortés-Ciriano [{'id': 'grid.38142.3c', 'name': 'Harvard Univ... NaN pub.1127548497
1 Jake June-Koo Lee [{'id': 'grid.38142.3c', 'name': 'Harvard Univ... NaN pub.1127548497
2 Ruibin Xi [{'id': 'grid.11135.37', 'name': 'Peking Unive... NaN pub.1127548497
3 Dhawal Jain [{'id': 'grid.38142.3c', 'name': 'Harvard Univ... NaN pub.1127548497
4 Youngsook L. Jung [{'id': 'grid.38142.3c', 'name': 'Harvard Univ... NaN pub.1127548497
5 Lixing Yang [{'name': 'Ben May Department for Cancer Resea... NaN pub.1127548497
6 Dmitry Gordenin [{'id': 'grid.280664.e', 'name': 'National Ins... NaN pub.1127548497
7 Leszek J. Klimczak [{'id': 'grid.280664.e', 'name': 'National Ins... NaN pub.1127548497
8 Cheng-Zhong Zhang [{'id': 'grid.38142.3c', 'name': 'Harvard Univ... NaN pub.1127548497
9 David S. Pellman [{'id': 'grid.65499.37', 'name': 'Dana-Farber ... NaN pub.1127548497

Extract the affiliations data

[5]:
affiliations = pubs.as_dataframe_authors_affiliations()
save(affiliations,"1_publications_affiliations.csv")
affiliations.head(10)
[5]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code pub_id researcher_id first_name last_name
0 grid.38142.3c Harvard University Cambridge 4.93197e+06 United States US Massachusetts US-MA pub.1127548497 Isidro Cortés-Ciriano
1 Ludwig Center at Harvard, Boston, MA, USA pub.1127548497 Isidro Cortés-Ciriano
2 grid.5335.0 University of Cambridge Cambridge 2.65394e+06 United Kingdom GB pub.1127548497 Isidro Cortés-Ciriano
3 grid.225360.0 European Bioinformatics Institute Cambridge 2.65394e+06 United Kingdom GB pub.1127548497 Isidro Cortés-Ciriano
4 grid.38142.3c Harvard University Cambridge 4.93197e+06 United States US Massachusetts US-MA pub.1127548497 Jake June-Koo Lee
5 Ludwig Center at Harvard, Boston, MA, USA pub.1127548497 Jake June-Koo Lee
6 grid.11135.37 Peking University Beijing 1.81667e+06 China CN pub.1127548497 Ruibin Xi
7 grid.38142.3c Harvard University Cambridge 4.93197e+06 United States US Massachusetts US-MA pub.1127548497 Dhawal Jain
8 grid.38142.3c Harvard University Cambridge 4.93197e+06 United States US Massachusetts US-MA pub.1127548497 Youngsook L. Jung
9 Ben May Department for Cancer Research, Univer... pub.1127548497 Lixing Yang

Some stats about authors

  • count how many authors in total

  • count how many authors have a researcher ID

  • count how many unique researchers IDs we have in total

[6]:
researchers = authors.query("researcher_id!=''")
#
df = pd.DataFrame({
    'measure' : ['Authors in total (non unique)', 'Authors with a researcher ID', 'Authors with a researcher ID (unique)'],
    'count' : [len(authors), len(researchers), researchers['researcher_id'].nunique()],
})
px.bar(df, x="measure", y="count", title=f"Author stats for {journal_title} (from {start_year})")
[7]:
# save the researchers data to a file
save(researchers, "1_authors_with_researchers_id.csv")

A quick look at authors without a Dimensions Researcher ID

We’re not going to try to disambiguate them here, but still it’s good to have a quick look at them…

Looks like the most common surname is Wang, while the most common first name is an empty value

[8]:
authors_without_id = authors.query("researcher_id==''")
authors_without_id[['first_name', 'last_name']].describe()
[8]:
first_name last_name
count 9776 9776
unique 4280 4183
top
freq 560 408

Top ten ‘ambiguous’ surnames seem to be all Asian.. it’s a rather known problem!

[9]:
authors_without_id['last_name'].value_counts()[:10]
[9]:
         408
Wang     183
Li       148
Zhang    148
Liu      128
Chen     108
Yang      71
Zhao      63
Lee       57
Kim       50
Name: last_name, dtype: int64

Any common patterns?

If we try to group the data by name+surname we can see some interesting patterns

  • some entries are things which are not persons (presumably the results of bad source data in Dimensions, eg from the publisher)

  • there are some apparently meaningful name+surname combinations with a lot of hits

  • not many Asian names in the top ones

[24]:
authors_without_id = authors_without_id.groupby(["first_name", "last_name"]).size().reset_index().rename(columns={0: "frequency"})
authors_without_id.sort_values("frequency", ascending=False, inplace=True)
authors_without_id.head(20)
[24]:
first_name last_name frequency
0 408
2427 Jaakko Tuomilehto 13
6090 Wei Zhao 12
2513 James G. Wilson 12
2529 James P. Cook 11
4570 Olle Melander 11
805 Brooke LaFlamme 10
3093 Jouke-Jan Hottenga 10
2772 Jie Huang 10
4885 Qiong Yang 10
1313 Daniela Toniolo 9
91 Aarno Palotie 9
3475 Lars Lind 9
3247 Kari Stefansson 9
322 Andre Franke 9
4406 Najaf Amin 9
6441 Ying Wu 9
409 Andrew P. Morris 9
5962 Tõnu Esko 9
4478 Nicholas G. Martin 9

Creating an export for manual curation

For the next tasks, we will focus on the disambiguated authors as the Researcher ID links will let us carry out useful analyses.

Still, we can save the authors with missing IDs results and try to do some manual disambiguation later. To this end, adding a simple google-search URL can help in making sense of these data quickly.

[25]:
from dimcli.shortcuts import google_url

authors_without_id['search_url'] = authors_without_id.apply(lambda x: google_url(x['first_name'] + " " +x['last_name'] ), axis=1)

authors_without_id.head(20)
[25]:
first_name last_name frequency search_url
0 408 https://www.google.com/search?q=%20
2427 Jaakko Tuomilehto 13 https://www.google.com/search?q=Jaakko%20Tuomi...
6090 Wei Zhao 12 https://www.google.com/search?q=Wei%20Zhao
2513 James G. Wilson 12 https://www.google.com/search?q=James%20G.%20W...
2529 James P. Cook 11 https://www.google.com/search?q=James%20P.%20Cook
4570 Olle Melander 11 https://www.google.com/search?q=Olle%20Melander
805 Brooke LaFlamme 10 https://www.google.com/search?q=Brooke%20LaFlamme
3093 Jouke-Jan Hottenga 10 https://www.google.com/search?q=Jouke-Jan%20Ho...
2772 Jie Huang 10 https://www.google.com/search?q=Jie%20Huang
4885 Qiong Yang 10 https://www.google.com/search?q=Qiong%20Yang
1313 Daniela Toniolo 9 https://www.google.com/search?q=Daniela%20Toniolo
91 Aarno Palotie 9 https://www.google.com/search?q=Aarno%20Palotie
3475 Lars Lind 9 https://www.google.com/search?q=Lars%20Lind
3247 Kari Stefansson 9 https://www.google.com/search?q=Kari%20Stefansson
322 Andre Franke 9 https://www.google.com/search?q=Andre%20Franke
4406 Najaf Amin 9 https://www.google.com/search?q=Najaf%20Amin
6441 Ying Wu 9 https://www.google.com/search?q=Ying%20Wu
409 Andrew P. Morris 9 https://www.google.com/search?q=Andrew%20P.%20...
5962 Tõnu Esko 9 https://www.google.com/search?q=T%C3%B5nu%20Esko
4478 Nicholas G. Martin 9 https://www.google.com/search?q=Nicholas%20G.%...
[26]:
# save the data
save(authors_without_id, "1_authors_without_researchers_id.csv")

That’s it!

Now let’s go and open this in Google Sheets

[13]:
# for colab users: download everything
if COLAB_ENV:
    from google.colab import auth
    auth.authenticate_user()

    import gspread
    from gspread_dataframe import get_as_dataframe, set_with_dataframe
    from oauth2client.client import GoogleCredentials

    gc = gspread.authorize(GoogleCredentials.get_application_default())

    title = 'Authors_without_IDs'
    sh = gc.create(title)
    worksheet = gc.open(title).sheet1
    set_with_dataframe(worksheet, authors_without_id)
    spreadsheet_url = "https://docs.google.com/spreadsheets/d/%s" % sh.id
    print(spreadsheet_url)


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg