../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Journal Profiling Part 2: Impact Metrics

This Python notebook shows how to use the Dimensions Analytics API to extract publications data for a specific journal, as well its authors and affiliations.

This tutorial is the second of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to:

  • Load the researchers data previously extracted

  • Enrich it by building a dataset focusing on their impact in terms of no of papers, citations etc..

  • Visualize the results with plotly to have a quick overview of the results

[12]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
[2]:
#
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

Measuring the Impact of Researchers within a Journal

Goal: from the list of authors and publications we previously extracted, we want to create a new dataset focused on researchers with the following information: * Number of papers * Citations median * Altmetric Attention Score median * Last publication year

This data will allow to determine the ‘impact’ of a researcher within the journal.

Load the publications and authors data previously saved

NOTE If you are using Google Colab or don’t have the data available, just do the following: 1. open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data 2. grab this file, unzip it, open the enclosed folder and upload the files called 1_publications.csv and 1_publications_authors.csv to Google Colab (‘Upload’ menu or also by dragging then inside the panel window) 3. move the files inside the journal-profile-data folder you just created

[3]:
dfpubs = pd.read_csv(FOLDER_NAME+"/1_publications.csv")
authors = pd.read_csv(FOLDER_NAME+"/1_publications_authors.csv")
[4]:
# replace empty values with 0 so to allow bulk calculations
dfpubs = dfpubs.fillna(0)
authors = authors.fillna(0)

Isolate the Researchers data (= authors with an ID)

[5]:
researchers = authors.query("researcher_id!=0")
print("Researchers total:",  len(researchers))
researchers.head(10)
Researchers total: 45040
[5]:
affiliations corresponding current_organization_id first_name last_name orcid raw_affiliation researcher_id pub_id
264 [{'city': 'Los Angeles', 'city_id': 5368361, '... True grid.19006.3e Yi Ding ['0000-0003-3595-2493'] ['Bioinformatics Interdepartmental Program, Un... ur.010112262235.93 pub.1144028502
265 [{'city': 'Los Angeles', 'city_id': 5368361, '... True grid.19006.3e Kangcheng Hou ['0000-0001-7110-5596'] ['Bioinformatics Interdepartmental Program, Un... ur.016361002743.43 pub.1144028502
266 [{'city': 'Los Angeles', 'city_id': 5368361, '... 0 grid.19006.3e Kathryn S. Burch ['0000-0001-9624-2108'] ['Bioinformatics Interdepartmental Program, Un... ur.016425610250.34 pub.1144028502
267 [{'city': 'Los Angeles', 'city_id': 5368361, '... 0 grid.19006.3e Sandra Lapinska 0 ['Bioinformatics Interdepartmental Program, Un... ur.012302603635.50 pub.1144028502
268 [{'city': 'Aarhus', 'city_id': 2624652, 'count... 0 grid.7048.b Florian Privé 0 ['Department of Economics and Business Economi... ur.013660120354.44 pub.1144028502
269 [{'city': 'Aarhus', 'city_id': 2624652, 'count... 0 grid.7048.b Bjarni Vilhjálmsson ['0000-0003-2277-9249'] ['Department of Economics and Business Economi... ur.0603337465.80 pub.1144028502
270 [{'city': 'Los Angeles', 'city_id': 5368361, '... 0 grid.19006.3e Sriram Sankararaman 0 ['Bioinformatics Interdepartmental Program, Un... ur.0575736217.50 pub.1144028502
271 [{'city': 'Los Angeles', 'city_id': 5368361, '... True grid.19006.3e Bogdan Pasaniuc ['0000-0002-0227-2056'] ['Bioinformatics Interdepartmental Program, Un... ur.0737513674.23 pub.1144028502
272 [{'city': 'Singapore', 'city_id': 1880252, 'co... True grid.414735.0 Emmanuelle Szenker-Ravi ['0000-0003-4839-737X'] ['Laboratory of Human Genetics and Therapeutic... ur.016624747106.72 pub.1143833644
274 [{'city': 'Singapore', 'city_id': 1880252, 'co... 0 grid.4280.e Muznah Khatoo 0 ['Laboratory of Human Genetics and Therapeutic... ur.0771147362.34 pub.1143833644

Enrich the data with Impact Statistics

First, let’s pivot on the researcher ID field to eliminate duplicates and count them

[6]:
researchers_impact = researchers[['researcher_id', 'pub_id']].groupby('researcher_id', as_index=False).count().sort_values(by=['pub_id'], ascending=False).reset_index(drop=True)
researchers_impact.rename(columns={"pub_id": "pubs"}, inplace=True)
researchers_impact.head(10)
[6]:
researcher_id pubs
0 ur.0723426172.10 77
1 ur.01277776417.51 51
2 ur.0641525362.39 42
3 ur.011675454737.09 39
4 ur.01264737414.70 39
5 ur.012264440652.05 36
6 ur.01247426430.47 36
7 ur.01317433110.75 36
8 ur.0637651205.48 35
9 ur.01174076626.46 34

Second, for each researcher ID we can query all of his/her publications so to calculate the following metrics:

  • citations median

  • altmetric median

  • last publication year

Also, we add a new field with the Dimensions URL of the researcher, as it can be handy later on to open up its profile page online.

[7]:

def get_name_surname(researcher_id):
    """
    eg
    >>> get_name_surname("ur.0723426172.10")
    'Kari Stefansson'
    """
    q = "researcher_id=='%s'" % researcher_id
    x = researchers.query(q)['first_name'].value_counts().idxmax()
    y = researchers.query(q)['last_name'].value_counts().idxmax()
    return f"{x} {y}"


# def dimensions_url(researcher_id):
#     return f"https://app.dimensions.ai/discover/publication?and_facet_researcher={researcher_id}"

fullnames, citations, altmetric, last_year, urls = [], [], [], [], []

for i, row in tqdm(researchers_impact.iterrows(), total=researchers_impact.shape[0]):
    q = "researcher_id=='%s'" % row['researcher_id']
    pub_ids = list(researchers.query(q)['pub_id'])
    fullnames.append(get_name_surname(row['researcher_id']))
    citations.append(dfpubs[dfpubs['id'].isin(pub_ids)]['times_cited'].mean())
    altmetric.append(dfpubs[dfpubs['id'].isin(pub_ids)]['altmetric'].mean())
    last_year.append(dfpubs[dfpubs['id'].isin(pub_ids)]['year'].max())
    urls.append(dimensions_url(row['researcher_id']))

researchers_impact['full_name'] = fullnames
researchers_impact['citations_mean'] = citations
researchers_impact['altmetric_mean'] = altmetric
researchers_impact['last_pub_year'] = last_year
researchers_impact['url'] = urls
# finally..
print("Researchers total:",  len(researchers_impact))
researchers_impact.head(10)
Researchers total: 23346
[7]:
researcher_id pubs full_name citations_mean altmetric_mean last_pub_year url
0 ur.0723426172.10 77 Kari Stefansson 201.493506 248.818182 2021 None
1 ur.01277776417.51 51 Unnur Thorsteinsdottir 131.862745 201.666667 2021 None
2 ur.0641525362.39 42 Gonçalo R. Abecasis 223.714286 167.809524 2021 None
3 ur.011675454737.09 39 Cornelia M van Duijn 216.102564 303.179487 2021 None
4 ur.01264737414.70 39 Tõnu Esko 244.564103 339.923077 2021 None
5 ur.012264440652.05 36 Jerome I. Rotter 176.555556 175.638889 2021 None
6 ur.01247426430.47 36 Gudmar Thorleifsson 142.805556 164.694444 2021 None
7 ur.01317433110.75 36 Caroline Hayward 256.166667 311.166667 2021 None
8 ur.0637651205.48 35 Daniel F Gudbjartsson 124.057143 211.142857 2021 None
9 ur.01174076626.46 34 André G Uitterlinden 228.058824 274.558824 2021 None

Save the data

[8]:
save(researchers_impact, "2_researchers_impact_metrics.csv")

Couple of Dataviz

[9]:
top100 = researchers_impact[:100]
px.scatter(top100,
           x="full_name", y="pubs",
           hover_name="full_name",
           hover_data=['citations_mean', 'altmetric_mean'],
           marginal_y="histogram",
           height=600,
           title="Researchers Impact - top 100")
[10]:
px.scatter(top100,
           x="citations_mean", y="altmetric_mean",
           hover_name="full_name",
           hover_data=['pubs', 'citations_mean', 'altmetric_mean'],
           color="pubs",
           size="pubs",
           height=600,
           title="Researchers Impact (citations vs pubs)")
[11]:
px.scatter(top100,
           x="citations_mean", y="altmetric_mean",
           hover_name="full_name",
           hover_data=['pubs', 'citations_mean', 'altmetric_mean'],
           color="pubs",
           size="pubs",
           facet_col="last_pub_year",
           height=600,
           title="Researchers Impact (citations vs pubs) by last publication year")


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg