Journal Profiling Part 2: Impact Metrics¶

This Python notebook shows how to use the Dimensions Analytics API to extract publications data for a specific journal, as well its authors and affiliations.

This tutorial is the second of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to:

Load the researchers data previously extracted
Enrich it by building a dataset focusing on their impact in terms of no of papers, citations etc..
Visualize the results with plotly to have a quick overview of the results

[12]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:

!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

[2]:

#
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

Measuring the Impact of Researchers within a Journal¶

Goal: from the list of authors and publications we previously extracted, we want to create a new dataset focused on researchers with the following information: * Number of papers * Citations median * Altmetric Attention Score median * Last publication year

This data will allow to determine the ‘impact’ of a researcher within the journal.

Load the publications and authors data previously saved¶

NOTE If you are using Google Colab or don’t have the data available, just do the following: 1. open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data 2. grab this file, unzip it, open the enclosed folder and upload the files called 1_publications.csv and 1_publications_authors.csv to Google Colab (‘Upload’ menu or also by dragging then inside the panel window) 3. move the files inside the journal-profile-data folder you just created

[3]:

dfpubs = pd.read_csv(FOLDER_NAME+"/1_publications.csv")
authors = pd.read_csv(FOLDER_NAME+"/1_publications_authors.csv")

[4]:

# replace empty values with 0 so to allow bulk calculations
dfpubs = dfpubs.fillna(0)
authors = authors.fillna(0)

Isolate the Researchers data (= authors with an ID)¶

[5]:

researchers = authors.query("researcher_id!=0")
print("Researchers total:",  len(researchers))
researchers.head(10)

Researchers total: 45040

[5]:

	affiliations	corresponding	current_organization_id	first_name	last_name	orcid	raw_affiliation	researcher_id	pub_id
264	[{'city': 'Los Angeles', 'city_id': 5368361, '...	True	grid.19006.3e	Yi	Ding	['0000-0003-3595-2493']	['Bioinformatics Interdepartmental Program, Un...	ur.010112262235.93	pub.1144028502
265	[{'city': 'Los Angeles', 'city_id': 5368361, '...	True	grid.19006.3e	Kangcheng	Hou	['0000-0001-7110-5596']	['Bioinformatics Interdepartmental Program, Un...	ur.016361002743.43	pub.1144028502
266	[{'city': 'Los Angeles', 'city_id': 5368361, '...	0	grid.19006.3e	Kathryn S.	Burch	['0000-0001-9624-2108']	['Bioinformatics Interdepartmental Program, Un...	ur.016425610250.34	pub.1144028502
267	[{'city': 'Los Angeles', 'city_id': 5368361, '...	0	grid.19006.3e	Sandra	Lapinska	0	['Bioinformatics Interdepartmental Program, Un...	ur.012302603635.50	pub.1144028502
268	[{'city': 'Aarhus', 'city_id': 2624652, 'count...	0	grid.7048.b	Florian	Privé	0	['Department of Economics and Business Economi...	ur.013660120354.44	pub.1144028502
269	[{'city': 'Aarhus', 'city_id': 2624652, 'count...	0	grid.7048.b	Bjarni	Vilhjálmsson	['0000-0003-2277-9249']	['Department of Economics and Business Economi...	ur.0603337465.80	pub.1144028502
270	[{'city': 'Los Angeles', 'city_id': 5368361, '...	0	grid.19006.3e	Sriram	Sankararaman	0	['Bioinformatics Interdepartmental Program, Un...	ur.0575736217.50	pub.1144028502
271	[{'city': 'Los Angeles', 'city_id': 5368361, '...	True	grid.19006.3e	Bogdan	Pasaniuc	['0000-0002-0227-2056']	['Bioinformatics Interdepartmental Program, Un...	ur.0737513674.23	pub.1144028502
272	[{'city': 'Singapore', 'city_id': 1880252, 'co...	True	grid.414735.0	Emmanuelle	Szenker-Ravi	['0000-0003-4839-737X']	['Laboratory of Human Genetics and Therapeutic...	ur.016624747106.72	pub.1143833644
274	[{'city': 'Singapore', 'city_id': 1880252, 'co...	0	grid.4280.e	Muznah	Khatoo	0	['Laboratory of Human Genetics and Therapeutic...	ur.0771147362.34	pub.1143833644

Enrich the data with Impact Statistics¶

First, let’s pivot on the researcher ID field to eliminate duplicates and count them

[6]:

researchers_impact = researchers[['researcher_id', 'pub_id']].groupby('researcher_id', as_index=False).count().sort_values(by=['pub_id'], ascending=False).reset_index(drop=True)
researchers_impact.rename(columns={"pub_id": "pubs"}, inplace=True)
researchers_impact.head(10)

[6]:

	researcher_id	pubs
0	ur.0723426172.10	77
1	ur.01277776417.51	51
2	ur.0641525362.39	42
3	ur.011675454737.09	39
4	ur.01264737414.70	39
5	ur.012264440652.05	36
6	ur.01247426430.47	36
7	ur.01317433110.75	36
8	ur.0637651205.48	35
9	ur.01174076626.46	34

Second, for each researcher ID we can query all of his/her publications so to calculate the following metrics:

citations median
altmetric median
last publication year

Also, we add a new field with the Dimensions URL of the researcher, as it can be handy later on to open up its profile page online.

[7]:

def get_name_surname(researcher_id):
    """
    eg
    >>> get_name_surname("ur.0723426172.10")
    'Kari Stefansson'
    """
    q = "researcher_id=='%s'" % researcher_id
    x = researchers.query(q)['first_name'].value_counts().idxmax()
    y = researchers.query(q)['last_name'].value_counts().idxmax()
    return f"{x} {y}"


# def dimensions_url(researcher_id):
#     return f"https://app.dimensions.ai/discover/publication?and_facet_researcher={researcher_id}"

fullnames, citations, altmetric, last_year, urls = [], [], [], [], []

for i, row in tqdm(researchers_impact.iterrows(), total=researchers_impact.shape[0]):
    q = "researcher_id=='%s'" % row['researcher_id']
    pub_ids = list(researchers.query(q)['pub_id'])
    fullnames.append(get_name_surname(row['researcher_id']))
    citations.append(dfpubs[dfpubs['id'].isin(pub_ids)]['times_cited'].mean())
    altmetric.append(dfpubs[dfpubs['id'].isin(pub_ids)]['altmetric'].mean())
    last_year.append(dfpubs[dfpubs['id'].isin(pub_ids)]['year'].max())
    urls.append(dimensions_url(row['researcher_id']))

researchers_impact['full_name'] = fullnames
researchers_impact['citations_mean'] = citations
researchers_impact['altmetric_mean'] = altmetric
researchers_impact['last_pub_year'] = last_year
researchers_impact['url'] = urls
# finally..
print("Researchers total:",  len(researchers_impact))
researchers_impact.head(10)

Researchers total: 23346

[7]:

	researcher_id	pubs	full_name	citations_mean	altmetric_mean	last_pub_year	url
0	ur.0723426172.10	77	Kari Stefansson	201.493506	248.818182	2021	None
1	ur.01277776417.51	51	Unnur Thorsteinsdottir	131.862745	201.666667	2021	None
2	ur.0641525362.39	42	Gonçalo R. Abecasis	223.714286	167.809524	2021	None
3	ur.011675454737.09	39	Cornelia M van Duijn	216.102564	303.179487	2021	None
4	ur.01264737414.70	39	Tõnu Esko	244.564103	339.923077	2021	None
5	ur.012264440652.05	36	Jerome I. Rotter	176.555556	175.638889	2021	None
6	ur.01247426430.47	36	Gudmar Thorleifsson	142.805556	164.694444	2021	None
7	ur.01317433110.75	36	Caroline Hayward	256.166667	311.166667	2021	None
8	ur.0637651205.48	35	Daniel F Gudbjartsson	124.057143	211.142857	2021	None
9	ur.01174076626.46	34	André G Uitterlinden	228.058824	274.558824	2021	None

Save the data

[8]:

save(researchers_impact, "2_researchers_impact_metrics.csv")

Couple of Dataviz¶

[9]:

top100 = researchers_impact[:100]
px.scatter(top100,
           x="full_name", y="pubs",
           hover_name="full_name",
           hover_data=['citations_mean', 'altmetric_mean'],
           marginal_y="histogram",
           height=600,
           title="Researchers Impact - top 100")

[10]:

px.scatter(top100,
           x="citations_mean", y="altmetric_mean",
           hover_name="full_name",
           hover_data=['pubs', 'citations_mean', 'altmetric_mean'],
           color="pubs",
           size="pubs",
           height=600,
           title="Researchers Impact (citations vs pubs)")

[11]:

px.scatter(top100,
           x="citations_mean", y="altmetric_mean",
           hover_name="full_name",
           hover_data=['pubs', 'citations_mean', 'altmetric_mean'],
           color="pubs",
           size="pubs",
           facet_col="last_pub_year",
           height=600,
           title="Researchers Impact (citations vs pubs) by last publication year")

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.