../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Journal Profiling Part 5: Competing Journals Analysis

This Python notebook shows how to use the Dimensions Analytics API to create a competitive journals analysis report, starting from a specific journal and by using its authors information.

This tutorial is the fifth of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to

  • Load the researchers impact metrics data previously extracted (see parts 1-2-3)

  • Get the full publications history for these researchers

  • Use this new publications dataset to determine which are the most frequent journals the researchers have also published in

  • Build some visualizations in order to have a quick overview of the results

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[1]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.shortcuts import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.7.4.2)
Connected to: https://app.dimensions.ai - DSL v1.27
Method: dsl.ini file

Finally, let’s set up a folder to store the data we are going to extract:

[2]:
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

Competing Journals

From our researchers master list, we now want to extract the following:

  • full list of publications for a N year period

  • full list of journals with counts of how many publications per journal

This new dataset will let us draw up some conclusions re. which are the competing journals of the one we selected at the beginning.

First let’s reload the data obtained in previous steps

NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data * grab this file, unzip it, open the enclosed folder and upload the file called 2_researchers_impact_metrics.csv to Google Colab (‘Upload’ menu or also by dragging then inside the panel window) * move the file inside the journal-profile-data folder you just created

[3]:
#
researchers = pd.read_csv(FOLDER_NAME+"/2_researchers_impact_metrics.csv")
#
print("Total researchers:", len(researchers))
researchers.head(5)
Total researchers: 19565
[3]:
researcher_id pubs full_name citations_mean altmetric_mean last_pub_year url
0 ur.0723426172.10 63 Kari Stefansson 129.555556 229.603175 2020 https://app.dimensions.ai/discover/publication...
1 ur.01277776417.51 45 Unnur Thorsteinsdottir 93.088889 160.622222 2019 https://app.dimensions.ai/discover/publication...
2 ur.0641525362.39 35 Gonçalo R Abecasis 134.257143 138.514286 2020 https://app.dimensions.ai/discover/publication...
3 ur.01247426430.47 33 Gudmar Thorleifsson 94.121212 171.878788 2019 https://app.dimensions.ai/discover/publication...
4 ur.01317433110.75 33 Caroline Hayward 157.333333 296.212121 2020 https://app.dimensions.ai/discover/publication...
[4]:
# TIP to speed this up I'm taking only the top 2000 researchers!
# for a full analysis, just comment out the next line
researchers = researchers[:2000]

What the query looks like

The approach we’re taking consists in pulling all publications data, so that we can count journals as a second step.

This approach may take some time (as we’re potentially retrieving a lot of publications data), but it will lead to precise results.

The query template to use looks like this (for a couple of researchers only):

[5]:
%%dsldf
search publications where researchers.id in ["ur.01277776417.51", "ur.0637651205.48"]
    and year >= 2015 and journal is not empty
    and journal.id != "jour.1103138"
return publications[id+journal] limit 10
Returned Publications: 10 (total = 161)
Time: 0.94s
[5]:
id journal.id journal.title
0 pub.1126893330 jour.1300829 Communications Biology
1 pub.1124191534 jour.1043282 Nature Communications
2 pub.1123951767 jour.1043282 Nature Communications
3 pub.1127766109 jour.1043282 Nature Communications
4 pub.1125690142 jour.1300829 Communications Biology
5 pub.1130504374 jour.1014075 New England Journal of Medicine
6 pub.1128712754 jour.1018957 Nature
7 pub.1130096425 jour.1040791 Circulation Genomic and Precision Medicine
8 pub.1130077601 jour.1014075 New England Journal of Medicine
9 pub.1128316340 jour.1091325 European Heart Journal

Extracting all publications/journals information

This part may take some time to run (depending on how many years back one wants to go) so you may want to get a coffee while you wait..

[6]:
#
journal_id = "jour.1103138" # = Nature Genetics
start_year = 2018

# our list of researchers
llist = list(researchers['researcher_id'])
#
# the query
q2 = """search publications
            where researchers.id in {}
            and year >= {} and journal is not empty and journal.id != "{}"
    return publications[id+journal+year]"""
[8]:

VERBOSE = True
RESEARCHER_ITERATOR_NO = 400

pubs = pd.DataFrame
for chunk in progress(list(chunks_of(llist, RESEARCHER_ITERATOR_NO))):
    # get all pubs
    query = q2.format(json.dumps(chunk), start_year, journal_id)
    res = dsl.query_iterative(query, verbose=VERBOSE)
    if pubs.empty:
        # first time, init the dataframe
        pubs = res.as_dataframe()
    else:
        pubs.append(res.as_dataframe())
Starting iteration with limit=1000 skip=0 ...
0-1000 / 15127 (1.67s)
1000-2000 / 15127 (1.73s)
2000-3000 / 15127 (1.01s)
3000-4000 / 15127 (1.52s)
4000-5000 / 15127 (1.53s)
5000-6000 / 15127 (1.76s)
6000-7000 / 15127 (0.95s)
7000-8000 / 15127 (0.95s)
8000-9000 / 15127 (1.58s)
9000-10000 / 15127 (0.86s)
10000-11000 / 15127 (1.57s)
11000-12000 / 15127 (1.56s)
12000-13000 / 15127 (1.98s)
13000-14000 / 15127 (0.94s)
14000-15000 / 15127 (0.97s)
15000-15127 / 15127 (1.17s)
===
Records extracted: 15127
Starting iteration with limit=1000 skip=0 ...
0-1000 / 13961 (0.92s)
1000-2000 / 13961 (1.04s)
2000-3000 / 13961 (0.96s)
3000-4000 / 13961 (0.85s)
4000-5000 / 13961 (0.86s)
5000-6000 / 13961 (0.98s)
6000-7000 / 13961 (0.92s)
7000-8000 / 13961 (0.96s)
8000-9000 / 13961 (0.86s)
9000-10000 / 13961 (1.50s)
10000-11000 / 13961 (1.73s)
11000-12000 / 13961 (1.05s)
12000-13000 / 13961 (1.57s)
13000-13961 / 13961 (1.06s)
===
Records extracted: 13961
Starting iteration with limit=1000 skip=0 ...
0-1000 / 13285 (1.56s)
1000-2000 / 13285 (0.95s)
2000-3000 / 13285 (0.94s)
3000-4000 / 13285 (0.97s)
4000-5000 / 13285 (1.48s)
5000-6000 / 13285 (1.66s)
6000-7000 / 13285 (0.95s)
7000-8000 / 13285 (0.91s)
8000-9000 / 13285 (2.01s)
9000-10000 / 13285 (0.84s)
10000-11000 / 13285 (1.53s)
11000-12000 / 13285 (2.03s)
12000-13000 / 13285 (1.44s)
13000-13285 / 13285 (1.57s)
===
Records extracted: 13285
Starting iteration with limit=1000 skip=0 ...
0-1000 / 14285 (2.17s)
1000-2000 / 14285 (3.39s)
2000-3000 / 14285 (2.15s)
3000-4000 / 14285 (1.82s)
4000-5000 / 14285 (2.10s)
5000-6000 / 14285 (1.56s)
6000-7000 / 14285 (1.74s)
7000-8000 / 14285 (1.87s)
8000-9000 / 14285 (1.67s)
9000-10000 / 14285 (1.57s)
10000-11000 / 14285 (1.67s)
11000-12000 / 14285 (1.74s)
12000-13000 / 14285 (1.73s)
13000-14000 / 14285 (1.93s)
14000-14285 / 14285 (1.61s)
===
Records extracted: 14285
Starting iteration with limit=1000 skip=0 ...
0-1000 / 12741 (2.33s)
1000-2000 / 12741 (2.16s)
2000-3000 / 12741 (1.54s)
3000-4000 / 12741 (2.81s)
4000-5000 / 12741 (1.64s)
5000-6000 / 12741 (1.67s)
6000-7000 / 12741 (1.46s)
7000-8000 / 12741 (1.90s)
8000-9000 / 12741 (1.45s)
9000-10000 / 12741 (1.65s)
10000-11000 / 12741 (1.89s)
11000-12000 / 12741 (1.57s)
12000-12741 / 12741 (1.78s)
===
Records extracted: 12741

[9]:
# remove duplicate publications, if they have the same PUB_ID
pubs = pubs.drop_duplicates(subset="id")
# preview the data
pubs
[9]:
id year journal.id journal.title
0 pub.1124795380 2020 jour.1045059 European Journal of Psychotraumatology
1 pub.1130479619 2020 jour.1041075 Global Health Action
2 pub.1125404225 2020 jour.1045337 Scientific Reports
3 pub.1125109477 2020 jour.1023439 Genome Biology
4 pub.1127505679 2020 jour.1043282 Nature Communications
... ... ... ... ...
15122 pub.1100301829 2018 jour.1048149 European Thyroid Journal
15123 pub.1111681430 2018 jour.1011409 Human Heredity
15124 pub.1101241166 2018 jour.1049794 Oncology Research and Treatment
15125 pub.1101248219 2018 jour.1100504 Dementia and Geriatric Cognitive Disorders
15126 pub.1109951444 2018 jour.1276748 SSRN Electronic Journal

15111 rows × 4 columns

Now we can create a journals-only dataset that includes counts per year, and grant total.

[10]:
journals = pubs.copy()
# drop pub_id column
journals = journals.drop(['id'], axis=1)
#
# add total column
journals['total'] = journals.groupby('journal.id')['journal.id'].transform('count')
journals['total_year'] = journals.groupby(['journal.id', 'year'])['journal.id'].transform('count')
#
# remove multiple counts for same journal
journals = journals.drop_duplicates()
journals.reset_index(drop=True)
#
# sort by total count
journals = journals.sort_values('total', ascending=False)
# #
# # save
save(journals, "5.competing_journals.csv" )
print("======\nDone")

#preview the data
journals.head(10)
======
Done
[10]:
year journal.id journal.title total total_year
9998 2018 jour.1293558 bioRxiv 1341 486
4226 2019 jour.1293558 bioRxiv 1341 583
409 2020 jour.1293558 bioRxiv 1341 272
4406 2019 jour.1101548 European Neuropsychopharmacology 480 465
1009 2020 jour.1101548 European Neuropsychopharmacology 480 7
11482 2018 jour.1101548 European Neuropsychopharmacology 480 8
4420 2019 jour.1043282 Nature Communications 353 136
10151 2018 jour.1043282 Nature Communications 353 123
4 2020 jour.1043282 Nature Communications 353 94
386 2020 jour.1369542 medRxiv 318 252

Visualizations

[11]:

threshold = 100
temp = journals.sort_values("total", ascending=False)[:threshold]

px.bar(journals[:threshold],
       x="journal.title", y="total_year",
       color="year",
       hover_name="journal.title",
       hover_data=['journal.id', 'journal.title', 'total' ],
       title=f"Top {threshold} competitors for {journal_id} (based on publications data from {start_year})")
[12]:
threshold = 200
temp = journals.sort_values("year", ascending=True).groupby("year").head(threshold)

px.bar(journals[:threshold],
       x="journal.title", y="total_year",
       color="year",
       facet_row="year",
       height=900,
       hover_name="journal.title",
       hover_data=['journal.id', 'journal.title', 'total' ],
       title=f"Top {threshold} competitors for {journal_id} - segmented by year")

NOTE the European Neuropsychopharmacology journal has a massive jump in 2019 cause they published a lot of conference proceedings! See also the journal Dimensions page for comparison..



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg