../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Journal Profiling Part 5: Competing Journals Analysis

This Python notebook shows how to use the Dimensions Analytics API to create a competitive journals analysis report, starting from a specific journal and by using its authors information.

This tutorial is the fifth of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to

  • Load the researchers impact metrics data previously extracted (see parts 1-2-3)

  • Get the full publications history for these researchers

  • Use this new publications dataset to determine which are the most frequent journals the researchers have also published in

  • Build some visualizations in order to have a quick overview of the results

[12]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Finally, let’s set up a folder to store the data we are going to extract:

[2]:
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

Competing Journals

From our researchers master list, we now want to extract the following:

  • full list of publications for a N year period

  • full list of journals with counts of how many publications per journal

This new dataset will let us draw up some conclusions re. which are the competing journals of the one we selected at the beginning.

First let’s reload the data obtained in previous steps

NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data * grab this file, unzip it, open the enclosed folder and upload the file called 2_researchers_impact_metrics.csv to Google Colab (‘Upload’ menu or also by dragging then inside the panel window) * move the file inside the journal-profile-data folder you just created

[3]:
#
researchers = pd.read_csv(FOLDER_NAME+"/2_researchers_impact_metrics.csv")
#
print("Total researchers:", len(researchers))
researchers.head(5)
Total researchers: 23346
[3]:
researcher_id pubs full_name citations_mean altmetric_mean last_pub_year url
0 ur.0723426172.10 77 Kari Stefansson 201.493506 248.818182 2021 NaN
1 ur.01277776417.51 51 Unnur Thorsteinsdottir 131.862745 201.666667 2021 NaN
2 ur.0641525362.39 42 Gonçalo R. Abecasis 223.714286 167.809524 2021 NaN
3 ur.011675454737.09 39 Cornelia M van Duijn 216.102564 303.179487 2021 NaN
4 ur.01264737414.70 39 Tõnu Esko 244.564103 339.923077 2021 NaN
[4]:
# TIP to speed this up I'm taking only the top 2000 researchers!
# for a full analysis, just comment out the next line
researchers = researchers[:2000]

What the query looks like

The approach we’re taking consists in pulling all publications data, so that we can count journals as a second step.

This approach may take some time (as we’re potentially retrieving a lot of publications data), but it will lead to precise results.

The query template to use looks like this (for a couple of researchers only):

[5]:
%%dsldf
search publications where researchers.id in ["ur.01277776417.51", "ur.0637651205.48"]
    and year >= 2015 and journal is not empty
    and journal.id != "jour.1103138"
return publications[id+journal] limit 10
Returned Publications: 10 (total = 209)
Time: 0.62s
[5]:
id journal.id journal.title
0 pub.1144294896 jour.1044693 BMJ Open
1 pub.1144045413 jour.1100952 Arteriosclerosis Thrombosis and Vascular Biology
2 pub.1144018365 jour.1369542 medRxiv
3 pub.1143745202 jour.1018957 Nature
4 pub.1143816650 jour.1293558 bioRxiv
5 pub.1143538243 jour.1041454 Science Translational Medicine
6 pub.1142697518 jour.1293558 bioRxiv
7 pub.1142590280 jour.1045682 Blood Cancer Journal
8 pub.1141693039 jour.1300829 Communications Biology
9 pub.1138032051 jour.1399822 Arthritis & Rheumatology

Extracting all publications/journals information

This part may take some time to run (depending on how many years back one wants to go) so you may want to get a coffee while you wait..

[6]:
#
journal_id = "jour.1103138" # = Nature Genetics
start_year = 2018

# our list of researchers
llist = list(researchers['researcher_id'])
#
# the query
q2 = """search publications
            where researchers.id in {}
            and year >= {} and journal is not empty and journal.id != "{}"
    return publications[id+journal+year]"""
[7]:

VERBOSE = True
RESEARCHER_ITERATOR_NO = 400

pubs = pd.DataFrame
for chunk in progress(list(chunks_of(llist, RESEARCHER_ITERATOR_NO))):
    # get all pubs
    query = q2.format(json.dumps(chunk), start_year, journal_id)
    res = dsl.query_iterative(query, verbose=VERBOSE)
    if pubs.empty:
        # first time, init the dataframe
        pubs = res.as_dataframe()
    else:
        pubs.append(res.as_dataframe())
Starting iteration with limit=1000 skip=0 ...
0-1000 / 23456 (2.23s)
1000-2000 / 23456 (2.17s)
2000-3000 / 23456 (2.36s)
3000-4000 / 23456 (1.87s)
4000-5000 / 23456 (1.93s)
5000-6000 / 23456 (1.86s)
6000-7000 / 23456 (1.78s)
7000-8000 / 23456 (2.04s)
8000-9000 / 23456 (1.91s)
9000-10000 / 23456 (1.89s)
10000-11000 / 23456 (1.86s)
11000-12000 / 23456 (1.76s)
12000-13000 / 23456 (1.79s)
13000-14000 / 23456 (1.69s)
14000-15000 / 23456 (1.82s)
15000-16000 / 23456 (1.87s)
16000-17000 / 23456 (1.77s)
17000-18000 / 23456 (1.73s)
18000-19000 / 23456 (1.94s)
19000-20000 / 23456 (3.05s)
20000-21000 / 23456 (2.24s)
21000-22000 / 23456 (1.99s)
22000-23000 / 23456 (1.90s)
23000-23456 / 23456 (1.63s)
===
Records extracted: 23456
Starting iteration with limit=1000 skip=0 ...
0-1000 / 21078 (1.98s)
1000-2000 / 21078 (1.98s)
2000-3000 / 21078 (1.99s)
3000-4000 / 21078 (1.75s)
4000-5000 / 21078 (1.81s)
5000-6000 / 21078 (1.85s)
6000-7000 / 21078 (1.84s)
7000-8000 / 21078 (1.95s)
8000-9000 / 21078 (1.77s)
9000-10000 / 21078 (1.85s)
10000-11000 / 21078 (1.86s)
11000-12000 / 21078 (1.83s)
12000-13000 / 21078 (1.75s)
13000-14000 / 21078 (1.80s)
14000-15000 / 21078 (1.86s)
15000-16000 / 21078 (1.66s)
16000-17000 / 21078 (1.77s)
17000-18000 / 21078 (1.79s)
18000-19000 / 21078 (1.80s)
19000-20000 / 21078 (1.78s)
20000-21000 / 21078 (1.73s)
21000-21078 / 21078 (1.77s)
===
Records extracted: 21078
Starting iteration with limit=1000 skip=0 ...
0-1000 / 22729 (1.88s)
1000-2000 / 22729 (1.86s)
2000-3000 / 22729 (2.30s)
3000-4000 / 22729 (1.94s)
4000-5000 / 22729 (1.83s)
5000-6000 / 22729 (1.87s)
6000-7000 / 22729 (1.91s)
7000-8000 / 22729 (1.89s)
8000-9000 / 22729 (2.05s)
9000-10000 / 22729 (1.76s)
10000-11000 / 22729 (1.89s)
11000-12000 / 22729 (1.84s)
12000-13000 / 22729 (1.86s)
13000-14000 / 22729 (1.92s)
14000-15000 / 22729 (1.85s)
15000-16000 / 22729 (1.83s)
16000-17000 / 22729 (1.86s)
17000-18000 / 22729 (1.86s)
18000-19000 / 22729 (1.84s)
19000-20000 / 22729 (2.17s)
20000-21000 / 22729 (1.82s)
21000-22000 / 22729 (1.99s)
22000-22729 / 22729 (1.75s)
===
Records extracted: 22729
Starting iteration with limit=1000 skip=0 ...
0-1000 / 19177 (1.87s)
1000-2000 / 19177 (1.86s)
2000-3000 / 19177 (1.92s)
3000-4000 / 19177 (1.62s)
4000-5000 / 19177 (1.62s)
5000-6000 / 19177 (1.64s)
6000-7000 / 19177 (1.68s)
7000-8000 / 19177 (1.68s)
8000-9000 / 19177 (1.80s)
9000-10000 / 19177 (1.60s)
10000-11000 / 19177 (1.58s)
11000-12000 / 19177 (1.70s)
12000-13000 / 19177 (1.81s)
13000-14000 / 19177 (1.73s)
14000-15000 / 19177 (1.68s)
15000-16000 / 19177 (1.77s)
16000-17000 / 19177 (1.77s)
17000-18000 / 19177 (1.69s)
18000-19000 / 19177 (1.73s)
19000-19177 / 19177 (1.52s)
===
Records extracted: 19177
Starting iteration with limit=1000 skip=0 ...
0-1000 / 19307 (1.81s)
1000-2000 / 19307 (1.99s)
2000-3000 / 19307 (1.90s)
3000-4000 / 19307 (1.66s)
4000-5000 / 19307 (1.68s)
5000-6000 / 19307 (1.65s)
6000-7000 / 19307 (1.60s)
7000-8000 / 19307 (1.65s)
8000-9000 / 19307 (1.77s)
9000-10000 / 19307 (1.80s)
10000-11000 / 19307 (1.72s)
11000-12000 / 19307 (1.62s)
12000-13000 / 19307 (1.88s)
13000-14000 / 19307 (1.72s)
14000-15000 / 19307 (1.70s)
15000-16000 / 19307 (1.72s)
16000-17000 / 19307 (1.65s)
17000-18000 / 19307 (1.63s)
18000-19000 / 19307 (1.71s)
19000-19307 / 19307 (1.62s)
===
Records extracted: 19307
[8]:
# remove duplicate publications, if they have the same PUB_ID
pubs = pubs.drop_duplicates(subset="id")
# preview the data
pubs
[8]:
id year journal.id journal.title
0 pub.1144058594 2022 jour.1088601 Maturitas
1 pub.1144115098 2022 jour.1098341 Journal of Food Composition and Analysis
2 pub.1144111238 2022 jour.1400578 Ophthalmology Science
3 pub.1142542983 2022 jour.1088601 Maturitas
4 pub.1142207532 2022 jour.1034064 Food Quality and Preference
... ... ... ... ...
23451 pub.1101137729 2018 jour.1119070 Journal of Alzheimer's Disease
23452 pub.1101137719 2018 jour.1119070 Journal of Alzheimer's Disease
23453 pub.1100924477 2018 jour.1018190 Nederlands Tijdschrift voor Geneeskunde
23454 pub.1100522151 2018 jour.1046789 Food and Nutrition Sciences
23455 pub.1084865232 2018 jour.1101614 European Psychiatry

23456 rows × 4 columns

Now we can create a journals-only dataset that includes counts per year, and grant total.

[9]:
journals = pubs.copy()
# drop pub_id column
journals = journals.drop(['id'], axis=1)
#
# add total column
journals['total'] = journals.groupby('journal.id')['journal.id'].transform('count')
journals['total_year'] = journals.groupby(['journal.id', 'year'])['journal.id'].transform('count')
#
# remove multiple counts for same journal
journals = journals.drop_duplicates()
journals.reset_index(drop=True)
#
# sort by total count
journals = journals.sort_values('total', ascending=False)
# #
# # save
save(journals, "5.competing_journals.csv" )
print("======\nDone")

#preview the data
journals.head(10)
======
Done
[9]:
year journal.id journal.title total total_year
65 2021 jour.1293558 bioRxiv 1811 320
18024 2018 jour.1293558 bioRxiv 1811 497
5980 2020 jour.1293558 bioRxiv 1811 403
12009 2019 jour.1293558 bioRxiv 1811 591
11999 2019 jour.1369542 medRxiv 994 65
5978 2020 jour.1369542 medRxiv 994 442
18 2021 jour.1369542 medRxiv 994 487
131 2021 jour.1043282 Nature Communications 566 132
6062 2020 jour.1043282 Nature Communications 566 166
18117 2018 jour.1043282 Nature Communications 566 130

Visualizations

[10]:

threshold = 100
temp = journals.sort_values("total", ascending=False)[:threshold]

px.bar(journals[:threshold],
       x="journal.title", y="total_year",
       color="year",
       hover_name="journal.title",
       hover_data=['journal.id', 'journal.title', 'total' ],
       title=f"Top {threshold} competitors for {journal_id} (based on publications data from {start_year})")
[11]:
threshold = 200
temp = journals.sort_values("year", ascending=True).groupby("year").head(threshold)

px.bar(journals[:threshold],
       x="journal.title", y="total_year",
       color="year",
       facet_row="year",
       height=900,
       hover_name="journal.title",
       hover_data=['journal.id', 'journal.title', 'total' ],
       title=f"Top {threshold} competitors for {journal_id} - segmented by year")

NOTE the European Neuropsychopharmacology journal has a massive jump in 2019 cause they published a lot of conference proceedings! See also the journal Dimensions page for comparison..



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg