Journal Profiling Part 5: Competing Journals Analysis¶
This Python notebook shows how to use the Dimensions Analytics API to create a competitive journals analysis report, starting from a specific journal and by using its authors information.
This tutorial is the fifth of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.
In this notebook we are going to
Load the researchers impact metrics data previously extracted (see parts 1-2-3)
Get the full publications history for these researchers
Use this new publications dataset to determine which are the most frequent journals the researchers have also published in
Build some visualizations in order to have a quick overview of the results
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.
[1]:
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.8.2)
Connected to: https://app.dimensions.ai - DSL v1.28
Method: dsl.ini file
Finally, let’s set up a folder to store the data we are going to extract:
[2]:
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
os.mkdir(FOLDER_NAME)
def save(df,filename_dot_csv):
df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)
Competing Journals¶
From our researchers master list, we now want to extract the following:
full list of publications for a N year period
full list of journals with counts of how many publications per journal
This new dataset will let us draw up some conclusions re. which are the competing journals of the one we selected at the beginning.
First let’s reload the data obtained in previous steps¶
NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data
* grab this file, unzip it, open the enclosed folder and upload the file called 2_researchers_impact_metrics.csv
to Google Colab (‘Upload’ menu or also by dragging then inside the
panel window) * move the file inside the journal-profile-data
folder you just created
[3]:
#
researchers = pd.read_csv(FOLDER_NAME+"/2_researchers_impact_metrics.csv")
#
print("Total researchers:", len(researchers))
researchers.head(5)
Total researchers: 19565
[3]:
researcher_id | pubs | full_name | citations_mean | altmetric_mean | last_pub_year | url | |
---|---|---|---|---|---|---|---|
0 | ur.0723426172.10 | 63 | Kari Stefansson | 129.555556 | 229.603175 | 2020 | https://app.dimensions.ai/discover/publication... |
1 | ur.01277776417.51 | 45 | Unnur Thorsteinsdottir | 93.088889 | 160.622222 | 2019 | https://app.dimensions.ai/discover/publication... |
2 | ur.0641525362.39 | 35 | Gonçalo R Abecasis | 134.257143 | 138.514286 | 2020 | https://app.dimensions.ai/discover/publication... |
3 | ur.01247426430.47 | 33 | Gudmar Thorleifsson | 94.121212 | 171.878788 | 2019 | https://app.dimensions.ai/discover/publication... |
4 | ur.01317433110.75 | 33 | Caroline Hayward | 157.333333 | 296.212121 | 2020 | https://app.dimensions.ai/discover/publication... |
[4]:
# TIP to speed this up I'm taking only the top 2000 researchers!
# for a full analysis, just comment out the next line
researchers = researchers[:2000]
What the query looks like¶
The approach we’re taking consists in pulling all publications data, so that we can count journals as a second step.
This approach may take some time (as we’re potentially retrieving a lot of publications data), but it will lead to precise results.
The query template to use looks like this (for a couple of researchers only):
[5]:
%%dsldf
search publications where researchers.id in ["ur.01277776417.51", "ur.0637651205.48"]
and year >= 2015 and journal is not empty
and journal.id != "jour.1103138"
return publications[id+journal] limit 10
Returned Publications: 10 (total = 161)
Time: 0.94s
[5]:
id | journal.id | journal.title | |
---|---|---|---|
0 | pub.1126893330 | jour.1300829 | Communications Biology |
1 | pub.1124191534 | jour.1043282 | Nature Communications |
2 | pub.1123951767 | jour.1043282 | Nature Communications |
3 | pub.1127766109 | jour.1043282 | Nature Communications |
4 | pub.1125690142 | jour.1300829 | Communications Biology |
5 | pub.1130504374 | jour.1014075 | New England Journal of Medicine |
6 | pub.1128712754 | jour.1018957 | Nature |
7 | pub.1130096425 | jour.1040791 | Circulation Genomic and Precision Medicine |
8 | pub.1130077601 | jour.1014075 | New England Journal of Medicine |
9 | pub.1128316340 | jour.1091325 | European Heart Journal |
Extracting all publications/journals information¶
This part may take some time to run (depending on how many years back one wants to go) so you may want to get a coffee while you wait..
[6]:
#
journal_id = "jour.1103138" # = Nature Genetics
start_year = 2018
# our list of researchers
llist = list(researchers['researcher_id'])
#
# the query
q2 = """search publications
where researchers.id in {}
and year >= {} and journal is not empty and journal.id != "{}"
return publications[id+journal+year]"""
[8]:
VERBOSE = True
RESEARCHER_ITERATOR_NO = 400
pubs = pd.DataFrame
for chunk in progress(list(chunks_of(llist, RESEARCHER_ITERATOR_NO))):
# get all pubs
query = q2.format(json.dumps(chunk), start_year, journal_id)
res = dsl.query_iterative(query, verbose=VERBOSE)
if pubs.empty:
# first time, init the dataframe
pubs = res.as_dataframe()
else:
pubs.append(res.as_dataframe())
Starting iteration with limit=1000 skip=0 ...
0-1000 / 15127 (1.67s)
1000-2000 / 15127 (1.73s)
2000-3000 / 15127 (1.01s)
3000-4000 / 15127 (1.52s)
4000-5000 / 15127 (1.53s)
5000-6000 / 15127 (1.76s)
6000-7000 / 15127 (0.95s)
7000-8000 / 15127 (0.95s)
8000-9000 / 15127 (1.58s)
9000-10000 / 15127 (0.86s)
10000-11000 / 15127 (1.57s)
11000-12000 / 15127 (1.56s)
12000-13000 / 15127 (1.98s)
13000-14000 / 15127 (0.94s)
14000-15000 / 15127 (0.97s)
15000-15127 / 15127 (1.17s)
===
Records extracted: 15127
Starting iteration with limit=1000 skip=0 ...
0-1000 / 13961 (0.92s)
1000-2000 / 13961 (1.04s)
2000-3000 / 13961 (0.96s)
3000-4000 / 13961 (0.85s)
4000-5000 / 13961 (0.86s)
5000-6000 / 13961 (0.98s)
6000-7000 / 13961 (0.92s)
7000-8000 / 13961 (0.96s)
8000-9000 / 13961 (0.86s)
9000-10000 / 13961 (1.50s)
10000-11000 / 13961 (1.73s)
11000-12000 / 13961 (1.05s)
12000-13000 / 13961 (1.57s)
13000-13961 / 13961 (1.06s)
===
Records extracted: 13961
Starting iteration with limit=1000 skip=0 ...
0-1000 / 13285 (1.56s)
1000-2000 / 13285 (0.95s)
2000-3000 / 13285 (0.94s)
3000-4000 / 13285 (0.97s)
4000-5000 / 13285 (1.48s)
5000-6000 / 13285 (1.66s)
6000-7000 / 13285 (0.95s)
7000-8000 / 13285 (0.91s)
8000-9000 / 13285 (2.01s)
9000-10000 / 13285 (0.84s)
10000-11000 / 13285 (1.53s)
11000-12000 / 13285 (2.03s)
12000-13000 / 13285 (1.44s)
13000-13285 / 13285 (1.57s)
===
Records extracted: 13285
Starting iteration with limit=1000 skip=0 ...
0-1000 / 14285 (2.17s)
1000-2000 / 14285 (3.39s)
2000-3000 / 14285 (2.15s)
3000-4000 / 14285 (1.82s)
4000-5000 / 14285 (2.10s)
5000-6000 / 14285 (1.56s)
6000-7000 / 14285 (1.74s)
7000-8000 / 14285 (1.87s)
8000-9000 / 14285 (1.67s)
9000-10000 / 14285 (1.57s)
10000-11000 / 14285 (1.67s)
11000-12000 / 14285 (1.74s)
12000-13000 / 14285 (1.73s)
13000-14000 / 14285 (1.93s)
14000-14285 / 14285 (1.61s)
===
Records extracted: 14285
Starting iteration with limit=1000 skip=0 ...
0-1000 / 12741 (2.33s)
1000-2000 / 12741 (2.16s)
2000-3000 / 12741 (1.54s)
3000-4000 / 12741 (2.81s)
4000-5000 / 12741 (1.64s)
5000-6000 / 12741 (1.67s)
6000-7000 / 12741 (1.46s)
7000-8000 / 12741 (1.90s)
8000-9000 / 12741 (1.45s)
9000-10000 / 12741 (1.65s)
10000-11000 / 12741 (1.89s)
11000-12000 / 12741 (1.57s)
12000-12741 / 12741 (1.78s)
===
Records extracted: 12741
[9]:
# remove duplicate publications, if they have the same PUB_ID
pubs = pubs.drop_duplicates(subset="id")
# preview the data
pubs
[9]:
id | year | journal.id | journal.title | |
---|---|---|---|---|
0 | pub.1124795380 | 2020 | jour.1045059 | European Journal of Psychotraumatology |
1 | pub.1130479619 | 2020 | jour.1041075 | Global Health Action |
2 | pub.1125404225 | 2020 | jour.1045337 | Scientific Reports |
3 | pub.1125109477 | 2020 | jour.1023439 | Genome Biology |
4 | pub.1127505679 | 2020 | jour.1043282 | Nature Communications |
... | ... | ... | ... | ... |
15122 | pub.1100301829 | 2018 | jour.1048149 | European Thyroid Journal |
15123 | pub.1111681430 | 2018 | jour.1011409 | Human Heredity |
15124 | pub.1101241166 | 2018 | jour.1049794 | Oncology Research and Treatment |
15125 | pub.1101248219 | 2018 | jour.1100504 | Dementia and Geriatric Cognitive Disorders |
15126 | pub.1109951444 | 2018 | jour.1276748 | SSRN Electronic Journal |
15111 rows × 4 columns
Now we can create a journals-only dataset that includes counts per year, and grant total.
[10]:
journals = pubs.copy()
# drop pub_id column
journals = journals.drop(['id'], axis=1)
#
# add total column
journals['total'] = journals.groupby('journal.id')['journal.id'].transform('count')
journals['total_year'] = journals.groupby(['journal.id', 'year'])['journal.id'].transform('count')
#
# remove multiple counts for same journal
journals = journals.drop_duplicates()
journals.reset_index(drop=True)
#
# sort by total count
journals = journals.sort_values('total', ascending=False)
# #
# # save
save(journals, "5.competing_journals.csv" )
print("======\nDone")
#preview the data
journals.head(10)
======
Done
[10]:
year | journal.id | journal.title | total | total_year | |
---|---|---|---|---|---|
9998 | 2018 | jour.1293558 | bioRxiv | 1341 | 486 |
4226 | 2019 | jour.1293558 | bioRxiv | 1341 | 583 |
409 | 2020 | jour.1293558 | bioRxiv | 1341 | 272 |
4406 | 2019 | jour.1101548 | European Neuropsychopharmacology | 480 | 465 |
1009 | 2020 | jour.1101548 | European Neuropsychopharmacology | 480 | 7 |
11482 | 2018 | jour.1101548 | European Neuropsychopharmacology | 480 | 8 |
4420 | 2019 | jour.1043282 | Nature Communications | 353 | 136 |
10151 | 2018 | jour.1043282 | Nature Communications | 353 | 123 |
4 | 2020 | jour.1043282 | Nature Communications | 353 | 94 |
386 | 2020 | jour.1369542 | medRxiv | 318 | 252 |
Visualizations¶
[11]:
threshold = 100
temp = journals.sort_values("total", ascending=False)[:threshold]
px.bar(journals[:threshold],
x="journal.title", y="total_year",
color="year",
hover_name="journal.title",
hover_data=['journal.id', 'journal.title', 'total' ],
title=f"Top {threshold} competitors for {journal_id} (based on publications data from {start_year})")
[12]:
threshold = 200
temp = journals.sort_values("year", ascending=True).groupby("year").head(threshold)
px.bar(journals[:threshold],
x="journal.title", y="total_year",
color="year",
facet_row="year",
height=900,
hover_name="journal.title",
hover_data=['journal.id', 'journal.title', 'total' ],
title=f"Top {threshold} competitors for {journal_id} - segmented by year")
NOTE the European Neuropsychopharmacology journal has a massive jump in 2019 cause they published a lot of conference proceedings! See also the journal Dimensions page for comparison..
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.