Journal Profiling Part 5: Competing Journals Analysis¶
This Python notebook shows how to use the Dimensions Analytics API to create a competitive journals analysis report, starting from a specific journal and by using its authors information.
This tutorial is the fifth of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.
In this notebook we are going to
Load the researchers impact metrics data previously extracted (see parts 1-2-3)
Get the full publications history for these researchers
Use this new publications dataset to determine which are the most frequent journals the researchers have also published in
Build some visualizations in order to have a quick overview of the results
[12]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[1]:
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
Finally, let’s set up a folder to store the data we are going to extract:
[2]:
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
os.mkdir(FOLDER_NAME)
def save(df,filename_dot_csv):
df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)
Competing Journals¶
From our researchers master list, we now want to extract the following:
full list of publications for a N year period
full list of journals with counts of how many publications per journal
This new dataset will let us draw up some conclusions re. which are the competing journals of the one we selected at the beginning.
First let’s reload the data obtained in previous steps¶
NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data
* grab this file, unzip it, open the enclosed folder and upload the file called 2_researchers_impact_metrics.csv
to Google Colab (‘Upload’ menu or also by dragging then inside the
panel window) * move the file inside the journal-profile-data
folder you just created
[3]:
#
researchers = pd.read_csv(FOLDER_NAME+"/2_researchers_impact_metrics.csv")
#
print("Total researchers:", len(researchers))
researchers.head(5)
Total researchers: 23346
[3]:
researcher_id | pubs | full_name | citations_mean | altmetric_mean | last_pub_year | url | |
---|---|---|---|---|---|---|---|
0 | ur.0723426172.10 | 77 | Kari Stefansson | 201.493506 | 248.818182 | 2021 | NaN |
1 | ur.01277776417.51 | 51 | Unnur Thorsteinsdottir | 131.862745 | 201.666667 | 2021 | NaN |
2 | ur.0641525362.39 | 42 | Gonçalo R. Abecasis | 223.714286 | 167.809524 | 2021 | NaN |
3 | ur.011675454737.09 | 39 | Cornelia M van Duijn | 216.102564 | 303.179487 | 2021 | NaN |
4 | ur.01264737414.70 | 39 | Tõnu Esko | 244.564103 | 339.923077 | 2021 | NaN |
[4]:
# TIP to speed this up I'm taking only the top 2000 researchers!
# for a full analysis, just comment out the next line
researchers = researchers[:2000]
What the query looks like¶
The approach we’re taking consists in pulling all publications data, so that we can count journals as a second step.
This approach may take some time (as we’re potentially retrieving a lot of publications data), but it will lead to precise results.
The query template to use looks like this (for a couple of researchers only):
[5]:
%%dsldf
search publications where researchers.id in ["ur.01277776417.51", "ur.0637651205.48"]
and year >= 2015 and journal is not empty
and journal.id != "jour.1103138"
return publications[id+journal] limit 10
Returned Publications: 10 (total = 209)
Time: 0.62s
[5]:
id | journal.id | journal.title | |
---|---|---|---|
0 | pub.1144294896 | jour.1044693 | BMJ Open |
1 | pub.1144045413 | jour.1100952 | Arteriosclerosis Thrombosis and Vascular Biology |
2 | pub.1144018365 | jour.1369542 | medRxiv |
3 | pub.1143745202 | jour.1018957 | Nature |
4 | pub.1143816650 | jour.1293558 | bioRxiv |
5 | pub.1143538243 | jour.1041454 | Science Translational Medicine |
6 | pub.1142697518 | jour.1293558 | bioRxiv |
7 | pub.1142590280 | jour.1045682 | Blood Cancer Journal |
8 | pub.1141693039 | jour.1300829 | Communications Biology |
9 | pub.1138032051 | jour.1399822 | Arthritis & Rheumatology |
Extracting all publications/journals information¶
This part may take some time to run (depending on how many years back one wants to go) so you may want to get a coffee while you wait..
[6]:
#
journal_id = "jour.1103138" # = Nature Genetics
start_year = 2018
# our list of researchers
llist = list(researchers['researcher_id'])
#
# the query
q2 = """search publications
where researchers.id in {}
and year >= {} and journal is not empty and journal.id != "{}"
return publications[id+journal+year]"""
[7]:
VERBOSE = True
RESEARCHER_ITERATOR_NO = 400
pubs = pd.DataFrame
for chunk in progress(list(chunks_of(llist, RESEARCHER_ITERATOR_NO))):
# get all pubs
query = q2.format(json.dumps(chunk), start_year, journal_id)
res = dsl.query_iterative(query, verbose=VERBOSE)
if pubs.empty:
# first time, init the dataframe
pubs = res.as_dataframe()
else:
pubs.append(res.as_dataframe())
Starting iteration with limit=1000 skip=0 ...
0-1000 / 23456 (2.23s)
1000-2000 / 23456 (2.17s)
2000-3000 / 23456 (2.36s)
3000-4000 / 23456 (1.87s)
4000-5000 / 23456 (1.93s)
5000-6000 / 23456 (1.86s)
6000-7000 / 23456 (1.78s)
7000-8000 / 23456 (2.04s)
8000-9000 / 23456 (1.91s)
9000-10000 / 23456 (1.89s)
10000-11000 / 23456 (1.86s)
11000-12000 / 23456 (1.76s)
12000-13000 / 23456 (1.79s)
13000-14000 / 23456 (1.69s)
14000-15000 / 23456 (1.82s)
15000-16000 / 23456 (1.87s)
16000-17000 / 23456 (1.77s)
17000-18000 / 23456 (1.73s)
18000-19000 / 23456 (1.94s)
19000-20000 / 23456 (3.05s)
20000-21000 / 23456 (2.24s)
21000-22000 / 23456 (1.99s)
22000-23000 / 23456 (1.90s)
23000-23456 / 23456 (1.63s)
===
Records extracted: 23456
Starting iteration with limit=1000 skip=0 ...
0-1000 / 21078 (1.98s)
1000-2000 / 21078 (1.98s)
2000-3000 / 21078 (1.99s)
3000-4000 / 21078 (1.75s)
4000-5000 / 21078 (1.81s)
5000-6000 / 21078 (1.85s)
6000-7000 / 21078 (1.84s)
7000-8000 / 21078 (1.95s)
8000-9000 / 21078 (1.77s)
9000-10000 / 21078 (1.85s)
10000-11000 / 21078 (1.86s)
11000-12000 / 21078 (1.83s)
12000-13000 / 21078 (1.75s)
13000-14000 / 21078 (1.80s)
14000-15000 / 21078 (1.86s)
15000-16000 / 21078 (1.66s)
16000-17000 / 21078 (1.77s)
17000-18000 / 21078 (1.79s)
18000-19000 / 21078 (1.80s)
19000-20000 / 21078 (1.78s)
20000-21000 / 21078 (1.73s)
21000-21078 / 21078 (1.77s)
===
Records extracted: 21078
Starting iteration with limit=1000 skip=0 ...
0-1000 / 22729 (1.88s)
1000-2000 / 22729 (1.86s)
2000-3000 / 22729 (2.30s)
3000-4000 / 22729 (1.94s)
4000-5000 / 22729 (1.83s)
5000-6000 / 22729 (1.87s)
6000-7000 / 22729 (1.91s)
7000-8000 / 22729 (1.89s)
8000-9000 / 22729 (2.05s)
9000-10000 / 22729 (1.76s)
10000-11000 / 22729 (1.89s)
11000-12000 / 22729 (1.84s)
12000-13000 / 22729 (1.86s)
13000-14000 / 22729 (1.92s)
14000-15000 / 22729 (1.85s)
15000-16000 / 22729 (1.83s)
16000-17000 / 22729 (1.86s)
17000-18000 / 22729 (1.86s)
18000-19000 / 22729 (1.84s)
19000-20000 / 22729 (2.17s)
20000-21000 / 22729 (1.82s)
21000-22000 / 22729 (1.99s)
22000-22729 / 22729 (1.75s)
===
Records extracted: 22729
Starting iteration with limit=1000 skip=0 ...
0-1000 / 19177 (1.87s)
1000-2000 / 19177 (1.86s)
2000-3000 / 19177 (1.92s)
3000-4000 / 19177 (1.62s)
4000-5000 / 19177 (1.62s)
5000-6000 / 19177 (1.64s)
6000-7000 / 19177 (1.68s)
7000-8000 / 19177 (1.68s)
8000-9000 / 19177 (1.80s)
9000-10000 / 19177 (1.60s)
10000-11000 / 19177 (1.58s)
11000-12000 / 19177 (1.70s)
12000-13000 / 19177 (1.81s)
13000-14000 / 19177 (1.73s)
14000-15000 / 19177 (1.68s)
15000-16000 / 19177 (1.77s)
16000-17000 / 19177 (1.77s)
17000-18000 / 19177 (1.69s)
18000-19000 / 19177 (1.73s)
19000-19177 / 19177 (1.52s)
===
Records extracted: 19177
Starting iteration with limit=1000 skip=0 ...
0-1000 / 19307 (1.81s)
1000-2000 / 19307 (1.99s)
2000-3000 / 19307 (1.90s)
3000-4000 / 19307 (1.66s)
4000-5000 / 19307 (1.68s)
5000-6000 / 19307 (1.65s)
6000-7000 / 19307 (1.60s)
7000-8000 / 19307 (1.65s)
8000-9000 / 19307 (1.77s)
9000-10000 / 19307 (1.80s)
10000-11000 / 19307 (1.72s)
11000-12000 / 19307 (1.62s)
12000-13000 / 19307 (1.88s)
13000-14000 / 19307 (1.72s)
14000-15000 / 19307 (1.70s)
15000-16000 / 19307 (1.72s)
16000-17000 / 19307 (1.65s)
17000-18000 / 19307 (1.63s)
18000-19000 / 19307 (1.71s)
19000-19307 / 19307 (1.62s)
===
Records extracted: 19307
[8]:
# remove duplicate publications, if they have the same PUB_ID
pubs = pubs.drop_duplicates(subset="id")
# preview the data
pubs
[8]:
id | year | journal.id | journal.title | |
---|---|---|---|---|
0 | pub.1144058594 | 2022 | jour.1088601 | Maturitas |
1 | pub.1144115098 | 2022 | jour.1098341 | Journal of Food Composition and Analysis |
2 | pub.1144111238 | 2022 | jour.1400578 | Ophthalmology Science |
3 | pub.1142542983 | 2022 | jour.1088601 | Maturitas |
4 | pub.1142207532 | 2022 | jour.1034064 | Food Quality and Preference |
... | ... | ... | ... | ... |
23451 | pub.1101137729 | 2018 | jour.1119070 | Journal of Alzheimer's Disease |
23452 | pub.1101137719 | 2018 | jour.1119070 | Journal of Alzheimer's Disease |
23453 | pub.1100924477 | 2018 | jour.1018190 | Nederlands Tijdschrift voor Geneeskunde |
23454 | pub.1100522151 | 2018 | jour.1046789 | Food and Nutrition Sciences |
23455 | pub.1084865232 | 2018 | jour.1101614 | European Psychiatry |
23456 rows × 4 columns
Now we can create a journals-only dataset that includes counts per year, and grant total.
[9]:
journals = pubs.copy()
# drop pub_id column
journals = journals.drop(['id'], axis=1)
#
# add total column
journals['total'] = journals.groupby('journal.id')['journal.id'].transform('count')
journals['total_year'] = journals.groupby(['journal.id', 'year'])['journal.id'].transform('count')
#
# remove multiple counts for same journal
journals = journals.drop_duplicates()
journals.reset_index(drop=True)
#
# sort by total count
journals = journals.sort_values('total', ascending=False)
# #
# # save
save(journals, "5.competing_journals.csv" )
print("======\nDone")
#preview the data
journals.head(10)
======
Done
[9]:
year | journal.id | journal.title | total | total_year | |
---|---|---|---|---|---|
65 | 2021 | jour.1293558 | bioRxiv | 1811 | 320 |
18024 | 2018 | jour.1293558 | bioRxiv | 1811 | 497 |
5980 | 2020 | jour.1293558 | bioRxiv | 1811 | 403 |
12009 | 2019 | jour.1293558 | bioRxiv | 1811 | 591 |
11999 | 2019 | jour.1369542 | medRxiv | 994 | 65 |
5978 | 2020 | jour.1369542 | medRxiv | 994 | 442 |
18 | 2021 | jour.1369542 | medRxiv | 994 | 487 |
131 | 2021 | jour.1043282 | Nature Communications | 566 | 132 |
6062 | 2020 | jour.1043282 | Nature Communications | 566 | 166 |
18117 | 2018 | jour.1043282 | Nature Communications | 566 | 130 |
Visualizations¶
[10]:
threshold = 100
temp = journals.sort_values("total", ascending=False)[:threshold]
px.bar(journals[:threshold],
x="journal.title", y="total_year",
color="year",
hover_name="journal.title",
hover_data=['journal.id', 'journal.title', 'total' ],
title=f"Top {threshold} competitors for {journal_id} (based on publications data from {start_year})")
[11]:
threshold = 200
temp = journals.sort_values("year", ascending=True).groupby("year").head(threshold)
px.bar(journals[:threshold],
x="journal.title", y="total_year",
color="year",
facet_row="year",
height=900,
hover_name="journal.title",
hover_data=['journal.id', 'journal.title', 'total' ],
title=f"Top {threshold} competitors for {journal_id} - segmented by year")
NOTE the European Neuropsychopharmacology journal has a massive jump in 2019 cause they published a lot of conference proceedings! See also the journal Dimensions page for comparison..
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.