Citation Analysis: Journals Citing a Research Organization¶
This notebook shows how to use the Dimensions Analytics API to discover what academic journals are most frequenlty citing publications from authors affiliation to a selected research organization. These are the steps:
We start from a GRID identifier (representing a research organization in Dimensions)
We then select all publications citing research where at least one author is/as affiliated to the GRID organization
Finally, we group this publications by source (journal) and analyse the findings
[11]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==
1. Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[5]:
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
#
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
A couple of utilities to simplify exporting the results we find as CSV files:
[2]:
#
# data-saving utils
#
DATAFOLDER = "extraction1"
#
if not os.path.exists(DATAFOLDER):
!mkdir $DATAFOLDER
print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
def save_as_csv(df, save_name_without_extension):
"usage: `save_as_csv(dataframe, 'filename')`"
df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")
2. Choose a Research Organization¶
For the purpose of this exercise, we will are going to use grid.471244.0. Feel free though to change the parameters below as you want, eg by choosing another GRID organization.
[3]:
GRIDID = "grid.414299.3" #@param {type:"string"}
#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}
if YEAR_END < YEAR_START:
YEAR_END = YEAR_START
#
# gen link to Dimensions
#
def dimensions_url(grids):
root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
return root + "&or_facet_research_org=".join([x for x in grids])
from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Preview {} Dimensions ⧉</a>'.format(dimensions_url([GRIDID]), GRIDID)))
3. Building a Publications Baseset¶
First we extract all publications where at least one of the authors is affiliated to GRID_ORG.
This will then let us query for citing publications using the reference_ids
field (see the Dimensions API data model for more details).
[4]:
publications = dsl.query_iterative(f"""
search publications
where research_orgs.id = "{GRIDID}"
and year in [{YEAR_START}:{YEAR_END}]
return publications[id+title+doi+year]
""")
#
# save the data
pubs_cited = publications.as_dataframe()
save_as_csv(pubs_cited, f"pubs_{GRIDID}")
Starting iteration with limit=1000 skip=0 ...
0-1000 / 2705 (1.21s)
1000-2000 / 2705 (0.88s)
2000-2705 / 2705 (0.68s)
===
Records extracted: 2705
===
Saved: extraction1/pubs_grid.414299.3.csv
4. Extracting Publications Citing the Baseset¶
In the next step we extract all publications citing the publications previously extracted. This query will return JSON data which can be further analyzed e.g. to count the unique number of journals they were published in.
E.g.:
'publications': [
{'journal': {'id': 'jour.1295784',
'title': 'IEEE Transactions on Cognitive and Developmental Systems'},
'publisher': 'Institute of Electrical and Electronics Engineers (IEEE)',
'year': 2018,
'id': 'pub.1061542201',
'issn': ['2379-8920', '2379-8939']},
{'journal': {'id': 'jour.1043581', 'title': 'International Geology Review'},
'publisher': 'Taylor & Francis',
'year': 2018,
'id': 'pub.1087302818',
'issn': ['0020-6814', '1938-2839']}, etc..
This is the query template we use.
[5]:
query_template = """search publications
where journal is not empty
and reference_ids in {}
return publications[id+journal+issn+year+publisher]"""
Note the {}
part which is where we will put lists of publication IDs (from the previous extraction) during each iteration. This is to ensure our query is never too long (<400 IDs is a good way to ensure we never get an API error).
[6]:
pubids = list(pubs_cited['id'])
#
# loop through all source-publications IDs in chunks and query Dimensions
print("===\nExtracting publications data ...")
results = []
BATCHSIZE = 200
VERBOSE = False # set to True to see extraction logs
for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
query = query_template.format(json.dumps(chunk))
data = dsl.query_iterative(query, verbose=VERBOSE)
results += data.publications
time.sleep(0.5)
#
# put the citing pub data into a dataframe, remove duplicates and save
pubs_citing = pd.DataFrame().from_dict(results)
print("===\nCiting Publications found: ", len(pubs_citing))
pubs_citing.drop_duplicates(subset='id', inplace=True)
print("Unique Citing Publications found: ", len(pubs_citing))
#
# split up nested journal columns into two columns
journals = pubs_citing['journal'].apply(pd.Series).rename(columns={"id": "journal.id", "title": "journal.title"})
pubs_citing = pd.concat([pubs_citing.drop(['journal'], axis=1), journals], axis=1)
#
# save
save_as_csv(pubs_citing, f"pubs_citing_{GRIDID}")
#
# preview the data
print("===\nPreview:")
pubs_citing.head(10)
===
Extracting publications data ...
===
Citing Publications found: 64886
Unique Citing Publications found: 57991
===
Saved: extraction1/pubs_citing_grid.414299.3.csv
===
Preview:
[6]:
id | issn | publisher | year | journal.id | journal.title | |
---|---|---|---|---|---|---|
0 | pub.1144498077 | [0016-2361, 1873-7153] | Elsevier | 2022 | jour.1044923 | Fuel |
1 | pub.1144184694 | [0883-9417, 1532-8228] | Elsevier | 2022 | jour.1097229 | Archives of Psychiatric Nursing |
2 | pub.1142566480 | [1746-8094, 1746-8108] | Elsevier | 2022 | jour.1039070 | Biomedical Signal Processing and Control |
3 | pub.1144814981 | [1664-1078] | Frontiers | 2022 | jour.1044598 | Frontiers in Psychology |
4 | pub.1144783633 | [0140-0118, 1741-0444] | Springer Nature | 2022 | jour.1005585 | Medical & Biological Engineering & Computing |
5 | pub.1144777341 | [0214-1582, 1578-1399] | Elsevier | 2022 | jour.1107232 | Revista de Senología y Patología Mamaria |
6 | pub.1144781125 | [2050-0068, 2050-0068] | Wiley | 2022 | jour.1320607 | Clinical & Translational Immunology |
7 | pub.1144764680 | NaN | Research Square Platform LLC | 2022 | jour.1380788 | Research Square |
8 | pub.1144744277 | [1661-6596, 1422-0067] | MDPI | 2022 | jour.1028874 | International Journal of Molecular Sciences |
9 | pub.1144706261 | [1038-5282, 1440-1584] | Wiley | 2022 | jour.1103707 | Australian Journal of Rural Health |
5. Journal Analysis¶
Finally, we can analyze the citing publications by grouping them by source journal. This can be achieved easily thanks to pandas’ Dataframe methods.
pandas is a popular Python software library for data manipulation and analysis
Number of Unique journals¶
[7]:
pubs_citing['journal.id'].describe()
[7]:
count 57991
unique 7295
top jour.1037553
freq 710
Name: journal.id, dtype: object
Most frequent journals¶
[8]:
# count journals and rename columns
journals = pubs_citing['journal.title'].value_counts()
journals = journals.to_frame().reset_index().rename(columns= {"index": 'journal.title', 'journal.title': 'count'})
journals.index.name = 'index'
#
# save
save_as_csv(journals, f"top_journals_citing_{GRIDID}")
#preview
journals.head(100)
===
Saved: extraction1/top_journals_citing_grid.414299.3.csv
[8]:
journal.title | count | |
---|---|---|
index | ||
0 | PLOS ONE | 710 |
1 | Inflammatory Bowel Diseases | 342 |
2 | World Journal of Gastroenterology | 321 |
3 | Scientific Reports | 288 |
4 | bioRxiv | 286 |
... | ... | ... |
95 | Journal of Pediatric Gastroenterology and Nutr... | 73 |
96 | Emergency Medicine Australasia | 72 |
97 | Techniques in Coloproctology | 72 |
98 | BJU International | 72 |
99 | Intensive Care Medicine | 71 |
100 rows × 2 columns
Top 100 journals chart¶
[9]:
px.bar(journals[:100],
x="journal.title", y="count",
title=f"Top 100 journals citing publications from {GRIDID}")
Top 20 journals by year chart¶
[10]:
THRESHOLD = 20 #@param {type: "slider", min: 10, max: 100}
# suppress empty values
pubs_citing.fillna("-no value-", inplace=True)
# make publications list smaller by only showing top journals
pubs_citing_topjournals = pubs_citing[pubs_citing['journal.title'].isin(list(journals[:THRESHOLD]['journal.title']))].sort_values('journal.title')
# build histogram
px.histogram(pubs_citing_topjournals,
x="year",
color="journal.title",
title=f"Top {THRESHOLD} journals citing publications from {GRIDID} - by year")
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.