Citation Analysis: Journals Cited by a Research Organization¶
This notebook shows how to use the Dimensions Analytics API to discover what academic journals are most frequenlty cited by authors affiliated to a selected research organization. These are the steps:
We start from a specific organization GRID ID (and other parameters of choice)
Using the publications API, we extract all publications authored by researchers at that institution. For each publication, we store all outgoing citations IDs using the
reference_ids
fieldWe query the API again to obtain other useful metadata for those outgoing citations e.g. title, publisher, journals etc..
We analyse the data, in particular by segmenting it by journal and publisher
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Sep 22, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
#
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.9.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.2
Method: dsl.ini file
1. Choosing a Research Organization¶
We can use the organizations API to find the GRID ID for Berkeley University.
[3]:
%%dsldf
search organizations for "berkeley university" return organizations
Returned Organizations: 1 (total = 1)
Time: 0.61s
[3]:
id | name | acronym | city_name | country_name | latitude | linkout | longitude | state_name | types | |
---|---|---|---|---|---|---|---|---|---|---|
0 | grid.47840.3f | University of California, Berkeley | UCB | Berkeley | United States | 37.87216 | [http://www.berkeley.edu/] | -122.258575 | California | [Education] |
The ID we are looking for is grid.47840.3f.
1.1 Selecting a Field of Research ID¶
Similarly, we can use the API to identify relevant Field of Research (FoR) categories for Berkeley University.
By using a specific FOR category we can make the subsequent data extraction & analysis a bit more focused.
[4]:
%%dsldf
search publications
where research_orgs.id = "grid.47840.3f"
return category_for limit 10
Returned Category_for: 10
Time: 0.97s
[4]:
id | name | count | |
---|---|---|---|
0 | 2206 | 06 Biological Sciences | 42951 |
1 | 2202 | 02 Physical Sciences | 42053 |
2 | 2209 | 09 Engineering | 38325 |
3 | 2211 | 11 Medical and Health Sciences | 33814 |
4 | 2203 | 03 Chemical Sciences | 24475 |
5 | 2201 | 01 Mathematical Sciences | 19433 |
6 | 2208 | 08 Information and Computing Sciences | 17366 |
7 | 2581 | 0601 Biochemistry and Cell Biology | 16245 |
8 | 2471 | 0306 Physical Chemistry (incl. Structural) | 13440 |
9 | 2217 | 17 Psychology and Cognitive Sciences | 13398 |
For example, let’s focus on 08 Information and Computing Sciences, ID 2208
-
Finally, we can also select a specific year range, e.g. the last five years.
Let’s save all of these variables so that we can reference them later on.
[5]:
GRIDID = "grid.47840.3f" #@param {type:"string"}
FOR_CODE = "2208" #@param {type:"string"}
#@markdown The start/end year of publications used to extract patents
YEAR_START = 2015 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2021 #@param {type: "slider", min: 1950, max: 2021}
if YEAR_END < YEAR_START:
YEAR_END = YEAR_START
2. Getting the IDs of the outgoing citations¶
In this section we use the Publications API to extract the Dimensions ID of all publications referenced by authors in the selected research organization.
These identifiers can be found in the reference_ids
field.
[7]:
publications = dsl.query_iterative(f"""
search publications
where research_orgs.id = "{GRIDID}"
and year in [{YEAR_START}:{YEAR_END}]
and category_for.id="{FOR_CODE}"
return publications[id+doi+reference_ids]
""")
#
# preview the data
pubs_and_citations = publications.as_dataframe().explode("reference_ids")
pubs_and_citations.head(5)
Starting iteration with limit=1000 skip=0 ...
0-1000 / 5068 (1.31s)
1000-2000 / 5068 (2.01s)
2000-3000 / 5068 (1.07s)
3000-4000 / 5068 (1.17s)
4000-5000 / 5068 (1.33s)
5000-5068 / 5068 (1.23s)
===
Records extracted: 5068
[7]:
id | doi | reference_ids | |
---|---|---|---|
0 | pub.1146101788 | 10.1109/icicas53977.2021.00031 | [pub.1121966168, pub.1121972405, pub.112197331... |
1 | pub.1144313713 | 10.1002/9780470015902.a0029363 | [pub.1137478539, pub.1125842877, pub.112565184... |
2 | pub.1144210288 | 10.1093/jncics/pkab099 | [pub.1143292627, pub.1009104520, pub.103834136... |
3 | pub.1142934064 | 10.1145/3485007 | [pub.1007394032, pub.1004190151, pub.105272630... |
4 | pub.1141526845 | 10.1145/3478535 | [pub.1023640338, pub.1127037784, pub.106438915... |
2.1 Removing duplicates and counting most frequent citations¶
Since multiple authors/publications from our organization will be referencing the same target publications, we may have various duplicates in our reference_ids
column.
So want to remove those duplicates, while at the same time retaining that information by adding a new column size
that counts how frequenlty a certain publication was cited.
This can be easily achieved using panda’s group-by function:
[10]:
# consider only IDs column
df = pubs_and_citations[['reference_ids']]
# group by ID and count
citations = df.groupby(df.columns.tolist(),as_index=False).size().sort_values("size", ascending=False)
# preview the data, most cited ID first
citations.head(10)
[10]:
reference_ids | size | |
---|---|---|
62077 | pub.1093359587 | 202 |
69920 | pub.1095689025 | 133 |
29065 | pub.1038140272 | 96 |
7290 | pub.1009767488 | 86 |
34655 | pub.1045321436 | 76 |
39735 | pub.1052031051 | 68 |
66651 | pub.1094727707 | 68 |
44994 | pub.1061179979 | 66 |
7503 | pub.1010020120 | 63 |
62982 | pub.1093626237 | 56 |
3. Enriching the citations IDs with other publication metadata¶
In this step we use the outgoing citations IDs obtained above to query the publications API again.
The goal is to retrieve more publications metadata so to be able to ‘group’ citations based on criteria of interest e.g. what journal they belong to. For example:
source_title
publisher
year
doi
NOTE Since we can have lots of publications to go through, the IDs list is chunked into smaller groups so to ensure the resulting API query is never too long (more info here).
[11]:
#
# get a list of citation IDs
pubids = list(citations['reference_ids'])
#
# DSL query - PS change the return statement to extract different metadata of interest
query_template = """search publications
where id in {}
return publications[id+doi+journal+year+publisher+type+issn]
limit 1000"""
#
# loop through all references-publications IDs in chunks and query Dimensions
print(f"===\nExtracting publications data for {len(pubids)} citations...")
results = []
BATCHSIZE = 400
VERBOSE = False # set to True to see extraction logs
for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
query = query_template.format(json.dumps(chunk))
data = dsl.query(query, verbose=VERBOSE)
results += data.publications
time.sleep(0.5)
#
# save the citing pub data into a dataframe, remove duplicates and save
pubs_cited = pd.DataFrame().from_dict(results)
print("===\nCited Publications found: ", len(pubs_cited))
#
# transform the 'journal' column cause it contains nested data
temp = pubs_cited['journal'].apply(pd.Series).rename(columns={"id": "journal.id",
"title": "journal.title"}).drop([0], axis=1)
pubs_cited = pd.concat([pubs_cited.drop(['journal'], axis=1), temp], axis=1).sort_values('type')
pubs_cited.head(10)
===
Extracting publications data for 87159 citations...
===
Cited Publications found: 87139
[11]:
doi | id | publisher | type | year | issn | journal.id | journal.title | |
---|---|---|---|---|---|---|---|---|
87138 | 10.1214/aoms/1177704711 | pub.1042438804 | Institute of Mathematical Statistics | article | 1962.0 | [0003-4851, 2168-8990] | jour.1018844 | The Annals of Mathematical Statistics |
72855 | 10.1109/surv.2014.012214.00007 | pub.1061446928 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1553-877X, 2373-745X] | jour.1139536 | IEEE Communications Surveys & Tutorials |
72854 | 10.1109/surv.2014.032014.00094 | pub.1061446943 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1553-877X, 2373-745X] | jour.1139536 | IEEE Communications Surveys & Tutorials |
72853 | 10.1109/lsp.2014.2351822 | pub.1061378903 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1070-9908, 1558-2361] | jour.1033580 | IEEE Signal Processing Letters |
72852 | 10.1109/mm.2014.61 | pub.1061408931 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [0272-1732, 1937-4143] | jour.1125669 | IEEE Micro |
72851 | 10.1109/mits.2014.2343262 | pub.1061407712 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1939-1390, 1941-1197] | jour.1140577 | IEEE Intelligent Transportation Systems Magazine |
72850 | 10.1109/lsp.2014.2334306 | pub.1061378828 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1070-9908, 1558-2361] | jour.1033580 | IEEE Signal Processing Letters |
72849 | 10.1109/mra.2014.2360283 | pub.1061419755 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1070-9932, 1558-223X] | jour.1033567 | IEEE Robotics & Automation Magazine |
72848 | 10.1109/msp.2014.107 | pub.1061424107 | Institute of Electrical and Electronics Engine... | article | 2015.0 | [1540-7993, 1558-4046] | jour.1033568 | IEEE Security & Privacy |
72847 | 10.1109/lsp.2015.2393295 | pub.1061379116 | Institute of Electrical and Electronics Engine... | article | 2015.0 | [1070-9908, 1558-2361] | jour.1033580 | IEEE Signal Processing Letters |
3.1 Adding the citations counts¶
We achieve this by joining this data with the ones we extracted before, that is, citations
Note: if there are a lot of publications, this step can take some time.
[12]:
pubs_cited = pubs_cited.merge(citations, left_on='id', right_on='reference_ids')
pubs_cited.head(10)
[12]:
doi | id | publisher | type | year | issn | journal.id | journal.title | reference_ids | size | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 10.1214/aoms/1177704711 | pub.1042438804 | Institute of Mathematical Statistics | article | 1962.0 | [0003-4851, 2168-8990] | jour.1018844 | The Annals of Mathematical Statistics | pub.1042438804 | 1 |
1 | 10.1109/surv.2014.012214.00007 | pub.1061446928 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1553-877X, 2373-745X] | jour.1139536 | IEEE Communications Surveys & Tutorials | pub.1061446928 | 1 |
2 | 10.1109/surv.2014.032014.00094 | pub.1061446943 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1553-877X, 2373-745X] | jour.1139536 | IEEE Communications Surveys & Tutorials | pub.1061446943 | 1 |
3 | 10.1109/lsp.2014.2351822 | pub.1061378903 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1070-9908, 1558-2361] | jour.1033580 | IEEE Signal Processing Letters | pub.1061378903 | 1 |
4 | 10.1109/mm.2014.61 | pub.1061408931 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [0272-1732, 1937-4143] | jour.1125669 | IEEE Micro | pub.1061408931 | 1 |
5 | 10.1109/mits.2014.2343262 | pub.1061407712 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1939-1390, 1941-1197] | jour.1140577 | IEEE Intelligent Transportation Systems Magazine | pub.1061407712 | 1 |
6 | 10.1109/lsp.2014.2334306 | pub.1061378828 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1070-9908, 1558-2361] | jour.1033580 | IEEE Signal Processing Letters | pub.1061378828 | 1 |
7 | 10.1109/mra.2014.2360283 | pub.1061419755 | Institute of Electrical and Electronics Engine... | article | 2014.0 | [1070-9932, 1558-223X] | jour.1033567 | IEEE Robotics & Automation Magazine | pub.1061419755 | 1 |
8 | 10.1109/msp.2014.107 | pub.1061424107 | Institute of Electrical and Electronics Engine... | article | 2015.0 | [1540-7993, 1558-4046] | jour.1033568 | IEEE Security & Privacy | pub.1061424107 | 1 |
9 | 10.1109/lsp.2015.2393295 | pub.1061379116 | Institute of Electrical and Electronics Engine... | article | 2015.0 | [1070-9908, 1558-2361] | jour.1033580 | IEEE Signal Processing Letters | pub.1061379116 | 1 |
4. Journal Analysis¶
Finally, we can analyze the citing publications by grouping them by source journal. This can be achieved easily thanks to pandas’ Dataframe methods.
4.1 Number of Unique journals¶
[13]:
pubs_cited['journal.id'].describe()
[13]:
count 59877
unique 6577
top jour.1017736
freq 779
Name: journal.id, dtype: object
4.2 Most frequent journals¶
[14]:
journals = pubs_cited.value_counts(['journal.title', 'publisher'])
journals = journals.to_frame().reset_index().rename(columns= {0: 'citations', 'journal.title' : 'title'})
journals.index.name = 'index'
#preview
journals.head(100)
[14]:
title | publisher | citations | |
---|---|---|---|
index | |||
0 | The Journal of Chemical Physics | AIP Publishing | 779 |
1 | ACM Transactions on Graphics | Association for Computing Machinery (ACM) | 680 |
2 | IEEE Transactions on Information Theory | Institute of Electrical and Electronics Engine... | 655 |
3 | Nature | Springer Nature | 594 |
4 | Proceedings of the National Academy of Science... | Proceedings of the National Academy of Sciences | 592 |
... | ... | ... | ... |
95 | Artificial Intelligence | Elsevier | 97 |
96 | Psychological Review | American Psychological Association (APA) | 96 |
97 | Expert Systems with Applications | Elsevier | 96 |
98 | JAMA | American Medical Association (AMA) | 96 |
99 | IEEE Transactions on Multimedia | Institute of Electrical and Electronics Engine... | 95 |
100 rows × 3 columns
4.3 Top 50 journals chart, by publisher¶
[15]:
px.bar(journals[:50],
x="title", y="citations", color="publisher",
height=900,
title=f"Top 50 journals cited by {GRIDID} (focus: FoR {FOR_CODE} and time span {YEAR_START}:{YEAR_END})")
4.4 Top 20 journals by year of the cited publication¶
[16]:
THRESHOLD = 20 #@param {type: "slider", min: 10, max: 100}
# suppress empty values
pubs_cited.fillna("-no value-", inplace=True)
# make publications list smaller by only showing top journals
pubs_citing_topjournals = pubs_cited[pubs_cited['journal.title'].isin(list(journals[:THRESHOLD]['title']))].sort_values('journal.title')
# build histogram
px.histogram(pubs_citing_topjournals,
x="year",
color="journal.title",
height=600,
title=f"Top {THRESHOLD} journals citing publications from {GRIDID} - by year")
Conclusions¶
In this notebook we have shown how to use the Dimensions Analytics API to discover what academic journals are most frequenlty cited by authors affiliated to a selected research organization.
This only scratches the surface of the possible applications of publication data, but hopefully it’ll give you a few basic tools to get started building your own applications. For more background, see the list of fields available via the Publications API.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.