Citation Analysis: Journals Cited by a Research Organization¶

This notebook shows how to use the Dimensions Analytics API to discover what academic journals are most frequenlty cited by authors affiliated to a selected research organization. These are the steps:

We start from a specific organization GRID ID (and other parameters of choice)
Using the publications API, we extract all publications authored by researchers at that institution. For each publication, we store all outgoing citations IDs using the reference_ids field
We query the API again to obtain other useful metadata for those outgoing citations e.g. title, publisher, journals etc..
We analyse the data, in particular by segmenting it by journal and publisher

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Sep 22, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.9.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.2
Method: dsl.ini file

1. Choosing a Research Organization¶

We can use the organizations API to find the GRID ID for Berkeley University.

[3]:

%%dsldf

search organizations for "berkeley university" return organizations

Returned Organizations: 1 (total = 1)
Time: 0.61s

[3]:

	id	name	acronym	city_name	country_name	latitude	linkout	longitude	state_name	types
0	grid.47840.3f	University of California, Berkeley	UCB	Berkeley	United States	37.87216	[http://www.berkeley.edu/]	-122.258575	California	[Education]

The ID we are looking for is grid.47840.3f.

1.1 Selecting a Field of Research ID¶

Similarly, we can use the API to identify relevant Field of Research (FoR) categories for Berkeley University.

By using a specific FOR category we can make the subsequent data extraction & analysis a bit more focused.

[4]:

%%dsldf

search publications
    where research_orgs.id = "grid.47840.3f"
    return category_for limit 10

Returned Category_for: 10
Time: 0.97s

[4]:

	id	name	count
0	2206	06 Biological Sciences	42951
1	2202	02 Physical Sciences	42053
2	2209	09 Engineering	38325
3	2211	11 Medical and Health Sciences	33814
4	2203	03 Chemical Sciences	24475
5	2201	01 Mathematical Sciences	19433
6	2208	08 Information and Computing Sciences	17366
7	2581	0601 Biochemistry and Cell Biology	16245
8	2471	0306 Physical Chemistry (incl. Structural)	13440
9	2217	17 Psychology and Cognitive Sciences	13398

For example, let’s focus on 08 Information and Computing Sciences, ID 2208 -

Finally, we can also select a specific year range, e.g. the last five years.

Let’s save all of these variables so that we can reference them later on.

[5]:

GRIDID = "grid.47840.3f" #@param {type:"string"}

FOR_CODE = "2208"  #@param {type:"string"}


#@markdown The start/end year of publications used to extract patents
YEAR_START = 2015 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2021 #@param {type: "slider", min: 1950, max: 2021}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

2. Getting the IDs of the outgoing citations¶

In this section we use the Publications API to extract the Dimensions ID of all publications referenced by authors in the selected research organization.

These identifiers can be found in the reference_ids field.

[7]:

publications = dsl.query_iterative(f"""

    search publications
        where research_orgs.id = "{GRIDID}"
        and year in [{YEAR_START}:{YEAR_END}]
        and category_for.id="{FOR_CODE}"
        return publications[id+doi+reference_ids]

""")

#
# preview the data
pubs_and_citations = publications.as_dataframe().explode("reference_ids")
pubs_and_citations.head(5)

Starting iteration with limit=1000 skip=0 ...
0-1000 / 5068 (1.31s)
1000-2000 / 5068 (2.01s)
2000-3000 / 5068 (1.07s)
3000-4000 / 5068 (1.17s)
4000-5000 / 5068 (1.33s)
5000-5068 / 5068 (1.23s)
===
Records extracted: 5068

[7]:

	id	doi	reference_ids
0	pub.1146101788	10.1109/icicas53977.2021.00031	[pub.1121966168, pub.1121972405, pub.112197331...
1	pub.1144313713	10.1002/9780470015902.a0029363	[pub.1137478539, pub.1125842877, pub.112565184...
2	pub.1144210288	10.1093/jncics/pkab099	[pub.1143292627, pub.1009104520, pub.103834136...
3	pub.1142934064	10.1145/3485007	[pub.1007394032, pub.1004190151, pub.105272630...
4	pub.1141526845	10.1145/3478535	[pub.1023640338, pub.1127037784, pub.106438915...

2.1 Removing duplicates and counting most frequent citations¶

Since multiple authors/publications from our organization will be referencing the same target publications, we may have various duplicates in our reference_ids column.

So want to remove those duplicates, while at the same time retaining that information by adding a new column size that counts how frequenlty a certain publication was cited.

This can be easily achieved using panda’s group-by function:

[10]:

# consider only IDs column
df = pubs_and_citations[['reference_ids']]
# group by ID and count
citations = df.groupby(df.columns.tolist(),as_index=False).size().sort_values("size", ascending=False)
# preview the data, most cited ID first
citations.head(10)

[10]:

	reference_ids	size
62077	pub.1093359587	202
69920	pub.1095689025	133
29065	pub.1038140272	96
7290	pub.1009767488	86
34655	pub.1045321436	76
39735	pub.1052031051	68
66651	pub.1094727707	68
44994	pub.1061179979	66
7503	pub.1010020120	63
62982	pub.1093626237	56

3. Enriching the citations IDs with other publication metadata¶

In this step we use the outgoing citations IDs obtained above to query the publications API again.

The goal is to retrieve more publications metadata so to be able to ‘group’ citations based on criteria of interest e.g. what journal they belong to. For example:

source_title
publisher
year
doi

NOTE Since we can have lots of publications to go through, the IDs list is chunked into smaller groups so to ensure the resulting API query is never too long (more info here).

[11]:

#
# get a list of citation IDs
pubids = list(citations['reference_ids'])


#
# DSL query - PS change the return statement to extract different metadata of interest
query_template = """search publications
                    where id in {}
                    return publications[id+doi+journal+year+publisher+type+issn]
                    limit 1000"""


#
# loop through all references-publications IDs in chunks and query Dimensions
print(f"===\nExtracting publications data for {len(pubids)} citations...")
results = []
BATCHSIZE = 400
VERBOSE = False # set to True to see extraction logs

for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
    query = query_template.format(json.dumps(chunk))
    data = dsl.query(query, verbose=VERBOSE)
    results += data.publications
    time.sleep(0.5)

#
# save the citing pub data into a dataframe, remove duplicates and save
pubs_cited = pd.DataFrame().from_dict(results)
print("===\nCited Publications found: ", len(pubs_cited))


#
# transform the 'journal' column cause it contains nested data
temp = pubs_cited['journal'].apply(pd.Series).rename(columns={"id": "journal.id",
                                                              "title": "journal.title"}).drop([0], axis=1)
pubs_cited = pd.concat([pubs_cited.drop(['journal'], axis=1), temp], axis=1).sort_values('type')
pubs_cited.head(10)

===
Extracting publications data for 87159 citations...

===
Cited Publications found:  87139

[11]:

	doi	id	publisher	type	year	issn	journal.id	journal.title
87138	10.1214/aoms/1177704711	pub.1042438804	Institute of Mathematical Statistics	article	1962.0	[0003-4851, 2168-8990]	jour.1018844	The Annals of Mathematical Statistics
72855	10.1109/surv.2014.012214.00007	pub.1061446928	Institute of Electrical and Electronics Engine...	article	2014.0	[1553-877X, 2373-745X]	jour.1139536	IEEE Communications Surveys & Tutorials
72854	10.1109/surv.2014.032014.00094	pub.1061446943	Institute of Electrical and Electronics Engine...	article	2014.0	[1553-877X, 2373-745X]	jour.1139536	IEEE Communications Surveys & Tutorials
72853	10.1109/lsp.2014.2351822	pub.1061378903	Institute of Electrical and Electronics Engine...	article	2014.0	[1070-9908, 1558-2361]	jour.1033580	IEEE Signal Processing Letters
72852	10.1109/mm.2014.61	pub.1061408931	Institute of Electrical and Electronics Engine...	article	2014.0	[0272-1732, 1937-4143]	jour.1125669	IEEE Micro
72851	10.1109/mits.2014.2343262	pub.1061407712	Institute of Electrical and Electronics Engine...	article	2014.0	[1939-1390, 1941-1197]	jour.1140577	IEEE Intelligent Transportation Systems Magazine
72850	10.1109/lsp.2014.2334306	pub.1061378828	Institute of Electrical and Electronics Engine...	article	2014.0	[1070-9908, 1558-2361]	jour.1033580	IEEE Signal Processing Letters
72849	10.1109/mra.2014.2360283	pub.1061419755	Institute of Electrical and Electronics Engine...	article	2014.0	[1070-9932, 1558-223X]	jour.1033567	IEEE Robotics & Automation Magazine
72848	10.1109/msp.2014.107	pub.1061424107	Institute of Electrical and Electronics Engine...	article	2015.0	[1540-7993, 1558-4046]	jour.1033568	IEEE Security & Privacy
72847	10.1109/lsp.2015.2393295	pub.1061379116	Institute of Electrical and Electronics Engine...	article	2015.0	[1070-9908, 1558-2361]	jour.1033580	IEEE Signal Processing Letters

3.1 Adding the citations counts¶

We achieve this by joining this data with the ones we extracted before, that is, citations

Note: if there are a lot of publications, this step can take some time.

[12]:

pubs_cited = pubs_cited.merge(citations, left_on='id', right_on='reference_ids')

pubs_cited.head(10)

[12]:

	doi	id	publisher	type	year	issn	journal.id	journal.title	reference_ids	size
0	10.1214/aoms/1177704711	pub.1042438804	Institute of Mathematical Statistics	article	1962.0	[0003-4851, 2168-8990]	jour.1018844	The Annals of Mathematical Statistics	pub.1042438804	1
1	10.1109/surv.2014.012214.00007	pub.1061446928	Institute of Electrical and Electronics Engine...	article	2014.0	[1553-877X, 2373-745X]	jour.1139536	IEEE Communications Surveys & Tutorials	pub.1061446928	1
2	10.1109/surv.2014.032014.00094	pub.1061446943	Institute of Electrical and Electronics Engine...	article	2014.0	[1553-877X, 2373-745X]	jour.1139536	IEEE Communications Surveys & Tutorials	pub.1061446943	1
3	10.1109/lsp.2014.2351822	pub.1061378903	Institute of Electrical and Electronics Engine...	article	2014.0	[1070-9908, 1558-2361]	jour.1033580	IEEE Signal Processing Letters	pub.1061378903	1
4	10.1109/mm.2014.61	pub.1061408931	Institute of Electrical and Electronics Engine...	article	2014.0	[0272-1732, 1937-4143]	jour.1125669	IEEE Micro	pub.1061408931	1
5	10.1109/mits.2014.2343262	pub.1061407712	Institute of Electrical and Electronics Engine...	article	2014.0	[1939-1390, 1941-1197]	jour.1140577	IEEE Intelligent Transportation Systems Magazine	pub.1061407712	1
6	10.1109/lsp.2014.2334306	pub.1061378828	Institute of Electrical and Electronics Engine...	article	2014.0	[1070-9908, 1558-2361]	jour.1033580	IEEE Signal Processing Letters	pub.1061378828	1
7	10.1109/mra.2014.2360283	pub.1061419755	Institute of Electrical and Electronics Engine...	article	2014.0	[1070-9932, 1558-223X]	jour.1033567	IEEE Robotics & Automation Magazine	pub.1061419755	1
8	10.1109/msp.2014.107	pub.1061424107	Institute of Electrical and Electronics Engine...	article	2015.0	[1540-7993, 1558-4046]	jour.1033568	IEEE Security & Privacy	pub.1061424107	1
9	10.1109/lsp.2015.2393295	pub.1061379116	Institute of Electrical and Electronics Engine...	article	2015.0	[1070-9908, 1558-2361]	jour.1033580	IEEE Signal Processing Letters	pub.1061379116	1

4. Journal Analysis¶

Finally, we can analyze the citing publications by grouping them by source journal. This can be achieved easily thanks to pandas’ Dataframe methods.

4.1 Number of Unique journals¶

[13]:

pubs_cited['journal.id'].describe()

[13]:

count            59877
unique            6577
top       jour.1017736
freq               779
Name: journal.id, dtype: object

4.2 Most frequent journals¶

[14]:

journals = pubs_cited.value_counts(['journal.title', 'publisher'])
journals = journals.to_frame().reset_index().rename(columns= {0: 'citations', 'journal.title' : 'title'})
journals.index.name = 'index'

#preview
journals.head(100)

[14]:

	title	publisher	citations
index
0	The Journal of Chemical Physics	AIP Publishing	779
1	ACM Transactions on Graphics	Association for Computing Machinery (ACM)	680
2	IEEE Transactions on Information Theory	Institute of Electrical and Electronics Engine...	655
3	Nature	Springer Nature	594
4	Proceedings of the National Academy of Science...	Proceedings of the National Academy of Sciences	592
...	...	...	...
95	Artificial Intelligence	Elsevier	97
96	Psychological Review	American Psychological Association (APA)	96
97	Expert Systems with Applications	Elsevier	96
98	JAMA	American Medical Association (AMA)	96
99	IEEE Transactions on Multimedia	Institute of Electrical and Electronics Engine...	95

100 rows × 3 columns

4.3 Top 50 journals chart, by publisher¶

[15]:

px.bar(journals[:50],
       x="title", y="citations", color="publisher",
       height=900,
       title=f"Top 50 journals cited by {GRIDID} (focus: FoR {FOR_CODE} and time span {YEAR_START}:{YEAR_END})")

4.4 Top 20 journals by year of the cited publication¶

[16]:

THRESHOLD = 20  #@param {type: "slider", min: 10, max: 100}

# suppress empty values
pubs_cited.fillna("-no value-", inplace=True)

# make publications list smaller by only showing top journals
pubs_citing_topjournals = pubs_cited[pubs_cited['journal.title'].isin(list(journals[:THRESHOLD]['title']))].sort_values('journal.title')

# build histogram
px.histogram(pubs_citing_topjournals,
             x="year",
             color="journal.title",
             height=600,
             title=f"Top {THRESHOLD} journals citing publications from {GRIDID} - by year")

Conclusions¶

In this notebook we have shown how to use the Dimensions Analytics API to discover what academic journals are most frequenlty cited by authors affiliated to a selected research organization.

This only scratches the surface of the possible applications of publication data, but hopefully it’ll give you a few basic tools to get started building your own applications. For more background, see the list of fields available via the Publications API.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.