Citation Analysis: an Introduction¶

This notebooks shows how to extract citations data using the Dimensions Analytics API.

Two approaches are considered: one that is most suited for smaller analyses, and one which is more query-efficient and hence is more suited for analyses involving lots of publications.

[3]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Aug 08, 2023
==
==
CHANGELOG
This notebook was last run on Aug 08, 2023
==

1. Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[4]:

!pip install dimcli pyvis -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()


[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v1.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.7
Method: dsl.ini file

Method A: getting citations for one publication at a time¶

By using the field reference_ids we can easily look up citations for individual publications (= incoming links). For example, here are the papers citing “pub.1053279155”:

[5]:

%dsldf search publications where reference_ids in [ "pub.1053279155" ] return publications[id+doi+title+year]

Returned Publications: 7 (total = 7)
Time: 6.00s

[5]:

	id	title	doi	year
0	pub.1148271626	Capturing the Semantics of Smell: The Odeuropa...	10.1007/978-3-031-06981-9_23	2022
1	pub.1103275659	Towards ontology-based multilingual URL filter...	10.1007/s11227-018-2338-1	2018
2	pub.1068603272	Metody sztucznej inteligencji w digitalizacji ...	10.18290/rkult.2016.7.1-3	2016
3	pub.1005502446	Challenges for Ontological Engineering in the ...	10.1007/978-3-319-24129-6_3	2015
4	pub.1012651711	Das Experteninterview als zentrale Methode der...	10.1515/iwp-2015-0057	2015
5	pub.1008922470	Transforming a Flat Metadata Schema to a Seman...	10.1007/978-3-642-24809-2_10	2012
6	pub.1053157726	Practice-Based Ontologies: A New Approach to A...	10.1007/978-3-642-24731-6_38	2011

Let’s try another paper ie “pub.1103275659” - in this case there are 3 citations

[6]:

%dsldf search publications where reference_ids in [ "pub.1103275659" ] return publications[id+doi+title+year]

Returned Publications: 18 (total = 18)
Time: 0.97s

[6]:

	id	title	doi	year
0	pub.1158592531	DSpamOnto: An Ontology Modelling for Domain-Sp...	10.3390/bdcc7020109	2023
1	pub.1157326322	Spam Detection and Fake User Identification in...	10.48175/ijarsct-9178	2023
2	pub.1156046866	Filtering objectionable information access bas...	10.1177/01655515231160041	2023
3	pub.1155856205	DeNet_SVM: Product Based Recommendation System...	10.1109/pdgc56933.2022.10053122	2022
4	pub.1152595729	Ground Truth Dataset: Objectionable Web Content	10.3390/data7110153	2022
5	pub.1140956277	Towards Aspect Based Components Integration Fr...	10.32604/cmc.2022.018779	2021
6	pub.1139784498	Requirement prioritization framework using cas...	10.1111/exsy.12770	2021
7	pub.1139789625	Text Mining in Cybersecurity	10.1145/3462477	2021
8	pub.1136536359	A Perceptive Fake User Detection and Visualiza...	10.1007/978-981-15-8685-9_44	2021
9	pub.1135354806	A preliminary study of cyber parental control ...	10.1109/ains50155.2020.9315134	2020
10	pub.1132401390	Foreground detection using motion histogram th...	10.1007/s00530-020-00676-3	2020
11	pub.1128314811	Cyber parental control: A bibliometric study	10.1016/j.childyouth.2020.105134	2020
12	pub.1125691748	OBAC: towards agent-based identification and c...	10.1007/s11042-020-08764-2	2020
13	pub.1125056530	Calculating Trust Using Multiple Heterogeneous...	10.1155/2020/8545128	2020
14	pub.1113878770	Perception layer security in Internet of Things	10.1016/j.future.2019.04.038	2019
15	pub.1115224509	Spammer Detection and Fake User Identification...	10.1109/access.2019.2918196	2019
16	pub.1109815383	A Fault Tolerant Approach for Malicious URL Fi...	10.1109/isncc.2018.8530984	2018
17	pub.1107354292	Social Internet of Vehicles: Complexity, Adapt...	10.1109/access.2018.2872928	2018

Using this simple approach, if we start with a list of publications (our ‘seed’) we can set up a simple loop to get through all of them and launch a ‘get-citations’ query each time.

TIP The json.dumps function easily transforms a list of objects into a string which can be used directly in our query eg

> json.dumps(seed)
'["pub.1053279155", "pub.1103275659"]'

[7]:

seed = [ "pub.1053279155" , "pub.1103275659"]
q = """search publications where reference_ids in [{}] return publications[id+doi+title+year]"""
results = {}
for p in seed:
  data = dsl.query(q.format(json.dumps(p)))
  results[p] = [x['id'] for x in data.publications]

Returned Publications: 7 (total = 7)
Time: 5.96s
Returned Publications: 18 (total = 18)
Time: 0.72s

[8]:

results

[8]:

{'pub.1053279155': ['pub.1148271626',
  'pub.1103275659',
  'pub.1068603272',
  'pub.1005502446',
  'pub.1012651711',
  'pub.1008922470',
  'pub.1053157726'],
 'pub.1103275659': ['pub.1158592531',
  'pub.1157326322',
  'pub.1156046866',
  'pub.1155856205',
  'pub.1152595729',
  'pub.1140956277',
  'pub.1139784498',
  'pub.1139789625',
  'pub.1136536359',
  'pub.1135354806',
  'pub.1132401390',
  'pub.1128314811',
  'pub.1125691748',
  'pub.1125056530',
  'pub.1113878770',
  'pub.1115224509',
  'pub.1109815383',
  'pub.1107354292']}

Comments about this method¶

this approach is straightforward and quick, but it’s better used with small datasets
we create one query per publication (and so on, for a N-degree network)
if you have lots of publicaitons, it’ll lead to lots of queries which may not be too efficient

Method B: Getting citations for multiple publications via a single query¶

We can use the same query template but instead of looking for a single publication ID, we can put multiple ones in a list.

So if we combine the two citations list for “pub.1053279155” and “pub.1103275659”, we will get 5 + 3 = 8 results in total.

However then it’s down to us to figure out which paper is citing which!

[9]:

%dsldf search publications where reference_ids in [ "pub.1053279155" , "pub.1103275659"] return publications[id+doi+title+year]

Returned Publications: 20 (total = 25)
Time: 5.26s

[9]:

	id	title	doi	year
0	pub.1158592531	DSpamOnto: An Ontology Modelling for Domain-Sp...	10.3390/bdcc7020109	2023
1	pub.1157326322	Spam Detection and Fake User Identification in...	10.48175/ijarsct-9178	2023
2	pub.1156046866	Filtering objectionable information access bas...	10.1177/01655515231160041	2023
3	pub.1155856205	DeNet_SVM: Product Based Recommendation System...	10.1109/pdgc56933.2022.10053122	2022
4	pub.1152595729	Ground Truth Dataset: Objectionable Web Content	10.3390/data7110153	2022
5	pub.1148271626	Capturing the Semantics of Smell: The Odeuropa...	10.1007/978-3-031-06981-9_23	2022
6	pub.1140956277	Towards Aspect Based Components Integration Fr...	10.32604/cmc.2022.018779	2021
7	pub.1139784498	Requirement prioritization framework using cas...	10.1111/exsy.12770	2021
8	pub.1139789625	Text Mining in Cybersecurity	10.1145/3462477	2021
9	pub.1136536359	A Perceptive Fake User Detection and Visualiza...	10.1007/978-981-15-8685-9_44	2021
10	pub.1135354806	A preliminary study of cyber parental control ...	10.1109/ains50155.2020.9315134	2020
11	pub.1132401390	Foreground detection using motion histogram th...	10.1007/s00530-020-00676-3	2020
12	pub.1128314811	Cyber parental control: A bibliometric study	10.1016/j.childyouth.2020.105134	2020
13	pub.1125691748	OBAC: towards agent-based identification and c...	10.1007/s11042-020-08764-2	2020
14	pub.1125056530	Calculating Trust Using Multiple Heterogeneous...	10.1155/2020/8545128	2020
15	pub.1113878770	Perception layer security in Internet of Things	10.1016/j.future.2019.04.038	2019
16	pub.1115224509	Spammer Detection and Fake User Identification...	10.1109/access.2019.2918196	2019
17	pub.1109815383	A Fault Tolerant Approach for Malicious URL Fi...	10.1109/isncc.2018.8530984	2018
18	pub.1103275659	Towards ontology-based multilingual URL filter...	10.1007/s11227-018-2338-1	2018
19	pub.1107354292	Social Internet of Vehicles: Complexity, Adapt...	10.1109/access.2018.2872928	2018

In order to resolve the citations data we got above, we must also extract the full references for each citing paper (by including reference_ids in the results) and then recreate the citation graph programmatically. EG

[10]:

seed = [ "pub.1053279155" , "pub.1103275659"]

[11]:

data = dsl.query(f"""search publications where reference_ids in {json.dumps(seed)} return publications[id+doi+title+year+reference_ids]""")

Returned Publications: 20 (total = 25)
Time: 0.96s

[12]:

def build_network_dict(seed, pubs_list):
  network={x:[] for x in seed} # seed a dictionary
  for pub in pubs_list:
    for key in network:
      if pub.get('reference_ids') and key in pub['reference_ids']:
        network[key].append(pub['id'])
  return network

A simple way to represent the citation network is a dictionary data structure with 'cited_paper' : [citing papers]

[13]:

network1 = build_network_dict(seed, data.publications)
network1

[13]:

{'pub.1053279155': ['pub.1148271626', 'pub.1103275659'],
 'pub.1103275659': ['pub.1158592531',
  'pub.1157326322',
  'pub.1156046866',
  'pub.1155856205',
  'pub.1152595729',
  'pub.1140956277',
  'pub.1139784498',
  'pub.1139789625',
  'pub.1136536359',
  'pub.1135354806',
  'pub.1132401390',
  'pub.1128314811',
  'pub.1125691748',
  'pub.1125056530',
  'pub.1113878770',
  'pub.1115224509',
  'pub.1109815383',
  'pub.1107354292']}

Creating a second-level citations network¶

Let’s now create a second level citations network!

This means going through all pubs citing the two seed-papers, and getting all the citing publications for them as well.

[14]:

all_citing_papers = []
for x in network1.values():
  all_citing_papers += x
all_citing_papers = list(set(all_citing_papers))

[15]:

all_citing_papers

[15]:

['pub.1113878770',
 'pub.1109815383',
 'pub.1132401390',
 'pub.1148271626',
 'pub.1115224509',
 'pub.1125691748',
 'pub.1136536359',
 'pub.1139784498',
 'pub.1107354292',
 'pub.1155856205',
 'pub.1158592531',
 'pub.1156046866',
 'pub.1125056530',
 'pub.1139789625',
 'pub.1152595729',
 'pub.1157326322',
 'pub.1103275659',
 'pub.1140956277',
 'pub.1135354806',
 'pub.1128314811']

Now let’s extract the network structure as previously done

[16]:

data2 = dsl.query(f"""search publications where reference_ids in {json.dumps(all_citing_papers)} return publications[id+doi+title+year+reference_ids]""")
network2 = build_network_dict(all_citing_papers, data2.publications)
network2

Returned Publications: 20 (total = 276)
Time: 4.97s

[16]:

{'pub.1113878770': ['pub.1154834771',
  'pub.1160605234',
  'pub.1156453470',
  'pub.1156130427',
  'pub.1159827875',
  'pub.1159316950'],
 'pub.1109815383': [],
 'pub.1132401390': [],
 'pub.1148271626': [],
 'pub.1115224509': ['pub.1162838256',
  'pub.1160759729',
  'pub.1162965229',
  'pub.1160136432',
  'pub.1160002777',
  'pub.1159966406'],
 'pub.1125691748': [],
 'pub.1136536359': [],
 'pub.1139784498': ['pub.1160542696'],
 'pub.1107354292': [],
 'pub.1155856205': ['pub.1160502736'],
 'pub.1158592531': [],
 'pub.1156046866': [],
 'pub.1125056530': [],
 'pub.1139789625': ['pub.1159839140', 'pub.1159700355'],
 'pub.1152595729': ['pub.1160624422'],
 'pub.1157326322': [],
 'pub.1103275659': ['pub.1158592531'],
 'pub.1140956277': [],
 'pub.1135354806': [],
 'pub.1128314811': ['pub.1160022332', 'pub.1158432245']}

Finally we can merge the two levels into one single dataset (note: nodes with same data will be merged automatically)

[17]:

final = dict(network1, **network2 )
final

[17]:

{'pub.1053279155': ['pub.1148271626', 'pub.1103275659'],
 'pub.1103275659': ['pub.1158592531'],
 'pub.1113878770': ['pub.1154834771',
  'pub.1160605234',
  'pub.1156453470',
  'pub.1156130427',
  'pub.1159827875',
  'pub.1159316950'],
 'pub.1109815383': [],
 'pub.1132401390': [],
 'pub.1148271626': [],
 'pub.1115224509': ['pub.1162838256',
  'pub.1160759729',
  'pub.1162965229',
  'pub.1160136432',
  'pub.1160002777',
  'pub.1159966406'],
 'pub.1125691748': [],
 'pub.1136536359': [],
 'pub.1139784498': ['pub.1160542696'],
 'pub.1107354292': [],
 'pub.1155856205': ['pub.1160502736'],
 'pub.1158592531': [],
 'pub.1156046866': [],
 'pub.1125056530': [],
 'pub.1139789625': ['pub.1159839140', 'pub.1159700355'],
 'pub.1152595729': ['pub.1160624422'],
 'pub.1157326322': [],
 'pub.1140956277': [],
 'pub.1135354806': [],
 'pub.1128314811': ['pub.1160022332', 'pub.1158432245']}

Building a Simple Dataviz¶

We can build a simple visualization using the excellent pyvis library. A custom version of pyvis is already included in dimcli.core.extras and is called NetworkViz (note: this custom version only fixes a bug that prevents pyvis graphs to be displayed online with Google Colab).

[18]:

# load custom version of pyvis
from pyvis.network import Network

[19]:

net = Network(notebook=True, width="100%", height="800px",cdn_resources="remote",
            neighborhood_highlight=True,
            select_menu=True)

nodes = []
for x in final:
  nodes.append(x)
  nodes += final[x]
nodes = list(set(nodes))

net.add_nodes(nodes) # node ids and labels = ["a", "b", "c", "d"]

for x in final:
  for target in final[x]:
    net.add_edge(x, target)

net.show("citation.html")

citation.html

[19]:

Final considerations¶

Querying for more than 1000 results¶

Each API query can return a maximum of 1000 records, so you must use the limit/skip syntax to get more.

See the paginating results section in the docs for more info.

Querying for more than 50K results¶

Even with limit/skip, one can only download 50k records for each single query.

So if your list of PUB-ids is getting too long (eg > 300) you should consider splitting up the list into chunks create an extra loop to go through all of them without hitting the max upper limit.

Dealing with highly cited publications¶

Some publications can have lots of citations: for example, here we have a single paper with 200K+ citation: https://app.dimensions.ai/details/publication/pub.1076750128

That’s quite an exceptional case, but there are several publications with more than 10k citations each. When you encounter such cases, you will hit the 50k limit pretty quickly, so you need to keep an eye out for these and possibly ‘slice’ the data in different ways eg by year or journal (so to get less results).

Pre-checking citations counts¶

The times_cited and recent_citations fields of publications can be used to check how many citations a paper has (ps recent_citations counts the last two years only).

So, by using these aggregated figures, it is possible to get a feeling for the size of citations-data we’ll have to deal with before setting up a proper data extraction pipeline.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.