Building a concepts co-occurence network¶
This Python notebook shows how to use the Dimensions Analytics API in order to extract concepts from publications and use them to generate a ‘topics map’ using co-occurence information. For more background on concepts, see also the Working with concepts in the Dimensions API tutorial.
Note this tutorial is best experienced using Google Colab.
[5]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial. The networkx and pyvis libraries are used for generating and visualizing the network, respectively.
[1]:
!pip install dimcli plotly networkx pyvis jsonpickle -U --quiet
import dimcli
from dimcli.utils import *
from dimcli.utils.networkviz import NetworkViz # custom version of pyvis - colab-compatible
import json
import sys
import pandas as pd
import networkx as nx
import plotly.express as px
import itertools
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
Step 1: Creating a dataset¶
We start by creating a dataset consisting of the most cited 1000 publications matching a chosen keyword.
The API query below will return a list of documents including all the related concepts.
Tip: try changing the query keyword in order to experiment with different results.
[2]:
#@markdown Enter the keyword used to seed the search
KEYWORD = "Semantic Web" #@param {type:"string"}
q = f"""search publications
for "\\"{KEYWORD}\\""
return publications[id+title+concepts_scores]
sort by times_cited limit 1000"""
data = dsl.query(q)
concepts = data.as_dataframe_concepts()
print("Total concepts:", len(concepts))
print("Concepts score average", concepts['score_avg'].mean())
concepts.head()
Returned Publications: 1000 (total = 166253)
Time: 3.55s
Total concepts: 50150
Concepts score average 0.3988755936191426
[2]:
id | title | concepts_count | concept | score | frequency | score_avg | |
---|---|---|---|---|---|---|---|
0 | pub.1056527616 | The Semantic Web | 2 | Web | 0.069 | 159 | 0.46508 |
1 | pub.1056527616 | The Semantic Web | 2 | Semantic Web | 0.006 | 81 | 0.65509 |
2 | pub.1010449058 | Social Network Sites: Definition, History, and... | 31 | special theme section | 0.283 | 1 | 0.28300 |
3 | pub.1010449058 | Social Network Sites: Definition, History, and... | 31 | scholarship | 0.262 | 2 | 0.43800 |
4 | pub.1010449058 | Social Network Sites: Definition, History, and... | 31 | theme section | 0.261 | 1 | 0.26100 |
Step 2: Building a concepts co-occurrence network¶
Each publication in our dataset includes a list of related concepts. In order to build a concepts co-occurrence network, we simply add an edge between concepts that appear in the same document.
Edges have a default weight of 1. If any two concepts appear together in more documents, we increase the weight each time.
Note: the resulting network can be very large, so in order to make our network smaller (and more relevant), we can filter out less interesting concepts in two ways:
by setting a
frequency
andscore_avg
threshold (as shown in the Working with concepts in the Dimensions API tutorial)by keeping only nodes that, in our network, have an edge > MIN WEIGHT (that is, nodes that have more connections)
[3]:
G = nx.Graph() # networkX instance
#
# TIP play with these parameters in ordeto generate different types of networks
#
MIN_CONCEPT_SCORE = 0.6
MIN_CONCEPT_FREQUENCY = 4
MIN_EDGE_WEIGHT = 2
CONCEPTS_SET = concepts.query(f"score_avg >= {MIN_CONCEPT_SCORE} & frequency >= {MIN_CONCEPT_FREQUENCY}")
#
# build nodes from concepts, including score_avg and frequency
# -- NOTE: score_bucket indicates if the concepts is above or below the mean_score
# -- this value is used in the visualization below to color-code nodes
#
mean_score = CONCEPTS_SET['score_avg'].mean()
for index, row in CONCEPTS_SET.drop_duplicates("concept").iterrows():
score_bucket = 1 if row['score_avg'] > mean_score else 2
G.add_node(row['concept'],frequency=row['frequency'], score_avg=row['score_avg'], score_bucket=score_bucket)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))
#
# build edges, based on concepts co-occurrence within pubs
# -- calculate a 'weight' based on how often two concepts co-occur
#
print(f".. adding edges from pubs cooccurrence...")
pubs_list = CONCEPTS_SET.drop_duplicates("id")['id'].to_list()
for p in pubs_list:
concepts_for_this_pub = CONCEPTS_SET[CONCEPTS_SET['id'] == p]['concept'].to_list()
for group in itertools.combinations(concepts_for_this_pub, 2): # gen all permutations
a, b = group[0], group[1]
try:
G.edges[a, b]['weight'] = G.edges[a, b]['weight'] + 1
except:
G.add_edge(a, b, weight=1)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))
#
# this extra step is useful to remove low-weight connections
#
print(f".. cleaning up edges with weight < {MIN_EDGE_WEIGHT}...")
for a, b, w in list(G.edges(data='weight')):
if w < MIN_EDGE_WEIGHT:
G.remove_edge(a, b)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))
print(f".. removing isolated nodes...")
G.remove_nodes_from(list(nx.isolates(G)))
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))
Nodes: 239 Edges: 0
.. adding edges from pubs cooccurrence...
Nodes: 239 Edges: 2115
.. cleaning up edges with weight < 2...
Nodes: 239 Edges: 490
.. removing isolated nodes...
Nodes: 201 Edges: 490
Step 3: Visualizing the network¶
Note
We’re using a custom version of pyvis which is included in dimcli.core.extras and is called
NetworkViz
. This custom version fixes a bug that prevents pyvis graphs to be displayed in Google Colab.The pyvis
from_nx
method doesn’t carry through WEIGHT or any other value from our network data; so we need to set it manually using via another passusing
score_bucket
(see above), we can mark the higher-score concepts using a brighter color
[4]:
viznet = NetworkViz(notebook=True, width="100%", height="800px")
viznet.toggle_hide_edges_on_drag(True)
viznet.barnes_hut()
viznet.repulsion(300)
viznet.heading = f"Concepts co-occurrence for '{KEYWORD}' publications"
# reuse plotly color palette
palette = px.colors.diverging.Temps # 7 colors
viznet.from_nx(G)
# update visual features
for node in viznet.nodes:
freq = G.nodes[node['label']]['frequency']
score_avg = G.nodes[node['label']]['score_avg']
score_bucket = G.nodes[node['label']]['score_bucket'] # get from original network
node['size'] = freq * 2
node['color'] = palette[3*score_bucket] # get color based on score_bucket (1 or 2)
node['borderWidthSelected'] = 5
node['title'] = f"<h4>Concept: '{node['label']}'</h4><hr>Frequency: {freq}<br>Score avg: {score_avg}",
# print(node)
for edge in viznet.edges:
# get value from main Network weight
edge['value'] = G.edges[edge['from'], edge['to']]['weight']
# print(edge)
viznet.show("concepts_network.html")
[4]:
Conclusions¶
In this tutorial we have demonstrated how to generate an concepts ‘co-occurence network diagram’ using the Dimensions API.
The resulting visualization is also available as a standalone file here.
For more information on this topic, see also the official documentation on concepts and the Working with concepts in the Dimensions API tutorial.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.