Building a concepts co-occurence network¶
This Python notebook shows how to use the Dimensions Analytics API in order to extract concepts from publications and use them to generate a ‘topics map’ using co-occurence information.
The resulting visualization is also available as a standalone file.
For more background on concepts, see also the Working with concepts in the Dimensions API tutorial.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Apr 20, 2023
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial. The networkx and pyvis libraries are used for generating and visualizing the network, respectively.
[2]:
!pip install dimcli plotly networkx pyvis jsonpickle -U --quiet
import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
import networkx as nx
import plotly.express as px
import itertools
from pyvis.network import Network
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.0.2)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.6
Method: dsl.ini file
Step 1: Creating a dataset¶
We start by creating a dataset consisting of the most cited 1000 publications matching a chosen keyword.
The API query below will return a list of documents including all the related concepts.
Tip: try changing the query keyword in order to experiment with different results.
[3]:
#@markdown Enter the keyword used to seed the search
KEYWORD = "Semantic Web" #@param {type:"string"}
q = f"""search publications
for "\\"{KEYWORD}\\""
return publications[id+title+concepts_scores]
sort by times_cited limit 1000"""
data = dsl.query(q)
concepts = data.as_dataframe_concepts()
print("Total concepts:", len(concepts))
print("Concepts score average", concepts['score_avg'].mean())
concepts.head()
Returned Publications: 1000 (total = 188034)
Time: 3.92s
Total concepts: 51192
Concepts score average 0.3960016883497422
[3]:
id | title | concepts_count | concept | score | frequency | score_avg | |
---|---|---|---|---|---|---|---|
0 | pub.1056527616 | The Semantic Web | 2 | Web | 0.072 | 136 | 0.45477 |
1 | pub.1056527616 | The Semantic Web | 2 | Semantic Web | 0.006 | 66 | 0.65579 |
2 | pub.1010449058 | Social Network Sites: Definition, History, and... | 31 | special theme section | 0.280 | 1 | 0.28000 |
3 | pub.1010449058 | Social Network Sites: Definition, History, and... | 31 | scholarship | 0.262 | 4 | 0.35000 |
4 | pub.1010449058 | Social Network Sites: Definition, History, and... | 31 | theme section | 0.260 | 1 | 0.26000 |
Step 2: Building a concepts co-occurrence network¶
Each publication in our dataset includes a list of related concepts. In order to build a concepts co-occurrence network, we simply add an edge between concepts that appear in the same document.
Edges have a default weight of 1. If any two concepts appear together in more documents, we increase the weight each time.
Note: the resulting network can be very large, so in order to make our network smaller (and more relevant), we can filter out less interesting concepts in two ways:
by setting a
frequency
andscore_avg
threshold (as shown in the Working with concepts in the Dimensions API tutorial)by keeping only nodes that, in our network, have an edge > MIN WEIGHT (that is, nodes that have more connections)
[4]:
G = nx.Graph() # networkX instance
#
# TIP play with these parameters in ordeto generate different types of networks
#
MIN_CONCEPT_SCORE = 0.6
MIN_CONCEPT_FREQUENCY = 4
MIN_CONCEPT_FREQUENCY = 4
MIN_EDGE_WEIGHT = 2
CONCEPTS_SET = concepts.query(f"score_avg >= {MIN_CONCEPT_SCORE} & frequency >= {MIN_CONCEPT_FREQUENCY}")
#
# build nodes from concepts, including score_avg and frequency
# -- NOTE: score_bucket indicates if the concepts is above or below the mean_score
# -- this value is used in the visualization below to color-code nodes
#
mean_score = CONCEPTS_SET['score_avg'].mean()
for index, row in CONCEPTS_SET.drop_duplicates("concept").iterrows():
score_bucket = 1 if row['score_avg'] > mean_score else 2
G.add_node(row['concept'],frequency=row['frequency'], score_avg=row['score_avg'], score_bucket=score_bucket)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))
#
# build edges, based on concepts co-occurrence within pubs
# -- calculate a 'weight' based on how often two concepts co-occur
#
print(f".. adding edges from pubs cooccurrence...")
pubs_list = CONCEPTS_SET.drop_duplicates("id")['id'].to_list()
for p in pubs_list:
concepts_for_this_pub = CONCEPTS_SET[CONCEPTS_SET['id'] == p]['concept'].to_list()
for group in itertools.combinations(concepts_for_this_pub, 2): # gen all permutations
a, b = group[0], group[1]
try:
G.edges[a, b]['weight'] = G.edges[a, b]['weight'] + 1
except:
G.add_edge(a, b, weight=1)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))
#
# this extra step is useful to remove low-weight connections
#
print(f".. cleaning up edges with weight < {MIN_EDGE_WEIGHT}...")
for a, b, w in list(G.edges(data='weight')):
if w < MIN_EDGE_WEIGHT:
G.remove_edge(a, b)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))
print(f".. removing isolated nodes...")
G.remove_nodes_from(list(nx.isolates(G)))
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))
Nodes: 231 Edges: 0
.. adding edges from pubs cooccurrence...
Nodes: 231 Edges: 2006
.. cleaning up edges with weight < 2...
Nodes: 231 Edges: 503
.. removing isolated nodes...
Nodes: 198 Edges: 503
Step 3: Visualizing the network¶
Note
The pyvis library allows to quickly generate network visualizations from a networkx object.
The pyvis
from_nx
method doesn’t automatically convert node WEIGHTs or any other value from our network data; so we need to set it manually using another pass.By using the
score_bucket
value (see above), we can mark the higher-score concepts using a brighter color.
The resulting visualization file might not render inline on all notebook environments; if that’s the case, just open the concepts_network.html
file in a new browser window (or see an example here).
[5]:
viznet = Network(notebook=True,
width="100%",
height="900px",
cdn_resources="remote",
neighborhood_highlight=True,
select_menu=True,
)
viznet.toggle_hide_edges_on_drag(True)
viznet.barnes_hut()
viznet.repulsion(300)
# Double heading bug: https://github.com/WestHealth/pyvis/issues/190
viznet.heading = f"Concepts co-occurrence for '{KEYWORD}' publications"
# reuse plotly color palette
palette = px.colors.diverging.Temps # 7 colors
viznet.from_nx(G)
# update visual features
for node in viznet.nodes:
freq = G.nodes[node['label']]['frequency']
score_avg = G.nodes[node['label']]['score_avg']
score_bucket = G.nodes[node['label']]['score_bucket'] # get from original network
node['size'] = freq * 2
node['color'] = palette[3*score_bucket] # get color based on score_bucket (1 or 2)
node['borderWidthSelected'] = 5
node['title'] = f"<h4>Concept: '{node['label']}'</h4><hr>Frequency: {freq}<br>Score avg: {score_avg}",
# print(node)
for edge in viznet.edges:
# get value from main Network weight
edge['value'] = G.edges[edge['from'], edge['to']]['width']
# print(edge)
viznet.show("concepts_network.html")
concepts_network.html
[5]:
Conclusions¶
In this tutorial we have demonstrated how to generate an concepts ‘co-occurence network diagram’ using the Dimensions API.
The resulting visualization is also available as a standalone file here.
For more information on this topic, see also the official documentation on concepts and the Working with concepts in the Dimensions API tutorial.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.