../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Building a concepts co-occurence network

This Python notebook shows how to use the Dimensions Analytics API in order to extract concepts from publications and use them to generate a ‘topics map’ using co-occurence information. For more background on concepts, see also the Working with concepts in the Dimensions API tutorial.

Note this tutorial is best experienced using Google Colab.

[5]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial. The networkx and pyvis libraries are used for generating and visualizing the network, respectively.

[1]:
!pip install dimcli plotly networkx pyvis jsonpickle  -U --quiet

import dimcli
from dimcli.utils import *
from dimcli.utils.networkviz import NetworkViz # custom version of pyvis - colab-compatible

import json
import sys
import pandas as pd
import networkx as nx
import plotly.express as px
import itertools

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Step 1: Creating a dataset

We start by creating a dataset consisting of the most cited 1000 publications matching a chosen keyword.

The API query below will return a list of documents including all the related concepts.

Tip: try changing the query keyword in order to experiment with different results.

[2]:
#@markdown Enter the keyword used to seed the search
KEYWORD = "Semantic Web" #@param {type:"string"}

q = f"""search publications
            for "\\"{KEYWORD}\\""
        return publications[id+title+concepts_scores]
        sort by times_cited limit 1000"""



data = dsl.query(q)
concepts = data.as_dataframe_concepts()
print("Total concepts:", len(concepts))
print("Concepts score average", concepts['score_avg'].mean())
concepts.head()
Returned Publications: 1000 (total = 166253)
Time: 3.55s
Total concepts: 50150
Concepts score average 0.3988755936191426
[2]:
id title concepts_count concept score frequency score_avg
0 pub.1056527616 The Semantic Web 2 Web 0.069 159 0.46508
1 pub.1056527616 The Semantic Web 2 Semantic Web 0.006 81 0.65509
2 pub.1010449058 Social Network Sites: Definition, History, and... 31 special theme section 0.283 1 0.28300
3 pub.1010449058 Social Network Sites: Definition, History, and... 31 scholarship 0.262 2 0.43800
4 pub.1010449058 Social Network Sites: Definition, History, and... 31 theme section 0.261 1 0.26100

Step 2: Building a concepts co-occurrence network

Each publication in our dataset includes a list of related concepts. In order to build a concepts co-occurrence network, we simply add an edge between concepts that appear in the same document.

Edges have a default weight of 1. If any two concepts appear together in more documents, we increase the weight each time.

Note: the resulting network can be very large, so in order to make our network smaller (and more relevant), we can filter out less interesting concepts in two ways:

  • by setting a frequency and score_avg threshold (as shown in the Working with concepts in the Dimensions API tutorial)

  • by keeping only nodes that, in our network, have an edge > MIN WEIGHT (that is, nodes that have more connections)

[3]:
G = nx.Graph() # networkX instance

#
# TIP play with these parameters in ordeto generate different types of networks
#
MIN_CONCEPT_SCORE = 0.6
MIN_CONCEPT_FREQUENCY = 4
MIN_EDGE_WEIGHT = 2

CONCEPTS_SET = concepts.query(f"score_avg >= {MIN_CONCEPT_SCORE} & frequency >=  {MIN_CONCEPT_FREQUENCY}")


#
# build nodes from concepts, including score_avg and frequency
# -- NOTE: score_bucket indicates if the concepts is above or below the mean_score
# -- this value is used in the visualization below to color-code nodes
#
mean_score = CONCEPTS_SET['score_avg'].mean()
for index, row in CONCEPTS_SET.drop_duplicates("concept").iterrows():
    score_bucket = 1 if row['score_avg'] > mean_score else 2
    G.add_node(row['concept'],frequency=row['frequency'], score_avg=row['score_avg'], score_bucket=score_bucket)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))

#
# build edges, based on concepts co-occurrence within pubs
# -- calculate a 'weight' based on how often two concepts co-occur
#
print(f".. adding edges from pubs cooccurrence...")
pubs_list = CONCEPTS_SET.drop_duplicates("id")['id'].to_list()

for p in pubs_list:
    concepts_for_this_pub = CONCEPTS_SET[CONCEPTS_SET['id'] == p]['concept'].to_list()
    for group in itertools.combinations(concepts_for_this_pub, 2):  # gen all permutations
        a, b = group[0], group[1]
        try:
            G.edges[a, b]['weight'] = G.edges[a, b]['weight'] + 1
        except:
            G.add_edge(a, b, weight=1)

print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))

#
# this extra step is useful to remove low-weight connections
#

print(f".. cleaning up edges with weight < {MIN_EDGE_WEIGHT}...")

for a, b, w in list(G.edges(data='weight')):
    if w < MIN_EDGE_WEIGHT:
        G.remove_edge(a, b)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))

print(f".. removing isolated nodes...")

G.remove_nodes_from(list(nx.isolates(G)))
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))

Nodes: 239 Edges: 0
.. adding edges from pubs cooccurrence...
Nodes: 239 Edges: 2115
.. cleaning up edges with weight < 2...
Nodes: 239 Edges: 490
.. removing isolated nodes...
Nodes: 201 Edges: 490

Step 3: Visualizing the network

Note

  • We’re using a custom version of pyvis which is included in dimcli.core.extras and is called NetworkViz. This custom version fixes a bug that prevents pyvis graphs to be displayed in Google Colab.

  • The pyvis from_nx method doesn’t carry through WEIGHT or any other value from our network data; so we need to set it manually using via another pass

  • using score_bucket (see above), we can mark the higher-score concepts using a brighter color

[4]:
viznet = NetworkViz(notebook=True, width="100%", height="800px")
viznet.toggle_hide_edges_on_drag(True)
viznet.barnes_hut()
viznet.repulsion(300)
viznet.heading = f"Concepts co-occurrence for '{KEYWORD}' publications"


# reuse plotly color palette
palette = px.colors.diverging.Temps  # 7 colors

viznet.from_nx(G)


# update visual features

for node in viznet.nodes:
    freq = G.nodes[node['label']]['frequency']
    score_avg = G.nodes[node['label']]['score_avg']
    score_bucket = G.nodes[node['label']]['score_bucket'] # get from original network

    node['size'] = freq * 2
    node['color'] = palette[3*score_bucket]  # get color based on score_bucket (1 or 2)
    node['borderWidthSelected'] = 5
    node['title'] = f"<h4>Concept: '{node['label']}'</h4><hr>Frequency: {freq}<br>Score avg: {score_avg}",
    # print(node)
for edge in viznet.edges:
    # get value from main Network weight
    edge['value'] = G.edges[edge['from'], edge['to']]['weight']
    # print(edge)

viznet.show("concepts_network.html")


[4]:

Conclusions

In this tutorial we have demonstrated how to generate an concepts ‘co-occurence network diagram’ using the Dimensions API.

The resulting visualization is also available as a standalone file here.

For more information on this topic, see also the official documentation on concepts and the Working with concepts in the Dimensions API tutorial.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg