../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Building a concepts co-occurence network

This Python notebook shows how to use the Dimensions Analytics API in order to extract concepts from publications and use them to generate a ‘topics map’ using co-occurence information. For more background on concepts, see also the Working with concepts in the Dimensions API tutorial.

Note this tutorial is best experienced using Google Colab.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial. The networkx and pyvis libraries are used for generating and visualizing the network, respectively.

[1]:
!pip install dimcli plotly networkx pyvis jsonpickle  -U --quiet

import dimcli
from dimcli.shortcuts import *
from dimcli.core.extras import NetworkViz # custom version of pyvis - colab-compatible

import json
import sys
import pandas as pd
import networkx as nx
import plotly.express as px
import itertools

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.7.4.2)
Connected to: https://app.dimensions.ai - DSL v1.27
Method: dsl.ini file

Step 1: Creating a dataset

We start by creating a dataset consisting of the most cited 1000 publications matching a chosen keyword.

The API query below will return a list of documents including all the related concepts.

Tip: try changing the query keyword in order to experiment with different results.

[2]:
#@markdown Enter the keyword used to seed the search
KEYWORD = "Semantic Web" #@param {type:"string"}

q = f"""search publications
            for "\\"{KEYWORD}\\""
        return publications[id+title+concepts_scores]
        sort by times_cited limit 1000"""



data = dsl.query(q)
concepts = data.as_dataframe_concepts()
print("Total concepts:", len(concepts))
print("Concepts score average", concepts['score_avg'].mean())
concepts.head()
Returned Publications: 1000 (total = 143549)
Time: 4.57s
Total concepts: 46176
Concepts score average 0.37900926845114347
[2]:
title id concepts_count concept score frequency score_avg
0 Building better batteries pub.1007137639 67 materials science 0.06888 4 0.20473
1 Building better batteries pub.1007137639 67 new series 0.06877 3 0.05128
2 Building better batteries pub.1007137639 67 better batteries 0.06703 1 0.06703
3 Building better batteries pub.1007137639 67 batteries 0.06070 3 0.03250
4 Building better batteries pub.1007137639 67 Murray-Rust 0.05667 3 0.02170

Step 2: Building a concepts co-occurrence network

Each publication in our dataset includes a list of related concepts. In order to build a concepts co-occurrence network, we simply add an edge between concepts that appear in the same document.

Edges have a default weight of 1. If any two concepts appear together in more documents, we increase the weight each time.

Note: the resulting network can be very large, so in order to make our network smaller (and more relevant), we can filter out less interesting concepts in two ways:

  • by setting a frequency and score_avg threshold (as shown in the Working with concepts in the Dimensions API tutorial)

  • by keeping only nodes that, in our network, have an edge > MIN WEIGHT (that is, nodes that have more connections)

[3]:
G = nx.Graph() # networkX instance

#
# TIP play with these parameters in ordeto generate different types of networks
#
MIN_CONCEPT_SCORE = 0.6
MIN_CONCEPT_FREQUENCY = 4
MIN_EDGE_WEIGHT = 2

CONCEPTS_SET = concepts.query(f"score_avg >= {MIN_CONCEPT_SCORE} & frequency >=  {MIN_CONCEPT_FREQUENCY}")


#
# build nodes from concepts, including score_avg and frequency
# -- NOTE: score_bucket indicates if the concepts is above or below the mean_score
# -- this value is used in the visualization below to color-code nodes
#
mean_score = CONCEPTS_SET['score_avg'].mean()
for index, row in CONCEPTS_SET.drop_duplicates("concept").iterrows():
    score_bucket = 1 if row['score_avg'] > mean_score else 2
    G.add_node(row['concept'],frequency=row['frequency'], score_avg=row['score_avg'], score_bucket=score_bucket)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))

#
# build edges, based on concepts co-occurrence within pubs
# -- calculate a 'weight' based on how often two concepts co-occur
#
print(f".. adding edges from pubs cooccurrence...")
pubs_list = CONCEPTS_SET.drop_duplicates("id")['id'].to_list()

for p in pubs_list:
    concepts_for_this_pub = CONCEPTS_SET[CONCEPTS_SET['id'] == p]['concept'].to_list()
    for group in itertools.combinations(concepts_for_this_pub, 2):  # gen all permutations
        a, b = group[0], group[1]
        try:
            G.edges[a, b]['weight'] = G.edges[a, b]['weight'] + 1
        except:
            G.add_edge(a, b, weight=1)

print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))

#
# this extra step is useful to remove low-weight connections
#

print(f".. cleaning up edges with weight < {MIN_EDGE_WEIGHT}...")

for a, b, w in list(G.edges(data='weight')):
    if w < MIN_EDGE_WEIGHT:
        G.remove_edge(a, b)
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))

print(f".. removing isolated nodes...")

G.remove_nodes_from(list(nx.isolates(G)))
print("Nodes:", len(G.nodes()), "Edges:", len(G.edges()))

Nodes: 177 Edges: 0
.. adding edges from pubs cooccurrence...
Nodes: 177 Edges: 1080
.. cleaning up edges with weight < 2...
Nodes: 177 Edges: 203
.. removing isolated nodes...
Nodes: 122 Edges: 203

Step 3: Visualizing the network

Note

  • We’re using a custom version of pyvis which is included in dimcli.core.extras and is called NetworkViz. This custom version fixes a bug that prevents pyvis graphs to be displayed in Google Colab.

  • The pyvis from_nx method doesn’t carry through WEIGHT or any other value from our network data; so we need to set it manually using via another pass

  • using score_bucket (see above), we can mark the higher-score concepts using a brighter color

[4]:
viznet = NetworkViz(notebook=True, width="100%", height="800px")
viznet.toggle_hide_edges_on_drag(True)
viznet.barnes_hut()
viznet.repulsion(300)
viznet.heading = f"Concepts co-occurrence for '{KEYWORD}' publications"


# reuse plotly color palette
palette = px.colors.diverging.Temps  # 7 colors

viznet.from_nx(G)


# update visual features

for node in viznet.nodes:
    freq = G.nodes[node['label']]['frequency']
    score_avg = G.nodes[node['label']]['score_avg']
    score_bucket = G.nodes[node['label']]['score_bucket'] # get from original network

    node['size'] = freq * 2
    node['color'] = palette[3*score_bucket]  # get color based on score_bucket (1 or 2)
    node['borderWidthSelected'] = 5
    node['title'] = f"<h4>Concept: '{node['label']}'</h4><hr>Frequency: {freq}<br>Score avg: {score_avg}",
    # print(node)
for edge in viznet.edges:
    # get value from main Network weight
    edge['value'] = G.edges[edge['from'], edge['to']]['weight']
    # print(edge)

viznet.show("concepts_network.html")


[4]:

Conclusions

In this tutorial we have demonstrated how to generate an concepts ‘co-occurence network diagram’ using the Dimensions API.

The resulting visualization is also available as a standalone file here.

For more information on this topic, see also the official documentation on concepts and the Working with concepts in the Dimensions API tutorial.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg