Building an Organizations Collaboration Network Diagram¶

This notebook shows how to analyse organizations collaboration data using the Organizations data source available via the Dimensions Analytics API.

Starting from a research organization, we will extract information about other organizations that collaborated with it, based on shared publications data.

In order to make the analysis more focused, we are going to select also a topic and a time-frame. By appying these extra constraints we will reduce the number of shared publications data and also make the overall extraction faster.

At the end of the tutorial we will generate a ‘collaborations network diagram’. The diagram nodes represent the organizations working together, while the edges represent the number of publications they have in common. An example of the resulting network diagram can be seen here.

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Aug 22, 2023
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli plotly tqdm pyvis -U --quiet

#
# load libraries
import dimcli
from dimcli.utils import *

import json, sys, time
import pandas as pd
from tqdm.notebook import tqdm as progress
import plotly.express as px  # plotly>=4.8.1
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()


[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v1.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.7
Method: dsl.ini file

1. Choose an Organization and a keyword (topic)¶

For the purpose of this exercise, we will use grid.412125.1 (King Abdulaziz University, Saudi Arabia).

You can try using a different GRID ID to see how results change, e.g. by browsing for another GRID organization.

[3]:

GRIDID = "grid.412125.1" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

#@markdown ---
#@markdown A keyword used to filter publications search
TOPIC = "nanotechnology" #@param {type:"string"}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

#
# gen link to Dimensions
#
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {}'.format(YEAR_START, YEAR_END)))
display(HTML('Topic: "{}" <br /><br />'.format(TOPIC)))

GRID: grid.412125.1 - King Abdulaziz University ⧉

Time period: 2000 to 2016

Topic: "nanotechnology"

2. Building a one-degree network of collaborating institutions¶

We can use the publications API to find the top 10 collaborating institutions based on the parameters above, via a single query.

The get_collaborators function below fills out a templated query with the relevant bits and runs it. Then it transforms the results into a pandas dataframe, which will make it easier to process the data later on.

[4]:

query_template = """search publications {}
                   where year in [{}:{}]
                   and research_orgs.id="{}"
                return research_orgs limit 11"""

def get_collaborators(orgid, level=1, printquery=False):
    if TOPIC:
        TOPIC_CLAUSE = f"""for "{TOPIC}" """
    else:
        TOPIC_CLAUSE = ""
    # fill in the blanks in the query_template
    query_full = query_template.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid)
    if printquery: print(query_full)
    df = dsl.query(query_full, verbose=False).as_dataframe()
    # add extra columns
    df['id_from'] = [orgid] * len(df)
    df['level'] = [level] * len(df)
    return df

Note:

Extra columns. The resulting dataframe contains two extra columns: a) id_from, which is the ‘seed’ institution we start from; b) level, an optional parameter representing the network depth of the query (we’ll see later how it is used with recursive querying).
Self-collaboration. The query returns 11 records - that’s because the first one is normally the seed GRID (due to internal collaborations) which we will omit from the results.
Custom changes. Lastly, it’s important to remember that this step can be easily customised by changing the query_template sttructure. For example, we could focus on specific research areas (using FOR codes), or set a threshold based on citation counts. The possibilities are endless!

For example, let’s try it out with our GRID ID:

[5]:

get_collaborators(GRIDID, printquery=True)

search publications for "nanotechnology"
                   where year in [2000:2016]
                   and research_orgs.id="grid.412125.1"
                return research_orgs limit 11

[5]:

	id	name	acronym	city_name	count	country_name	latitude	linkout	longitude	types	state_name	id_from	level
0	grid.412125.1	King Abdulaziz University	KAU	Jeddah	1444	Saudi Arabia	21.493889	[http://www.kau.edu.sa/home_english.aspx]	39.250280	[Education]	NaN	grid.412125.1	1
1	grid.261112.7	Northeastern University	NU	Boston	106	United States	42.339830	[http://www.northeastern.edu/]	-71.089180	[Education]	Massachusetts	grid.412125.1	1
2	grid.38142.3c	Harvard University	NaN	Cambridge	98	United States	42.377052	[http://www.harvard.edu/]	-71.116650	[Education]	Massachusetts	grid.412125.1	1
3	grid.116068.8	Massachusetts Institute of Technology	MIT	Cambridge	73	United States	42.359820	[http://web.mit.edu/]	-71.092110	[Education]	Massachusetts	grid.412125.1	1
4	grid.16753.36	Northwestern University	NU	Evanston	59	United States	42.054850	[http://www.northwestern.edu/]	-87.673940	[Education]	Illinois	grid.412125.1	1
5	grid.413735.7	Harvard–MIT Division of Health Sciences and Te...	HST	Cambridge	58	United States	42.361780	[http://hst.mit.edu/]	-71.086914	[Education]	Massachusetts	grid.412125.1	1
6	grid.411340.3	Aligarh Muslim University	AMU	Aligarh	47	India	27.917370	[http://www.amu.ac.in/]	78.077850	[Education]	Uttar Pradesh	grid.412125.1	1
7	grid.412621.2	Quaid-i-Azam University	QAU	Islamabad	47	Pakistan	33.747223	[http://www.qau.edu.pk/]	73.138885	[Education]	NaN	grid.412125.1	1
8	grid.33003.33	Suez Canal University	NaN	Ismailia	42	Egypt	30.622778	[http://scuegypt.edu.eg/ar/]	32.275000	[Education]	NaN	grid.412125.1	1
9	grid.411818.5	Jamia Millia Islamia	JMI	New Delhi	42	India	28.561607	[http://jmi.ac.in/]	77.280150	[Education]	NaN	grid.412125.1	1
10	grid.56302.32	King Saud University	KSU	Riyadh	42	Saudi Arabia	24.723982	[http://ksu.edu.sa/en/]	46.645840	[Education]	NaN	grid.412125.1	1

3. Building a network of any size¶

What if we want to retrieve the collaborators of the collaborators? In other words, what if we want to generate a larger network?

If we think of our collaboration data as a graph structure with nodes and edges, we can see that the get_collaborators function defined above is limited. That’s because it allows to obtain only the objects directly linked to the ‘seed’ GRID organization.

We would like to run the same collaborators-extraction step iteratively for any GRID ID in our results, so to generate an N-degrees network where N is chosen by us.

To this purpose, we can set up a recursive function. This function essentially repeats the get_collaborators function as many times as needed. Here’s what it looks like:

[6]:

def recursive_network(seed, maxlevel=1, thislevel=1):
    "Recursive function for building an organization collaboration network"
    results = get_collaborators(seed, thislevel)
    time.sleep(1)
    print("--" * thislevel, seed, " :: level =", thislevel)
    if thislevel < maxlevel:
        # remove the originating grid-id
        gridslist = list(results[results['id'] != GRIDID]['id'])
        next_level_results = [recursive_network(x, maxlevel, thislevel+1) for x in gridslist]
        next_level_results = pd.concat(next_level_results)
        results = pd.concat([results, next_level_results])
        return results
    else:
        # finally
        return results

A few key points to note:

Recursion depth. The maxlevel parameter determines how big our network should be (1 = neighbours only, 2 = collaborators of neighbours,e tc..)
API quota. We pause 1 second after each iteration to avoid hitting the normal Analytics API quota (~30 requests per minute)
Data size. The function can generate lots of data! E.g. calling this function with maxlevel=5 will lead to 10k queries! (note: you can get a rough estimate of the queries via the formula 10 to the power of maxlevel-1. That’s because 10 is the number of orgs we extract per iteration, and maxlevel is the number or iterations, minus the first one which generates no extra queries).

Let’s try this out.

We can construct a 2-degrees collaboration network starting from King Abdulaziz University. We are extracting 10 organizations per node so our network will have ~100 nodes at the end!

[7]:

collaborators = recursive_network(GRIDID, maxlevel=2)
# change column order for readability purposes
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'state_name', 'country_name', 'latitude', 'longitude', 'linkout',  'types' ]]
collaborators.head()

-- grid.412125.1  :: level = 1
---- grid.261112.7  :: level = 2
---- grid.38142.3c  :: level = 2
---- grid.116068.8  :: level = 2
---- grid.16753.36  :: level = 2
---- grid.413735.7  :: level = 2
---- grid.411340.3  :: level = 2
---- grid.412621.2  :: level = 2
---- grid.33003.33  :: level = 2
---- grid.411818.5  :: level = 2
---- grid.56302.32  :: level = 2

[7]:

	id_from	id_to	level	count	name	acronym	city_name	state_name	country_name	latitude	longitude	linkout	types
0	grid.412125.1	grid.412125.1	1	1444	King Abdulaziz University	KAU	Jeddah	NaN	Saudi Arabia	21.493889	39.25028	[http://www.kau.edu.sa/home_english.aspx]	[Education]
1	grid.412125.1	grid.261112.7	1	106	Northeastern University	NU	Boston	Massachusetts	United States	42.339830	-71.08918	[http://www.northeastern.edu/]	[Education]
2	grid.412125.1	grid.38142.3c	1	98	Harvard University	NaN	Cambridge	Massachusetts	United States	42.377052	-71.11665	[http://www.harvard.edu/]	[Education]
3	grid.412125.1	grid.116068.8	1	73	Massachusetts Institute of Technology	MIT	Cambridge	Massachusetts	United States	42.359820	-71.09211	[http://web.mit.edu/]	[Education]
4	grid.412125.1	grid.16753.36	1	59	Northwestern University	NU	Evanston	Illinois	United States	42.054850	-87.67394	[http://www.northwestern.edu/]	[Education]

4. Visualizing the network¶

In order to get an overview of the network data we can visualize it using the Python pyvis library. A custom version of pyvis is already included in dimcli.core.extras and is called NetworkViz (note: this custom version only fixes a bug that prevents pyvis graphs to be displayed online with Google Colab).

Network visualizations can be very complex, but to begin with we can focus on representing two key aspects:

Core collaborators. The size of the nodes should be proportional to the proximity to our ‘seed’ organization. This will make it easier to quickly identify the key players in the network
Number of publications. The strenght of the collaboration should be proportional to the size of the edges (= how many publications two orgs have in common)

In a nutshell, this is what the code below does:

After creating a NetworkViz object, we add nodes and edges from our dataset using the add_node and add_edge method.
The Network repulsion parameter is set to 300, but for bigger charts you may want to increase that.
Nodes and edges in pyvis can have a number of attributes. The full list of attributes can be found in the pyvis documentation.
In order to have some nice colors, we take advantage of the built-in plotly color scales. Try changing them!

[8]:

# load pyvis
from pyvis.network import Network


def build_visualization(collaborator_df):
    """
    Return a network visualization object from a collaborators dataframe
    The object can be then displayed/saved with `g.show(f"network.html")`
    """

    # set up dataviz
    g = Network(notebook=True, width="100%", height="800px",cdn_resources="remote",
            neighborhood_highlight=True,
            select_menu=True)
    g.toggle_hide_edges_on_drag(False)
    g.barnes_hut()
    g.repulsion(300)
    # reuse plotly color palette
    palette = px.colors.diverging.Temps
    # g.show_buttons() # in html-standalone mode, this command shows viz controls


    #
    # create nodes and edges
    #

    # remove duplicates from nodes
    nodes = collaborator_df.drop_duplicates(subset ="id_to", keep = 'first')
    # remove internal collaborations stats
    edges = collaborator_df[(collaborator_df['id_to'] != collaborator_df['id_from'])]


    #
    # add nodes
    #

    for index, row in nodes.iterrows():

        # calc size based on level
        maxsize = int(nodes['level'].max()) + 1
        if row['id_to'] == GRIDID:
            size = maxsize
        else:
            size = maxsize - row['level']

        # calc color based on level
        if row['id_to'] == GRIDID:
            color = palette[0]
        else:
            color = palette[row['level'] * 2]

        g.add_node(
            n_id = row['id_to'],
            label = row['name'],
            title = f"<h4>{row['name']}<br>{row['city_name']}, {row['country_name']}<br> - {row['id_to']}</h4>",
            value = size,
            color = color,
            borderWidthSelected = 5,
            shape = "dot",
        )


    #
    # add edges
    #

    edges_maxcount = edges['count'].max()

    for index, row in edges.iterrows():
      g.add_edge(row['id_from'], row['id_to'],
                 value = float(row['count']) / edges_maxcount,
                 label=int(row['count']),
                 arrows="none"
                )

    # add tooltips with adjancent links info
    neighbor_map = g.get_adj_list()
    for node in g.nodes:
        neigh = neighbor_map[node["id"]]
        labels = [nodes[nodes['id_to'] == x].iloc[0]['name'] for x in neigh]
        node["title"] += "Links:<li>" + "</li><li>".join(labels)

    return g

#
# finall, run the viz builder
#
g = build_visualization(collaborators)
g.show(f"network_{GRIDID}.html")

network_grid.412125.1.html

[8]:

5. Addendum: showing only ‘Government’ collaborators¶

What if we want to show a collaboration network focusing only on ‘government’ organizations?

That’s pretty easy to do, since the GRID database includes information about organization types. We can easily see what types are available using the API and a facet query:

[9]:

%dsldf search organizations return types

Returned Types: 9
Time: 1.00s

[9]:

	id	count
0	Company	30742
1	Education	20761
2	Nonprofit	17573
3	Healthcare	13926
4	Facility	10168
5	Government	6580
6	Other	4017
7	Archive	2926
8	Education,Company	1

The steps are the following:

New query filter. We rewrite the get_collaborators function we created in section 2 above, so that the API query includes a filter for organizations with the selected type only: .. and research_orgs.types in ["Government"]...
Get more results. We increase the number of results returned: ..return research_orgs limit 50. This is to ensure we still have enough results after removing the ones that don’t have the chosen ‘type’
Remove unwanted data. The new query filter research_orgs.types in ["{}"] will return also publications with multiple authors/affiliations, even though only one of them has the desired ‘type’. So an extra step is required and this is achieved via the keep_type function below. This function simply filters out all unwanted organizations data after they’re retrieved from the API.

That’s it! Run the cell below to generate a new visualization showing only “Government” collaborators. Or try changing the value of GRID_TYPE to see different results.

[10]:

#@markdown Try using one of the organization types from the list above

GRID_TYPE = "Government" #@param {type:"string"}

query = """search publications {}
               where year in [{}:{}]
               and research_orgs.id="{}"
               and research_orgs.types in ["{}"]
            return research_orgs limit 50"""

def keep_only_type(data, a_type, orgid):
    clean_list = []
    for x in data.research_orgs:
        # include also originating GRID to ensure chart is complete
        if x['id'] == orgid or a_type in x['types']:
            clean_list.append(x)
    data.json['research_orgs'] = clean_list
    return data


def get_collaborators(orgid, level=1, printquery=False):
    "New version that filters using org types as well"
    if TOPIC:
        TOPIC_CLAUSE = f"""for "{TOPIC}" """
    else:
        TOPIC_CLAUSE = ""
    # include also the GRID_TYPE
    query_full = query.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid, GRID_TYPE)
    if printquery: print(query_full)
    data = dsl.query(query_full, verbose=False)
    # remove results with unwanted types
    data = keep_only_type(data, "Government", orgid)
    df = data.as_dataframe()
    df['id_from'] = [orgid] * len(df)
    df['level'] = [level] * len(df)
    return df


#
# RUN THE RECURSIVE QUERY (same code as above)
#
collaborators = recursive_network(GRIDID, maxlevel=2)
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'country_name', 'latitude', 'longitude', 'linkout',  'types' ]]

#
# BUILD VIZ
#
g = build_visualization(collaborators)
g.show(f"network_{GRIDID}_{GRID_TYPE}.html")

-- grid.412125.1  :: level = 1
---- grid.7327.1  :: level = 2
---- grid.9227.e  :: level = 2
---- grid.20256.33  :: level = 2
---- grid.1089.0  :: level = 2
---- grid.14467.30  :: level = 2
network_grid.412125.1_Government.html

[10]:

Conclusions¶

In this tutorial we have demonstrated how to generate an organization ‘collaborations network diagram’ using the Dimensions API. Starting from a research organization, we extracted information about other collaborating organizations, based on shared publications data, a topic and a time-frame.

An example of the resulting network diagram can be seen here.

Here’s some ideas for further experimentation:

try changing the initial publications query so to include other parameters. The publications API is rich so there’re many ways to fine-tune your analysis
try increasing the number of iterations using the maxlevel parameter
try customizing the resulting network diagram, e.g. to highlight nodes and edges based on different criteria like countries or years.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.