../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Building an Organizations Collaboration Network Diagram

This notebook shows how to analyse organizations collaboration data using the Organizations data source available via the Dimensions Analytics API.

Starting from a research organization, we will extract information about other organizations that collaborated with it, based on shared publications data.

In order to make the analysis more focused, we are going to select also a topic and a time-frame. By appying these extra constraints we will reduce the number of shared publications data and also make the overall extraction faster.

At the end of the tutorial we will generate a ‘collaborations network diagram’. The diagram nodes represent the organizations working together, while the edges represent the number of publications they have in common. An example of the resulting network diagram can be seen here.

1. Prerequisites: load libraries and log in

Please install the latest versions of these libraries to run this notebook.

[58]:
# @markdown # Get the API library and login
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}


!pip install dimcli plotly tqdm pyvis -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

#
# load common libraries
import time
import sys
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progress
import networkx as nx

#
# charts libs
# import plotly_express as px
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
DimCli v0.6.6.5 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)

2. Choose an Organization and a keyword (topic)

For the purpose of this exercise, we will use grid.412125.1 (King Abdulaziz University, Saudi Arabia).

You can try using a different GRID ID to see how results change, e.g. by browsing for another GRID organization.

[59]:
GRIDID = "grid.412125.1" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

#@markdown ---
#@markdown A keyword used to filter publications search
TOPIC = "nanotechnology" #@param {type:"string"}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

#
# gen link to Dimensions
#
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {}'.format(YEAR_START, YEAR_END)))
display(HTML('Topic: "{}" <br /><br />'.format(TOPIC)))

Time period: 2000 to 2016
Topic: "nanotechnology"

3. Building a one-degree network of collaborating institutions

We can use the publications API to find the top 10 collaborating institutions based on the parameters above, via a single query.

The get_collaborators function below fills out a templated query with the relevant bits and runs it. Then it transforms the results into a pandas dataframe, which will make it easier to process the data later on.

A couple of things to note:

  • The resulting dataframe contains two extra columns: a) id_from, which is the ‘seed’ institution we start from; b) level, an optional parameter representing the network depth of the query (we’ll see later how it is used with recursive querying).

  • The query returns 11 records - that’s because the first one is normally the seed GRID (due to internal collaborations) which we will omit from the results.

  • Lastly, it’s important to note that one could easily more constraints to the query e.g. research areas via FOR codes, or setting a threshold based on citation counts. The possibilities are endless!

[60]:
query = """search publications {}
               where year in [{}:{}]
               and research_orgs.id="{}"
            return research_orgs limit 11"""

def get_collaborators(orgid, level=1, printquery=False):
    if TOPIC:
        TOPIC_CLAUSE = f"""for "{TOPIC}" """
    else:
        TOPIC_CLAUSE = ""
    searchstring = query.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid)
    if printquery: print(searchstring)
    df = dsl.query(searchstring, verbose=False).as_dataframe()
    df['id_from'] = [orgid] * len(df)
    df['level'] = [level] * len(df)
    return df

For example, let’s try it out with our GRID ID:

[61]:
get_collaborators(GRIDID, printquery=True)
search publications for "nanotechnology"
               where year in [2000:2016]
               and research_orgs.id="grid.412125.1"
            return research_orgs limit 11
[61]:
id count types acronym longitude country_name name city_name linkout latitude state_name id_from level
0 grid.412125.1 1186 [Education] KAU 39.250280 Saudi Arabia King Abdulaziz University Jeddah [http://www.kau.edu.sa/home_english.aspx] 21.493889 NaN grid.412125.1 1
1 grid.261112.7 84 [Education] NU -71.089180 United States Northeastern University Boston [http://www.northeastern.edu/] 42.339830 Massachusetts grid.412125.1 1
2 grid.116068.8 62 [Education] MIT -71.092110 United States Massachusetts Institute of Technology Cambridge [http://web.mit.edu/] 42.359820 Massachusetts grid.412125.1 1
3 grid.38142.3c 60 [Education] NaN -71.116650 United States Harvard University Cambridge [http://www.harvard.edu/] 42.377052 Massachusetts grid.412125.1 1
4 grid.412621.2 40 [Education] QAU 73.138885 Pakistan Quaid-i-Azam University Islamabad [http://www.qau.edu.pk/] 33.747223 NaN grid.412125.1 1
5 grid.411340.3 39 [Education] AMU 78.077850 India Aligarh Muslim University Aligarh [http://www.amu.ac.in/] 27.917370 Uttar Pradesh grid.412125.1 1
6 grid.56302.32 36 [Education] KSU 46.645840 Saudi Arabia King Saud University Riyadh [http://ksu.edu.sa/en/] 24.723982 NaN grid.412125.1 1
7 grid.33003.33 35 [Education] NaN 32.275000 Egypt Suez Canal University Ismailia [http://scuegypt.edu.eg/ar/] 30.622778 NaN grid.412125.1 1
8 grid.411818.5 34 [Education] NaN 77.280150 India National Islamic University New Delhi [http://jmi.ac.in/] 28.561607 NaN grid.412125.1 1
9 grid.411320.5 33 [Education] NaN 39.202843 Turkey Fırat University Elâzığ [https://yeni.firat.edu.tr/] 38.679900 NaN grid.412125.1 1
10 grid.412144.6 31 [Education] KKU 42.559700 Saudi Arabia King Khalid University Abhā [http://www.kku.edu.sa/] 18.249500 NaN grid.412125.1 1

4. Building a network of any size

What if we want to retrieve the collaborators of the collaborators?

In other words, what if we want to generate a larger network, which includes the institutions linked to the collaborating institutions of King Abdulaziz University? If we think of our collaboration data as a graph structure with nodes and edges, we can see that the get_collaborators function defined above is limited. That’s because it allows to obtain only the objects directly linked to the ‘seed’ GRID. Instead, we want to run the same analysis for any GRID ID in our results, iteratively, so to generate an N-degrees network where N is chosen by us.

To this purpose, we can set up a recursive function. This function essentially repeats the get_collaborators function as many times as needed. A few key points to note: * The maxlevel parameter determines how big our network should be (1 = neighbours only, 2 = collaborators of neighbours,e tc..) * We pause 1 second after each iteration to avoid hitting the normal Analytics API quota (~30 requests per minute) * The function can generate lots of data! E.g. calling this function with maxlevel=5 will lead to 10k queries! (note: you can get a rough estimate of the queries via the formula 10 to the power of maxlevel-1. That’s because 10 is the number of orgs we extract per iteration, and maxlevel is the number or iterations, minus the first one which generates no extra queries).

[62]:
def looper(seed, maxlevel=1, thislevel=1):
    "Recursive function for building an organization collaboration network"
    collaborators = get_collaborators(seed, thislevel)
    time.sleep(1)
    print("--" * thislevel, seed, " :: level =", thislevel)
    if thislevel < maxlevel:
        # first remove the originating grid
        gridslist = list(collaborators[collaborators['id'] != GRIDID]['id'])
        extra_data = [looper(x, maxlevel, thislevel+1) for x in gridslist]
        return collaborators.append(extra_data)
    else:
        # finally
        return collaborators

Let’s try this out.

We can construct a 2-degrees collaboration network starting from King Abdulaziz University. We are extracting 10 organizations per node so our network will have ~100 nodes at the end!

[63]:
collaborators = looper(GRIDID, maxlevel=2)
# change column order for readability purposes
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'state_name', 'country_name', 'latitude', 'longitude', 'linkout',  'types' ]]
collaborators.head()
-- grid.412125.1  :: level = 1
---- grid.261112.7  :: level = 2
---- grid.116068.8  :: level = 2
---- grid.38142.3c  :: level = 2
---- grid.412621.2  :: level = 2
---- grid.411340.3  :: level = 2
---- grid.56302.32  :: level = 2
---- grid.33003.33  :: level = 2
---- grid.411818.5  :: level = 2
---- grid.411320.5  :: level = 2
---- grid.412144.6  :: level = 2
[63]:
id_from id_to level count name acronym city_name state_name country_name latitude longitude linkout types
0 grid.412125.1 grid.412125.1 1 1186 King Abdulaziz University KAU Jeddah NaN Saudi Arabia 21.493889 39.250280 [http://www.kau.edu.sa/home_english.aspx] [Education]
1 grid.412125.1 grid.261112.7 1 84 Northeastern University NU Boston Massachusetts United States 42.339830 -71.089180 [http://www.northeastern.edu/] [Education]
2 grid.412125.1 grid.116068.8 1 62 Massachusetts Institute of Technology MIT Cambridge Massachusetts United States 42.359820 -71.092110 [http://web.mit.edu/] [Education]
3 grid.412125.1 grid.38142.3c 1 60 Harvard University NaN Cambridge Massachusetts United States 42.377052 -71.116650 [http://www.harvard.edu/] [Education]
4 grid.412125.1 grid.412621.2 1 40 Quaid-i-Azam University QAU Islamabad NaN Pakistan 33.747223 73.138885 [http://www.qau.edu.pk/] [Education]

5. Visualizing the network

In order to get an overview of the network data we can build a visualization using the pyvis library. In particular, in order to quickly identify the key players in the network, we can build a visualization where the size of the nodes is proportional to the proximity to our ‘seed’ organization, and the strenght of the collaboration is proportional to the size of the edges (= how many publications two orgs have in common).

A custom version of pyvis is already included in dimcli.core.extras and is called NetworkViz (note: this custom version only fixes a bug that prevents pyvis graphs to be displayed online with Google Colab).

This is what the code below does:

  • After creating a NetworkViz object, we fill it in with the add_node and add_edge method. The full list of attributes for nodes and edges are described in pyvis.

  • We generate colors for the chart, using the built-in plotly color scales. Try changing them!

  • The repulsion parameter is set to 300, but for bigger charts you may want to increase that..

  • Tip: by experimenting with the way node sizes/colors are generated to the underlying data, it is possible to highlight different dimensions eg countries or types of the organizations.

[64]:
# load custom version of pyvis
from dimcli.core.extras import NetworkViz


def build_visualization(collaborator_df):
    """
    Return a network visualization object from a collaborators dataframe
    The object can be then displayed/saved with `g.show(f"network.html")`
    """

    # set up dataviz
    g = NetworkViz(notebook=True, width="100%", height="800px")
    g.toggle_hide_edges_on_drag(False)
    g.barnes_hut()
    g.repulsion(300)
    # g.show_buttons() # in html-standalone mode, this command show viz controls

    #
    # create nodes and edges
    #

    # remove duplicates from nodes
    nodes = collaborator_df.drop_duplicates(subset ="id_to", keep = 'first')
    # remove internal collaborations stats
    edges = collaborator_df[(collaborator_df['id_to'] != collaborator_df['id_from'])]

    # reuse plotly color palette
    palette = px.colors.diverging.Temps

    #
    # add nodes
    #

    for index, row in nodes.iterrows():

        # calc size based on level
        maxsize = int(nodes['level'].max()) + 1
        if row['id_to'] == GRIDID:
            size = maxsize
        else:
            size = maxsize - row['level']

        # calc color based on level
        if row['id_to'] == GRIDID:
            color = palette[0]
        else:
            color = palette[row['level'] * 2]

        g.add_node(
            n_id = row['id_to'],
            label = row['name'],
            title = f"<h4>{row['name']}<br>{row['city_name']}, {row['country_name']}<br> - {row['id_to']}</h4>",
            value = size,
            color = color,
            borderWidthSelected = 5,
            shape = "dot",
        )

    # store the max value for normalization operations later
    edges_maxcount = edges['count'].max()

    #
    # add edges
    #

    for index, row in edges.iterrows():
      g.add_edge(row['id_from'], row['id_to'],
                 value = float(row['count']) / edges_maxcount,
                 label=int(row['count']),
                 arrows="none"
                )

    # add tooltips with adjancent links info
    neighbor_map = g.get_adj_list()
    for node in g.nodes:
        neigh = neighbor_map[node["id"]]
        labels = [nodes[nodes['id_to'] == x].iloc[0]['name'] for x in neigh]
        node["title"] += "Links:<li>" + "</li><li>".join(labels)

    return g

# run the function
g = build_visualization(collaborators)
g.show(f"network_{GRIDID}.html")
[64]:

6. Addendum: showing only ‘Government’ collaborators

What if we want to show a collaboration network only for a specific GRID organization type? A simple facet query will show us what types are available:

[65]:
%dsldf search organizations return types
Returned Types: 13
[65]:
id count
0 Company 27922
1 Education 19431
2 Healthcare 12370
3 Nonprofit 11991
4 Facility 8420
5 Other 7832
6 Government 5694
7 Archive 2699
8 Education,Company 2
9 Education,Facility 2
10 Education,Healthcare 2
11 Archive,Nonprofit 1
12 Education,Other 1

Now we can pick one of those types and prefilter the results of the get_collaborators function above.

In order to do this we can modify the API query so that

  • organizations with the selected type are included: research_orgs.types in ["{}"]

  • more results get returned: return research_orgs limit 50 to ensure we still have enough results after removing the ones that don’t have the chosen ‘type’

One final step

The query by itself is not enough to get what we want though, because the filter research_orgs.types in ["{}"] will return also publications with multiple authors/affiliations, even though only one of them has the desired ‘type’.

So an extra step is required and this is achieved via the keep_type function below. This function simply filters out all unwanted organizations data after they’re retrieved from the API.

That’s it! Run the cell below to generata a new visualization.

[66]:
#@markdown Try using one of the organization types from the list above

GRID_TYPE = "Government" #@param {type:"string"}

query = """search publications {}
               where year in [{}:{}]
               and research_orgs.id="{}"
               and research_orgs.types in ["{}"]
            return research_orgs limit 50"""

def keep_type(data, a_type, orgid):
    clean_list = []
    for x in data.research_orgs:
        # include also originating GRID to ensure chart is complete
        if x['id'] == orgid or a_type in x['types']:
            clean_list.append(x)
    data.json['research_orgs'] = clean_list
    return data

def get_collaborators(orgid, level=1, printquery=False):
    if TOPIC:
        TOPIC_CLAUSE = f"""for "{TOPIC}" """
    else:
        TOPIC_CLAUSE = ""
    searchstring = query.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid, GRID_TYPE)
    if printquery: print(searchstring)
    data = dsl.query(searchstring, verbose=False)
    # new bit for including only types we want
    data = keep_type(data, "Government", orgid)
    df = data.as_dataframe()
    df['id_from'] = [orgid] * len(df)
    df['level'] = [level] * len(df)
    return df


# RUN THE RECURSIVE QUERY (same code as above)

collaborators = looper(GRIDID, maxlevel=2)
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'country_name', 'latitude', 'longitude', 'linkout',  'types' ]]

# BUILD VIZ

g = build_visualization(collaborators)
g.show(f"network_{GRIDID}_{GRID_TYPE}.html")


-- grid.412125.1  :: level = 1
---- grid.4886.2  :: level = 2
---- grid.9227.e  :: level = 2
---- grid.20256.33  :: level = 2
---- grid.7327.1  :: level = 2
---- grid.1016.6  :: level = 2
---- grid.14467.30  :: level = 2
[66]:

7. Conclusions

In this tutorial we have demonstrated how to generate an organization ‘collaborations network diagram’ using the Dimensions API. Starting from a research organization, we extracted information about other collaborating organizations, based on shared publications data, a topic and a time-frame.

An example of the resulting network diagram can be seen here.

Here’s some ideas for further experimentation:

  • try changing the initial publications query so to include other parameters. The publications API is rich so there’re many ways to fine-tune your analysis

  • try increasing the number of iterations using the maxlevel parameter

  • try customizing the resulting network diagram, e.g. to highlight nodes and edges based on different criteria like countries or years.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg