../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Building an Organizations Collaboration Network Diagram

This notebook shows how to analyse organizations collaboration data using the Organizations data source available via the Dimensions Analytics API.

Starting from a research organization, we will extract information about other organizations that collaborated with it, based on shared publications data.

In order to make the analysis more focused, we are going to select also a topic and a time-frame. By appying these extra constraints we will reduce the number of shared publications data and also make the overall extraction faster.

At the end of the tutorial we will generate a ‘collaborations network diagram’. The diagram nodes represent the organizations working together, while the edges represent the number of publications they have in common. An example of the resulting network diagram can be seen here.

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Aug 22, 2023
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli plotly tqdm pyvis -U --quiet

#
# load libraries
import dimcli
from dimcli.utils import *

import json, sys, time
import pandas as pd
from tqdm.notebook import tqdm as progress
import plotly.express as px  # plotly>=4.8.1
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.7
Method: dsl.ini file

1. Choose an Organization and a keyword (topic)

For the purpose of this exercise, we will use grid.412125.1 (King Abdulaziz University, Saudi Arabia).

You can try using a different GRID ID to see how results change, e.g. by browsing for another GRID organization.

[3]:
GRIDID = "grid.412125.1" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

#@markdown ---
#@markdown A keyword used to filter publications search
TOPIC = "nanotechnology" #@param {type:"string"}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

#
# gen link to Dimensions
#
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {}'.format(YEAR_START, YEAR_END)))
display(HTML('Topic: "{}" <br /><br />'.format(TOPIC)))

Time period: 2000 to 2016
Topic: "nanotechnology"

2. Building a one-degree network of collaborating institutions

We can use the publications API to find the top 10 collaborating institutions based on the parameters above, via a single query.

The get_collaborators function below fills out a templated query with the relevant bits and runs it. Then it transforms the results into a pandas dataframe, which will make it easier to process the data later on.

[4]:
query_template = """search publications {}
                   where year in [{}:{}]
                   and research_orgs.id="{}"
                return research_orgs limit 11"""

def get_collaborators(orgid, level=1, printquery=False):
    if TOPIC:
        TOPIC_CLAUSE = f"""for "{TOPIC}" """
    else:
        TOPIC_CLAUSE = ""
    # fill in the blanks in the query_template
    query_full = query_template.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid)
    if printquery: print(query_full)
    df = dsl.query(query_full, verbose=False).as_dataframe()
    # add extra columns
    df['id_from'] = [orgid] * len(df)
    df['level'] = [level] * len(df)
    return df

Note:

  • Extra columns. The resulting dataframe contains two extra columns: a) id_from, which is the ‘seed’ institution we start from; b) level, an optional parameter representing the network depth of the query (we’ll see later how it is used with recursive querying).

  • Self-collaboration. The query returns 11 records - that’s because the first one is normally the seed GRID (due to internal collaborations) which we will omit from the results.

  • Custom changes. Lastly, it’s important to remember that this step can be easily customised by changing the query_template sttructure. For example, we could focus on specific research areas (using FOR codes), or set a threshold based on citation counts. The possibilities are endless!

For example, let’s try it out with our GRID ID:

[5]:
get_collaborators(GRIDID, printquery=True)
search publications for "nanotechnology"
                   where year in [2000:2016]
                   and research_orgs.id="grid.412125.1"
                return research_orgs limit 11
[5]:
id name acronym city_name count country_name latitude linkout longitude types state_name id_from level
0 grid.412125.1 King Abdulaziz University KAU Jeddah 1444 Saudi Arabia 21.493889 [http://www.kau.edu.sa/home_english.aspx] 39.250280 [Education] NaN grid.412125.1 1
1 grid.261112.7 Northeastern University NU Boston 106 United States 42.339830 [http://www.northeastern.edu/] -71.089180 [Education] Massachusetts grid.412125.1 1
2 grid.38142.3c Harvard University NaN Cambridge 98 United States 42.377052 [http://www.harvard.edu/] -71.116650 [Education] Massachusetts grid.412125.1 1
3 grid.116068.8 Massachusetts Institute of Technology MIT Cambridge 73 United States 42.359820 [http://web.mit.edu/] -71.092110 [Education] Massachusetts grid.412125.1 1
4 grid.16753.36 Northwestern University NU Evanston 59 United States 42.054850 [http://www.northwestern.edu/] -87.673940 [Education] Illinois grid.412125.1 1
5 grid.413735.7 Harvard–MIT Division of Health Sciences and Te... HST Cambridge 58 United States 42.361780 [http://hst.mit.edu/] -71.086914 [Education] Massachusetts grid.412125.1 1
6 grid.411340.3 Aligarh Muslim University AMU Aligarh 47 India 27.917370 [http://www.amu.ac.in/] 78.077850 [Education] Uttar Pradesh grid.412125.1 1
7 grid.412621.2 Quaid-i-Azam University QAU Islamabad 47 Pakistan 33.747223 [http://www.qau.edu.pk/] 73.138885 [Education] NaN grid.412125.1 1
8 grid.33003.33 Suez Canal University NaN Ismailia 42 Egypt 30.622778 [http://scuegypt.edu.eg/ar/] 32.275000 [Education] NaN grid.412125.1 1
9 grid.411818.5 Jamia Millia Islamia JMI New Delhi 42 India 28.561607 [http://jmi.ac.in/] 77.280150 [Education] NaN grid.412125.1 1
10 grid.56302.32 King Saud University KSU Riyadh 42 Saudi Arabia 24.723982 [http://ksu.edu.sa/en/] 46.645840 [Education] NaN grid.412125.1 1

3. Building a network of any size

What if we want to retrieve the collaborators of the collaborators? In other words, what if we want to generate a larger network?

If we think of our collaboration data as a graph structure with nodes and edges, we can see that the get_collaborators function defined above is limited. That’s because it allows to obtain only the objects directly linked to the ‘seed’ GRID organization.

We would like to run the same collaborators-extraction step iteratively for any GRID ID in our results, so to generate an N-degrees network where N is chosen by us.

To this purpose, we can set up a recursive function. This function essentially repeats the get_collaborators function as many times as needed. Here’s what it looks like:

[6]:
def recursive_network(seed, maxlevel=1, thislevel=1):
    "Recursive function for building an organization collaboration network"
    results = get_collaborators(seed, thislevel)
    time.sleep(1)
    print("--" * thislevel, seed, " :: level =", thislevel)
    if thislevel < maxlevel:
        # remove the originating grid-id
        gridslist = list(results[results['id'] != GRIDID]['id'])
        next_level_results = [recursive_network(x, maxlevel, thislevel+1) for x in gridslist]
        next_level_results = pd.concat(next_level_results)
        results = pd.concat([results, next_level_results])
        return results
    else:
        # finally
        return results

A few key points to note:

  • Recursion depth. The maxlevel parameter determines how big our network should be (1 = neighbours only, 2 = collaborators of neighbours,e tc..)

  • API quota. We pause 1 second after each iteration to avoid hitting the normal Analytics API quota (~30 requests per minute)

  • Data size. The function can generate lots of data! E.g. calling this function with maxlevel=5 will lead to 10k queries! (note: you can get a rough estimate of the queries via the formula 10 to the power of maxlevel-1. That’s because 10 is the number of orgs we extract per iteration, and maxlevel is the number or iterations, minus the first one which generates no extra queries).

Let’s try this out.

We can construct a 2-degrees collaboration network starting from King Abdulaziz University. We are extracting 10 organizations per node so our network will have ~100 nodes at the end!

[7]:
collaborators = recursive_network(GRIDID, maxlevel=2)
# change column order for readability purposes
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'state_name', 'country_name', 'latitude', 'longitude', 'linkout',  'types' ]]
collaborators.head()
-- grid.412125.1  :: level = 1
---- grid.261112.7  :: level = 2
---- grid.38142.3c  :: level = 2
---- grid.116068.8  :: level = 2
---- grid.16753.36  :: level = 2
---- grid.413735.7  :: level = 2
---- grid.411340.3  :: level = 2
---- grid.412621.2  :: level = 2
---- grid.33003.33  :: level = 2
---- grid.411818.5  :: level = 2
---- grid.56302.32  :: level = 2
[7]:
id_from id_to level count name acronym city_name state_name country_name latitude longitude linkout types
0 grid.412125.1 grid.412125.1 1 1444 King Abdulaziz University KAU Jeddah NaN Saudi Arabia 21.493889 39.25028 [http://www.kau.edu.sa/home_english.aspx] [Education]
1 grid.412125.1 grid.261112.7 1 106 Northeastern University NU Boston Massachusetts United States 42.339830 -71.08918 [http://www.northeastern.edu/] [Education]
2 grid.412125.1 grid.38142.3c 1 98 Harvard University NaN Cambridge Massachusetts United States 42.377052 -71.11665 [http://www.harvard.edu/] [Education]
3 grid.412125.1 grid.116068.8 1 73 Massachusetts Institute of Technology MIT Cambridge Massachusetts United States 42.359820 -71.09211 [http://web.mit.edu/] [Education]
4 grid.412125.1 grid.16753.36 1 59 Northwestern University NU Evanston Illinois United States 42.054850 -87.67394 [http://www.northwestern.edu/] [Education]

4. Visualizing the network

In order to get an overview of the network data we can visualize it using the Python pyvis library. A custom version of pyvis is already included in dimcli.core.extras and is called NetworkViz (note: this custom version only fixes a bug that prevents pyvis graphs to be displayed online with Google Colab).

Network visualizations can be very complex, but to begin with we can focus on representing two key aspects:

  • Core collaborators. The size of the nodes should be proportional to the proximity to our ‘seed’ organization. This will make it easier to quickly identify the key players in the network

  • Number of publications. The strenght of the collaboration should be proportional to the size of the edges (= how many publications two orgs have in common)

In a nutshell, this is what the code below does:

  • After creating a NetworkViz object, we add nodes and edges from our dataset using the add_node and add_edge method.

  • The Network repulsion parameter is set to 300, but for bigger charts you may want to increase that.

  • Nodes and edges in pyvis can have a number of attributes. The full list of attributes can be found in the pyvis documentation.

  • In order to have some nice colors, we take advantage of the built-in plotly color scales. Try changing them!

[8]:
# load pyvis
from pyvis.network import Network


def build_visualization(collaborator_df):
    """
    Return a network visualization object from a collaborators dataframe
    The object can be then displayed/saved with `g.show(f"network.html")`
    """

    # set up dataviz
    g = Network(notebook=True, width="100%", height="800px",cdn_resources="remote",
            neighborhood_highlight=True,
            select_menu=True)
    g.toggle_hide_edges_on_drag(False)
    g.barnes_hut()
    g.repulsion(300)
    # reuse plotly color palette
    palette = px.colors.diverging.Temps
    # g.show_buttons() # in html-standalone mode, this command shows viz controls


    #
    # create nodes and edges
    #

    # remove duplicates from nodes
    nodes = collaborator_df.drop_duplicates(subset ="id_to", keep = 'first')
    # remove internal collaborations stats
    edges = collaborator_df[(collaborator_df['id_to'] != collaborator_df['id_from'])]


    #
    # add nodes
    #

    for index, row in nodes.iterrows():

        # calc size based on level
        maxsize = int(nodes['level'].max()) + 1
        if row['id_to'] == GRIDID:
            size = maxsize
        else:
            size = maxsize - row['level']

        # calc color based on level
        if row['id_to'] == GRIDID:
            color = palette[0]
        else:
            color = palette[row['level'] * 2]

        g.add_node(
            n_id = row['id_to'],
            label = row['name'],
            title = f"<h4>{row['name']}<br>{row['city_name']}, {row['country_name']}<br> - {row['id_to']}</h4>",
            value = size,
            color = color,
            borderWidthSelected = 5,
            shape = "dot",
        )


    #
    # add edges
    #

    edges_maxcount = edges['count'].max()

    for index, row in edges.iterrows():
      g.add_edge(row['id_from'], row['id_to'],
                 value = float(row['count']) / edges_maxcount,
                 label=int(row['count']),
                 arrows="none"
                )

    # add tooltips with adjancent links info
    neighbor_map = g.get_adj_list()
    for node in g.nodes:
        neigh = neighbor_map[node["id"]]
        labels = [nodes[nodes['id_to'] == x].iloc[0]['name'] for x in neigh]
        node["title"] += "Links:<li>" + "</li><li>".join(labels)

    return g

#
# finall, run the viz builder
#
g = build_visualization(collaborators)
g.show(f"network_{GRIDID}.html")
network_grid.412125.1.html
[8]:

5. Addendum: showing only ‘Government’ collaborators

What if we want to show a collaboration network focusing only on ‘government’ organizations?

That’s pretty easy to do, since the GRID database includes information about organization types. We can easily see what types are available using the API and a facet query:

[9]:
%dsldf search organizations return types
Returned Types: 9
Time: 1.00s
[9]:
id count
0 Company 30742
1 Education 20761
2 Nonprofit 17573
3 Healthcare 13926
4 Facility 10168
5 Government 6580
6 Other 4017
7 Archive 2926
8 Education,Company 1

The steps are the following:

  • New query filter. We rewrite the get_collaborators function we created in section 2 above, so that the API query includes a filter for organizations with the selected type only: .. and research_orgs.types in ["Government"]...

  • Get more results. We increase the number of results returned: ..return research_orgs limit 50. This is to ensure we still have enough results after removing the ones that don’t have the chosen ‘type’

  • Remove unwanted data. The new query filter research_orgs.types in ["{}"] will return also publications with multiple authors/affiliations, even though only one of them has the desired ‘type’. So an extra step is required and this is achieved via the keep_type function below. This function simply filters out all unwanted organizations data after they’re retrieved from the API.

That’s it! Run the cell below to generate a new visualization showing only “Government” collaborators. Or try changing the value of GRID_TYPE to see different results.

[10]:
#@markdown Try using one of the organization types from the list above

GRID_TYPE = "Government" #@param {type:"string"}

query = """search publications {}
               where year in [{}:{}]
               and research_orgs.id="{}"
               and research_orgs.types in ["{}"]
            return research_orgs limit 50"""

def keep_only_type(data, a_type, orgid):
    clean_list = []
    for x in data.research_orgs:
        # include also originating GRID to ensure chart is complete
        if x['id'] == orgid or a_type in x['types']:
            clean_list.append(x)
    data.json['research_orgs'] = clean_list
    return data


def get_collaborators(orgid, level=1, printquery=False):
    "New version that filters using org types as well"
    if TOPIC:
        TOPIC_CLAUSE = f"""for "{TOPIC}" """
    else:
        TOPIC_CLAUSE = ""
    # include also the GRID_TYPE
    query_full = query.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid, GRID_TYPE)
    if printquery: print(query_full)
    data = dsl.query(query_full, verbose=False)
    # remove results with unwanted types
    data = keep_only_type(data, "Government", orgid)
    df = data.as_dataframe()
    df['id_from'] = [orgid] * len(df)
    df['level'] = [level] * len(df)
    return df


#
# RUN THE RECURSIVE QUERY (same code as above)
#
collaborators = recursive_network(GRIDID, maxlevel=2)
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'country_name', 'latitude', 'longitude', 'linkout',  'types' ]]

#
# BUILD VIZ
#
g = build_visualization(collaborators)
g.show(f"network_{GRIDID}_{GRID_TYPE}.html")


-- grid.412125.1  :: level = 1
---- grid.7327.1  :: level = 2
---- grid.9227.e  :: level = 2
---- grid.20256.33  :: level = 2
---- grid.1089.0  :: level = 2
---- grid.14467.30  :: level = 2
network_grid.412125.1_Government.html
[10]:

Conclusions

In this tutorial we have demonstrated how to generate an organization ‘collaborations network diagram’ using the Dimensions API. Starting from a research organization, we extracted information about other collaborating organizations, based on shared publications data, a topic and a time-frame.

An example of the resulting network diagram can be seen here.

Here’s some ideas for further experimentation:

  • try changing the initial publications query so to include other parameters. The publications API is rich so there’re many ways to fine-tune your analysis

  • try increasing the number of iterations using the maxlevel parameter

  • try customizing the resulting network diagram, e.g. to highlight nodes and edges based on different criteria like countries or years.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg