Building an Organizations Collaboration Network Diagram¶
This notebook shows how to analyse organizations collaboration data using the Organizations data source available via the Dimensions Analytics API.
Starting from a research organization, we will extract information about other organizations that collaborated with it, based on shared publications data.
In order to make the analysis more focused, we are going to select also a topic and a time-frame. By appying these extra constraints we will reduce the number of shared publications data and also make the overall extraction faster.
At the end of the tutorial we will generate a ‘collaborations network diagram’. The diagram nodes represent the organizations working together, while the edges represent the number of publications they have in common. An example of the resulting network diagram can be seen here.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Aug 22, 2023
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli plotly tqdm pyvis -U --quiet
#
# load libraries
import dimcli
from dimcli.utils import *
import json, sys, time
import pandas as pd
from tqdm.notebook import tqdm as progress
import plotly.express as px # plotly>=4.8.1
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.1)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.7
Method: dsl.ini file
1. Choose an Organization and a keyword (topic)¶
For the purpose of this exercise, we will use grid.412125.1 (King Abdulaziz University, Saudi Arabia).
You can try using a different GRID ID to see how results change, e.g. by browsing for another GRID organization.
[3]:
GRIDID = "grid.412125.1" #@param {type:"string"}
#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}
#@markdown ---
#@markdown A keyword used to filter publications search
TOPIC = "nanotechnology" #@param {type:"string"}
if YEAR_END < YEAR_START:
YEAR_END = YEAR_START
#
# gen link to Dimensions
#
try:
gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
gridname = ""
from IPython.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} ⧉</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {}'.format(YEAR_START, YEAR_END)))
display(HTML('Topic: "{}" <br /><br />'.format(TOPIC)))
2. Building a one-degree network of collaborating institutions¶
We can use the publications API to find the top 10 collaborating institutions based on the parameters above, via a single query.
The get_collaborators
function below fills out a templated query with the relevant bits and runs it. Then it transforms the results into a pandas dataframe, which will make it easier to process the data later on.
[4]:
query_template = """search publications {}
where year in [{}:{}]
and research_orgs.id="{}"
return research_orgs limit 11"""
def get_collaborators(orgid, level=1, printquery=False):
if TOPIC:
TOPIC_CLAUSE = f"""for "{TOPIC}" """
else:
TOPIC_CLAUSE = ""
# fill in the blanks in the query_template
query_full = query_template.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid)
if printquery: print(query_full)
df = dsl.query(query_full, verbose=False).as_dataframe()
# add extra columns
df['id_from'] = [orgid] * len(df)
df['level'] = [level] * len(df)
return df
Note:
Extra columns. The resulting dataframe contains two extra columns: a)
id_from
, which is the ‘seed’ institution we start from; b)level
, an optional parameter representing the network depth of the query (we’ll see later how it is used with recursive querying).Self-collaboration. The query returns 11 records - that’s because the first one is normally the seed GRID (due to internal collaborations) which we will omit from the results.
Custom changes. Lastly, it’s important to remember that this step can be easily customised by changing the
query_template
sttructure. For example, we could focus on specific research areas (using FOR codes), or set a threshold based on citation counts. The possibilities are endless!
For example, let’s try it out with our GRID ID:
[5]:
get_collaborators(GRIDID, printquery=True)
search publications for "nanotechnology"
where year in [2000:2016]
and research_orgs.id="grid.412125.1"
return research_orgs limit 11
[5]:
id | name | acronym | city_name | count | country_name | latitude | linkout | longitude | types | state_name | id_from | level | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | grid.412125.1 | King Abdulaziz University | KAU | Jeddah | 1444 | Saudi Arabia | 21.493889 | [http://www.kau.edu.sa/home_english.aspx] | 39.250280 | [Education] | NaN | grid.412125.1 | 1 |
1 | grid.261112.7 | Northeastern University | NU | Boston | 106 | United States | 42.339830 | [http://www.northeastern.edu/] | -71.089180 | [Education] | Massachusetts | grid.412125.1 | 1 |
2 | grid.38142.3c | Harvard University | NaN | Cambridge | 98 | United States | 42.377052 | [http://www.harvard.edu/] | -71.116650 | [Education] | Massachusetts | grid.412125.1 | 1 |
3 | grid.116068.8 | Massachusetts Institute of Technology | MIT | Cambridge | 73 | United States | 42.359820 | [http://web.mit.edu/] | -71.092110 | [Education] | Massachusetts | grid.412125.1 | 1 |
4 | grid.16753.36 | Northwestern University | NU | Evanston | 59 | United States | 42.054850 | [http://www.northwestern.edu/] | -87.673940 | [Education] | Illinois | grid.412125.1 | 1 |
5 | grid.413735.7 | Harvard–MIT Division of Health Sciences and Te... | HST | Cambridge | 58 | United States | 42.361780 | [http://hst.mit.edu/] | -71.086914 | [Education] | Massachusetts | grid.412125.1 | 1 |
6 | grid.411340.3 | Aligarh Muslim University | AMU | Aligarh | 47 | India | 27.917370 | [http://www.amu.ac.in/] | 78.077850 | [Education] | Uttar Pradesh | grid.412125.1 | 1 |
7 | grid.412621.2 | Quaid-i-Azam University | QAU | Islamabad | 47 | Pakistan | 33.747223 | [http://www.qau.edu.pk/] | 73.138885 | [Education] | NaN | grid.412125.1 | 1 |
8 | grid.33003.33 | Suez Canal University | NaN | Ismailia | 42 | Egypt | 30.622778 | [http://scuegypt.edu.eg/ar/] | 32.275000 | [Education] | NaN | grid.412125.1 | 1 |
9 | grid.411818.5 | Jamia Millia Islamia | JMI | New Delhi | 42 | India | 28.561607 | [http://jmi.ac.in/] | 77.280150 | [Education] | NaN | grid.412125.1 | 1 |
10 | grid.56302.32 | King Saud University | KSU | Riyadh | 42 | Saudi Arabia | 24.723982 | [http://ksu.edu.sa/en/] | 46.645840 | [Education] | NaN | grid.412125.1 | 1 |
3. Building a network of any size¶
What if we want to retrieve the collaborators of the collaborators? In other words, what if we want to generate a larger network?
If we think of our collaboration data as a graph structure with nodes and edges, we can see that the get_collaborators
function defined above is limited. That’s because it allows to obtain only the objects directly linked to the ‘seed’ GRID organization.
We would like to run the same collaborators-extraction step iteratively for any GRID ID in our results, so to generate an N-degrees network where N is chosen by us.
To this purpose, we can set up a recursive function. This function essentially repeats the get_collaborators
function as many times as needed. Here’s what it looks like:
[6]:
def recursive_network(seed, maxlevel=1, thislevel=1):
"Recursive function for building an organization collaboration network"
results = get_collaborators(seed, thislevel)
time.sleep(1)
print("--" * thislevel, seed, " :: level =", thislevel)
if thislevel < maxlevel:
# remove the originating grid-id
gridslist = list(results[results['id'] != GRIDID]['id'])
next_level_results = [recursive_network(x, maxlevel, thislevel+1) for x in gridslist]
next_level_results = pd.concat(next_level_results)
results = pd.concat([results, next_level_results])
return results
else:
# finally
return results
A few key points to note:
Recursion depth. The
maxlevel
parameter determines how big our network should be (1 = neighbours only, 2 = collaborators of neighbours,e tc..)API quota. We pause 1 second after each iteration to avoid hitting the normal Analytics API quota (~30 requests per minute)
Data size. The function can generate lots of data! E.g. calling this function with
maxlevel=5
will lead to 10k queries! (note: you can get a rough estimate of the queries via the formula 10 to the power of maxlevel-1. That’s because 10 is the number of orgs we extract per iteration, and maxlevel is the number or iterations, minus the first one which generates no extra queries).
Let’s try this out.
We can construct a 2-degrees collaboration network starting from King Abdulaziz University. We are extracting 10 organizations per node so our network will have ~100 nodes at the end!
[7]:
collaborators = recursive_network(GRIDID, maxlevel=2)
# change column order for readability purposes
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'state_name', 'country_name', 'latitude', 'longitude', 'linkout', 'types' ]]
collaborators.head()
-- grid.412125.1 :: level = 1
---- grid.261112.7 :: level = 2
---- grid.38142.3c :: level = 2
---- grid.116068.8 :: level = 2
---- grid.16753.36 :: level = 2
---- grid.413735.7 :: level = 2
---- grid.411340.3 :: level = 2
---- grid.412621.2 :: level = 2
---- grid.33003.33 :: level = 2
---- grid.411818.5 :: level = 2
---- grid.56302.32 :: level = 2
[7]:
id_from | id_to | level | count | name | acronym | city_name | state_name | country_name | latitude | longitude | linkout | types | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | grid.412125.1 | grid.412125.1 | 1 | 1444 | King Abdulaziz University | KAU | Jeddah | NaN | Saudi Arabia | 21.493889 | 39.25028 | [http://www.kau.edu.sa/home_english.aspx] | [Education] |
1 | grid.412125.1 | grid.261112.7 | 1 | 106 | Northeastern University | NU | Boston | Massachusetts | United States | 42.339830 | -71.08918 | [http://www.northeastern.edu/] | [Education] |
2 | grid.412125.1 | grid.38142.3c | 1 | 98 | Harvard University | NaN | Cambridge | Massachusetts | United States | 42.377052 | -71.11665 | [http://www.harvard.edu/] | [Education] |
3 | grid.412125.1 | grid.116068.8 | 1 | 73 | Massachusetts Institute of Technology | MIT | Cambridge | Massachusetts | United States | 42.359820 | -71.09211 | [http://web.mit.edu/] | [Education] |
4 | grid.412125.1 | grid.16753.36 | 1 | 59 | Northwestern University | NU | Evanston | Illinois | United States | 42.054850 | -87.67394 | [http://www.northwestern.edu/] | [Education] |
4. Visualizing the network¶
In order to get an overview of the network data we can visualize it using the Python pyvis library. A custom version of pyvis is already included in dimcli.core.extras and is called NetworkViz
(note: this custom version only fixes a bug that prevents pyvis graphs to be displayed online with Google Colab).
Network visualizations can be very complex, but to begin with we can focus on representing two key aspects:
Core collaborators. The size of the nodes should be proportional to the proximity to our ‘seed’ organization. This will make it easier to quickly identify the key players in the network
Number of publications. The strenght of the collaboration should be proportional to the size of the edges (= how many publications two orgs have in common)
In a nutshell, this is what the code below does:
After creating a
NetworkViz
object, we add nodes and edges from our dataset using theadd_node
andadd_edge
method.The Network
repulsion
parameter is set to 300, but for bigger charts you may want to increase that.Nodes and edges in pyvis can have a number of attributes. The full list of attributes can be found in the pyvis documentation.
In order to have some nice colors, we take advantage of the built-in plotly color scales. Try changing them!
[8]:
# load pyvis
from pyvis.network import Network
def build_visualization(collaborator_df):
"""
Return a network visualization object from a collaborators dataframe
The object can be then displayed/saved with `g.show(f"network.html")`
"""
# set up dataviz
g = Network(notebook=True, width="100%", height="800px",cdn_resources="remote",
neighborhood_highlight=True,
select_menu=True)
g.toggle_hide_edges_on_drag(False)
g.barnes_hut()
g.repulsion(300)
# reuse plotly color palette
palette = px.colors.diverging.Temps
# g.show_buttons() # in html-standalone mode, this command shows viz controls
#
# create nodes and edges
#
# remove duplicates from nodes
nodes = collaborator_df.drop_duplicates(subset ="id_to", keep = 'first')
# remove internal collaborations stats
edges = collaborator_df[(collaborator_df['id_to'] != collaborator_df['id_from'])]
#
# add nodes
#
for index, row in nodes.iterrows():
# calc size based on level
maxsize = int(nodes['level'].max()) + 1
if row['id_to'] == GRIDID:
size = maxsize
else:
size = maxsize - row['level']
# calc color based on level
if row['id_to'] == GRIDID:
color = palette[0]
else:
color = palette[row['level'] * 2]
g.add_node(
n_id = row['id_to'],
label = row['name'],
title = f"<h4>{row['name']}<br>{row['city_name']}, {row['country_name']}<br> - {row['id_to']}</h4>",
value = size,
color = color,
borderWidthSelected = 5,
shape = "dot",
)
#
# add edges
#
edges_maxcount = edges['count'].max()
for index, row in edges.iterrows():
g.add_edge(row['id_from'], row['id_to'],
value = float(row['count']) / edges_maxcount,
label=int(row['count']),
arrows="none"
)
# add tooltips with adjancent links info
neighbor_map = g.get_adj_list()
for node in g.nodes:
neigh = neighbor_map[node["id"]]
labels = [nodes[nodes['id_to'] == x].iloc[0]['name'] for x in neigh]
node["title"] += "Links:<li>" + "</li><li>".join(labels)
return g
#
# finall, run the viz builder
#
g = build_visualization(collaborators)
g.show(f"network_{GRIDID}.html")
network_grid.412125.1.html
[8]:
5. Addendum: showing only ‘Government’ collaborators¶
What if we want to show a collaboration network focusing only on ‘government’ organizations?
That’s pretty easy to do, since the GRID database includes information about organization types. We can easily see what types are available using the API and a facet
query:
[9]:
%dsldf search organizations return types
Returned Types: 9
Time: 1.00s
[9]:
id | count | |
---|---|---|
0 | Company | 30742 |
1 | Education | 20761 |
2 | Nonprofit | 17573 |
3 | Healthcare | 13926 |
4 | Facility | 10168 |
5 | Government | 6580 |
6 | Other | 4017 |
7 | Archive | 2926 |
8 | Education,Company | 1 |
The steps are the following:
New query filter. We rewrite the
get_collaborators
function we created in section 2 above, so that the API query includes a filter for organizations with the selected type only:.. and research_orgs.types in ["Government"]...
Get more results. We increase the number of results returned:
..return research_orgs limit 50
. This is to ensure we still have enough results after removing the ones that don’t have the chosen ‘type’Remove unwanted data. The new query filter
research_orgs.types in ["{}"]
will return also publications with multiple authors/affiliations, even though only one of them has the desired ‘type’. So an extra step is required and this is achieved via thekeep_type
function below. This function simply filters out all unwanted organizations data after they’re retrieved from the API.
That’s it! Run the cell below to generate a new visualization showing only “Government” collaborators. Or try changing the value of GRID_TYPE
to see different results.
[10]:
#@markdown Try using one of the organization types from the list above
GRID_TYPE = "Government" #@param {type:"string"}
query = """search publications {}
where year in [{}:{}]
and research_orgs.id="{}"
and research_orgs.types in ["{}"]
return research_orgs limit 50"""
def keep_only_type(data, a_type, orgid):
clean_list = []
for x in data.research_orgs:
# include also originating GRID to ensure chart is complete
if x['id'] == orgid or a_type in x['types']:
clean_list.append(x)
data.json['research_orgs'] = clean_list
return data
def get_collaborators(orgid, level=1, printquery=False):
"New version that filters using org types as well"
if TOPIC:
TOPIC_CLAUSE = f"""for "{TOPIC}" """
else:
TOPIC_CLAUSE = ""
# include also the GRID_TYPE
query_full = query.format(TOPIC_CLAUSE, YEAR_START, YEAR_END, orgid, GRID_TYPE)
if printquery: print(query_full)
data = dsl.query(query_full, verbose=False)
# remove results with unwanted types
data = keep_only_type(data, "Government", orgid)
df = data.as_dataframe()
df['id_from'] = [orgid] * len(df)
df['level'] = [level] * len(df)
return df
#
# RUN THE RECURSIVE QUERY (same code as above)
#
collaborators = recursive_network(GRIDID, maxlevel=2)
collaborators.rename(columns={"id": "id_to"}, inplace=True)
collaborators = collaborators[['id_from', 'id_to', 'level', 'count', 'name', 'acronym', 'city_name', 'country_name', 'latitude', 'longitude', 'linkout', 'types' ]]
#
# BUILD VIZ
#
g = build_visualization(collaborators)
g.show(f"network_{GRIDID}_{GRID_TYPE}.html")
-- grid.412125.1 :: level = 1
---- grid.7327.1 :: level = 2
---- grid.9227.e :: level = 2
---- grid.20256.33 :: level = 2
---- grid.1089.0 :: level = 2
---- grid.14467.30 :: level = 2
network_grid.412125.1_Government.html
[10]:
Conclusions¶
In this tutorial we have demonstrated how to generate an organization ‘collaborations network diagram’ using the Dimensions API. Starting from a research organization, we extracted information about other collaborating organizations, based on shared publications data, a topic and a time-frame.
An example of the resulting network diagram can be seen here.
Here’s some ideas for further experimentation:
try changing the initial
publications
query so to include other parameters. The publications API is rich so there’re many ways to fine-tune your analysistry increasing the number of iterations using the
maxlevel
parametertry customizing the resulting network diagram, e.g. to highlight nodes and edges based on different criteria like countries or years.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.