../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Identifying the Industry Collaborators of an Academic Institution

Dimensions uses GRID identifiers for institutions, hence you can take advantage of the GRID metadata with Dimensions queries.

In this tutorial we identify all organizations that have an industry type.

This list of organizations is then used to identify industry collaborations for a chosen academic institution.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[1]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.shortcuts import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.7.4.2)
Connected to: https://app.dimensions.ai - DSL v1.27
Method: dsl.ini file

1. Selecting an academic institution

For the purpose of this exercise, we will use University of Trento, Italy (grid.11696.39) as a starting point. You can pick any other GRID organization of course. Just use a DSL query or the GRID website to discover the ID of an organization that interests you.

[2]:
#@markdown The main organization we are interested in:
GRIDID = "grid.11696.39" #@param {type:"string"}

#@markdown The start/end year of publications used to extract industry collaborations:
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

#
# gen link to Dimensions
#
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {} <br /><br />'.format(YEAR_START, YEAR_END)))

Time period: 2000 to 2016

2. Extracting publications from industry collaborations

First of all we want to extract all GRID orgs with type='Company' using the API. Then we will use this list of organizations to identify industry collaborators for our chosen institution.

  • We can use the dimcli.query_iterative method to automatically retrieve ‘company’ GRID orgs in batches of 1000.

  • NOTE this step retrieves several thousands records from the API so it may take a few minutes to complete.

[3]:
# get GRID IDs
company_grids = dsl.query_iterative("""search organizations where types="Company" return organizations[id]""")
Starting iteration with limit=1000 skip=0 ...
0-1000 / 28136 (0.55s)
1000-2000 / 28136 (0.55s)
2000-3000 / 28136 (0.54s)
3000-4000 / 28136 (0.79s)
4000-5000 / 28136 (0.53s)
5000-6000 / 28136 (0.55s)
6000-7000 / 28136 (0.52s)
7000-8000 / 28136 (0.62s)
8000-9000 / 28136 (0.54s)
9000-10000 / 28136 (0.53s)
10000-11000 / 28136 (0.52s)
11000-12000 / 28136 (0.54s)
12000-13000 / 28136 (0.54s)
13000-14000 / 28136 (0.53s)
14000-15000 / 28136 (0.56s)
15000-16000 / 28136 (0.52s)
16000-17000 / 28136 (0.53s)
17000-18000 / 28136 (0.55s)
18000-19000 / 28136 (0.52s)
19000-20000 / 28136 (0.67s)
20000-21000 / 28136 (0.52s)
21000-22000 / 28136 (0.52s)
22000-23000 / 28136 (0.51s)
23000-24000 / 28136 (0.50s)
24000-25000 / 28136 (0.99s)
25000-26000 / 28136 (0.81s)
26000-27000 / 28136 (0.51s)
27000-28000 / 28136 (0.60s)
28000-28136 / 28136 (0.51s)
===
Records extracted: 28136

We can now set up a parametrized query that pulls Dimensions publications resulting from industry collaborations.

Together with IDs, title and DOIs, the publications generated from industry collaborations should include citations counts and authors info, so that we can draw up some useful statistics based on these metadata later on.

[4]:
query_template = """
    search publications
       where
        research_orgs.id = "{}"
        and research_orgs.id in {}
        and year in [{}:{}]
    return publications[id+doi+type+times_cited+year+authors]
    """
[5]:
gridis = list(company_grids.as_dataframe()['id'])

#
# loop through all grids

ITERATION_RECORDS = 1000  # Publication records per query iteration
GRID_RECORDS = 200       # grid IDs per query
VERBOSE = False          # set to True to view full extraction logs
print(f"===\nExtracting {GRIDID} publications with industry collaborators ...")
print("Records per query : ", ITERATION_RECORDS)
print("GRID IDs per query: ", GRID_RECORDS)
results = []


for chunk in progress(list(chunks_of(gridis, GRID_RECORDS))):
    query = query_template.format(GRIDID, json.dumps(chunk), YEAR_START, YEAR_END)
#     print(query)
    data = dsl.query_iterative(query, verbose=VERBOSE, limit=ITERATION_RECORDS)
    if data.errors:
        print("==\nIteration failed: due an error no data was extracted for this iteration. \nTry adjusting the ITERATION_RECORDS or BATCHSIZE parameters and rerun the extraction.")
    else:
        results += data.publications
    time.sleep(0.5)

#
# put the publication data into a dataframe, remove duplicates and save

pubs = pd.DataFrame().from_dict(results)
# print("===\nIndustry Publications found: ", len(pubs))
pubs.drop_duplicates(subset='id', inplace=True)
print("Unique Industry Publications found: ", len(pubs))

#
# preview the data
print("===\nPreview:")
pubs.head(10)
===
Extracting grid.11696.39 publications with industry collaborators ...
Records per query :  1000
GRID IDs per query:  200

Unique Industry Publications found:  357
===
Preview:
[5]:
times_cited authors doi year id type
0 0 [{'first_name': 'LUCA DALLA', 'last_name': 'VA... 10.2495/sdp-v12-n3-552-558/020 2016 pub.1087081818 chapter
1 7 [{'first_name': 'M', 'last_name': 'Armano', 'c... 10.1088/0264-9381/33/23/235015 2016 pub.1059063534 article
2 20 [{'first_name': 'Andrey', 'last_name': 'Bogomo... 10.1140/epjds/s13688-016-0075-3 2016 pub.1049398390 article
3 12 [{'first_name': 'Simone', 'last_name': 'Centel... 10.1140/epjds/s13688-016-0064-6 2016 pub.1033140941 article
4 24 [{'first_name': 'Timofei', 'last_name': 'Istom... 10.1145/2994551.2994558 2016 pub.1001093903 proceeding
5 480 [{'first_name': 'Menachem', 'last_name': 'From... 10.1038/nn.4399 2016 pub.1022499785 article
6 1 [{'first_name': 'Irena', 'last_name': 'Zurnic'... 10.1186/s12977-016-0294-5 2016 pub.1015046696 article
7 10 [{'first_name': 'Julia', 'last_name': 'Leibing... 10.1016/j.apnum.2016.02.001 2016 pub.1038596770 article
8 0 [{'first_name': 'Moira', 'last_name': 'Marizzo... 10.1016/j.jalz.2016.06.1894 2016 pub.1014099327 article
9 0 [{'first_name': 'Michela', 'last_name': 'Pieva... 10.1016/j.jalz.2016.06.1013 2016 pub.1049823866 article

3. Analyses

In this section we will build some visualizations that help understanding the data we extracted.

3.1 Count of Publications per year from Industry Collaborations

A simple histogram chart can tell us the rate of publications per year.

[6]:
px.histogram(pubs,
             x="year",
             color="type",
             title=f"Publications per year with industry collaborations for {GRIDID}")

3.2 Citations from Industry Collaboration

[7]:
pubs_grouped = pubs.groupby(['year'], as_index=False).sum()
px.bar(pubs_grouped,
       x="year",
       y="times_cited",
       title=f"Tot Citations per year for publications with industry collaborations for {GRIDID}")

3.3 Top Industry Collaborators

In order to dig deeper into the industry affiliations we have to process the nested JSON data in the ‘authors’ column. By doing so, we can process authors & affiliations information and identify the ones belonging to the ‘industry’ set defined above.

For example, if we extract the authors data for the first publication/row (pubs.iloc[0]['authors']), this is what it’d look like:

[{'first_name': 'LUCA DALLA',
  'last_name': 'VALLE',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.11696.39',
  'researcher_id': 'ur.013645226073.38',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'ELENA CRISTINA',
  'last_name': 'RADA',
  'corresponding': '',
  'orcid': "['0000-0003-0807-1826']",
  'current_organization_id': 'grid.18147.3b',
  'researcher_id': 'ur.01344320306.26',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'MARCO',
  'last_name': 'RAGAZZI',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.11696.39',
  'researcher_id': 'ur.0655652202.53',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'MICHELE',
  'last_name': 'CARAVIELLO',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.14587.3f',
  'researcher_id': 'ur.016015622301.36',
  'affiliations': [{'id': 'grid.14587.3f',
    'name': 'Telecom Italia (Italy)',
    'city': 'Rome',
    'city_id': 3169070,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]}]

NOTE: Instead of iterating through the authors/affiliations data by building a new function, we can just take advantage of the DslDataset class in the Dimcli library. This class abstracts the notion of a Dimensions ‘results list’ and provides useful methods to quickly process authors and affiliations.

[8]:
# create a new DslDataset instance
pubsnew = DslDataset.from_publications_list(pubs)
# extract affiliations as a dataframe
affiliations = pubsnew.as_dataframe_authors_affiliations()
# focus only on affiliations including a grid from the industry set created above
affiliations = affiliations[affiliations['aff_id' ].isin(gridis)]
# preview the data
affiliations.head(5)
[8]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code pub_id researcher_id first_name last_name
3 grid.14587.3f Telecom Italia (Italy) Rome 3.16907e+06 Italy IT pub.1087081818 MICHELE CARAVIELLO
11 grid.410308.e Airbus (Germany) Hamburg 2.9113e+06 Germany DE pub.1059063534 N Brandt
12 grid.424032.3 Orbitale Hochtechnologie Bremen (Italy) Milan 3.17344e+06 Italy IT pub.1059063534 A Bursi
19 grid.424032.3 Orbitale Hochtechnologie Bremen (Italy) Milan 3.17344e+06 Italy IT pub.1059063534 D Desiderio
20 grid.424032.3 Orbitale Hochtechnologie Bremen (Italy) Milan 3.17344e+06 Italy IT pub.1059063534 E Piersanti

Let’s now count frequency and create a nice chart summing up the top industry collaborators.

TIP Try zooming in on the left-hand side to put into focus the organizations that appear most frequently.

[9]:
px.histogram(affiliations,
             x="aff_name",
             height=900,
             title=f"Top Industry collaborators for {GRIDID}").update_xaxes(categoryorder="total descending")

3.4 Countries of Industry Collaborators

We can use the same dataset to segment the data by country.

[10]:
px.pie(affiliations,
       names="aff_country",
       height=600,
       title=f"Countries of collaborators for {GRIDID}")

3.5 Putting Countries and Collaborators together

TIP by clicking on the right panel you can turn on/off specific countries

[11]:
px.histogram(affiliations,
             x="aff_name",
             height=900,
             color="aff_country",
             title=f"Top Countries and Industry collaborators for {gridname}-{GRIDID}",
             color_discrete_sequence=px.colors.diverging.Spectral)


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg