Identifying the Industry Collaborators of an Academic Institution¶

Dimensions uses GRID identifiers for institutions, hence you can take advantage of the GRID metadata with Dimensions queries.

In this tutorial we identify all organizations that have an industry type.

This list of organizations is then used to identify industry collaborations for a chosen academic institution.

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *

import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Selecting an academic institution¶

For the purpose of this exercise, we will use University of Trento, Italy (grid.11696.39) as a starting point. You can pick any other GRID organization of course. Just use a DSL query or the GRID website to discover the ID of an organization that interests you.

[3]:

#@markdown The main organization we are interested in:
GRIDID = "grid.11696.39" #@param {type:"string"}

#@markdown The start/end year of publications used to extract industry collaborations:
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

#
# gen link to Dimensions
#
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {} <br /><br />'.format(YEAR_START, YEAR_END)))

GRID: grid.11696.39 - University of Trento ⧉

Time period: 2000 to 2016

2. Extracting publications from industry collaborations¶

First of all we want to extract all GRID orgs with type='Company' using the API. Then we will use this list of organizations to identify industry collaborators for our chosen institution.

We can use the dimcli.query_iterative method to automatically retrieve ‘company’ GRID orgs in batches of 1000.
NOTE this step retrieves several thousands records from the API so it may take a few minutes to complete.

[4]:

# get GRID IDs
company_grids = dsl.query_iterative("""search organizations where types="Company" return organizations[id]""")

Starting iteration with limit=1000 skip=0 ...
0-1000 / 30088 (0.63s)
1000-2000 / 30088 (0.57s)
2000-3000 / 30088 (0.69s)
3000-4000 / 30088 (0.52s)
4000-5000 / 30088 (0.51s)
5000-6000 / 30088 (0.61s)
6000-7000 / 30088 (0.52s)
7000-8000 / 30088 (0.56s)
8000-9000 / 30088 (2.24s)
9000-10000 / 30088 (0.56s)
10000-11000 / 30088 (0.57s)
11000-12000 / 30088 (0.58s)
12000-13000 / 30088 (0.62s)
13000-14000 / 30088 (1.74s)
14000-15000 / 30088 (0.58s)
15000-16000 / 30088 (0.49s)
16000-17000 / 30088 (0.58s)
17000-18000 / 30088 (0.53s)
18000-19000 / 30088 (0.57s)
19000-20000 / 30088 (0.50s)
20000-21000 / 30088 (0.51s)
21000-22000 / 30088 (0.51s)
22000-23000 / 30088 (0.54s)
23000-24000 / 30088 (0.50s)
24000-25000 / 30088 (0.53s)
25000-26000 / 30088 (0.62s)
26000-27000 / 30088 (0.49s)
27000-28000 / 30088 (0.48s)
28000-29000 / 30088 (0.56s)
29000-30000 / 30088 (0.90s)
30000-30088 / 30088 (0.61s)
===
Records extracted: 30088

We can now set up a parametrized query that pulls Dimensions publications resulting from industry collaborations.

Together with IDs, title and DOIs, the publications generated from industry collaborations should include citations counts and authors info, so that we can draw up some useful statistics based on these metadata later on.

[5]:

query_template = """
    search publications
       where
        research_orgs.id = "{}"
        and research_orgs.id in {}
        and year in [{}:{}]
    return publications[id+doi+type+times_cited+year+authors]
    """

[6]:

gridis = list(company_grids.as_dataframe()['id'])

#
# loop through all grids

ITERATION_RECORDS = 1000  # Publication records per query iteration
GRID_RECORDS = 200       # grid IDs per query
VERBOSE = False          # set to True to view full extraction logs
print(f"===\nExtracting {GRIDID} publications with industry collaborators ...")
print("Records per query : ", ITERATION_RECORDS)
print("GRID IDs per query: ", GRID_RECORDS)
results = []


for chunk in progress(list(chunks_of(gridis, GRID_RECORDS))):
    query = query_template.format(GRIDID, json.dumps(chunk), YEAR_START, YEAR_END)
#     print(query)
    data = dsl.query_iterative(query, verbose=VERBOSE, limit=ITERATION_RECORDS)
    if data.errors:
        print("==\nIteration failed: due an error no data was extracted for this iteration. \nTry adjusting the ITERATION_RECORDS or BATCHSIZE parameters and rerun the extraction.")
    else:
        results += data.publications
    time.sleep(0.5)

#
# put the publication data into a dataframe, remove duplicates and save

pubs = pd.DataFrame().from_dict(results)
# print("===\nIndustry Publications found: ", len(pubs))
pubs.drop_duplicates(subset='id', inplace=True)
print("Unique Industry Publications found: ", len(pubs))

#
# preview the data
print("===\nPreview:")
pubs.head(10)

===
Extracting grid.11696.39 publications with industry collaborators ...
Records per query :  1000
GRID IDs per query:  200

Unique Industry Publications found:  375
===
Preview:

[6]:

	authors	doi	id	times_cited	type	year
0	[{'affiliations': [{'city': 'Madrid', 'city_id...	10.1088/0264-9381/33/23/235015	pub.1059063534	7	article	2016
1	[{'affiliations': [{'city': 'Dublin', 'city_id...	10.1145/2984356.2984363	pub.1001653422	14	proceeding	2016
2	[{'affiliations': [{'city': 'Stuttgart', 'city...	10.1016/j.apnum.2016.02.001	pub.1038596770	12	article	2016
3	[{'affiliations': [{'city': 'Madrid', 'city_id...	10.1103/physrevlett.116.231101	pub.1001053038	313	article	2016
4	[{'affiliations': [{'city': 'Dublin', 'city_id...	10.1109/eucnc.2016.7561056	pub.1094950798	22	proceeding	2016
5	[{'affiliations': [{'city': 'Trento', 'city_id...	10.1140/epjds/s13688-016-0064-6	pub.1033140941	15	article	2016
6	[{'affiliations': [{'city': 'Trento', 'city_id...	10.1089/big.2014.0054	pub.1018945654	48	article	2015
7	[{'affiliations': [{'city': 'Madrid', 'city_id...	10.1088/1742-6596/610/1/012027	pub.1031150191	1	article	2015
8	[{'affiliations': [{'city': 'Madrid', 'city_id...	10.1088/1742-6596/610/1/012005	pub.1052522882	17	article	2015
9	[{'affiliations': [{'city': 'Madrid', 'city_id...	10.1088/1742-6596/610/1/012026	pub.1033837350	2	article	2015

3. Analyses¶

In this section we will build some visualizations that help understanding the data we extracted.

3.1 Count of Publications per year from Industry Collaborations¶

A simple histogram chart can tell us the rate of publications per year.

[7]:

px.histogram(pubs,
             x="year",
             color="type",
             title=f"Publications per year with industry collaborations for {GRIDID}")

3.2 Citations from Industry Collaboration¶

[8]:

pubs_grouped = pubs.groupby(['year'], as_index=False).sum()
px.bar(pubs_grouped,
       x="year",
       y="times_cited",
       title=f"Tot Citations per year for publications with industry collaborations for {GRIDID}")

3.3 Top Industry Collaborators¶

In order to dig deeper into the industry affiliations we have to process the nested JSON data in the ‘authors’ column. By doing so, we can process authors & affiliations information and identify the ones belonging to the ‘industry’ set defined above.

For example, if we extract the authors data for the first publication/row (pubs.iloc[0]['authors']), this is what it’d look like:

[{'first_name': 'LUCA DALLA',
  'last_name': 'VALLE',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.11696.39',
  'researcher_id': 'ur.013645226073.38',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'ELENA CRISTINA',
  'last_name': 'RADA',
  'corresponding': '',
  'orcid': "['0000-0003-0807-1826']",
  'current_organization_id': 'grid.18147.3b',
  'researcher_id': 'ur.01344320306.26',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'MARCO',
  'last_name': 'RAGAZZI',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.11696.39',
  'researcher_id': 'ur.0655652202.53',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'MICHELE',
  'last_name': 'CARAVIELLO',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.14587.3f',
  'researcher_id': 'ur.016015622301.36',
  'affiliations': [{'id': 'grid.14587.3f',
    'name': 'Telecom Italia (Italy)',
    'city': 'Rome',
    'city_id': 3169070,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]}]

NOTE: Instead of iterating through the authors/affiliations data by building a new function, we can just take advantage of the DslDataset class in the Dimcli library. This class abstracts the notion of a Dimensions ‘results list’ and provides useful methods to quickly process authors and affiliations.

[10]:

from dimcli import DslDataset

# create a new DslDataset instance
pubsnew = DslDataset.from_publications_list(pubs)
# extract affiliations as a dataframe
affiliations = pubsnew.as_dataframe_authors_affiliations()
# focus only on affiliations including a grid from the industry set created above
affiliations = affiliations[affiliations['aff_id' ].isin(gridis)]
# preview the data
affiliations.head(5)

[10]:

	aff_city	aff_city_id	aff_country	aff_country_code	aff_id	aff_name	aff_raw_affiliation	pub_id	researcher_id	first_name	last_name
7	Hamburg	2911298.0	Germany	DE	grid.410308.e	Airbus (Germany)	Airbus Defence and Space, Claude-Dornier-Stras...	pub.1059063534		N	Brandt
8	Milan	3173435.0	Italy	IT	grid.424032.3	OHB (Italy)	CGS S.p.A, Compagnia Generale per lo Spazio, V...	pub.1059063534	ur.014542047336.90	A	Bursi
15	Milan	3173435.0	Italy	IT	grid.424032.3	OHB (Italy)	CGS S.p.A, Compagnia Generale per lo Spazio, V...	pub.1059063534		D	Desiderio
16	Milan	3173435.0	Italy	IT	grid.424032.3	OHB (Italy)	CGS S.p.A, Compagnia Generale per lo Spazio, V...	pub.1059063534		E	Piersanti
19	Bristol	2654675.0	United Kingdom	GB	grid.7546.0	Airbus (United Kingdom)	Airbus Defence and Space, Gunnels Wood Road, S...	pub.1059063534	ur.010504106037.54	N	Dunbar

Let’s now count frequency and create a nice chart summing up the top industry collaborators.

TIP Try zooming in on the left-hand side to put into focus the organizations that appear most frequently.

[11]:

px.histogram(affiliations,
             x="aff_name",
             height=900,
             title=f"Top Industry collaborators for {GRIDID}").update_xaxes(categoryorder="total descending")

3.4 Countries of Industry Collaborators¶

We can use the same dataset to segment the data by country.

[12]:

px.pie(affiliations,
       names="aff_country",
       height=600,
       title=f"Countries of collaborators for {GRIDID}")

3.5 Putting Countries and Collaborators together¶

TIP by clicking on the right panel you can turn on/off specific countries

[13]:

px.histogram(affiliations,
             x="aff_name",
             height=900,
             color="aff_country",
             title=f"Top Countries and Industry collaborators for {gridname}-{GRIDID}",
             color_discrete_sequence=px.colors.diverging.Spectral)

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.