../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Identifying the Industry Collaborators of an Academic Institution

Dimensions uses GRID identifiers for institutions, hence you can take advantage of the GRID metadata with Dimensions queries.

In this tutorial we identify all organizations that have an industry type.

This list of organizations is then used to identify industry collaborations for a chosen academic institution.

Prerequisites

Please install the latest versions of these libraries to run this notebook.

[1]:
# @markdown # Get the API library and login
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}


!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

#
# load common libraries
import time
import sys
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progress

#
# charts libs
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
Dimcli - Dimensions API Client (v0.6.9)
Connected to endpoint: https://app.dimensions.ai - DSL version: 1.24
Method: dsl.ini file

1. Selecting an academic institution

For the purpose of this exercise, we will use University of Trento, Italy (grid.11696.39) as a starting point. You can pick any other GRID organization of course. Just use a DSL query or the GRID website to discover the ID of an organization that interests you.

[2]:
#@markdown The main organization we are interested in:
GRIDID = "grid.11696.39" #@param {type:"string"}

#@markdown The start/end year of publications used to extract industry collaborations:
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

#
# gen link to Dimensions
#
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {} <br /><br />'.format(YEAR_START, YEAR_END)))

Time period: 2000 to 2016

2. Extracting publications from industry collaborations

First of all we want to extract all GRID orgs with type='Company' using the API. Then we will use this list of organizations to identify industry collaborators for our chosen institution.

  • We can use the dimcli.query_iterative method to automatically retrieve ‘company’ GRID orgs in batches of 1000.

  • NOTE this step retrieves several thousands records from the API so it may take a few minutes to complete.

[3]:
# get GRID IDs
company_grids = dsl.query_iterative("""search organizations where types="Company" return organizations[id]""")
1000 / ...
1000 / 27940
2000 / 27940
3000 / 27940
4000 / 27940
5000 / 27940
6000 / 27940
7000 / 27940
8000 / 27940
9000 / 27940
10000 / 27940
11000 / 27940
12000 / 27940
13000 / 27940
14000 / 27940
15000 / 27940
16000 / 27940
17000 / 27940
18000 / 27940
19000 / 27940
20000 / 27940
21000 / 27940
22000 / 27940
23000 / 27940
24000 / 27940
25000 / 27940
26000 / 27940
27000 / 27940
27940 / 27940
===
Records extracted: 27940

We can now set up a parametrized query that pulls Dimensions publications resulting from industry collaborations.

Together with IDs, title and DOIs, the publications generated from industry collaborations should include citations counts and authors info, so that we can draw up some useful statistics based on these metadata later on.

[4]:
query_template = """
    search publications
       where
        research_orgs.id = "{}"
        and research_orgs.id in {}
        and year in [{}:{}]
    return publications[id+doi+type+times_cited+year+authors]
    """
[5]:
gridis = list(company_grids.as_dataframe()['id'])

#
# loop through all grids

ITERATION_RECORDS = 1000  # Publication records per query iteration
GRID_RECORDS = 200       # grid IDs per query
VERBOSE = False          # set to True to view full extraction logs
print(f"===\nExtracting {GRIDID} publications with industry collaborators ...")
print("Records per query : ", ITERATION_RECORDS)
print("GRID IDs per query: ", GRID_RECORDS)
results = []


for chunk in progress(list(chunks_of(gridis, GRID_RECORDS))):
    query = query_template.format(GRIDID, json.dumps(chunk), YEAR_START, YEAR_END)
#     print(query)
    data = dsl.query_iterative(query, verbose=VERBOSE, limit=ITERATION_RECORDS)
    if data.errors:
        print("==\nIteration failed: due an error no data was extracted for this iteration. \nTry adjusting the ITERATION_RECORDS or BATCHSIZE parameters and rerun the extraction.")
    else:
        results += data.publications
    time.sleep(0.5)

#
# put the publication data into a dataframe, remove duplicates and save

pubs = pd.DataFrame().from_dict(results)
# print("===\nIndustry Publications found: ", len(pubs))
pubs.drop_duplicates(subset='id', inplace=True)
print("Unique Industry Publications found: ", len(pubs))

#
# preview the data
print("===\nPreview:")
pubs.head(10)
===
Extracting grid.11696.39 publications with industry collaborators ...
Records per query :  1000
GRID IDs per query:  200

Unique Industry Publications found:  368
===
Preview:
[5]:
id year doi type authors times_cited
0 pub.1087081818 2016 10.2495/sdp-v12-n3-552-558/020 chapter [{'first_name': 'LUCA DALLA', 'last_name': 'VA... 0
1 pub.1059063534 2016 10.1088/0264-9381/33/23/235015 article [{'first_name': 'M', 'last_name': 'Armano', 'i... 7
2 pub.1049398390 2016 10.1140/epjds/s13688-016-0075-3 article [{'first_name': 'Andrey', 'last_name': 'Bogomo... 17
3 pub.1033140941 2016 10.1140/epjds/s13688-016-0064-6 article [{'first_name': 'Simone', 'last_name': 'Centel... 11
4 pub.1001093903 2016 10.1145/2994551.2994558 proceeding [{'first_name': 'Timofei', 'last_name': 'Istom... 23
5 pub.1022499785 2016 10.1038/nn.4399 article [{'first_name': 'Menachem', 'last_name': 'From... 434
6 pub.1015046696 2016 10.1186/s12977-016-0294-5 article [{'first_name': 'Irena', 'last_name': 'Zurnic'... 1
7 pub.1014099327 2016 10.1016/j.jalz.2016.06.1894 article [{'first_name': 'Moira', 'last_name': 'Marizzo... 0
8 pub.1038596770 2016 10.1016/j.apnum.2016.02.001 article [{'first_name': 'Julia', 'last_name': 'Leibing... 10
9 pub.1049823866 2016 10.1016/j.jalz.2016.06.1013 article [{'first_name': 'Michela', 'last_name': 'Pieva... 0

3. Analyses

In this section we will build some visualizations that help understanding the data we extracted.

3.1 Count of Publications per year from Industry Collaborations

A simple histogram chart can tell us the rate of publications per year.

[6]:
px.histogram(pubs, x="year", y="id", color="type",
                title=f"Publications per year with industry collaborations for {GRIDID}")

3.2 Citations from Industry Collaboration

[7]:
pubs_grouped = pubs.groupby(['year'], as_index=False).sum()
px.bar(pubs_grouped, x="year", y="times_cited",
       title=f"Tot Citations per year for publications with industry collaborations for {GRIDID}")

3.3 Top Industry Collaborators

In order to dig deeper into the industry affiliations we have to process the nested JSON data in the ‘authors’ column. By doing so, we can process authors & affiliations information and identify the ones belonging to the ‘industry’ set defined above.

For example, if we extract the authors data for the first publication/row (pubs.iloc[0]['authors']), this is what it’d look like:

[{'first_name': 'LUCA DALLA',
  'last_name': 'VALLE',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.11696.39',
  'researcher_id': 'ur.013645226073.38',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'ELENA CRISTINA',
  'last_name': 'RADA',
  'corresponding': '',
  'orcid': "['0000-0003-0807-1826']",
  'current_organization_id': 'grid.18147.3b',
  'researcher_id': 'ur.01344320306.26',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'MARCO',
  'last_name': 'RAGAZZI',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.11696.39',
  'researcher_id': 'ur.0655652202.53',
  'affiliations': [{'id': 'grid.11696.39',
    'name': 'University of Trento',
    'city': 'Trento',
    'city_id': 3165243,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]},
 {'first_name': 'MICHELE',
  'last_name': 'CARAVIELLO',
  'corresponding': '',
  'orcid': '',
  'current_organization_id': 'grid.14587.3f',
  'researcher_id': 'ur.016015622301.36',
  'affiliations': [{'id': 'grid.14587.3f',
    'name': 'Telecom Italia (Italy)',
    'city': 'Rome',
    'city_id': 3169070,
    'country': 'Italy',
    'country_code': 'IT',
    'state': None,
    'state_code': None}]}]

NOTE: Instead of iterating through the authors/affiliations data by building a new function, we can just take advantage of the DslDataset class in the Dimcli library. This class abstracts the notion of a Dimensions ‘results list’ and provides useful methods to quickly process authors and affiliations.

[8]:
# create a new DslDataset instance
pubsnew = DslDataset.from_publications_list(pubs)
# extract affiliations as a dataframe
affiliations = pubsnew.as_dataframe_authors_affiliations()
# focus only on affiliations including a grid from the industry set created above
affiliations = affiliations[affiliations['aff_id' ].isin(gridis)]
# preview the data
affiliations.head(5)
[8]:
aff_id aff_name aff_city aff_city_id aff_country aff_country_code aff_state aff_state_code pub_id researcher_id first_name last_name
3 grid.14587.3f Telecom Italia (Italy) Rome 3.16907e+06 Italy IT pub.1087081818 ur.016015622301.36 MICHELE CARAVIELLO
11 grid.410308.e Airbus (Germany) Hamburg 2.9113e+06 Germany DE pub.1059063534 N Brandt
12 grid.424032.3 Orbitale Hochtechnologie Bremen (Italy) Milan 3.17344e+06 Italy IT pub.1059063534 ur.014542047336.90 A Bursi
19 grid.424032.3 Orbitale Hochtechnologie Bremen (Italy) Milan 3.17344e+06 Italy IT pub.1059063534 D Desiderio
20 grid.424032.3 Orbitale Hochtechnologie Bremen (Italy) Milan 3.17344e+06 Italy IT pub.1059063534 E Piersanti

Let’s now count frequency and create a nice chart summing up the top industry collaborators.

TIP Try zooming in on the left-hand side to put into focus the organizations that appear most frequently.

[9]:
px.histogram(affiliations, x="aff_name", height=900,
             title=f"Top Industry collaborators for {GRIDID}").update_xaxes(categoryorder="total descending")

3.4 Countries of Industry Collaborators

We can use the same dataset to segment the data by country.

[10]:
px.pie(affiliations, names="aff_country", height=600,
      title=f"Countries of collaborators for {GRIDID}")

3.5 Putting Countries and Collaborators together

TIP by clicking on the right panel you can turn on/off specific countries

[11]:
px.histogram(affiliations, x="aff_name", height=900, color="aff_country",
             title=f"Top Countries and Industry collaborators for {gridname}-{GRIDID}",
             color_discrete_sequence=px.colors.diverging.Spectral)


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg