Identifying the Industry Collaborators of an Academic Institution¶
Dimensions uses GRID identifiers for institutions, hence you can take advantage of the GRID metadata with Dimensions queries.
In this tutorial we identify all organizations that have an industry
type.
This list of organizations is then used to identify industry collaborations for a chosen academic institution.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
#
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
1. Selecting an academic institution¶
For the purpose of this exercise, we will use University of Trento, Italy (grid.11696.39) as a starting point. You can pick any other GRID organization of course. Just use a DSL query or the GRID website to discover the ID of an organization that interests you.
[3]:
#@markdown The main organization we are interested in:
GRIDID = "grid.11696.39" #@param {type:"string"}
#@markdown The start/end year of publications used to extract industry collaborations:
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}
if YEAR_END < YEAR_START:
YEAR_END = YEAR_START
#
# gen link to Dimensions
#
try:
gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} ⧉</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {} <br /><br />'.format(YEAR_START, YEAR_END)))
2. Extracting publications from industry collaborations¶
First of all we want to extract all GRID orgs with type='Company'
using the API. Then we will use this list of organizations to identify industry collaborators for our chosen institution.
We can use the dimcli.query_iterative method to automatically retrieve ‘company’ GRID orgs in batches of 1000.
NOTE this step retrieves several thousands records from the API so it may take a few minutes to complete.
[4]:
# get GRID IDs
company_grids = dsl.query_iterative("""search organizations where types="Company" return organizations[id]""")
Starting iteration with limit=1000 skip=0 ...
0-1000 / 30088 (0.63s)
1000-2000 / 30088 (0.57s)
2000-3000 / 30088 (0.69s)
3000-4000 / 30088 (0.52s)
4000-5000 / 30088 (0.51s)
5000-6000 / 30088 (0.61s)
6000-7000 / 30088 (0.52s)
7000-8000 / 30088 (0.56s)
8000-9000 / 30088 (2.24s)
9000-10000 / 30088 (0.56s)
10000-11000 / 30088 (0.57s)
11000-12000 / 30088 (0.58s)
12000-13000 / 30088 (0.62s)
13000-14000 / 30088 (1.74s)
14000-15000 / 30088 (0.58s)
15000-16000 / 30088 (0.49s)
16000-17000 / 30088 (0.58s)
17000-18000 / 30088 (0.53s)
18000-19000 / 30088 (0.57s)
19000-20000 / 30088 (0.50s)
20000-21000 / 30088 (0.51s)
21000-22000 / 30088 (0.51s)
22000-23000 / 30088 (0.54s)
23000-24000 / 30088 (0.50s)
24000-25000 / 30088 (0.53s)
25000-26000 / 30088 (0.62s)
26000-27000 / 30088 (0.49s)
27000-28000 / 30088 (0.48s)
28000-29000 / 30088 (0.56s)
29000-30000 / 30088 (0.90s)
30000-30088 / 30088 (0.61s)
===
Records extracted: 30088
We can now set up a parametrized query that pulls Dimensions publications resulting from industry collaborations.
Together with IDs, title and DOIs, the publications generated from industry collaborations should include citations counts and authors info, so that we can draw up some useful statistics based on these metadata later on.
[5]:
query_template = """
search publications
where
research_orgs.id = "{}"
and research_orgs.id in {}
and year in [{}:{}]
return publications[id+doi+type+times_cited+year+authors]
"""
[6]:
gridis = list(company_grids.as_dataframe()['id'])
#
# loop through all grids
ITERATION_RECORDS = 1000 # Publication records per query iteration
GRID_RECORDS = 200 # grid IDs per query
VERBOSE = False # set to True to view full extraction logs
print(f"===\nExtracting {GRIDID} publications with industry collaborators ...")
print("Records per query : ", ITERATION_RECORDS)
print("GRID IDs per query: ", GRID_RECORDS)
results = []
for chunk in progress(list(chunks_of(gridis, GRID_RECORDS))):
query = query_template.format(GRIDID, json.dumps(chunk), YEAR_START, YEAR_END)
# print(query)
data = dsl.query_iterative(query, verbose=VERBOSE, limit=ITERATION_RECORDS)
if data.errors:
print("==\nIteration failed: due an error no data was extracted for this iteration. \nTry adjusting the ITERATION_RECORDS or BATCHSIZE parameters and rerun the extraction.")
else:
results += data.publications
time.sleep(0.5)
#
# put the publication data into a dataframe, remove duplicates and save
pubs = pd.DataFrame().from_dict(results)
# print("===\nIndustry Publications found: ", len(pubs))
pubs.drop_duplicates(subset='id', inplace=True)
print("Unique Industry Publications found: ", len(pubs))
#
# preview the data
print("===\nPreview:")
pubs.head(10)
===
Extracting grid.11696.39 publications with industry collaborators ...
Records per query : 1000
GRID IDs per query: 200
Unique Industry Publications found: 375
===
Preview:
[6]:
authors | doi | id | times_cited | type | year | |
---|---|---|---|---|---|---|
0 | [{'affiliations': [{'city': 'Madrid', 'city_id... | 10.1088/0264-9381/33/23/235015 | pub.1059063534 | 7 | article | 2016 |
1 | [{'affiliations': [{'city': 'Dublin', 'city_id... | 10.1145/2984356.2984363 | pub.1001653422 | 14 | proceeding | 2016 |
2 | [{'affiliations': [{'city': 'Stuttgart', 'city... | 10.1016/j.apnum.2016.02.001 | pub.1038596770 | 12 | article | 2016 |
3 | [{'affiliations': [{'city': 'Madrid', 'city_id... | 10.1103/physrevlett.116.231101 | pub.1001053038 | 313 | article | 2016 |
4 | [{'affiliations': [{'city': 'Dublin', 'city_id... | 10.1109/eucnc.2016.7561056 | pub.1094950798 | 22 | proceeding | 2016 |
5 | [{'affiliations': [{'city': 'Trento', 'city_id... | 10.1140/epjds/s13688-016-0064-6 | pub.1033140941 | 15 | article | 2016 |
6 | [{'affiliations': [{'city': 'Trento', 'city_id... | 10.1089/big.2014.0054 | pub.1018945654 | 48 | article | 2015 |
7 | [{'affiliations': [{'city': 'Madrid', 'city_id... | 10.1088/1742-6596/610/1/012027 | pub.1031150191 | 1 | article | 2015 |
8 | [{'affiliations': [{'city': 'Madrid', 'city_id... | 10.1088/1742-6596/610/1/012005 | pub.1052522882 | 17 | article | 2015 |
9 | [{'affiliations': [{'city': 'Madrid', 'city_id... | 10.1088/1742-6596/610/1/012026 | pub.1033837350 | 2 | article | 2015 |
3. Analyses¶
In this section we will build some visualizations that help understanding the data we extracted.
3.1 Count of Publications per year from Industry Collaborations¶
A simple histogram chart can tell us the rate of publications per year.
[7]:
px.histogram(pubs,
x="year",
color="type",
title=f"Publications per year with industry collaborations for {GRIDID}")
3.2 Citations from Industry Collaboration¶
[8]:
pubs_grouped = pubs.groupby(['year'], as_index=False).sum()
px.bar(pubs_grouped,
x="year",
y="times_cited",
title=f"Tot Citations per year for publications with industry collaborations for {GRIDID}")
3.3 Top Industry Collaborators¶
In order to dig deeper into the industry affiliations we have to process the nested JSON data in the ‘authors’ column. By doing so, we can process authors & affiliations information and identify the ones belonging to the ‘industry’ set defined above.
For example, if we extract the authors data for the first publication/row (pubs.iloc[0]['authors']
), this is what it’d look like:
[{'first_name': 'LUCA DALLA',
'last_name': 'VALLE',
'corresponding': '',
'orcid': '',
'current_organization_id': 'grid.11696.39',
'researcher_id': 'ur.013645226073.38',
'affiliations': [{'id': 'grid.11696.39',
'name': 'University of Trento',
'city': 'Trento',
'city_id': 3165243,
'country': 'Italy',
'country_code': 'IT',
'state': None,
'state_code': None}]},
{'first_name': 'ELENA CRISTINA',
'last_name': 'RADA',
'corresponding': '',
'orcid': "['0000-0003-0807-1826']",
'current_organization_id': 'grid.18147.3b',
'researcher_id': 'ur.01344320306.26',
'affiliations': [{'id': 'grid.11696.39',
'name': 'University of Trento',
'city': 'Trento',
'city_id': 3165243,
'country': 'Italy',
'country_code': 'IT',
'state': None,
'state_code': None}]},
{'first_name': 'MARCO',
'last_name': 'RAGAZZI',
'corresponding': '',
'orcid': '',
'current_organization_id': 'grid.11696.39',
'researcher_id': 'ur.0655652202.53',
'affiliations': [{'id': 'grid.11696.39',
'name': 'University of Trento',
'city': 'Trento',
'city_id': 3165243,
'country': 'Italy',
'country_code': 'IT',
'state': None,
'state_code': None}]},
{'first_name': 'MICHELE',
'last_name': 'CARAVIELLO',
'corresponding': '',
'orcid': '',
'current_organization_id': 'grid.14587.3f',
'researcher_id': 'ur.016015622301.36',
'affiliations': [{'id': 'grid.14587.3f',
'name': 'Telecom Italia (Italy)',
'city': 'Rome',
'city_id': 3169070,
'country': 'Italy',
'country_code': 'IT',
'state': None,
'state_code': None}]}]
NOTE: Instead of iterating through the authors/affiliations data by building a new function, we can just take advantage of the DslDataset
class in the Dimcli library. This class abstracts the notion of a Dimensions ‘results list’ and provides useful methods to quickly process authors and affiliations.
[10]:
from dimcli import DslDataset
# create a new DslDataset instance
pubsnew = DslDataset.from_publications_list(pubs)
# extract affiliations as a dataframe
affiliations = pubsnew.as_dataframe_authors_affiliations()
# focus only on affiliations including a grid from the industry set created above
affiliations = affiliations[affiliations['aff_id' ].isin(gridis)]
# preview the data
affiliations.head(5)
[10]:
aff_city | aff_city_id | aff_country | aff_country_code | aff_id | aff_name | aff_raw_affiliation | aff_state | aff_state_code | pub_id | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | Hamburg | 2911298.0 | Germany | DE | grid.410308.e | Airbus (Germany) | Airbus Defence and Space, Claude-Dornier-Stras... | pub.1059063534 | N | Brandt | |||
8 | Milan | 3173435.0 | Italy | IT | grid.424032.3 | OHB (Italy) | CGS S.p.A, Compagnia Generale per lo Spazio, V... | pub.1059063534 | ur.014542047336.90 | A | Bursi | ||
15 | Milan | 3173435.0 | Italy | IT | grid.424032.3 | OHB (Italy) | CGS S.p.A, Compagnia Generale per lo Spazio, V... | pub.1059063534 | D | Desiderio | |||
16 | Milan | 3173435.0 | Italy | IT | grid.424032.3 | OHB (Italy) | CGS S.p.A, Compagnia Generale per lo Spazio, V... | pub.1059063534 | E | Piersanti | |||
19 | Bristol | 2654675.0 | United Kingdom | GB | grid.7546.0 | Airbus (United Kingdom) | Airbus Defence and Space, Gunnels Wood Road, S... | pub.1059063534 | ur.010504106037.54 | N | Dunbar |
Let’s now count frequency and create a nice chart summing up the top industry collaborators.
TIP Try zooming in on the left-hand side to put into focus the organizations that appear most frequently.
[11]:
px.histogram(affiliations,
x="aff_name",
height=900,
title=f"Top Industry collaborators for {GRIDID}").update_xaxes(categoryorder="total descending")
3.4 Countries of Industry Collaborators¶
We can use the same dataset to segment the data by country.
[12]:
px.pie(affiliations,
names="aff_country",
height=600,
title=f"Countries of collaborators for {GRIDID}")
3.5 Putting Countries and Collaborators together¶
TIP by clicking on the right panel you can turn on/off specific countries
[13]:
px.histogram(affiliations,
x="aff_name",
height=900,
color="aff_country",
title=f"Top Countries and Industry collaborators for {gridname}-{GRIDID}",
color_discrete_sequence=px.colors.diverging.Spectral)
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.