../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Measuring the Innovation Impact of an Organization using Patents Citations

This notebook shows how to use the Dimensions Analytics API to analyze the innovation impact of an organization.

There are two ways to achieve this. The first method is counting how many patents have inventors (assignees) affiliated to the GRID organization we are interested in. This method returns the direct innovation impact of an organization, which can be easily achieved via a single patents API query:

> search patents where assignees in ["grid.89170.37"] return patents limit 500
Returned Patents: 192
---
[1] METHODS AND SYSTEMS FOR OBJECT IDENTIFICATION AND FOR AUTHENTICATION (id: https://app.dimensions.ai/details/patent/WO-2007149621-A3 )
[2] ADENOVIRAL VECTOR-BASED MALARIA VACCINES (id: https://app.dimensions.ai/details/patent/EP-1929021-A2 )
[3] MULTIPLE BAND SHORT WAVE INFRARED MOSAIC ARRAY FILTER (id: https://app.dimensions.ai/details/patent/WO-2016040755-A1 )
[4] ARMOR PLATE (id: https://app.dimensions.ai/details/patent/WO-2011142867-A3 )
..etc..

The second method - which is what we demonstrate in this tutorial - permits instead to gain some insights into the indirect innovation impact of a research organization. This can be achieved by extracting patents that cite publications from that research organization.

Since the various content-types included in the Dimensions database are deeply interlinked, the Dimensions APIs allow to perform this analysis via a few simple steps:

  • We start from a GRID identifier (representing a research organization in Dimensions)

  • We use the publications API to extract all publications where at least one author is/as affiliated to the GRID organization, for a selected time-period

  • We then use the patents API to discover patents that include citations to any of those publications

  • Finally, we analyse the patents data to highlight trends e.g. about countries, inventors etc..

Prerequisites

Please install the latest versions of these libraries to run this notebook.

[1]:
# @markdown # Get the API library and login
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}

# import all libraries and login
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
#
import os
import sys
import time
import json
import pandas as pd

from tqdm.notebook import tqdm as progressbar
#
# charts lib
import plotly.express as px
if not 'google.colab' in sys.modules:
    # make js dependecies local / needed by html exports
    from plotly.offline import init_notebook_mode
    init_notebook_mode(connected=True)
DimCli v0.6.7.2 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)

1. Choosing a GRID Research Organization

For the purpose of this exercise, we are going to use grid.4305.2 (The Open University, UK). Feel free to change the parameters below as you want, eg by choosing another GRID organization.

[2]:

GRIDID = "grid.10837.3d" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2005 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2015 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Open in Dimensions &#x29c9;</a>'.format(dimensions_url(GRIDID))))

#@markdown ---

2. Extracting Publications Data

By looking at the Dimensions API data model, we can see that Patents and Publications are connected by a property called publication_ids, which goes from Patents to Publications. This property represents the publications citations found in patents.

Hence, we need to 1. query for all publications with authors affiliated to our GRID ID 2. query for patents citing these publications

[3]:
# Get full list of publications linked to this organization for the selected time frame

q = f"""search publications
        where research_orgs.id="{GRIDID}"
        and year in [{YEAR_START}:{YEAR_END}]
        return publications[id+doi+title+type+journal+year+research_orgs+researchers+category_for+times_cited]"""
print("===\n", q, "\n===")
pubs_json = dsl.query_iterative(q, limit=1000)
pubs = pubs_json.as_dataframe()
===
 search publications
        where research_orgs.id="grid.10837.3d"
        and year in [2005:2015]
        return publications[id+doi+title+type+journal+year+research_orgs+researchers+category_for+times_cited] ===
1000 / ...
1000 / 8983
2000 / 8983
3000 / 8983
4000 / 8983
5000 / 8983
6000 / 8983
7000 / 8983
8000 / 8983
8983 / 8983
===
Records extracted: 8983
[4]:
pubs.head()
[4]:
id doi times_cited title research_orgs type category_for year researchers journal.id journal.title
0 pub.1028750505 10.1108/hcs-07-2015-0013 2 Do people choose to be homeless? An existentia... [{'id': 'grid.10837.3d', 'linkout': ['http://w... article [{'id': '3448', 'name': '1608 Sociology'}, {'i... 2015 [{'id': 'ur.014121733420.41', 'orcid_id': ['00... jour.1046873 Housing Care and Support
1 pub.1038820736 10.5194/gmdd-8-10677-2015 0 PLASIM-GENIE: a new intermediate complexity AOGCM [{'id': 'grid.450268.d', 'linkout': ['http://w... article [{'id': '2204', 'name': '04 Earth Sciences'}, ... 2015 [{'id': 'ur.015722150315.42', 'orcid_id': ['00... jour.1371971 Geoscientific Model Development Discussions
2 pub.1014068242 10.5194/esurf-3-587-2015 6 Perspective – synthetic DEMs: A vital underpin... [{'id': 'grid.5608.b', 'state_name': 'Veneto',... article [{'id': '2204', 'name': '04 Earth Sciences'}, ... 2015 [{'id': 'ur.014255575633.30', 'first_name': 'J... jour.1146353 Earth Surface Dynamics
3 pub.1046891576 10.1108/jap-11-2014-0035 0 Bruising in older adults: what do social worke... [{'id': 'grid.10837.3d', 'linkout': ['http://w... article [{'id': '3177', 'name': '1117 Public Health an... 2015 [{'id': 'ur.07357206671.63', 'orcid_id': ['000... jour.1023876 The Journal of Adult Protection
4 pub.1022169569 10.1117/1.jatis.2.1.011007 42 Technology advancement of the CCD201-20 EMCCD ... [{'id': 'grid.10837.3d', 'linkout': ['http://w... article [{'id': '2377', 'name': '0201 Astronomical and... 2015 [{'id': 'ur.010230156453.49', 'first_name': 'L... jour.1052851 Journal of Astronomical Telescopes Instruments...

Quick look at publications statistics

[5]:
px.histogram(pubs,
             x="year", y="id",
             color="type",
             barmode="group",
             title=f"Publication distribution by year - {GRIDID}")

What are the main subject areas?

We can use the Field of Research categories information in publications to obtain a breakdown of the publications by subject areas.

This can be achieved by ‘exploding’ the category_for data into a separate table, since there can be more than one category per publication. The new categories table also retains some basic info about the publications it relates to eg journal, title, publication id etc.. so to make it easier to analyse the data.

[6]:
pubs_categories = pubs.explode('category_for')
pubs_categories.dropna(subset=["category_for"], inplace=True)

def for_nice_name(for_dict):
    "transforms a category JSON into a nice looking title"
    if type(for_dict) == dict:
        name = for_dict['name']
        return ''.join([i for i in name if not i.isdigit()])
    else:
        return ""

# new col for nice name
pubs_categories["category_for_name"] = pubs_categories['category_for'].apply(lambda x: for_nice_name(x))
# new col for tot-pubs count
pubs_categories['count_pubs'] = pubs_categories.groupby("category_for_name")['id'].transform('count')

Let’s view the top categories using a pie chart.

[7]:
categories = pubs_categories.drop_duplicates(subset="category_for_name").sort_values("count_pubs", ascending=False)[['category_for_name', 'count_pubs']]

px.pie(categories[:20],
       names="category_for_name", # the dimension for the slices
       values="count_pubs",  # the metric
       color_discrete_sequence=px.colors.sequential.Bluyl,
       title=f"Top FOR categories")

3. Extracting Patents linked to Publications

In this section we extract all patents linked to the publications dataset previously created. The steps are the following:

  • we loop over the publication IDs and create patents queries, via the referencing publication_ids field of patents

  • we collate all patens data, remove duplicates from patents and save the results

  • finally, we count patents per publication and enrich the original publication dataset with these numbers

[8]:
#
# the main patents query
#
q = """search patents
        where publication_ids in {}
      return patents[basics+publication_ids+category_for]"""

BATCHSIZE = 400
VERBOSE = False # set to True to see patents extraction logs


#
# loop through all pub IDs in chunks and query patents
#
print("===\nExtracting patents data ...")
patents_json = []
pubsids = pubs['id']

for chunk in progressbar(list(chunks_of(list(pubsids), 400))):
    data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=VERBOSE)
    patents_json += data.patents
    time.sleep(1)


patents = pd.DataFrame().from_dict(patents_json)
patents.drop_duplicates(subset='id', inplace=True)
print("Patents found: ", len(patents))
===
Extracting patents data ...

Patents found:  175

Let’s preview the data

[9]:
patents.head(5)
[9]:
assignees assignee_names filing_status publication_date year id times_cited inventor_names title category_for publication_ids granted_year
0 [{'id': 'grid.10837.3d', 'name': 'The Open Uni... [UNIV OPEN] Application 2015-02-05 2014 WO-2015015185-A1 0.0 [PHILLIPS, JAMES, GEORGIOU, Melanie] ENGINEERED NEURAL TISSUE [{'id': '2581', 'name': '0601 Biochemistry and... [pub.1011179655, pub.1022718673, pub.100775934... NaN
1 [{'id': 'grid.83440.3b', 'name': 'University C... [UCL Business Ltd, THE OPEN UNIV, UCL BUSINESS... Grant 2018-06-12 2014 US-9993581-B2 NaN [James Phillips, Melanie Georgiou] Engineered neural tissue [{'id': '2581', 'name': '0601 Biochemistry and... [pub.1038537221, pub.1022718673, pub.100069165... 2018.0
2 [{'id': 'grid.482406.c', 'name': 'Prothena (Ir... [PROTHENA BIOSCIENCES LTD] Application 2017-09-08 2017 WO-2017149513-A1 NaN [NESS, DANIEL KEITH, FLANAGAN, KENNETH] ANTI-MCAM ANTIBODIES AND ASSOCIATED METHODS OF... NaN [pub.1036841477, pub.1014162397, pub.104598330... NaN
3 [{'id': 'grid.457334.2', 'name': 'CEA Saclay',... [Commissariat a lEnergie Atomique et aux Energ... Application 2016-06-02 2015 DE-102015223347-A1 NaN [Didier Landru, Capucine Delage, Franck Fourne... PROCESS FOR CONNECTING TWO SUBSTRATES [{'id': '2921', 'name': '0912 Materials Engine... [pub.1039579399] NaN
5 [{'id': 'grid.471325.6', 'name': 'Semiconducto... [Semiconductor Manufacturing International (Sh... Grant 2018-08-28 2016 US-10062704-B2 0.0 [Tzu Yin Chiu, Clifford Ian Drowley, Leong Tee... Buried-channel MOSFET and a surface-channel MO... [{'id': '2471', 'name': '0306 Physical Chemist... [pub.1061595798, pub.1061593545, pub.1061462502] 2018.0

Enriching publications with patents citations metrics

Each patent record contains all the publication_ids it cites, so we can take this metric so to enrich the original publications dataset we created above.

[10]:
def count_patents_per_pub(pubid):
  global patents
  return len(patents[patents['publication_ids'].str.contains(pubid)])

# turn lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
patents['publication_ids'] = patents['publication_ids'].apply(lambda x: ','.join(map(str, x)))

progressbar.pandas()
pubs['patents'] = pubs['id'].progress_apply(lambda x: count_patents_per_pub(x))
/Users/michele.pasin/Envs/jupyterlab/lib/python3.7/site-packages/tqdm/std.py:668: FutureWarning:

The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version


Now the patents column gives us the top publications by number of citing patents

[11]:
pubs.sort_values("patents", ascending=False).head()
[11]:
id doi times_cited title research_orgs type category_for year researchers journal.id journal.title patents
8449 pub.1002742563 10.1096/fj.04-3458fje 817 Blood‐brain barrier‐specific properties of a h... [{'id': 'grid.83440.3b', 'linkout': ['http://w... article [{'id': '3120', 'name': '1109 Neurosciences'},... 2005 [{'id': 'ur.01351646547.75', 'orcid_id': ['000... jour.1017429 The FASEB Journal 16
7988 pub.1061641397 10.1109/tip.2006.871114 83 Recovery of Surface Orientation From Diffuse P... [{'id': 'grid.10837.3d', 'linkout': ['http://w... article [{'id': '2921', 'name': '0912 Materials Engine... 2006 [{'id': 'ur.015423613355.47', 'orcid_id': ['00... jour.1129825 IEEE Transactions on Image Processing 15
5163 pub.1055155906 10.1021/bc900397s 37 Modification of thiol functionalized aptamers ... [{'id': 'grid.10837.3d', 'longitude': -0.70955... article [{'id': '2203', 'name': '03 Chemical Sciences'... 2010 [{'id': 'ur.01124615125.34', 'research_orgs': ... jour.1100499 Bioconjugate Chemistry 14
2886 pub.1033473743 10.1096/fj.11-201384 82 Cell‐penetrating anti‐GFAP VHH and correspondi... [{'id': 'grid.10992.33', 'city_name': 'Paris',... article [{'id': '3120', 'name': '1109 Neurosciences'},... 2012 [{'id': 'ur.01027626121.41', 'last_name': 'Li'... jour.1017429 The FASEB Journal 9
8322 pub.1014463103 10.1039/b600709k 111 An EXAFS study of the formation of a nanoporou... [{'id': 'grid.493299.c', 'linkout': ['http://w... article [{'id': '2203', 'name': '03 Chemical Sciences'... 2006 [{'id': 'ur.01040541162.25', 'orcid_id': ['000... jour.1297159 Chemical Communications 5

4. Patents Data Analysis

Now that we have extracted all the data we need, let’s start exploring them by building a few visualizations.

How many patents per year?

[12]:
px.histogram(patents, x="year", y="id",
             color="filing_status",
             barmode="group",
             title=f"Patents referencing publications from {GRIDID} - by year")

Who is filing the patents?

This can be done by looking at the field assigness of patent. Since the field contains nested information, first we need to extract it into its own table (similarly to what we’ve done above with publications categories).

[13]:
# ensure the key exists in all rows (even if empty)
normalize_key('assignees', patents_json)
# explode assigness into separate table
patents_assignees = pd.json_normalize(patents_json,
                                   record_path=['assignees'],
                                   meta=['id', 'year', 'title'],
                                   meta_prefix="patent_")
top_assignees = patents_assignees.groupby(['name', 'country_name'],
                                          as_index=False).count().sort_values(by="patent_id", ascending=False)
# preview the data: ps the patent_id column is the COUNT of patents
top_assignees[['name', 'country_name', 'patent_id']].head()
[13]:
name country_name patent_id
54 Pasteur Institute France 11
23 French National Centre for Scientific Research France 10
22 French Institute of Health and Medical Research France 10
52 Panasonic (Japan) Japan 8
71 The Open University United Kingdom 8
[14]:
px.bar(top_assignees,
       x="name", y="patent_id",
       hover_name="name", color="country_name",
       height=900,
       title=f"Top Assignees for patents referencing publications from {GRIDID}")
[15]:
px.scatter(patents_assignees,
           x="name", y="country_name",
           color="patent_year", hover_name="name",
           height = 1000,
           hover_data=["id", "patent_id"],  marginal_y="histogram",
           title=f"Assignees for patents referencing publications from {GRIDID} - Yearly breakdown")

What are the publications most frequenlty referenced in patents?

[16]:
pubs_cited = pubs.query("patents > 0 ").sort_values('patents', ascending=False).copy()
pubs_cited.head()
[16]:
id doi times_cited title research_orgs type category_for year researchers journal.id journal.title patents
8449 pub.1002742563 10.1096/fj.04-3458fje 817 Blood‐brain barrier‐specific properties of a h... [{'id': 'grid.83440.3b', 'linkout': ['http://w... article [{'id': '3120', 'name': '1109 Neurosciences'},... 2005 [{'id': 'ur.01351646547.75', 'orcid_id': ['000... jour.1017429 The FASEB Journal 16
7988 pub.1061641397 10.1109/tip.2006.871114 83 Recovery of Surface Orientation From Diffuse P... [{'id': 'grid.10837.3d', 'linkout': ['http://w... article [{'id': '2921', 'name': '0912 Materials Engine... 2006 [{'id': 'ur.015423613355.47', 'orcid_id': ['00... jour.1129825 IEEE Transactions on Image Processing 15
5163 pub.1055155906 10.1021/bc900397s 37 Modification of thiol functionalized aptamers ... [{'id': 'grid.10837.3d', 'longitude': -0.70955... article [{'id': '2203', 'name': '03 Chemical Sciences'... 2010 [{'id': 'ur.01124615125.34', 'research_orgs': ... jour.1100499 Bioconjugate Chemistry 14
2886 pub.1033473743 10.1096/fj.11-201384 82 Cell‐penetrating anti‐GFAP VHH and correspondi... [{'id': 'grid.10992.33', 'city_name': 'Paris',... article [{'id': '3120', 'name': '1109 Neurosciences'},... 2012 [{'id': 'ur.01027626121.41', 'last_name': 'Li'... jour.1017429 The FASEB Journal 9
8656 pub.1004641182 10.1196/annals.1342.014 11 Amyloid Precursor Protein: From Synaptic Plast... [{'id': 'grid.10837.3d', 'linkout': ['http://w... article [{'id': '3120', 'name': '1109 Neurosciences'},... 2005 [{'id': 'ur.01333733704.35', 'first_name': 'Ra... jour.1002487 Annals of the New York Academy of Sciences 5
[17]:
px.bar(pubs_cited[:1000],
       color="type",
       x="year", y="patents",
       hover_name="title",  hover_data=["journal.title"],
       title=f"Top Publications from {GRIDID} mentioned in patents, by year of publication")

What are the main subject areas of referenced publications?

[18]:
THRESHOLD_PUBS = 1000
citedids = list(pubs_cited[:THRESHOLD_PUBS]['id'])
pubs_categories_cited = pubs_categories[pubs_categories['id'].isin(citedids)]
[19]:
px.scatter(pubs_categories_cited, x="year", y="category_for_name", color="type",
           hover_name="title",
           hover_data=["doi", "year", "journal.title"],
           height=800,
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top {THRESHOLD_PUBS} {GRIDID} publications cited by patents - by subject area")

Is there a correlation between publication citations and patents citations?

Note: if the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see below), that is indicative of a positive correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be high as well, and vice versa.

[20]:
px.scatter(pubs, x="patents", y="times_cited",
           title=f"Patents citations VS Publication citations")

Where to go from here

In this Dimensions Analytics API tutorial we have seen how, starting from a GRID organization, it is possible to extract a) publications from authors associated to this organization, b) patents citing those publications (from any organization).

This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application. Here are some ideas for customizing this notebook:

  • Change the GRID ID to another one you are more familiar with

  • Use a different way to select publications: e.g. not using an organizations segment, but a category, a country, a funder or a combination of the many Publication filters available

  • Do a more in-depth analysis of the patents inventors: do they have other publications linking to the publications they cite in patents? Are there patterns of collaborations with the institutions they cite?



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg