../../_images/badge-colab.svg ../../_images/badge-github-custom.svg ../../_images/badge-dimensions-api.svg

Extracting Patents that cite Publications from a chosen Research Organization

This tutorial shows how to extract and analyse patents information linked to a selected research organization, using the Dimensions Analytics API.

Load libraries and log in

# @markdown # Get the API library and login
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}

# import all libraries and login
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progressbar
# charts lib
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
DimCli v0.6.2.4 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)

A couple of utility functions to simplify exporting CSV files to a selected folder

# data-saving utils
DATAFOLDER = "extraction1"
if not os.path.exists(DATAFOLDER):
  !mkdir $DATAFOLDER
  print(f"==\nCreated data folder:", DATAFOLDER + "/")
def save_as_csv(df, save_name_without_extension):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
    print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")

Choose a GRID Research Organization

For the purpose of this exercise, we will are going to use grid.89170.37. Feel free though to change the parameters below as you want, eg by choosing another GRID organization.


GRIDID = "grid.89170.37" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}


from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Open in Dimensions &#x29c9;</a>'.format(dimensions_url(GRIDID))))

#@markdown ---

1 - Prerequisite: Extracting Publications Data

By looking at the Dimensions API data model, we can see that the connection between Patents and Publications is represented via a directed arrow going from Patents to Publications: that means that we should look for patents records where the publication_ids field contain references to the GRID-publications we are interested in.

Hence, we need to * a) extract all publications linked to one (or more) GRID IDs, and * b) use these publications to extract patents referencing those publications.

# Get full list of publications linked to this organization for the selected time frame

q = f"""search publications
        where research_orgs.id="{GRIDID}"
        and year in [{YEAR_START}:{YEAR_END}]
        return publications[basics+category_for+times_cited]"""

pubs_json = dsl.query_iterative(q)
pubs = pubs_json.as_dataframe()

# save the data
save_as_csv(pubs, f"pubs_{GRIDID}")
1000 / 17204
2000 / 17204
3000 / 17204
4000 / 17204
5000 / 17204
6000 / 17204
7000 / 17204
8000 / 17204
9000 / 17204
10000 / 17204
11000 / 17204
12000 / 17204
13000 / 17204
14000 / 17204
15000 / 17204
16000 / 17204
17000 / 17204
17204 / 17204
Saved:  extraction1/pubs_grid.89170.37.csv

How many publications per year?

Let’s have a quick look a the publication volume per year.

px.histogram(pubs, x="year", y="id", color="type", barmode="group", title=f"Publication by year from {GRIDID}")

What are the main subject areas?

We can use the Field of Research categories information in publications to obtain a breakdown of the publications by subject areas.

This can be achieved by ‘exploding’ the category_for data into a separate table, since there can be more than one category per publication. The new categories table also retains some basic info about the publications it relates to eg journal, title, publication id etc.. so to make it easier to analyse the data.

# ensure key exists in all rows (even if empty)
normalize_key("category_for", pubs_json.publications)
normalize_key("journal", pubs_json.publications)
# explode subjects into separate table
pubs_subjects = json_normalize(pubs_json.publications, record_path=['category_for'],
                               meta=["id", "type", ["journal", "title"], "year"],
                               errors='ignore', record_prefix='for_')
# add a new column: category name without digits for better readability
pubs_subjects['topic'] = pubs_subjects['for_name'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

Now we can build a scatter plot that shows the amount and distribution of categories of the years.

px.scatter(pubs_subjects, x="year", y="topic", color="type",
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top publication subjects for {GRIDID} (marginal subplots = X/Y totals)")