../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Patent publication references, for an entire patent family

This notebook shows how to use the Dimensions Analytics API to identify all the publications referenced by patents, for all the patents that belong to the same patent family.

There are the steps:

  1. We start from a specific patent Dimensions ID and obtain its family ID

  2. Using the family ID, we query the patents API to search for all related patents and return the publications IDs they reference

  3. Finally, we query the publications API to obtain other useful publication metadata e.g. title, publisher, journals etc..

These sample results can be explored in Google Sheets.


This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

!pip install dimcli tqdm -U --quiet

import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Logging in..
Dimcli - Dimensions API Client (v0.9.1)
Connected to: https://app.dimensions.ai - DSL v1.31
Method: dsl.ini file

1. Search for the patent ID and return the family ID.

As a starting point, let’s take patent ID US-20210108231-A1.

patent_id = "US-20210108231-A1" #@param {type:"string"}

q_family_id = dsl.query(f"""
    search patents
    where id = "{patent_id}"
    return family_id

    family_id = q_family_id['family_id'][0]['id']
    print("Found family_id:",  family_id)
    print("No family ID found. \nFull API results:\n", str(q_family_id.json))
Returned Family_id: 1
Time: 0.55s
Found family_id: 49624232

3. Enriching the publication IDs with additional metadata

In this step we query the publications API, using the referenced Dimensions IDs extracted previously in order to retrieve further metadata about publications.

Since we can have lots of publications to go through, the IDs list is chunked into smaller groups so to ensure the resulting API query is never too long (more info here).

PS Change the query template return statement to customise the metadata returned.

pubids = list(references_list['publication_ids'])

query_template = """search publications
                    where id in {}
                    return publications[id+doi+pmid+title+journal+year+publisher+type+dimensions_url]
                    limit 1000"""

# loop through all references-publications IDs in chunks and query Dimensions

print("===\nExtracting publications data ...")
results = []
VERBOSE = False # set to True to see extraction logs

for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
    query = query_template.format(json.dumps(chunk))
    data = dsl.query(query, verbose=VERBOSE)
    results += data.publications

# put the cited publication data into a dataframe

pubs_cited = pd.DataFrame().from_dict(results)
print("===\nCited Publications found: ", len(pubs_cited))

# transform the 'journal' column cause it contains nested data

temp = pubs_cited['journal'].apply(pd.Series).rename(columns={"id": "journal.id",
                                                            "title": "journal.title"}).drop([0], axis=0)
pubs_cited = pd.concat([pubs_cited.drop(['journal'], axis=1), temp], axis=1)


Extracting publications data ...
Cited Publications found:  363
publisher dimensions_url title doi type pmid year id journal.id journal.title
0 American Chemical Society (ACS) https://app.dimensions.ai/details/publication/... Molecular Mechanism and Evolution of Nuclear P... 10.1021/acs.chemrev.7b00499 article 29377672 2018 pub.1100683468 NaN NaN
1 Cold Spring Harbor Laboratory https://app.dimensions.ai/details/publication/... P53 toxicity is a hurdle to CRISPR/CAS9 screen... 10.1101/168443 preprint NaN 2017 pub.1091918134 jour.1293558 bioRxiv
2 Elsevier https://app.dimensions.ai/details/publication/... Editing DNA Methylation in the Mammalian Genome 10.1016/j.cell.2016.08.056 article 27662091 2016 pub.1035697575 jour.1019114 Cell
3 Impact Journals, LLC https://app.dimensions.ai/details/publication/... CRISPR-dCas9 mediated TET1 targeting for selec... 10.18632/oncotarget.10234 article 27356740 2016 pub.1017128844 jour.1043645 Oncotarget
4 Springer Nature https://app.dimensions.ai/details/publication/... Programmable editing of a target base in genom... 10.1038/nature17946 article 27096365 2016 pub.1009172001 jour.1018957 Nature

4. Combine the publication metadata with the patent citations information

In this step we take the results of the patents query from step 2 and merge them with the publication query from step 3.

The goal is simply to retain the total count of patent citations per publication in the final dataset containing detailed publications metadata.

# merge two datasets using 'publication id' as key
final_data = pubs_cited.merge(references_list, left_on='id', right_on='publication_ids')

# rename 'size' column
final_data.rename(columns = {"size" : "patents_citations"}, inplace = True)

# show top 5 cited publications
final_data.sort_values("patents_citations", ascending=False, inplace=True)
publisher dimensions_url title doi type pmid year id journal.id journal.title publication_ids patents_citations
134 Springer Nature https://app.dimensions.ai/details/publication/... CRISPR RNA maturation by trans-encoded small R... 10.1038/nature09886 article 21455174 2011 pub.1030591890 jour.1018957 Nature pub.1030591890 50
92 American Association for the Advancement of Sc... https://app.dimensions.ai/details/publication/... A Programmable Dual-RNA–Guided DNA Endonucleas... 10.1126/science.1225829 article 22745249 2012 pub.1041850060 jour.1346339 Science pub.1041850060 47
126 Oxford University Press (OUP) https://app.dimensions.ai/details/publication/... The Streptococcus thermophilus CRISPR/Cas syst... 10.1093/nar/gkr606 article 21813460 2011 pub.1052438070 jour.1018982 Nucleic Acids Research pub.1052438070 47
78 American Association for the Advancement of Sc... https://app.dimensions.ai/details/publication/... RNA-Guided Human Genome Engineering via Cas9 10.1126/science.1232033 article 23287722 2013 pub.1022072971 jour.1346339 Science pub.1022072971 44
79 American Association for the Advancement of Sc... https://app.dimensions.ai/details/publication/... Multiplex Genome Engineering Using CRISPR/Cas ... 10.1126/science.1231143 article 23287718 2013 pub.1019873131 jour.1346339 Science pub.1019873131 44

4.1 Optional: exporting the data to google sheets

NOTE: this will work only Google Colab, or in other Jupyter environment if you have previously enabled the required Google credentials (more info here).

..authorizing with google..
..creating a google sheet..


In this notebook we have shown how to use the Dimensions Analytics API to identify all the publications referenced by patents, for all the patents that belong to the same patent family.

This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application. For more background, see the list of fields available via the Patents API.


The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.