../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Patent publication references, for an entire patent family

This notebook shows how to use the Dimensions Analytics API to identify all the publications referenced by patents, for all the patents that belong to the same patent family.

There are the steps:

  1. We start from a specific patent Dimensions ID and obtain its family ID

  2. Using the family ID, we query the patents API to search for all related patents and return the publications IDs they reference

  3. Finally, we query the publications API to obtain other useful publication metadata e.g. title, publisher, journals etc..

These sample results can be explored in Google Sheets.

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli tqdm -U --quiet

import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Search for the patent ID and return the family ID.

As a starting point, let’s take patent ID US-20210108231-A1.

[3]:
patent_id = "US-20210108231-A1" #@param {type:"string"}

q_family_id = dsl.query(f"""
    search patents
    where id = "{patent_id}"
    return family_id
""")

try:
    family_id = q_family_id['family_id'][0]['id']
    print("Found family_id:",  family_id)
except:
    print("No family ID found. \nFull API results:\n", str(q_family_id.json))
Returned Family_id: 1
Time: 0.60s
Found family_id: 49624232

3. Enriching the publication IDs with additional metadata

In this step we query the publications API, using the referenced Dimensions IDs extracted previously in order to retrieve further metadata about publications.

Since we can have lots of publications to go through, the IDs list is chunked into smaller groups so to ensure the resulting API query is never too long (more info here).

PS Change the query template return statement to customise the metadata returned.

[6]:
pubids = list(references_list['publication_ids'])


query_template = """search publications
                    where id in {}
                    return publications[id+doi+pmid+title+journal+year+publisher+type+dimensions_url]
                    limit 1000"""


#
# loop through all references-publications IDs in chunks and query Dimensions

print("===\nExtracting publications data ...")
results = []
BATCHSIZE = 300
VERBOSE = False # set to True to see extraction logs

for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
    query = query_template.format(json.dumps(chunk))
    data = dsl.query(query, verbose=VERBOSE)
    results += data.publications
    time.sleep(0.5)

#
# put the cited publication data into a dataframe

pubs_cited = pd.DataFrame().from_dict(results)
print("===\nCited Publications found: ", len(pubs_cited))



#
# transform the 'journal' column cause it contains nested data

temp = pubs_cited['journal'].apply(pd.Series).rename(columns={"id": "journal.id",
                                                            "title": "journal.title"}).drop([0], axis=0)
pubs_cited = pd.concat([pubs_cited.drop(['journal'], axis=1), temp], axis=1)

pubs_cited.head()

===
Extracting publications data ...
===
Cited Publications found:  363
[6]:
dimensions_url doi id pmid publisher title type year journal.id journal.title
0 https://app.dimensions.ai/details/publication/... 10.1021/acs.chemrev.7b00499 pub.1100683468 29377672 American Chemical Society (ACS) Molecular Mechanism and Evolution of Nuclear P... article 2018 NaN NaN
1 https://app.dimensions.ai/details/publication/... 10.1101/168443 pub.1091918134 NaN Cold Spring Harbor Laboratory P53 toxicity is a hurdle to CRISPR/CAS9 screen... preprint 2017 jour.1293558 bioRxiv
2 https://app.dimensions.ai/details/publication/... 10.1016/j.cell.2016.08.056 pub.1035697575 27662091 Elsevier Editing DNA Methylation in the Mammalian Genome article 2016 jour.1019114 Cell
3 https://app.dimensions.ai/details/publication/... 10.18632/oncotarget.10234 pub.1017128844 27356740 Impact Journals, LLC CRISPR-dCas9 mediated TET1 targeting for selec... article 2016 jour.1043645 Oncotarget
4 https://app.dimensions.ai/details/publication/... 10.1038/nature17946 pub.1009172001 27096365 Springer Nature Programmable editing of a target base in genom... article 2016 jour.1018957 Nature

4. Combine the publication metadata with the patent citations information

In this step we take the results of the patents query from step 2 and merge them with the publication query from step 3.

The goal is simply to retain the total count of patent citations per publication in the final dataset containing detailed publications metadata.

[7]:
# merge two datasets using 'publication id' as key
final_data = pubs_cited.merge(references_list, left_on='id', right_on='publication_ids')

# rename 'size' column
final_data.rename(columns = {"size" : "patents_citations"}, inplace = True)

# show top 5 cited publications
final_data.sort_values("patents_citations", ascending=False, inplace=True)
final_data.head(5)
[7]:
dimensions_url doi id pmid publisher title type year journal.id journal.title publication_ids patents_citations
134 https://app.dimensions.ai/details/publication/... 10.1038/nature09886 pub.1030591890 21455174 Springer Nature CRISPR RNA maturation by trans-encoded small R... article 2011 jour.1018957 Nature pub.1030591890 50
92 https://app.dimensions.ai/details/publication/... 10.1126/science.1225829 pub.1041850060 22745249 American Association for the Advancement of Sc... A Programmable Dual-RNA–Guided DNA Endonucleas... article 2012 jour.1346339 Science pub.1041850060 47
126 https://app.dimensions.ai/details/publication/... 10.1093/nar/gkr606 pub.1052438070 21813460 Oxford University Press (OUP) The Streptococcus thermophilus CRISPR/Cas syst... article 2011 jour.1018982 Nucleic Acids Research pub.1052438070 47
78 https://app.dimensions.ai/details/publication/... 10.1126/science.1232033 pub.1022072971 23287722 American Association for the Advancement of Sc... RNA-Guided Human Genome Engineering via Cas9 article 2013 jour.1346339 Science pub.1022072971 44
79 https://app.dimensions.ai/details/publication/... 10.1126/science.1231143 pub.1019873131 23287718 American Association for the Advancement of Sc... Multiplex Genome Engineering Using CRISPR/Cas ... article 2013 jour.1346339 Science pub.1019873131 44

4.1 Optional: exporting the data to google sheets

NOTE: this will work only Google Colab, or in other Jupyter environment if you have previously enabled the required Google credentials (more info here).

[ ]:
export_as_gsheets(final_data)

Conclusions

In this notebook we have shown how to use the Dimensions Analytics API to identify all the publications referenced by patents, for all the patents that belong to the same patent family.

This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application. For more background, see the list of fields available via the Patents API.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg