Patent publication references, for an entire patent family¶

This notebook shows how to use the Dimensions Analytics API to identify all the publications referenced by patents, for all the patents that belong to the same patent family.

There are the steps:

We start from a specific patent Dimensions ID and obtain its family ID
Using the family ID, we query the patents API to search for all related patents and return the publications IDs they reference
Finally, we query the publications API to obtain other useful publication metadata e.g. title, publisher, journals etc..

These sample results can be explored in Google Sheets.

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli tqdm -U --quiet

import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Search for the patent ID and return the family ID.¶

As a starting point, let’s take patent ID US-20210108231-A1.

View this patent record in Dimensions: Methods and compositions for rna-directed target dna modification and for rna-directed modulation of transcription

[3]:

patent_id = "US-20210108231-A1" #@param {type:"string"}

q_family_id = dsl.query(f"""
    search patents
    where id = "{patent_id}"
    return family_id
""")

try:
    family_id = q_family_id['family_id'][0]['id']
    print("Found family_id:",  family_id)
except:
    print("No family ID found. \nFull API results:\n", str(q_family_id.json))

Returned Family_id: 1
Time: 0.60s
Found family_id: 49624232

2. Use the family ID to search for all related patents and return the publications IDs they reference¶

A few things to note about the query below:

The unnest operator in return patents[unnest(publication_ids)] is used to ‘explode’ lists of its into a single value per column - more info here
The filter publication_ids is not empty means that only patents that have at least one publication reference get returned
The return statement could be changed to ..return publications i.e. a facet (aggregation). However remember that all facet queries allow a maximum of 1000 results.. so to ensure we get all results for any family ID we simply return one line per patent and aggregate data manually.
Finally, we should keep in mind that results contain duplicate rows cause the more than one patent (in the same family) may be referencing the same publication. We’ll dedup the data later but keep this infos, as it’ll tell us which publications are cited most frequently.

[4]:

#
# get all patents from same family
all_patents = []

q_all_patents = dsl.query_iterative(f"""
    search patents
    where family_id = {family_id}
    and publication_ids is not empty
    return patents[unnest(publication_ids)]
""")

df = q_all_patents.as_dataframe()

#
# pivot on IDs and count frequency
references_list = df.groupby(df.columns.tolist(),as_index=False).size().sort_values("size", ascending=False)

Starting iteration with limit=1000 skip=0 ...
0-55 / 55 (0.81s)
55-55 / 55 (0.56s)
===
Records extracted: 11401

[5]:

# preview the data

references_list.head(10)

[5]:

	publication_ids	size
206	pub.1030591890	50
323	pub.1052438070	47
276	pub.1041850060	47
132	pub.1019873131	44
151	pub.1022072971	44
152	pub.1022097335	40
129	pub.1019168198	39
260	pub.1039119530	39
285	pub.1043148894	39
264	pub.1040038815	39

3. Enriching the publication IDs with additional metadata¶

In this step we query the publications API, using the referenced Dimensions IDs extracted previously in order to retrieve further metadata about publications.

Since we can have lots of publications to go through, the IDs list is chunked into smaller groups so to ensure the resulting API query is never too long (more info here).

PS Change the query template return statement to customise the metadata returned.

[6]:

pubids = list(references_list['publication_ids'])


query_template = """search publications
                    where id in {}
                    return publications[id+doi+pmid+title+journal+year+publisher+type+dimensions_url]
                    limit 1000"""


#
# loop through all references-publications IDs in chunks and query Dimensions

print("===\nExtracting publications data ...")
results = []
BATCHSIZE = 300
VERBOSE = False # set to True to see extraction logs

for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
    query = query_template.format(json.dumps(chunk))
    data = dsl.query(query, verbose=VERBOSE)
    results += data.publications
    time.sleep(0.5)

#
# put the cited publication data into a dataframe

pubs_cited = pd.DataFrame().from_dict(results)
print("===\nCited Publications found: ", len(pubs_cited))



#
# transform the 'journal' column cause it contains nested data

temp = pubs_cited['journal'].apply(pd.Series).rename(columns={"id": "journal.id",
                                                            "title": "journal.title"}).drop([0], axis=0)
pubs_cited = pd.concat([pubs_cited.drop(['journal'], axis=1), temp], axis=1)

pubs_cited.head()

===
Extracting publications data ...

===
Cited Publications found:  363

[6]:

	dimensions_url	doi	id	pmid	publisher	title	type	year	journal.id	journal.title
0	https://app.dimensions.ai/details/publication/...	10.1021/acs.chemrev.7b00499	pub.1100683468	29377672	American Chemical Society (ACS)	Molecular Mechanism and Evolution of Nuclear P...	article	2018	NaN	NaN
1	https://app.dimensions.ai/details/publication/...	10.1101/168443	pub.1091918134	NaN	Cold Spring Harbor Laboratory	P53 toxicity is a hurdle to CRISPR/CAS9 screen...	preprint	2017	jour.1293558	bioRxiv
2	https://app.dimensions.ai/details/publication/...	10.1016/j.cell.2016.08.056	pub.1035697575	27662091	Elsevier	Editing DNA Methylation in the Mammalian Genome	article	2016	jour.1019114	Cell
3	https://app.dimensions.ai/details/publication/...	10.18632/oncotarget.10234	pub.1017128844	27356740	Impact Journals, LLC	CRISPR-dCas9 mediated TET1 targeting for selec...	article	2016	jour.1043645	Oncotarget
4	https://app.dimensions.ai/details/publication/...	10.1038/nature17946	pub.1009172001	27096365	Springer Nature	Programmable editing of a target base in genom...	article	2016	jour.1018957	Nature

4. Combine the publication metadata with the patent citations information¶

In this step we take the results of the patents query from step 2 and merge them with the publication query from step 3.

The goal is simply to retain the total count of patent citations per publication in the final dataset containing detailed publications metadata.

[7]:

# merge two datasets using 'publication id' as key
final_data = pubs_cited.merge(references_list, left_on='id', right_on='publication_ids')

# rename 'size' column
final_data.rename(columns = {"size" : "patents_citations"}, inplace = True)

# show top 5 cited publications
final_data.sort_values("patents_citations", ascending=False, inplace=True)
final_data.head(5)

[7]:

	dimensions_url	doi	id	pmid	publisher	title	type	year	journal.id	journal.title	publication_ids	patents_citations
134	https://app.dimensions.ai/details/publication/...	10.1038/nature09886	pub.1030591890	21455174	Springer Nature	CRISPR RNA maturation by trans-encoded small R...	article	2011	jour.1018957	Nature	pub.1030591890	50
92	https://app.dimensions.ai/details/publication/...	10.1126/science.1225829	pub.1041850060	22745249	American Association for the Advancement of Sc...	A Programmable Dual-RNA–Guided DNA Endonucleas...	article	2012	jour.1346339	Science	pub.1041850060	47
126	https://app.dimensions.ai/details/publication/...	10.1093/nar/gkr606	pub.1052438070	21813460	Oxford University Press (OUP)	The Streptococcus thermophilus CRISPR/Cas syst...	article	2011	jour.1018982	Nucleic Acids Research	pub.1052438070	47
78	https://app.dimensions.ai/details/publication/...	10.1126/science.1232033	pub.1022072971	23287722	American Association for the Advancement of Sc...	RNA-Guided Human Genome Engineering via Cas9	article	2013	jour.1346339	Science	pub.1022072971	44
79	https://app.dimensions.ai/details/publication/...	10.1126/science.1231143	pub.1019873131	23287718	American Association for the Advancement of Sc...	Multiplex Genome Engineering Using CRISPR/Cas ...	article	2013	jour.1346339	Science	pub.1019873131	44

4.1 Optional: exporting the data to google sheets¶

NOTE: this will work only Google Colab, or in other Jupyter environment if you have previously enabled the required Google credentials (more info here).

[ ]:

export_as_gsheets(final_data)

Conclusions¶

In this notebook we have shown how to use the Dimensions Analytics API to identify all the publications referenced by patents, for all the patents that belong to the same patent family.

This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application. For more background, see the list of fields available via the Patents API.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.