Patent publication references, for an entire patent family¶
This notebook shows how to use the Dimensions Analytics API to identify all the publications referenced by patents, for all the patents that belong to the same patent family.
There are the steps:
We start from a specific patent Dimensions ID and obtain its family ID
Using the family ID, we query the patents API to search for all related patents and return the publications IDs they reference
Finally, we query the publications API to obtain other useful publication metadata e.g. title, publisher, journals etc..
These sample results can be explored in Google Sheets.
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli tqdm -U --quiet
import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
#
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
1. Search for the patent ID and return the family ID.¶
As a starting point, let’s take patent ID US-20210108231-A1
.
View this patent record in Dimensions: Methods and compositions for rna-directed target dna modification and for rna-directed modulation of transcription
[3]:
patent_id = "US-20210108231-A1" #@param {type:"string"}
q_family_id = dsl.query(f"""
search patents
where id = "{patent_id}"
return family_id
""")
try:
family_id = q_family_id['family_id'][0]['id']
print("Found family_id:", family_id)
except:
print("No family ID found. \nFull API results:\n", str(q_family_id.json))
Returned Family_id: 1
Time: 0.60s
Found family_id: 49624232
3. Enriching the publication IDs with additional metadata¶
In this step we query the publications API, using the referenced Dimensions IDs extracted previously in order to retrieve further metadata about publications.
Since we can have lots of publications to go through, the IDs list is chunked into smaller groups so to ensure the resulting API query is never too long (more info here).
PS Change the query template return
statement to customise the metadata returned.
[6]:
pubids = list(references_list['publication_ids'])
query_template = """search publications
where id in {}
return publications[id+doi+pmid+title+journal+year+publisher+type+dimensions_url]
limit 1000"""
#
# loop through all references-publications IDs in chunks and query Dimensions
print("===\nExtracting publications data ...")
results = []
BATCHSIZE = 300
VERBOSE = False # set to True to see extraction logs
for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
query = query_template.format(json.dumps(chunk))
data = dsl.query(query, verbose=VERBOSE)
results += data.publications
time.sleep(0.5)
#
# put the cited publication data into a dataframe
pubs_cited = pd.DataFrame().from_dict(results)
print("===\nCited Publications found: ", len(pubs_cited))
#
# transform the 'journal' column cause it contains nested data
temp = pubs_cited['journal'].apply(pd.Series).rename(columns={"id": "journal.id",
"title": "journal.title"}).drop([0], axis=0)
pubs_cited = pd.concat([pubs_cited.drop(['journal'], axis=1), temp], axis=1)
pubs_cited.head()
===
Extracting publications data ...
===
Cited Publications found: 363
[6]:
dimensions_url | doi | id | pmid | publisher | title | type | year | journal.id | journal.title | |
---|---|---|---|---|---|---|---|---|---|---|
0 | https://app.dimensions.ai/details/publication/... | 10.1021/acs.chemrev.7b00499 | pub.1100683468 | 29377672 | American Chemical Society (ACS) | Molecular Mechanism and Evolution of Nuclear P... | article | 2018 | NaN | NaN |
1 | https://app.dimensions.ai/details/publication/... | 10.1101/168443 | pub.1091918134 | NaN | Cold Spring Harbor Laboratory | P53 toxicity is a hurdle to CRISPR/CAS9 screen... | preprint | 2017 | jour.1293558 | bioRxiv |
2 | https://app.dimensions.ai/details/publication/... | 10.1016/j.cell.2016.08.056 | pub.1035697575 | 27662091 | Elsevier | Editing DNA Methylation in the Mammalian Genome | article | 2016 | jour.1019114 | Cell |
3 | https://app.dimensions.ai/details/publication/... | 10.18632/oncotarget.10234 | pub.1017128844 | 27356740 | Impact Journals, LLC | CRISPR-dCas9 mediated TET1 targeting for selec... | article | 2016 | jour.1043645 | Oncotarget |
4 | https://app.dimensions.ai/details/publication/... | 10.1038/nature17946 | pub.1009172001 | 27096365 | Springer Nature | Programmable editing of a target base in genom... | article | 2016 | jour.1018957 | Nature |
4. Combine the publication metadata with the patent citations information¶
In this step we take the results of the patents query from step 2 and merge them with the publication query from step 3.
The goal is simply to retain the total count of patent citations per publication in the final dataset containing detailed publications metadata.
[7]:
# merge two datasets using 'publication id' as key
final_data = pubs_cited.merge(references_list, left_on='id', right_on='publication_ids')
# rename 'size' column
final_data.rename(columns = {"size" : "patents_citations"}, inplace = True)
# show top 5 cited publications
final_data.sort_values("patents_citations", ascending=False, inplace=True)
final_data.head(5)
[7]:
dimensions_url | doi | id | pmid | publisher | title | type | year | journal.id | journal.title | publication_ids | patents_citations | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
134 | https://app.dimensions.ai/details/publication/... | 10.1038/nature09886 | pub.1030591890 | 21455174 | Springer Nature | CRISPR RNA maturation by trans-encoded small R... | article | 2011 | jour.1018957 | Nature | pub.1030591890 | 50 |
92 | https://app.dimensions.ai/details/publication/... | 10.1126/science.1225829 | pub.1041850060 | 22745249 | American Association for the Advancement of Sc... | A Programmable Dual-RNA–Guided DNA Endonucleas... | article | 2012 | jour.1346339 | Science | pub.1041850060 | 47 |
126 | https://app.dimensions.ai/details/publication/... | 10.1093/nar/gkr606 | pub.1052438070 | 21813460 | Oxford University Press (OUP) | The Streptococcus thermophilus CRISPR/Cas syst... | article | 2011 | jour.1018982 | Nucleic Acids Research | pub.1052438070 | 47 |
78 | https://app.dimensions.ai/details/publication/... | 10.1126/science.1232033 | pub.1022072971 | 23287722 | American Association for the Advancement of Sc... | RNA-Guided Human Genome Engineering via Cas9 | article | 2013 | jour.1346339 | Science | pub.1022072971 | 44 |
79 | https://app.dimensions.ai/details/publication/... | 10.1126/science.1231143 | pub.1019873131 | 23287718 | American Association for the Advancement of Sc... | Multiplex Genome Engineering Using CRISPR/Cas ... | article | 2013 | jour.1346339 | Science | pub.1019873131 | 44 |
4.1 Optional: exporting the data to google sheets¶
NOTE: this will work only Google Colab, or in other Jupyter environment if you have previously enabled the required Google credentials (more info here).
[ ]:
export_as_gsheets(final_data)
Conclusions¶
In this notebook we have shown how to use the Dimensions Analytics API to identify all the publications referenced by patents, for all the patents that belong to the same patent family.
This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application. For more background, see the list of fields available via the Patents API.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.