Rejected Article tracker¶
This Python notebook shows how publishers can use the Dimensions Analytics API to identify whether articles they chose not to publish were ultimately published somewhere else.
In this notebook we will: 1. Import a .csv file containing rejected articles 2. Search for publications similar to the rejected articles 4. Measure the strength of the matches and provide ideas for validation
[ ]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[1]:
# install extra dependencies used in this notebook
!pip install dimcli pandasql levenshtein -U --quiet
[2]:
import dimcli
from dimcli.utils import *
import json, sys
import requests
import pandas as pd
import numpy as np
from pandasql import sqldf
import pandasql as ps
from uuid import uuid4
from tqdm.notebook import tqdm
import string
#
pd.set_option('display.max_columns', None)
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.4)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.10
Method: dsl.ini file
1. Get an example data set¶
For this tutorial, we are going to use a sample data set of preprints and pretend that the preprints are articles we have rejected. This is a good proof of the concept of finding a similar article that has been published in a peer-reviewed journal: preprints often reappear published in journals and might have subtly different titles or abstracts.
In this simplified example, we’ll just use the author names and titles for matching and we’ll add a unique (made up) submission ID, as real data is likely to have this. We’ll use the preprint publishing date as our rejection date.
Here is the query to get our example data set as a pandas
data frame, and some code to make it look more like a data set of rejected articles. You don’t need to understand this bit necessarily, assuming you will have you’re own data you just need to know what the table looks like at the end (which will be shown).
[3]:
rejected_publications = []
preprints = dsl.query(
# This is quite a specific search for preprints published on 2020-01-22
"""
search publications
where type = "preprint" and date = "2020-01-22" and abstract is not empty
return publications[date+title+abstract+authors]
limit 10
"""
)
for p in preprints.json["publications"]:
# This will be a row of our data:
rejected_article_data_row = {
"rejected_date": None, # Initialising the rows with null values
"first_author": None,
"title": None,
"abstract": None
}
rejected_article_data_row['rejected_date'] = p['date']
rejected_article_data_row['title'] = p['title']
rejected_article_data_row['abstract'] = p['abstract']
for order, a in enumerate(p["authors"]):
if order == 0: # i.e. first author
rejected_article_data_row['first_author'] = a['last_name']
rejected_publications.append(rejected_article_data_row)
rejected_publication_data = pd.DataFrame(rejected_publications)
# generate a unique ID for each row to keep things tidy
rejected_publication_data['submission_id'] = [
str(uuid4()) for _ in range(len(rejected_publication_data))
]
rejected_publication_data
Returned Publications: 10 (total = 730)
Time: 0.61s
WARNINGS [1]
Field current_organization_id of the authors field is deprecated and will be removed in the next major release.
[3]:
rejected_date | first_author | title | abstract | submission_id | |
---|---|---|---|---|---|
0 | 2020-01-22 | Kong | Predicting Prolonged Length of Hospital Stay f... | <sec>\n BACKGROUND\n ... | d9239d95-c10f-48df-9092-08851146312b |
1 | 2020-01-22 | Bowman | OSF Prereg Template | <p>Preregistration is the act of submitting a ... | fb127b82-6fc7-47fe-a26f-2577864992c8 |
2 | 2020-01-22 | Di Sia | On the Concept of Time in everyday Life and be... | <p>In this paper I consider the concept of tim... | 82973136-b462-4274-99cb-4d46548acf9c |
3 | 2020-01-22 | Di Sia | Birth and development of quantum physics: a tr... | <p>The last century has been a period of extre... | fe794675-8f8b-4d72-a2b6-e749d094d9d1 |
4 | 2020-01-22 | Bedoya | Fabricación de capas antirreflejantes y absorb... | <p>Se prepararon películas delgadas de SiO2 en... | 99c9217d-11c2-4457-83cf-bbc8354e2731 |
5 | 2020-01-22 | Coretta | Open Science in phonetics and phonology | <p>Open Science is a movement that stresses th... | 74eb9ad6-151d-4ee0-9da8-4b28c1eb6bd9 |
6 | 2020-01-22 | Wekke | Merumuskan Masalah Penelitian dengan Metode MAIL | <p>Ringkasan kuliah di pascasarjana STAIN Soro... | 8c833b8a-f388-4083-83c4-69e63f782f60 |
7 | 2020-01-22 | Hernández-Caballero | Epigenética en cáncer | <p>Las células contienen información determina... | cb4e7fdd-6279-466e-8281-1169b6e13861 |
8 | 2020-01-22 | Joyce | Scientific Racism 2.0 (SR2.0): An erroneous ar... | <p>SR2.0 refers to a prominent argument made b... | e4ecee6a-1532-4ab9-a6e5-145a6ee1438d |
9 | 2020-01-22 | Sinar | Functional Features of Forensic Corruption Cas... | <p>This study examines the multimodal use of l... | d9d7af18-32cd-4c39-9704-b0e222bc8b06 |
2. Define the search template¶
Python concatenates multiple strings one after another in brackets, so we have written it out as shown below so that we can add comments to the query. This format isn’t necessary, but hopefully it’s helpful!
[4]:
template = (
'search publications '
'in title_abstract_only ' # Search the whole of the publication
'for "{title}" ' # Stop words will be automatically excluded
'where date > "{rejected_date}" '
'and ('
'authors = "{first_author}"'
# The line below gives an example of how you could also search for
# the surname of the corresponding author if you have it:
# ' or authors = "{corresponding_author}"'
') '
'return publications['
'date' # Published date
'+'
'doi' # DOI of the published article
'+'
'title' # Title of the published article
'+'
'abstract' # Abstract of the published article
'] '
'limit 1' # Get the most relevant result only
)
template
[4]:
'search publications in title_abstract_only for "{title}" where date > "{rejected_date}" and (authors = "{first_author}") return publications[date+doi+title+abstract] limit 1'
3. Iteratively Query the Dimensions API for the retracted articles¶
[5]:
def no_punctuation(s: str) -> str:
"""
Remove punctuation from a python string
"""
return s.translate(str.maketrans('', '', string.punctuation))
# We'll store all our results in this list as we iterate, then join them together at the end...
results = []
# For each row in the data set as a python dictionary:
for row in tqdm(rejected_publication_data.to_dict(orient="records")):
row['title'] = no_punctuation(row['title'])
query = template.format(**row)
best = dsl.query(query, verbose=False).as_dataframe()
best['submission_id'] = row['submission_id']
results.append(best)
# Join results together
output = pd.concat(results)
output.head() # .head() shows just a few rows
[5]:
submission_id | title | abstract | date | doi | |
---|---|---|---|---|---|
0 | 82973136-b462-4274-99cb-4d46548acf9c | On the Concept of Time in Everyday Life and be... | In this paper I consider the concept of time i... | 2021-01-01 | 10.23880/eoij-16000268 |
0 | d9d7af18-32cd-4c39-9704-b0e222bc8b06 | Functional Features of Forensic Corruption Cas... | <p>Functional Features of Forensic Corruption ... | 2020-02-28 | 10.31228/osf.io/m3xa6 |
4. Join together the input and output data¶
[6]:
merged_results = pd.merge(
rejected_publication_data,
output,
left_on='submission_id',
right_on='submission_id',
how='left')
merged_results.head()
[6]:
rejected_date | first_author | title_x | abstract_x | submission_id | title_y | abstract_y | date | doi | |
---|---|---|---|---|---|---|---|---|---|
0 | 2020-01-22 | Kong | Predicting Prolonged Length of Hospital Stay f... | <sec>\n BACKGROUND\n ... | d9239d95-c10f-48df-9092-08851146312b | NaN | NaN | NaN | NaN |
1 | 2020-01-22 | Bowman | OSF Prereg Template | <p>Preregistration is the act of submitting a ... | fb127b82-6fc7-47fe-a26f-2577864992c8 | NaN | NaN | NaN | NaN |
2 | 2020-01-22 | Di Sia | On the Concept of Time in everyday Life and be... | <p>In this paper I consider the concept of tim... | 82973136-b462-4274-99cb-4d46548acf9c | On the Concept of Time in Everyday Life and be... | In this paper I consider the concept of time i... | 2021-01-01 | 10.23880/eoij-16000268 |
3 | 2020-01-22 | Di Sia | Birth and development of quantum physics: a tr... | <p>The last century has been a period of extre... | fe794675-8f8b-4d72-a2b6-e749d094d9d1 | NaN | NaN | NaN | NaN |
4 | 2020-01-22 | Bedoya | Fabricación de capas antirreflejantes y absorb... | <p>Se prepararon películas delgadas de SiO2 en... | 99c9217d-11c2-4457-83cf-bbc8354e2731 | NaN | NaN | NaN | NaN |
5. Add Matching Score¶
We have found some publications that might match our rejected articles. Now we need to score them to see whether they are good matches.
In this case we’ll measure the edit distance between the titles. The most commonly-used edit distance between strings is Levensthtein distance, which is nicely implemented in Python in the Levenshtein package.
The Levenshtein
package has a function “ratio” which uses Levenshtein distance to get a similarity (not distance) score between 0 (disimilar) and 1 (identical). We will use this to compare titles converted to lowercase.
Sorting the results by score descending (from highest to lowest) we can see that there was one good match. If we wanted to make the matching more automatic, we could choose to filter out everything with a score less than e.g. 0.75.
[7]:
from Levenshtein import ratio
def similarity(string1: str, string2: str) -> float:
"""
Case-insensitive similarity score made by subtracting the normalised
Levenshtein distance from 1.
"""
if pd.isna(string1) or pd.isna(string2):
return 0.
else:
return ratio(string1.lower(), string2.lower())
print(similarity('The cat sat on the mat', 'The dog sat on the frog'))
print(similarity('The cat sat on the mat', 'The mat sat on the cat'))
0.7111111111111111
0.9090909090909091
[8]:
merged_results['score'] = merged_results.apply(
lambda row: similarity(row['abstract_x'], row['abstract_y']),
axis=1
)
merged_results = merged_results.sort_values("score", ascending=False)
final_output = merged_results[[
'submission_id',
'rejected_date',
'title_x',
'title_y',
'abstract_x',
'abstract_y',
'doi',
'score'
]]
final_output.columns = [
'submission_id',
'rejected_date',
'original_title',
'published_title',
'abstract_x',
'abstract_y',
'doi',
'score'
]
final_output
[8]:
submission_id | rejected_date | original_title | published_title | abstract_x | abstract_y | doi | score | |
---|---|---|---|---|---|---|---|---|
2 | 82973136-b462-4274-99cb-4d46548acf9c | 2020-01-22 | On the Concept of Time in everyday Life and be... | On the Concept of Time in Everyday Life and be... | <p>In this paper I consider the concept of tim... | In this paper I consider the concept of time i... | 10.23880/eoij-16000268 | 0.705539 |
9 | d9d7af18-32cd-4c39-9704-b0e222bc8b06 | 2020-01-22 | Functional Features of Forensic Corruption Cas... | Functional Features of Forensic Corruption Cas... | <p>This study examines the multimodal use of l... | <p>Functional Features of Forensic Corruption ... | 10.31228/osf.io/m3xa6 | 0.156177 |
0 | d9239d95-c10f-48df-9092-08851146312b | 2020-01-22 | Predicting Prolonged Length of Hospital Stay f... | NaN | <sec>\n BACKGROUND\n ... | NaN | NaN | 0.000000 |
1 | fb127b82-6fc7-47fe-a26f-2577864992c8 | 2020-01-22 | OSF Prereg Template | NaN | <p>Preregistration is the act of submitting a ... | NaN | NaN | 0.000000 |
3 | fe794675-8f8b-4d72-a2b6-e749d094d9d1 | 2020-01-22 | Birth and development of quantum physics: a tr... | NaN | <p>The last century has been a period of extre... | NaN | NaN | 0.000000 |
4 | 99c9217d-11c2-4457-83cf-bbc8354e2731 | 2020-01-22 | Fabricación de capas antirreflejantes y absorb... | NaN | <p>Se prepararon películas delgadas de SiO2 en... | NaN | NaN | 0.000000 |
5 | 74eb9ad6-151d-4ee0-9da8-4b28c1eb6bd9 | 2020-01-22 | Open Science in phonetics and phonology | NaN | <p>Open Science is a movement that stresses th... | NaN | NaN | 0.000000 |
6 | 8c833b8a-f388-4083-83c4-69e63f782f60 | 2020-01-22 | Merumuskan Masalah Penelitian dengan Metode MAIL | NaN | <p>Ringkasan kuliah di pascasarjana STAIN Soro... | NaN | NaN | 0.000000 |
7 | cb4e7fdd-6279-466e-8281-1169b6e13861 | 2020-01-22 | Epigenética en cáncer | NaN | <p>Las células contienen información determina... | NaN | NaN | 0.000000 |
8 | e4ecee6a-1532-4ab9-a6e5-145a6ee1438d | 2020-01-22 | Scientific Racism 2.0 (SR2.0): An erroneous ar... | NaN | <p>SR2.0 refers to a prominent argument made b... | NaN | NaN | 0.000000 |
6. Conclusion¶
In this tutorial we have shown how to use the Dimensions API to search for articles with titles and abstracts that contain similar terms to the titles of articles that have been rejected in the past.
In terms of next steps, we might choose to do some bibliometric analysis of the articles we rejected. We could also try to improve our search process by extracting keywords from our article abstracts and searching for those too.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.