Rejected Article tracker¶

This Python notebook shows how publishers can use the Dimensions Analytics API to identify whether articles they chose not to publish were ultimately published somewhere else.

In this notebook we will: 1. Import a .csv file containing rejected articles 2. Search for publications similar to the rejected articles 4. Measure the strength of the matches and provide ideas for validation

[ ]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:

# install extra dependencies used in this notebook
!pip install dimcli pandasql levenshtein -U --quiet

[2]:

import dimcli
from dimcli.utils import *

import json, sys
import requests
import pandas as pd
import numpy as np
from pandasql import sqldf
import pandasql as ps
from uuid import uuid4
from tqdm.notebook import tqdm
import string
#
pd.set_option('display.max_columns', None)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v1.4)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.10
Method: dsl.ini file

1. Get an example data set¶

For this tutorial, we are going to use a sample data set of preprints and pretend that the preprints are articles we have rejected. This is a good proof of the concept of finding a similar article that has been published in a peer-reviewed journal: preprints often reappear published in journals and might have subtly different titles or abstracts.

In this simplified example, we’ll just use the author names and titles for matching and we’ll add a unique (made up) submission ID, as real data is likely to have this. We’ll use the preprint publishing date as our rejection date.

Here is the query to get our example data set as a pandas data frame, and some code to make it look more like a data set of rejected articles. You don’t need to understand this bit necessarily, assuming you will have you’re own data you just need to know what the table looks like at the end (which will be shown).

[3]:

rejected_publications = []

preprints = dsl.query(
  # This is quite a specific search for preprints published on 2020-01-22
  """
  search publications
  where type = "preprint" and date = "2020-01-22" and abstract is not empty
  return publications[date+title+abstract+authors]
  limit 10
  """
)

for p in preprints.json["publications"]:
  # This will be a row of our data:
  rejected_article_data_row = {
    "rejected_date": None, # Initialising the rows with null values
    "first_author": None,
    "title": None,
    "abstract": None
  }
  rejected_article_data_row['rejected_date'] = p['date']
  rejected_article_data_row['title'] = p['title']
  rejected_article_data_row['abstract'] = p['abstract']
  for order, a in enumerate(p["authors"]):
    if order == 0: # i.e. first author
      rejected_article_data_row['first_author'] = a['last_name']
  rejected_publications.append(rejected_article_data_row)

rejected_publication_data = pd.DataFrame(rejected_publications)

# generate a unique ID for each row to keep things tidy
rejected_publication_data['submission_id'] = [
    str(uuid4()) for _ in range(len(rejected_publication_data))
]

rejected_publication_data

Returned Publications: 10 (total = 730)
Time: 0.61s
WARNINGS [1]
Field current_organization_id of the authors field is deprecated and will be removed in the next major release.

[3]:

	rejected_date	first_author	title	abstract	submission_id
0	2020-01-22	Kong	Predicting Prolonged Length of Hospital Stay f...	<sec>\n BACKGROUND\n ...	d9239d95-c10f-48df-9092-08851146312b
1	2020-01-22	Bowman	OSF Prereg Template	<p>Preregistration is the act of submitting a ...	fb127b82-6fc7-47fe-a26f-2577864992c8
2	2020-01-22	Di Sia	On the Concept of Time in everyday Life and be...	<p>In this paper I consider the concept of tim...	82973136-b462-4274-99cb-4d46548acf9c
3	2020-01-22	Di Sia	Birth and development of quantum physics: a tr...	<p>The last century has been a period of extre...	fe794675-8f8b-4d72-a2b6-e749d094d9d1
4	2020-01-22	Bedoya	Fabricación de capas antirreflejantes y absorb...	<p>Se prepararon películas delgadas de SiO2 en...	99c9217d-11c2-4457-83cf-bbc8354e2731
5	2020-01-22	Coretta	Open Science in phonetics and phonology	<p>Open Science is a movement that stresses th...	74eb9ad6-151d-4ee0-9da8-4b28c1eb6bd9
6	2020-01-22	Wekke	Merumuskan Masalah Penelitian dengan Metode MAIL	<p>Ringkasan kuliah di pascasarjana STAIN Soro...	8c833b8a-f388-4083-83c4-69e63f782f60
7	2020-01-22	Hernández-Caballero	Epigenética en cáncer	<p>Las células contienen información determina...	cb4e7fdd-6279-466e-8281-1169b6e13861
8	2020-01-22	Joyce	Scientific Racism 2.0 (SR2.0): An erroneous ar...	<p>SR2.0 refers to a prominent argument made b...	e4ecee6a-1532-4ab9-a6e5-145a6ee1438d
9	2020-01-22	Sinar	Functional Features of Forensic Corruption Cas...	<p>This study examines the multimodal use of l...	d9d7af18-32cd-4c39-9704-b0e222bc8b06

2. Define the search template¶

Python concatenates multiple strings one after another in brackets, so we have written it out as shown below so that we can add comments to the query. This format isn’t necessary, but hopefully it’s helpful!

[4]:

template = (
  'search publications '
  'in title_abstract_only ' # Search the whole of the publication
  'for "{title}" ' # Stop words will be automatically excluded
  'where date > "{rejected_date}" '
  'and ('
    'authors = "{first_author}"'
    # The line below gives an example of how you could also search for
    # the surname of the corresponding author if you have it:
    # ' or authors = "{corresponding_author}"'
  ') '
  'return publications['
    'date' # Published date
    '+'
    'doi' # DOI of the published article
    '+'
    'title' # Title of the published article
    '+'
    'abstract' # Abstract of the published article
  '] '
  'limit 1' # Get the most relevant result only
)

template

[4]:

'search publications in title_abstract_only for "{title}" where date > "{rejected_date}" and (authors = "{first_author}") return publications[date+doi+title+abstract] limit 1'

3. Iteratively Query the Dimensions API for the retracted articles¶

[5]:

def no_punctuation(s: str) -> str:
  """
  Remove punctuation from a python string
  """
  return s.translate(str.maketrans('', '', string.punctuation))

# We'll store all our results in this list as we iterate, then join them together at the end...
results = []

# For each row in the data set as a python dictionary:
for row in tqdm(rejected_publication_data.to_dict(orient="records")):
  row['title'] = no_punctuation(row['title'])
  query = template.format(**row)
  best = dsl.query(query, verbose=False).as_dataframe()
  best['submission_id'] = row['submission_id']
  results.append(best)

# Join results together
output = pd.concat(results)

output.head() # .head() shows just a few rows

[5]:

	submission_id	title	abstract	date	doi
0	82973136-b462-4274-99cb-4d46548acf9c	On the Concept of Time in Everyday Life and be...	In this paper I consider the concept of time i...	2021-01-01	10.23880/eoij-16000268
0	d9d7af18-32cd-4c39-9704-b0e222bc8b06	Functional Features of Forensic Corruption Cas...	<p>Functional Features of Forensic Corruption ...	2020-02-28	10.31228/osf.io/m3xa6

4. Join together the input and output data¶

[6]:

merged_results = pd.merge(
    rejected_publication_data,
    output,
    left_on='submission_id',
    right_on='submission_id',
    how='left')

merged_results.head()

[6]:

	rejected_date	first_author	title_x	abstract_x	submission_id	title_y	abstract_y	date	doi
0	2020-01-22	Kong	Predicting Prolonged Length of Hospital Stay f...	<sec>\n BACKGROUND\n ...	d9239d95-c10f-48df-9092-08851146312b	NaN	NaN	NaN	NaN
1	2020-01-22	Bowman	OSF Prereg Template	<p>Preregistration is the act of submitting a ...	fb127b82-6fc7-47fe-a26f-2577864992c8	NaN	NaN	NaN	NaN
2	2020-01-22	Di Sia	On the Concept of Time in everyday Life and be...	<p>In this paper I consider the concept of tim...	82973136-b462-4274-99cb-4d46548acf9c	On the Concept of Time in Everyday Life and be...	In this paper I consider the concept of time i...	2021-01-01	10.23880/eoij-16000268
3	2020-01-22	Di Sia	Birth and development of quantum physics: a tr...	<p>The last century has been a period of extre...	fe794675-8f8b-4d72-a2b6-e749d094d9d1	NaN	NaN	NaN	NaN
4	2020-01-22	Bedoya	Fabricación de capas antirreflejantes y absorb...	<p>Se prepararon películas delgadas de SiO2 en...	99c9217d-11c2-4457-83cf-bbc8354e2731	NaN	NaN	NaN	NaN

5. Add Matching Score¶

We have found some publications that might match our rejected articles. Now we need to score them to see whether they are good matches.

In this case we’ll measure the edit distance between the titles. The most commonly-used edit distance between strings is Levensthtein distance, which is nicely implemented in Python in the Levenshtein package.

The Levenshtein package has a function “ratio” which uses Levenshtein distance to get a similarity (not distance) score between 0 (disimilar) and 1 (identical). We will use this to compare titles converted to lowercase.

Sorting the results by score descending (from highest to lowest) we can see that there was one good match. If we wanted to make the matching more automatic, we could choose to filter out everything with a score less than e.g. 0.75.

[7]:

from Levenshtein import ratio

def similarity(string1: str, string2: str) -> float:
  """
  Case-insensitive similarity score made by subtracting the normalised
    Levenshtein distance from 1.
  """
  if pd.isna(string1) or pd.isna(string2):
    return 0.
  else:
    return ratio(string1.lower(), string2.lower())

print(similarity('The cat sat on the mat', 'The dog sat on the frog'))
print(similarity('The cat sat on the mat', 'The mat sat on the cat'))

0.7111111111111111
0.9090909090909091

[8]:

merged_results['score'] = merged_results.apply(
    lambda row: similarity(row['abstract_x'], row['abstract_y']),
    axis=1
)

merged_results = merged_results.sort_values("score", ascending=False)

final_output = merged_results[[
    'submission_id',
    'rejected_date',
    'title_x',
    'title_y',
    'abstract_x',
    'abstract_y',
    'doi',
    'score'
]]

final_output.columns = [
    'submission_id',
    'rejected_date',
    'original_title',
    'published_title',
    'abstract_x',
    'abstract_y',
    'doi',
    'score'
]

final_output

[8]:

	submission_id	rejected_date	original_title	published_title	abstract_x	abstract_y	doi	score
2	82973136-b462-4274-99cb-4d46548acf9c	2020-01-22	On the Concept of Time in everyday Life and be...	On the Concept of Time in Everyday Life and be...	<p>In this paper I consider the concept of tim...	In this paper I consider the concept of time i...	10.23880/eoij-16000268	0.705539
9	d9d7af18-32cd-4c39-9704-b0e222bc8b06	2020-01-22	Functional Features of Forensic Corruption Cas...	Functional Features of Forensic Corruption Cas...	<p>This study examines the multimodal use of l...	<p>Functional Features of Forensic Corruption ...	10.31228/osf.io/m3xa6	0.156177
0	d9239d95-c10f-48df-9092-08851146312b	2020-01-22	Predicting Prolonged Length of Hospital Stay f...	NaN	<sec>\n BACKGROUND\n ...	NaN	NaN	0.000000
1	fb127b82-6fc7-47fe-a26f-2577864992c8	2020-01-22	OSF Prereg Template	NaN	<p>Preregistration is the act of submitting a ...	NaN	NaN	0.000000
3	fe794675-8f8b-4d72-a2b6-e749d094d9d1	2020-01-22	Birth and development of quantum physics: a tr...	NaN	<p>The last century has been a period of extre...	NaN	NaN	0.000000
4	99c9217d-11c2-4457-83cf-bbc8354e2731	2020-01-22	Fabricación de capas antirreflejantes y absorb...	NaN	<p>Se prepararon películas delgadas de SiO2 en...	NaN	NaN	0.000000
5	74eb9ad6-151d-4ee0-9da8-4b28c1eb6bd9	2020-01-22	Open Science in phonetics and phonology	NaN	<p>Open Science is a movement that stresses th...	NaN	NaN	0.000000
6	8c833b8a-f388-4083-83c4-69e63f782f60	2020-01-22	Merumuskan Masalah Penelitian dengan Metode MAIL	NaN	<p>Ringkasan kuliah di pascasarjana STAIN Soro...	NaN	NaN	0.000000
7	cb4e7fdd-6279-466e-8281-1169b6e13861	2020-01-22	Epigenética en cáncer	NaN	<p>Las células contienen información determina...	NaN	NaN	0.000000
8	e4ecee6a-1532-4ab9-a6e5-145a6ee1438d	2020-01-22	Scientific Racism 2.0 (SR2.0): An erroneous ar...	NaN	<p>SR2.0 refers to a prominent argument made b...	NaN	NaN	0.000000

6. Conclusion¶

In this tutorial we have shown how to use the Dimensions API to search for articles with titles and abstracts that contain similar terms to the titles of articles that have been rejected in the past.

In terms of next steps, we might choose to do some bibliometric analysis of the articles we rejected. We could also try to improve our search process by extracting keywords from our article abstracts and searching for those too.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.