../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Rejected Article tracker

This Python notebook shows how publishers can use the Dimensions Analytics API to identify whether articles they chose not to publish were ultimately published somewhere else.

In this notebook we will: 1. Import a .csv file containing rejected articles 2. Search for publications similar to the rejected articles 4. Measure the strength of the matches and provide ideas for validation

[ ]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:
# install extra dependencies used in this notebook
!pip install dimcli pandasql levenshtein -U --quiet
[2]:
import dimcli
from dimcli.utils import *

import json, sys
import requests
import pandas as pd
import numpy as np
from pandasql import sqldf
import pandasql as ps
from uuid import uuid4
from tqdm.notebook import tqdm
import string
#
pd.set_option('display.max_columns', None)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.4)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.10
Method: dsl.ini file

1. Get an example data set

For this tutorial, we are going to use a sample data set of preprints and pretend that the preprints are articles we have rejected. This is a good proof of the concept of finding a similar article that has been published in a peer-reviewed journal: preprints often reappear published in journals and might have subtly different titles or abstracts.

In this simplified example, we’ll just use the author names and titles for matching and we’ll add a unique (made up) submission ID, as real data is likely to have this. We’ll use the preprint publishing date as our rejection date.

Here is the query to get our example data set as a pandas data frame, and some code to make it look more like a data set of rejected articles. You don’t need to understand this bit necessarily, assuming you will have you’re own data you just need to know what the table looks like at the end (which will be shown).

[3]:
rejected_publications = []

preprints = dsl.query(
  # This is quite a specific search for preprints published on 2020-01-22
  """
  search publications
  where type = "preprint" and date = "2020-01-22" and abstract is not empty
  return publications[date+title+abstract+authors]
  limit 10
  """
)

for p in preprints.json["publications"]:
  # This will be a row of our data:
  rejected_article_data_row = {
    "rejected_date": None, # Initialising the rows with null values
    "first_author": None,
    "title": None,
    "abstract": None
  }
  rejected_article_data_row['rejected_date'] = p['date']
  rejected_article_data_row['title'] = p['title']
  rejected_article_data_row['abstract'] = p['abstract']
  for order, a in enumerate(p["authors"]):
    if order == 0: # i.e. first author
      rejected_article_data_row['first_author'] = a['last_name']
  rejected_publications.append(rejected_article_data_row)

rejected_publication_data = pd.DataFrame(rejected_publications)

# generate a unique ID for each row to keep things tidy
rejected_publication_data['submission_id'] = [
    str(uuid4()) for _ in range(len(rejected_publication_data))
]

rejected_publication_data
Returned Publications: 10 (total = 730)
Time: 0.61s
WARNINGS [1]
Field current_organization_id of the authors field is deprecated and will be removed in the next major release.
[3]:
rejected_date first_author title abstract submission_id
0 2020-01-22 Kong Predicting Prolonged Length of Hospital Stay f... <sec>\n BACKGROUND\n ... d9239d95-c10f-48df-9092-08851146312b
1 2020-01-22 Bowman OSF Prereg Template <p>Preregistration is the act of submitting a ... fb127b82-6fc7-47fe-a26f-2577864992c8
2 2020-01-22 Di Sia On the Concept of Time in everyday Life and be... <p>In this paper I consider the concept of tim... 82973136-b462-4274-99cb-4d46548acf9c
3 2020-01-22 Di Sia Birth and development of quantum physics: a tr... <p>The last century has been a period of extre... fe794675-8f8b-4d72-a2b6-e749d094d9d1
4 2020-01-22 Bedoya Fabricación de capas antirreflejantes y absorb... <p>Se prepararon películas delgadas de SiO2 en... 99c9217d-11c2-4457-83cf-bbc8354e2731
5 2020-01-22 Coretta Open Science in phonetics and phonology <p>Open Science is a movement that stresses th... 74eb9ad6-151d-4ee0-9da8-4b28c1eb6bd9
6 2020-01-22 Wekke Merumuskan Masalah Penelitian dengan Metode MAIL <p>Ringkasan kuliah di pascasarjana STAIN Soro... 8c833b8a-f388-4083-83c4-69e63f782f60
7 2020-01-22 Hernández-Caballero Epigenética en cáncer <p>Las células contienen información determina... cb4e7fdd-6279-466e-8281-1169b6e13861
8 2020-01-22 Joyce Scientific Racism 2.0 (SR2.0): An erroneous ar... <p>SR2.0 refers to a prominent argument made b... e4ecee6a-1532-4ab9-a6e5-145a6ee1438d
9 2020-01-22 Sinar Functional Features of Forensic Corruption Cas... <p>This study examines the multimodal use of l... d9d7af18-32cd-4c39-9704-b0e222bc8b06

2. Define the search template

Python concatenates multiple strings one after another in brackets, so we have written it out as shown below so that we can add comments to the query. This format isn’t necessary, but hopefully it’s helpful!

[4]:
template = (
  'search publications '
  'in title_abstract_only ' # Search the whole of the publication
  'for "{title}" ' # Stop words will be automatically excluded
  'where date > "{rejected_date}" '
  'and ('
    'authors = "{first_author}"'
    # The line below gives an example of how you could also search for
    # the surname of the corresponding author if you have it:
    # ' or authors = "{corresponding_author}"'
  ') '
  'return publications['
    'date' # Published date
    '+'
    'doi' # DOI of the published article
    '+'
    'title' # Title of the published article
    '+'
    'abstract' # Abstract of the published article
  '] '
  'limit 1' # Get the most relevant result only
)

template
[4]:
'search publications in title_abstract_only for "{title}" where date > "{rejected_date}" and (authors = "{first_author}") return publications[date+doi+title+abstract] limit 1'

3. Iteratively Query the Dimensions API for the retracted articles

[5]:
def no_punctuation(s: str) -> str:
  """
  Remove punctuation from a python string
  """
  return s.translate(str.maketrans('', '', string.punctuation))

# We'll store all our results in this list as we iterate, then join them together at the end...
results = []

# For each row in the data set as a python dictionary:
for row in tqdm(rejected_publication_data.to_dict(orient="records")):
  row['title'] = no_punctuation(row['title'])
  query = template.format(**row)
  best = dsl.query(query, verbose=False).as_dataframe()
  best['submission_id'] = row['submission_id']
  results.append(best)

# Join results together
output = pd.concat(results)

output.head() # .head() shows just a few rows
[5]:
submission_id title abstract date doi
0 82973136-b462-4274-99cb-4d46548acf9c On the Concept of Time in Everyday Life and be... In this paper I consider the concept of time i... 2021-01-01 10.23880/eoij-16000268
0 d9d7af18-32cd-4c39-9704-b0e222bc8b06 Functional Features of Forensic Corruption Cas... <p>Functional Features of Forensic Corruption ... 2020-02-28 10.31228/osf.io/m3xa6

4. Join together the input and output data

[6]:
merged_results = pd.merge(
    rejected_publication_data,
    output,
    left_on='submission_id',
    right_on='submission_id',
    how='left')

merged_results.head()
[6]:
rejected_date first_author title_x abstract_x submission_id title_y abstract_y date doi
0 2020-01-22 Kong Predicting Prolonged Length of Hospital Stay f... <sec>\n BACKGROUND\n ... d9239d95-c10f-48df-9092-08851146312b NaN NaN NaN NaN
1 2020-01-22 Bowman OSF Prereg Template <p>Preregistration is the act of submitting a ... fb127b82-6fc7-47fe-a26f-2577864992c8 NaN NaN NaN NaN
2 2020-01-22 Di Sia On the Concept of Time in everyday Life and be... <p>In this paper I consider the concept of tim... 82973136-b462-4274-99cb-4d46548acf9c On the Concept of Time in Everyday Life and be... In this paper I consider the concept of time i... 2021-01-01 10.23880/eoij-16000268
3 2020-01-22 Di Sia Birth and development of quantum physics: a tr... <p>The last century has been a period of extre... fe794675-8f8b-4d72-a2b6-e749d094d9d1 NaN NaN NaN NaN
4 2020-01-22 Bedoya Fabricación de capas antirreflejantes y absorb... <p>Se prepararon películas delgadas de SiO2 en... 99c9217d-11c2-4457-83cf-bbc8354e2731 NaN NaN NaN NaN

5. Add Matching Score

We have found some publications that might match our rejected articles. Now we need to score them to see whether they are good matches.

In this case we’ll measure the edit distance between the titles. The most commonly-used edit distance between strings is Levensthtein distance, which is nicely implemented in Python in the Levenshtein package.

The Levenshtein package has a function “ratio” which uses Levenshtein distance to get a similarity (not distance) score between 0 (disimilar) and 1 (identical). We will use this to compare titles converted to lowercase.

Sorting the results by score descending (from highest to lowest) we can see that there was one good match. If we wanted to make the matching more automatic, we could choose to filter out everything with a score less than e.g. 0.75.

[7]:
from Levenshtein import ratio

def similarity(string1: str, string2: str) -> float:
  """
  Case-insensitive similarity score made by subtracting the normalised
    Levenshtein distance from 1.
  """
  if pd.isna(string1) or pd.isna(string2):
    return 0.
  else:
    return ratio(string1.lower(), string2.lower())

print(similarity('The cat sat on the mat', 'The dog sat on the frog'))
print(similarity('The cat sat on the mat', 'The mat sat on the cat'))
0.7111111111111111
0.9090909090909091
[8]:
merged_results['score'] = merged_results.apply(
    lambda row: similarity(row['abstract_x'], row['abstract_y']),
    axis=1
)

merged_results = merged_results.sort_values("score", ascending=False)

final_output = merged_results[[
    'submission_id',
    'rejected_date',
    'title_x',
    'title_y',
    'abstract_x',
    'abstract_y',
    'doi',
    'score'
]]

final_output.columns = [
    'submission_id',
    'rejected_date',
    'original_title',
    'published_title',
    'abstract_x',
    'abstract_y',
    'doi',
    'score'
]

final_output
[8]:
submission_id rejected_date original_title published_title abstract_x abstract_y doi score
2 82973136-b462-4274-99cb-4d46548acf9c 2020-01-22 On the Concept of Time in everyday Life and be... On the Concept of Time in Everyday Life and be... <p>In this paper I consider the concept of tim... In this paper I consider the concept of time i... 10.23880/eoij-16000268 0.705539
9 d9d7af18-32cd-4c39-9704-b0e222bc8b06 2020-01-22 Functional Features of Forensic Corruption Cas... Functional Features of Forensic Corruption Cas... <p>This study examines the multimodal use of l... <p>Functional Features of Forensic Corruption ... 10.31228/osf.io/m3xa6 0.156177
0 d9239d95-c10f-48df-9092-08851146312b 2020-01-22 Predicting Prolonged Length of Hospital Stay f... NaN <sec>\n BACKGROUND\n ... NaN NaN 0.000000
1 fb127b82-6fc7-47fe-a26f-2577864992c8 2020-01-22 OSF Prereg Template NaN <p>Preregistration is the act of submitting a ... NaN NaN 0.000000
3 fe794675-8f8b-4d72-a2b6-e749d094d9d1 2020-01-22 Birth and development of quantum physics: a tr... NaN <p>The last century has been a period of extre... NaN NaN 0.000000
4 99c9217d-11c2-4457-83cf-bbc8354e2731 2020-01-22 Fabricación de capas antirreflejantes y absorb... NaN <p>Se prepararon películas delgadas de SiO2 en... NaN NaN 0.000000
5 74eb9ad6-151d-4ee0-9da8-4b28c1eb6bd9 2020-01-22 Open Science in phonetics and phonology NaN <p>Open Science is a movement that stresses th... NaN NaN 0.000000
6 8c833b8a-f388-4083-83c4-69e63f782f60 2020-01-22 Merumuskan Masalah Penelitian dengan Metode MAIL NaN <p>Ringkasan kuliah di pascasarjana STAIN Soro... NaN NaN 0.000000
7 cb4e7fdd-6279-466e-8281-1169b6e13861 2020-01-22 Epigenética en cáncer NaN <p>Las células contienen información determina... NaN NaN 0.000000
8 e4ecee6a-1532-4ab9-a6e5-145a6ee1438d 2020-01-22 Scientific Racism 2.0 (SR2.0): An erroneous ar... NaN <p>SR2.0 refers to a prominent argument made b... NaN NaN 0.000000

6. Conclusion

In this tutorial we have shown how to use the Dimensions API to search for articles with titles and abstracts that contain similar terms to the titles of articles that have been rejected in the past.

In terms of next steps, we might choose to do some bibliometric analysis of the articles we rejected. We could also try to improve our search process by extracting keywords from our article abstracts and searching for those too.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg