../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Extracting Authors order from Publications data

This Python notebook shows how to use the Dimensions Analytics API, in particular the publications source, in order to analyse the publications’ authors’ order.

These are the steps:

  • First we extract a dataset of interest from Dimensions’ publications database

  • Second, we process authors structured data so to turn the implicit authorship order into a number

  • Third, we mark first and last authors via a new ‘author category’ column

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Apr 20, 2023
==

Prerequisites

This notebook assumes you have installed the Dimcli library and have followed the steps in the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *

import os, sys, time, json
from tqdm.notebook import tqdm as progressbar

import pandas as pd
import numpy as np

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.0.2)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.6
Method: dsl.ini file

1. Extracting a dataset from Dimensions

We use three different queries in order to extract

  • authors information

  • publications metadata

  • research organizations information

NOTE other approaches are also possible e.g. extracting all data via a single query and then using Python to select only the fields of interests. For the purpose of this tutorial, using separate queries is the most straighforward way to achieve our goal.

[3]:
#
# the main query string selects publications based on a) pub year, b) specific organization IDs and c) concept
# you can update this query based on your preferences
#

main_query = """
search publications
    where year in [2022:2022]
    and research_orgs in ["grid.21925.3d","grid.147455.6","grid.25879.31","grid.29857.31"]
    and concepts = "oncology"
 return publications
"""
[4]:
# use the main query but extract only authors infos
Authors = dsl.query_iterative(main_query + "[id+authors]").as_dataframe_authors()  ##researcher_id, pub_id, current_organization_ID
Authors.head()
Starting iteration with limit=1000 skip=0 ...
0-120 / 120 (4.36s)
===
Records extracted: 120
[4]:
affiliations corresponding current_organization_id first_name last_name orcid raw_affiliation researcher_id pub_id
0 [{'city': 'Philadelphia', 'city_id': 4560349, ... grid.25879.31 Andrew Schlafly None [Perelman School of Medicine, University of Pe... ur.012676303143.43 pub.1154094821
1 [{'city': 'Jacksonville', 'city_id': 4160021, ... True grid.25879.31 Ronnie Sebro None [Center for Augmented Intelligence, Mayo Clini... ur.0660765735.77 pub.1154094821
2 [{'city': 'Madison', 'city_id': 5261457, 'coun... grid.14003.36 Jessica R. Schumacher [0000-0002-6740-9498] [Department of Surgery, University of Wisconsi... ur.0661627033.29 pub.1153677611
3 [{'city': 'Madison', 'city_id': 5261457, 'coun... grid.14003.36 Alyssa A. Wiener None [Department of Surgery, University of Wisconsi... ur.015612367333.32 pub.1153677611
4 [{'city': 'Madison', 'city_id': 5261457, 'coun... grid.410427.4 Caprice C. Greenberg None [Department of Surgery, University of Wisconsi... ur.012326542557.13 pub.1153677611
[5]:
# use the main query but extract only pubs metadata
Pubs = dsl.query_iterative(main_query + "[id+title+year+times_cited]").as_dataframe()
Pubs.head()
Starting iteration with limit=1000 skip=0 ...
0-120 / 120 (1.88s)
===
Records extracted: 120
[5]:
id title times_cited year
0 pub.1154094821 Does NIH funding differ between medical specia... 0 2022
1 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022
2 pub.1153575321 Quality and Safety Considerations in Intensity... 0 2022
3 pub.1153525111 Data standards in pediatric oncology: Past, pr... 0 2022
4 pub.1153522196 Assessments of Somatic Variant Classification ... 0 2022
[6]:
# use the main query but extract only research orgs infos
RORGS = dsl.query_iterative(main_query + "[unnest(research_orgs)]").as_dataframe()
RORGS.head()
Starting iteration with limit=1000 skip=0 ...
0-120 / 120 (1.03s)
120-120 / 120 (4.07s)
===
Records extracted: 599
[6]:
research_orgs.city_name research_orgs.country_name research_orgs.id research_orgs.latitude research_orgs.linkout research_orgs.longitude research_orgs.name research_orgs.state_name research_orgs.types research_orgs.acronym
0 Philadelphia United States grid.25879.31 39.952457 [http://www.upenn.edu/] -75.193220 University of Pennsylvania Pennsylvania [Education] NaN
1 Jacksonville United States grid.417467.7 30.289337 [https://www.mayoclinic.org/patient-visitor-gu... -81.437775 Mayo Clinic Florida [Healthcare] NaN
2 Madison United States grid.14003.36 43.076694 [http://www.wisc.edu/] -89.412440 University of Wisconsin–Madison Wisconsin [Education] UW
3 Rochester United States grid.66875.3a 44.024070 [http://www.mayoclinic.org/patient-visitor-gui... -92.466310 Mayo Clinic Minnesota [Healthcare] NaN
4 Madison United States grid.412639.b 43.076946 [https://cancer.wisc.edu/] -89.431470 UW Carbone Cancer Center Wisconsin [Healthcare] UWCCC

2. Combining the results

We merge the results from the queries above into a single table containing only the columns we want.

Additionally, we calculate for each author which is the order of authorship and add a category for ‘first’ and ‘last’ authors.

[7]:
#
# Authors becomes the "main table" because it has both the PubID and the ResearcherID
# Then use Authors->Pubs to lookup title, year, times cited      on authors.pub_id = Pubs.id
# Then use Authors->RORGS to lookup rorg name, type and country  on authors.current_organization_ID = RORGS.id
#


##prep RORGS for merge

RORGS = RORGS.dropna(subset = ['research_orgs.id'])
RORGS = RORGS.rename(columns = {'research_orgs.id':'rorg_id'})
RORGS = RORGS.drop_duplicates(subset=['rorg_id', 'research_orgs.name'], keep='last')


##Combine all three dataframes into one

AutPub = pd.merge(
    left=Authors,
    right=Pubs,
    left_on='pub_id',
    right_on='id',
    how='left'
)

final = pd.merge(
    left=AutPub,
    right=RORGS,
    left_on='current_organization_id',
    right_on='rorg_id',
    how='left'
)

final["author_name"] = final["last_name"] + [", "] + final["first_name"]
final['author_number'] = final.groupby(['pub_id']).cumcount()+1;  #this will only work if you haven't sorted the dataframe
final = final.drop(columns=['affiliations', 'corresponding', 'raw_affiliation', 'id', 'first_name', 'last_name','research_orgs.latitude','research_orgs.longitude','research_orgs.acronym'])

#Get AuthorCounts,etc by pub ID and join back to AutPubRORG table

AuthorCount = final.groupby(['pub_id'])['author_number'].max()

final = pd.merge(
    left=final,
    right=AuthorCount,
    left_on='pub_id',
    right_on='pub_id',
    how='left'
)

final = final.rename(columns = {'author_number_x':'author_number', 'author_number_y':'authors_tot', })


# Assing a category to first authors and last authors

final['AuthorCategory'] = np.where(
     final['author_number']==1, 'FirstAuthor',
         np.where(
            final['author_number']==final['authors_tot'],"LastAuthor",
             np.where(
                (final['authors_tot']-final['author_number'])==1,"Penultimate",""
             )
         )
)

final.head(20)

[7]:
current_organization_id orcid researcher_id pub_id title times_cited year research_orgs.city_name research_orgs.country_name rorg_id research_orgs.linkout research_orgs.name research_orgs.state_name research_orgs.types author_name author_number authors_tot AuthorCategory
0 grid.25879.31 None ur.012676303143.43 pub.1154094821 Does NIH funding differ between medical specia... 0 2022 Philadelphia United States grid.25879.31 [http://www.upenn.edu/] University of Pennsylvania Pennsylvania [Education] Schlafly, Andrew 1 2 FirstAuthor
1 grid.25879.31 None ur.0660765735.77 pub.1154094821 Does NIH funding differ between medical specia... 0 2022 Philadelphia United States grid.25879.31 [http://www.upenn.edu/] University of Pennsylvania Pennsylvania [Education] Sebro, Ronnie 2 2 LastAuthor
2 grid.14003.36 [0000-0002-6740-9498] ur.0661627033.29 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Madison United States grid.14003.36 [http://www.wisc.edu/] University of Wisconsin–Madison Wisconsin [Education] Schumacher, Jessica R. 1 14 FirstAuthor
3 grid.14003.36 None ur.015612367333.32 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Madison United States grid.14003.36 [http://www.wisc.edu/] University of Wisconsin–Madison Wisconsin [Education] Wiener, Alyssa A. 2 14
4 grid.410427.4 None ur.012326542557.13 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Augusta United States grid.410427.4 [http://www.augusta.edu/] Augusta University Georgia [Education] Greenberg, Caprice C. 3 14
5 grid.14003.36 [0000-0002-4517-1204] ur.0632670166.10 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Madison United States grid.14003.36 [http://www.wisc.edu/] University of Wisconsin–Madison Wisconsin [Education] Hanlon, Bret 4 14
6 grid.240614.5 None ur.0671641425.86 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Buffalo United States grid.240614.5 [https://www.roswellpark.org/] Roswell Park Comprehensive Cancer Center New York [Healthcare] Edge, Stephen B. 5 14
7 grid.66875.3a None ur.01264057027.05 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Rochester United States grid.66875.3a [http://www.mayoclinic.org/patient-visitor-gui... Mayo Clinic Minnesota [Healthcare] Ruddy, Kathryn J. 6 14
8 grid.65499.37 [0000-0002-4722-4824] ur.012333143317.98 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Boston United States grid.65499.37 [http://www.dana-farber.org/] Dana-Farber Cancer Institute Massachusetts [Facility] Partridge, Ann H. 7 14
9 grid.66875.3a [0000-0002-2234-7430] ur.0654547635.88 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Rochester United States grid.66875.3a [http://www.mayoclinic.org/patient-visitor-gui... Mayo Clinic Minnesota [Healthcare] Le-Rademacher, Jennifer G. 8 14
10 grid.14003.36 None ur.016365762407.99 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Madison United States grid.14003.36 [http://www.wisc.edu/] University of Wisconsin–Madison Wisconsin [Education] Yu, Menggang 9 14
11 grid.29857.31 [0000-0002-9790-2988] ur.07542517775.28 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 State College United States grid.29857.31 [http://www.psu.edu/] Pennsylvania State University Pennsylvania [Education] Vanness, David J. 10 14
12 grid.14003.36 None ur.012527107034.48 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Madison United States grid.14003.36 [http://www.wisc.edu/] University of Wisconsin–Madison Wisconsin [Education] Yang, Dou-Yan 11 14
13 grid.14003.36 [0000-0001-8796-4328] ur.01224567375.46 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Madison United States grid.14003.36 [http://www.wisc.edu/] University of Wisconsin–Madison Wisconsin [Education] Havlena, Jeffrey 12 14
14 grid.66875.3a None ur.014436171657.16 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Rochester United States grid.66875.3a [http://www.mayoclinic.org/patient-visitor-gui... Mayo Clinic Minnesota [Healthcare] Strand, Carrie 13 14 Penultimate
15 grid.14003.36 None ur.01333351663.72 pub.1153677611 Local/Regional Recurrence Rates After Breast-C... 0 2022 Madison United States grid.14003.36 [http://www.wisc.edu/] University of Wisconsin–Madison Wisconsin [Education] Neuman, Heather B. 14 14 LastAuthor
16 [] None pub.1153575321 Quality and Safety Considerations in Intensity... 0 2022 NaN NaN NaN NaN NaN NaN NaN Moran, Jean M 1 9 FirstAuthor
17 [] None pub.1153575321 Quality and Safety Considerations in Intensity... 0 2022 NaN NaN NaN NaN NaN NaN NaN Bazan, Jose G 2 9
18 grid.478397.6 None ur.016200147053.28 pub.1153575321 Quality and Safety Considerations in Intensity... 0 2022 Arlington United States grid.478397.6 [https://www.astro.org/home/] American Society for Radiation Oncology Virginia [Nonprofit] Dawes, Samantha L 3 9
19 grid.478397.6 None ur.010477224250.27 pub.1153575321 Quality and Safety Considerations in Intensity... 0 2022 Arlington United States grid.478397.6 [https://www.astro.org/home/] American Society for Radiation Oncology Virginia [Nonprofit] Kujundzic, Ksenija 4 9

Where to go from here

In this Dimensions Analytics API tutorial we have seen how, using the publications source, it is possible to extract and analyse information about authors and their order to authorhip.

This only scratches the surface of the possible applications of publications data, but hopefully it’ll give you a few basic tools to get started building your own application.

For more tutorials, see the API LAB homepage.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg