Extracting Authors order from Publications data¶
This Python notebook shows how to use the Dimensions Analytics API, in particular the publications source, in order to analyse the publications’ authors’ order.
These are the steps:
First we extract a dataset of interest from Dimensions’ publications database
Second, we process authors structured data so to turn the implicit authorship order into a number
Third, we mark first and last authors via a new ‘author category’ column
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Apr 20, 2023
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and have followed the steps in the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progressbar
import pandas as pd
import numpy as np
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.0.2)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.6
Method: dsl.ini file
1. Extracting a dataset from Dimensions¶
We use three different queries in order to extract
authors information
publications metadata
research organizations information
NOTE other approaches are also possible e.g. extracting all data via a single query and then using Python to select only the fields of interests. For the purpose of this tutorial, using separate queries is the most straighforward way to achieve our goal.
[3]:
#
# the main query string selects publications based on a) pub year, b) specific organization IDs and c) concept
# you can update this query based on your preferences
#
main_query = """
search publications
where year in [2022:2022]
and research_orgs in ["grid.21925.3d","grid.147455.6","grid.25879.31","grid.29857.31"]
and concepts = "oncology"
return publications
"""
[4]:
# use the main query but extract only authors infos
Authors = dsl.query_iterative(main_query + "[id+authors]").as_dataframe_authors() ##researcher_id, pub_id, current_organization_ID
Authors.head()
Starting iteration with limit=1000 skip=0 ...
0-120 / 120 (4.36s)
===
Records extracted: 120
[4]:
affiliations | corresponding | current_organization_id | first_name | last_name | orcid | raw_affiliation | researcher_id | pub_id | |
---|---|---|---|---|---|---|---|---|---|
0 | [{'city': 'Philadelphia', 'city_id': 4560349, ... | grid.25879.31 | Andrew | Schlafly | None | [Perelman School of Medicine, University of Pe... | ur.012676303143.43 | pub.1154094821 | |
1 | [{'city': 'Jacksonville', 'city_id': 4160021, ... | True | grid.25879.31 | Ronnie | Sebro | None | [Center for Augmented Intelligence, Mayo Clini... | ur.0660765735.77 | pub.1154094821 |
2 | [{'city': 'Madison', 'city_id': 5261457, 'coun... | grid.14003.36 | Jessica R. | Schumacher | [0000-0002-6740-9498] | [Department of Surgery, University of Wisconsi... | ur.0661627033.29 | pub.1153677611 | |
3 | [{'city': 'Madison', 'city_id': 5261457, 'coun... | grid.14003.36 | Alyssa A. | Wiener | None | [Department of Surgery, University of Wisconsi... | ur.015612367333.32 | pub.1153677611 | |
4 | [{'city': 'Madison', 'city_id': 5261457, 'coun... | grid.410427.4 | Caprice C. | Greenberg | None | [Department of Surgery, University of Wisconsi... | ur.012326542557.13 | pub.1153677611 |
[5]:
# use the main query but extract only pubs metadata
Pubs = dsl.query_iterative(main_query + "[id+title+year+times_cited]").as_dataframe()
Pubs.head()
Starting iteration with limit=1000 skip=0 ...
0-120 / 120 (1.88s)
===
Records extracted: 120
[5]:
id | title | times_cited | year | |
---|---|---|---|---|
0 | pub.1154094821 | Does NIH funding differ between medical specia... | 0 | 2022 |
1 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 |
2 | pub.1153575321 | Quality and Safety Considerations in Intensity... | 0 | 2022 |
3 | pub.1153525111 | Data standards in pediatric oncology: Past, pr... | 0 | 2022 |
4 | pub.1153522196 | Assessments of Somatic Variant Classification ... | 0 | 2022 |
[6]:
# use the main query but extract only research orgs infos
RORGS = dsl.query_iterative(main_query + "[unnest(research_orgs)]").as_dataframe()
RORGS.head()
Starting iteration with limit=1000 skip=0 ...
0-120 / 120 (1.03s)
120-120 / 120 (4.07s)
===
Records extracted: 599
[6]:
research_orgs.city_name | research_orgs.country_name | research_orgs.id | research_orgs.latitude | research_orgs.linkout | research_orgs.longitude | research_orgs.name | research_orgs.state_name | research_orgs.types | research_orgs.acronym | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Philadelphia | United States | grid.25879.31 | 39.952457 | [http://www.upenn.edu/] | -75.193220 | University of Pennsylvania | Pennsylvania | [Education] | NaN |
1 | Jacksonville | United States | grid.417467.7 | 30.289337 | [https://www.mayoclinic.org/patient-visitor-gu... | -81.437775 | Mayo Clinic | Florida | [Healthcare] | NaN |
2 | Madison | United States | grid.14003.36 | 43.076694 | [http://www.wisc.edu/] | -89.412440 | University of Wisconsin–Madison | Wisconsin | [Education] | UW |
3 | Rochester | United States | grid.66875.3a | 44.024070 | [http://www.mayoclinic.org/patient-visitor-gui... | -92.466310 | Mayo Clinic | Minnesota | [Healthcare] | NaN |
4 | Madison | United States | grid.412639.b | 43.076946 | [https://cancer.wisc.edu/] | -89.431470 | UW Carbone Cancer Center | Wisconsin | [Healthcare] | UWCCC |
2. Combining the results¶
We merge the results from the queries above into a single table containing only the columns we want.
Additionally, we calculate for each author which is the order of authorship and add a category for ‘first’ and ‘last’ authors.
[7]:
#
# Authors becomes the "main table" because it has both the PubID and the ResearcherID
# Then use Authors->Pubs to lookup title, year, times cited on authors.pub_id = Pubs.id
# Then use Authors->RORGS to lookup rorg name, type and country on authors.current_organization_ID = RORGS.id
#
##prep RORGS for merge
RORGS = RORGS.dropna(subset = ['research_orgs.id'])
RORGS = RORGS.rename(columns = {'research_orgs.id':'rorg_id'})
RORGS = RORGS.drop_duplicates(subset=['rorg_id', 'research_orgs.name'], keep='last')
##Combine all three dataframes into one
AutPub = pd.merge(
left=Authors,
right=Pubs,
left_on='pub_id',
right_on='id',
how='left'
)
final = pd.merge(
left=AutPub,
right=RORGS,
left_on='current_organization_id',
right_on='rorg_id',
how='left'
)
final["author_name"] = final["last_name"] + [", "] + final["first_name"]
final['author_number'] = final.groupby(['pub_id']).cumcount()+1; #this will only work if you haven't sorted the dataframe
final = final.drop(columns=['affiliations', 'corresponding', 'raw_affiliation', 'id', 'first_name', 'last_name','research_orgs.latitude','research_orgs.longitude','research_orgs.acronym'])
#Get AuthorCounts,etc by pub ID and join back to AutPubRORG table
AuthorCount = final.groupby(['pub_id'])['author_number'].max()
final = pd.merge(
left=final,
right=AuthorCount,
left_on='pub_id',
right_on='pub_id',
how='left'
)
final = final.rename(columns = {'author_number_x':'author_number', 'author_number_y':'authors_tot', })
# Assing a category to first authors and last authors
final['AuthorCategory'] = np.where(
final['author_number']==1, 'FirstAuthor',
np.where(
final['author_number']==final['authors_tot'],"LastAuthor",
np.where(
(final['authors_tot']-final['author_number'])==1,"Penultimate",""
)
)
)
final.head(20)
[7]:
current_organization_id | orcid | researcher_id | pub_id | title | times_cited | year | research_orgs.city_name | research_orgs.country_name | rorg_id | research_orgs.linkout | research_orgs.name | research_orgs.state_name | research_orgs.types | author_name | author_number | authors_tot | AuthorCategory | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | grid.25879.31 | None | ur.012676303143.43 | pub.1154094821 | Does NIH funding differ between medical specia... | 0 | 2022 | Philadelphia | United States | grid.25879.31 | [http://www.upenn.edu/] | University of Pennsylvania | Pennsylvania | [Education] | Schlafly, Andrew | 1 | 2 | FirstAuthor |
1 | grid.25879.31 | None | ur.0660765735.77 | pub.1154094821 | Does NIH funding differ between medical specia... | 0 | 2022 | Philadelphia | United States | grid.25879.31 | [http://www.upenn.edu/] | University of Pennsylvania | Pennsylvania | [Education] | Sebro, Ronnie | 2 | 2 | LastAuthor |
2 | grid.14003.36 | [0000-0002-6740-9498] | ur.0661627033.29 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Madison | United States | grid.14003.36 | [http://www.wisc.edu/] | University of Wisconsin–Madison | Wisconsin | [Education] | Schumacher, Jessica R. | 1 | 14 | FirstAuthor |
3 | grid.14003.36 | None | ur.015612367333.32 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Madison | United States | grid.14003.36 | [http://www.wisc.edu/] | University of Wisconsin–Madison | Wisconsin | [Education] | Wiener, Alyssa A. | 2 | 14 | |
4 | grid.410427.4 | None | ur.012326542557.13 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Augusta | United States | grid.410427.4 | [http://www.augusta.edu/] | Augusta University | Georgia | [Education] | Greenberg, Caprice C. | 3 | 14 | |
5 | grid.14003.36 | [0000-0002-4517-1204] | ur.0632670166.10 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Madison | United States | grid.14003.36 | [http://www.wisc.edu/] | University of Wisconsin–Madison | Wisconsin | [Education] | Hanlon, Bret | 4 | 14 | |
6 | grid.240614.5 | None | ur.0671641425.86 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Buffalo | United States | grid.240614.5 | [https://www.roswellpark.org/] | Roswell Park Comprehensive Cancer Center | New York | [Healthcare] | Edge, Stephen B. | 5 | 14 | |
7 | grid.66875.3a | None | ur.01264057027.05 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Rochester | United States | grid.66875.3a | [http://www.mayoclinic.org/patient-visitor-gui... | Mayo Clinic | Minnesota | [Healthcare] | Ruddy, Kathryn J. | 6 | 14 | |
8 | grid.65499.37 | [0000-0002-4722-4824] | ur.012333143317.98 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Boston | United States | grid.65499.37 | [http://www.dana-farber.org/] | Dana-Farber Cancer Institute | Massachusetts | [Facility] | Partridge, Ann H. | 7 | 14 | |
9 | grid.66875.3a | [0000-0002-2234-7430] | ur.0654547635.88 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Rochester | United States | grid.66875.3a | [http://www.mayoclinic.org/patient-visitor-gui... | Mayo Clinic | Minnesota | [Healthcare] | Le-Rademacher, Jennifer G. | 8 | 14 | |
10 | grid.14003.36 | None | ur.016365762407.99 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Madison | United States | grid.14003.36 | [http://www.wisc.edu/] | University of Wisconsin–Madison | Wisconsin | [Education] | Yu, Menggang | 9 | 14 | |
11 | grid.29857.31 | [0000-0002-9790-2988] | ur.07542517775.28 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | State College | United States | grid.29857.31 | [http://www.psu.edu/] | Pennsylvania State University | Pennsylvania | [Education] | Vanness, David J. | 10 | 14 | |
12 | grid.14003.36 | None | ur.012527107034.48 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Madison | United States | grid.14003.36 | [http://www.wisc.edu/] | University of Wisconsin–Madison | Wisconsin | [Education] | Yang, Dou-Yan | 11 | 14 | |
13 | grid.14003.36 | [0000-0001-8796-4328] | ur.01224567375.46 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Madison | United States | grid.14003.36 | [http://www.wisc.edu/] | University of Wisconsin–Madison | Wisconsin | [Education] | Havlena, Jeffrey | 12 | 14 | |
14 | grid.66875.3a | None | ur.014436171657.16 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Rochester | United States | grid.66875.3a | [http://www.mayoclinic.org/patient-visitor-gui... | Mayo Clinic | Minnesota | [Healthcare] | Strand, Carrie | 13 | 14 | Penultimate |
15 | grid.14003.36 | None | ur.01333351663.72 | pub.1153677611 | Local/Regional Recurrence Rates After Breast-C... | 0 | 2022 | Madison | United States | grid.14003.36 | [http://www.wisc.edu/] | University of Wisconsin–Madison | Wisconsin | [Education] | Neuman, Heather B. | 14 | 14 | LastAuthor |
16 | [] | None | pub.1153575321 | Quality and Safety Considerations in Intensity... | 0 | 2022 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Moran, Jean M | 1 | 9 | FirstAuthor | |
17 | [] | None | pub.1153575321 | Quality and Safety Considerations in Intensity... | 0 | 2022 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bazan, Jose G | 2 | 9 | ||
18 | grid.478397.6 | None | ur.016200147053.28 | pub.1153575321 | Quality and Safety Considerations in Intensity... | 0 | 2022 | Arlington | United States | grid.478397.6 | [https://www.astro.org/home/] | American Society for Radiation Oncology | Virginia | [Nonprofit] | Dawes, Samantha L | 3 | 9 | |
19 | grid.478397.6 | None | ur.010477224250.27 | pub.1153575321 | Quality and Safety Considerations in Intensity... | 0 | 2022 | Arlington | United States | grid.478397.6 | [https://www.astro.org/home/] | American Society for Radiation Oncology | Virginia | [Nonprofit] | Kujundzic, Ksenija | 4 | 9 |
Where to go from here¶
In this Dimensions Analytics API tutorial we have seen how, using the publications source, it is possible to extract and analyse information about authors and their order to authorhip.
This only scratches the surface of the possible applications of publications data, but hopefully it’ll give you a few basic tools to get started building your own application.
For more tutorials, see the API LAB homepage.
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.