Journal Profiling Part 4: Institutions¶

This Python notebook shows how to use the Dimensions Analytics API to extract information about the authors affiliations data linked to publications of a specific journal.

This tutorial is the fourth of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to

Load the publications data extracted in part 1
Focus on institutions linked to a journal: measure how often do they appear, how many affiliated authors they have etc..

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Finally, let’s set up a folder to store the data we are going to extract:

[3]:

# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

Institutions Contributing to a Journal¶

From our original publications dataset, we now want to look at institutions i.e.

getting the full list of institutions (also ones without a GRID, for subsequent analysis) linked to the journal
publications count
authors count

Load previously saved affiliations data¶

Let’s reload the affiliations data from Part-1 of this tutorial series.

NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data * grab this file, unzip it, open the enclosed folder and upload the file called 1_publications_affiliations.csv to Google Colab (‘Upload’ menu or also by dragging then inside the panel window) * move the file inside the journal-profile-data folder you just created

[4]:

affiliations = pd.read_csv(FOLDER_NAME+"/1_publications_affiliations.csv")
affiliations.head(10)

[4]:

	aff_city	aff_city_id	aff_country	aff_country_code	aff_id	aff_name	aff_raw_affiliation	aff_state	aff_state_code	pub_id	researcher_id	first_name	last_name
0	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Center for Disease Neurogenomics, Icahn School...	New York	US-NY	pub.1144816179	NaN	Biao	Zeng
1	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Pamela Sklar Division of Psychiatric Genomics,...	New York	US-NY	pub.1144816179	NaN	Biao	Zeng
2	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Department of Genetics and Genomic Sciences, I...	New York	US-NY	pub.1144816179	NaN	Biao	Zeng
3	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Icahn Institute for Data Science and Genomic T...	New York	US-NY	pub.1144816179	NaN	Biao	Zeng
4	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Department of Psychiatry, Icahn School of Medi...	New York	US-NY	pub.1144816179	NaN	Biao	Zeng
5	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Center for Disease Neurogenomics, Icahn School...	New York	US-NY	pub.1144816179	NaN	Jaroslav	Bendl
6	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Pamela Sklar Division of Psychiatric Genomics,...	New York	US-NY	pub.1144816179	NaN	Jaroslav	Bendl
7	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Department of Genetics and Genomic Sciences, I...	New York	US-NY	pub.1144816179	NaN	Jaroslav	Bendl
8	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Icahn Institute for Data Science and Genomic T...	New York	US-NY	pub.1144816179	NaN	Jaroslav	Bendl
9	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Department of Psychiatry, Icahn School of Medi...	New York	US-NY	pub.1144816179	NaN	Jaroslav	Bendl

Basic stats about affiliations¶

count how many affiliations statements in total
count how many affiliations have a GRID ID
count how many unique GRID IDs we have in total

[5]:

#
# segment the affiliations dataset
affiliations = affiliations.fillna('')
affiliations_with_grid = affiliations.query("aff_id != ''")
affiliations_without_grid = affiliations.query("aff_id == ''")
#
# save
save(affiliations_without_grid, "4_institutions_without_grid.csv")

[6]:

# build a summary barchart

df = pd.DataFrame({
    'measure' : ['Affiliations in total (non unique)', 'Affiliations with a GRID ID', 'Affiliations with a GRID ID (unique)'],
    'count' : [len(affiliations), len(affiliations_with_grid), affiliations_with_grid['aff_id'].nunique()],
})
px.bar(df, x="measure", y="count", title=f"Affiliations stats")

Enriching the unique affiliations (GRIDs list) with pubs count and authors count¶

We want a table with the following columns

grid ID
city
country
country code
name
tot_pubs
tot_affiliations

NOTE: tot_affiliations is a list of ‘authorships’ (ie authors in the context of each publication).

For out analysis we can start from the gridaffiliations dataframe.

[7]:

affiliations_with_grid.head(5)

[7]:

	aff_city	aff_city_id	aff_country	aff_country_code	aff_id	aff_name	aff_raw_affiliation	aff_state	aff_state_code	pub_id	first_name	last_name
0	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Center for Disease Neurogenomics, Icahn School...	New York	US-NY	pub.1144816179	Biao	Zeng
1	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Pamela Sklar Division of Psychiatric Genomics,...	New York	US-NY	pub.1144816179	Biao	Zeng
2	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Department of Genetics and Genomic Sciences, I...	New York	US-NY	pub.1144816179	Biao	Zeng
3	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Icahn Institute for Data Science and Genomic T...	New York	US-NY	pub.1144816179	Biao	Zeng
4	New York	5128581.0	United States	US	grid.59734.3c	Icahn School of Medicine at Mount Sinai	Department of Psychiatry, Icahn School of Medi...	New York	US-NY	pub.1144816179	Biao	Zeng

[8]:

gridaffiliations = affiliations_with_grid.copy()
#
# group by GRIDID and add new column with affiliations count
gridaffiliations["tot_affiliations"] = gridaffiliations.groupby('aff_id')['aff_id'].transform('count')
#
# add new column with publications count, for each GRID
gridaffiliations["tot_pubs"] = gridaffiliations.groupby(['aff_id'])['pub_id'].transform('nunique')
#
# remove unnecessary columns
gridaffiliations = gridaffiliations.drop(['aff_city_id', 'pub_id', 'researcher_id', 'first_name', 'last_name'], axis=1).reset_index(drop=True)
#
# remove duplicate rows
gridaffiliations.drop_duplicates(inplace=True)
#
# update columns order
gridaffiliations = gridaffiliations[[ 'aff_id', 'aff_name','aff_city',
                                     'aff_country', 'aff_country_code',  'aff_state',
                                     'aff_state_code', 'tot_affiliations',  'tot_pubs']]
#
# sort
gridaffiliations = gridaffiliations.sort_values(['tot_affiliations', 'tot_pubs'], ascending=False)
#
#
# That's it! Let's see the result
gridaffiliations.head()

[8]:

	aff_id	aff_name	aff_city	aff_country	aff_country_code	aff_state	aff_state_code	tot_affiliations	tot_pubs
94	grid.38142.3c	Harvard University	Cambridge	United States	US	Massachusetts	US-MA	2291	370
267	grid.38142.3c	Harvard University	Cambridge	United States	US	Massachusetts	US-MA	2291	370
1071	grid.38142.3c	Harvard University	Cambridge	United States	US	Massachusetts	US-MA	2291	370
1108	grid.38142.3c	Harvard University	Cambridge	United States	US	Massachusetts	US-MA	2291	370
1110	grid.38142.3c	Harvard University	Cambridge	United States	US	Massachusetts	US-MA	2291	370

[9]:

# save the data
save(gridaffiliations, "4_institutions_with_grid_with_metrics.csv")

Couple of Dataviz¶

[12]:

treshold = 5000

px.scatter(gridaffiliations[:treshold],
           x="aff_name", y="tot_pubs",
           color="aff_country",
           size='tot_affiliations',
           hover_name="aff_name",
           height=800,
           hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'],
           title=f"Top {treshold} affiliations by number of publications (country segmentation)")

[14]:

treshold = 5000

px.scatter(gridaffiliations[:treshold],
           x="tot_affiliations", y="tot_pubs",
           color="aff_country",
           hover_name="aff_name",
           hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'],
           title=f"Top {treshold} affiliations: Number of Publications VS Number of Authors")

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.