../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Journal Profiling Part 4: Institutions

This Python notebook shows how to use the Dimensions Analytics API to extract information about the authors affiliations data linked to publications of a specific journal.

This tutorial is the fourth of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to

  • Load the publications data extracted in part 1

  • Focus on institutions linked to a journal: measure how often do they appear, how many affiliated authors they have etc..

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Finally, let’s set up a folder to store the data we are going to extract:

[3]:
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

Institutions Contributing to a Journal

From our original publications dataset, we now want to look at institutions i.e.

  • getting the full list of institutions (also ones without a GRID, for subsequent analysis) linked to the journal

  • publications count

  • authors count

Load previously saved affiliations data

Let’s reload the affiliations data from Part-1 of this tutorial series.

NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data * grab this file, unzip it, open the enclosed folder and upload the file called 1_publications_affiliations.csv to Google Colab (‘Upload’ menu or also by dragging then inside the panel window) * move the file inside the journal-profile-data folder you just created

[4]:
affiliations = pd.read_csv(FOLDER_NAME+"/1_publications_affiliations.csv")
affiliations.head(10)
[4]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
0 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Center for Disease Neurogenomics, Icahn School... New York US-NY pub.1144816179 NaN Biao Zeng
1 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Pamela Sklar Division of Psychiatric Genomics,... New York US-NY pub.1144816179 NaN Biao Zeng
2 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Genetics and Genomic Sciences, I... New York US-NY pub.1144816179 NaN Biao Zeng
3 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Icahn Institute for Data Science and Genomic T... New York US-NY pub.1144816179 NaN Biao Zeng
4 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Psychiatry, Icahn School of Medi... New York US-NY pub.1144816179 NaN Biao Zeng
5 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Center for Disease Neurogenomics, Icahn School... New York US-NY pub.1144816179 NaN Jaroslav Bendl
6 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Pamela Sklar Division of Psychiatric Genomics,... New York US-NY pub.1144816179 NaN Jaroslav Bendl
7 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Genetics and Genomic Sciences, I... New York US-NY pub.1144816179 NaN Jaroslav Bendl
8 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Icahn Institute for Data Science and Genomic T... New York US-NY pub.1144816179 NaN Jaroslav Bendl
9 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Psychiatry, Icahn School of Medi... New York US-NY pub.1144816179 NaN Jaroslav Bendl

Basic stats about affiliations

  • count how many affiliations statements in total

  • count how many affiliations have a GRID ID

  • count how many unique GRID IDs we have in total

[5]:
#
# segment the affiliations dataset
affiliations = affiliations.fillna('')
affiliations_with_grid = affiliations.query("aff_id != ''")
affiliations_without_grid = affiliations.query("aff_id == ''")
#
# save
save(affiliations_without_grid, "4_institutions_without_grid.csv")
[6]:
# build a summary barchart

df = pd.DataFrame({
    'measure' : ['Affiliations in total (non unique)', 'Affiliations with a GRID ID', 'Affiliations with a GRID ID (unique)'],
    'count' : [len(affiliations), len(affiliations_with_grid), affiliations_with_grid['aff_id'].nunique()],
})
px.bar(df, x="measure", y="count", title=f"Affiliations stats")

Enriching the unique affiliations (GRIDs list) with pubs count and authors count

We want a table with the following columns

  • grid ID

  • city

  • country

  • country code

  • name

  • tot_pubs

  • tot_affiliations

NOTE: tot_affiliations is a list of ‘authorships’ (ie authors in the context of each publication).

For out analysis we can start from the gridaffiliations dataframe.

[7]:
affiliations_with_grid.head(5)
[7]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
0 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Center for Disease Neurogenomics, Icahn School... New York US-NY pub.1144816179 Biao Zeng
1 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Pamela Sklar Division of Psychiatric Genomics,... New York US-NY pub.1144816179 Biao Zeng
2 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Genetics and Genomic Sciences, I... New York US-NY pub.1144816179 Biao Zeng
3 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Icahn Institute for Data Science and Genomic T... New York US-NY pub.1144816179 Biao Zeng
4 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Psychiatry, Icahn School of Medi... New York US-NY pub.1144816179 Biao Zeng
[8]:
gridaffiliations = affiliations_with_grid.copy()
#
# group by GRIDID and add new column with affiliations count
gridaffiliations["tot_affiliations"] = gridaffiliations.groupby('aff_id')['aff_id'].transform('count')
#
# add new column with publications count, for each GRID
gridaffiliations["tot_pubs"] = gridaffiliations.groupby(['aff_id'])['pub_id'].transform('nunique')
#
# remove unnecessary columns
gridaffiliations = gridaffiliations.drop(['aff_city_id', 'pub_id', 'researcher_id', 'first_name', 'last_name'], axis=1).reset_index(drop=True)
#
# remove duplicate rows
gridaffiliations.drop_duplicates(inplace=True)
#
# update columns order
gridaffiliations = gridaffiliations[[ 'aff_id', 'aff_name','aff_city',
                                     'aff_country', 'aff_country_code',  'aff_state',
                                     'aff_state_code', 'tot_affiliations',  'tot_pubs']]
#
# sort
gridaffiliations = gridaffiliations.sort_values(['tot_affiliations', 'tot_pubs'], ascending=False)
#
#
# That's it! Let's see the result
gridaffiliations.head()
[8]:
aff_id aff_name aff_city aff_country aff_country_code aff_state aff_state_code tot_affiliations tot_pubs
94 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
267 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
1071 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
1108 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
1110 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
[9]:
# save the data
save(gridaffiliations, "4_institutions_with_grid_with_metrics.csv")

Couple of Dataviz

[12]:
treshold = 5000

px.scatter(gridaffiliations[:treshold],
           x="aff_name", y="tot_pubs",
           color="aff_country",
           size='tot_affiliations',
           hover_name="aff_name",
           height=800,
           hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'],
           title=f"Top {treshold} affiliations by number of publications (country segmentation)")
[14]:
treshold = 5000

px.scatter(gridaffiliations[:treshold],
           x="tot_affiliations", y="tot_pubs",
           color="aff_country",
           hover_name="aff_name",
           hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'],
           title=f"Top {treshold} affiliations: Number of Publications VS Number of Authors")