../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Journal Profiling Part 4: Institutions

This Python notebook shows how to use the Dimensions Analytics API to extract information about the authors affiliations data linked to publications of a specific journal.

This tutorial is the fourth of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to

  • Load the publications data extracted in part 1

  • Focus on institutions linked to a journal: measure how often do they appear, how many affiliated authors they have etc..

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Finally, let’s set up a folder to store the data we are going to extract:

[3]:
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

Institutions Contributing to a Journal

From our original publications dataset, we now want to look at institutions i.e.

  • getting the full list of institutions (also ones without a GRID, for subsequent analysis) linked to the journal

  • publications count

  • authors count

Load previously saved affiliations data

Let’s reload the affiliations data from Part-1 of this tutorial series.

NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data * grab this file, unzip it, open the enclosed folder and upload the file called 1_publications_affiliations.csv to Google Colab (‘Upload’ menu or also by dragging then inside the panel window) * move the file inside the journal-profile-data folder you just created

[4]:
affiliations = pd.read_csv(FOLDER_NAME+"/1_publications_affiliations.csv")
affiliations.head(10)
[4]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
0 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Center for Disease Neurogenomics, Icahn School... New York US-NY pub.1144816179 NaN Biao Zeng
1 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Pamela Sklar Division of Psychiatric Genomics,... New York US-NY pub.1144816179 NaN Biao Zeng
2 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Genetics and Genomic Sciences, I... New York US-NY pub.1144816179 NaN Biao Zeng
3 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Icahn Institute for Data Science and Genomic T... New York US-NY pub.1144816179 NaN Biao Zeng
4 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Psychiatry, Icahn School of Medi... New York US-NY pub.1144816179 NaN Biao Zeng
5 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Center for Disease Neurogenomics, Icahn School... New York US-NY pub.1144816179 NaN Jaroslav Bendl
6 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Pamela Sklar Division of Psychiatric Genomics,... New York US-NY pub.1144816179 NaN Jaroslav Bendl
7 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Genetics and Genomic Sciences, I... New York US-NY pub.1144816179 NaN Jaroslav Bendl
8 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Icahn Institute for Data Science and Genomic T... New York US-NY pub.1144816179 NaN Jaroslav Bendl
9 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Psychiatry, Icahn School of Medi... New York US-NY pub.1144816179 NaN Jaroslav Bendl

Basic stats about affiliations

  • count how many affiliations statements in total

  • count how many affiliations have a GRID ID

  • count how many unique GRID IDs we have in total

[5]:
#
# segment the affiliations dataset
affiliations = affiliations.fillna('')
affiliations_with_grid = affiliations.query("aff_id != ''")
affiliations_without_grid = affiliations.query("aff_id == ''")
#
# save
save(affiliations_without_grid, "4_institutions_without_grid.csv")
[6]:
# build a summary barchart

df = pd.DataFrame({
    'measure' : ['Affiliations in total (non unique)', 'Affiliations with a GRID ID', 'Affiliations with a GRID ID (unique)'],
    'count' : [len(affiliations), len(affiliations_with_grid), affiliations_with_grid['aff_id'].nunique()],
})
px.bar(df, x="measure", y="count", title=f"Affiliations stats")

Enriching the unique affiliations (GRIDs list) with pubs count and authors count

We want a table with the following columns

  • grid ID

  • city

  • country

  • country code

  • name

  • tot_pubs

  • tot_affiliations

NOTE: tot_affiliations is a list of ‘authorships’ (ie authors in the context of each publication).

For out analysis we can start from the gridaffiliations dataframe.

[7]:
affiliations_with_grid.head(5)
[7]:
aff_city aff_city_id aff_country aff_country_code aff_id aff_name aff_raw_affiliation aff_state aff_state_code pub_id researcher_id first_name last_name
0 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Center for Disease Neurogenomics, Icahn School... New York US-NY pub.1144816179 Biao Zeng
1 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Pamela Sklar Division of Psychiatric Genomics,... New York US-NY pub.1144816179 Biao Zeng
2 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Genetics and Genomic Sciences, I... New York US-NY pub.1144816179 Biao Zeng
3 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Icahn Institute for Data Science and Genomic T... New York US-NY pub.1144816179 Biao Zeng
4 New York 5128581.0 United States US grid.59734.3c Icahn School of Medicine at Mount Sinai Department of Psychiatry, Icahn School of Medi... New York US-NY pub.1144816179 Biao Zeng
[8]:
gridaffiliations = affiliations_with_grid.copy()
#
# group by GRIDID and add new column with affiliations count
gridaffiliations["tot_affiliations"] = gridaffiliations.groupby('aff_id')['aff_id'].transform('count')
#
# add new column with publications count, for each GRID
gridaffiliations["tot_pubs"] = gridaffiliations.groupby(['aff_id'])['pub_id'].transform('nunique')
#
# remove unnecessary columns
gridaffiliations = gridaffiliations.drop(['aff_city_id', 'pub_id', 'researcher_id', 'first_name', 'last_name'], axis=1).reset_index(drop=True)
#
# remove duplicate rows
gridaffiliations.drop_duplicates(inplace=True)
#
# update columns order
gridaffiliations = gridaffiliations[[ 'aff_id', 'aff_name','aff_city',
                                     'aff_country', 'aff_country_code',  'aff_state',
                                     'aff_state_code', 'tot_affiliations',  'tot_pubs']]
#
# sort
gridaffiliations = gridaffiliations.sort_values(['tot_affiliations', 'tot_pubs'], ascending=False)
#
#
# That's it! Let's see the result
gridaffiliations.head()
[8]:
aff_id aff_name aff_city aff_country aff_country_code aff_state aff_state_code tot_affiliations tot_pubs
94 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
267 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
1071 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
1108 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
1110 grid.38142.3c Harvard University Cambridge United States US Massachusetts US-MA 2291 370
[9]:
# save the data
save(gridaffiliations, "4_institutions_with_grid_with_metrics.csv")

Couple of Dataviz

[12]:
treshold = 5000

px.scatter(gridaffiliations[:treshold],
           x="aff_name", y="tot_pubs",
           color="aff_country",
           size='tot_affiliations',
           hover_name="aff_name",
           height=800,
           hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'],
           title=f"Top {treshold} affiliations by number of publications (country segmentation)")
[14]:
treshold = 5000

px.scatter(gridaffiliations[:treshold],
           x="tot_affiliations", y="tot_pubs",
           color="aff_country",
           hover_name="aff_name",
           hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'],
           title=f"Top {treshold} affiliations: Number of Publications VS Number of Authors")


Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg