Journal Profiling Part 4: Institutions¶
This Python notebook shows how to use the Dimensions Analytics API to extract information about the authors affiliations data linked to publications of a specific journal.
This tutorial is the fourth of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.
In this notebook we are going to
Load the publications data extracted in part 1
Focus on institutions linked to a journal: measure how often do they appear, how many affiliated authors they have etc..
[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 24, 2022
==
Prerequisites¶
This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.
[2]:
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
import getpass
KEY = getpass.getpass(prompt='API Key: ')
dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
KEY = ""
dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file
Finally, let’s set up a folder to store the data we are going to extract:
[3]:
# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
os.mkdir(FOLDER_NAME)
def save(df,filename_dot_csv):
df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)
Institutions Contributing to a Journal¶
From our original publications dataset, we now want to look at institutions i.e.
getting the full list of institutions (also ones without a GRID, for subsequent analysis) linked to the journal
publications count
authors count
Load previously saved affiliations data¶
Let’s reload the affiliations data from Part-1 of this tutorial series.
NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data
* grab this file, unzip it, open the enclosed folder and upload the file called 1_publications_affiliations.csv
to Google Colab (‘Upload’ menu or also by dragging then inside the
panel window) * move the file inside the journal-profile-data
folder you just created
[4]:
affiliations = pd.read_csv(FOLDER_NAME+"/1_publications_affiliations.csv")
affiliations.head(10)
[4]:
aff_city | aff_city_id | aff_country | aff_country_code | aff_id | aff_name | aff_raw_affiliation | aff_state | aff_state_code | pub_id | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Center for Disease Neurogenomics, Icahn School... | New York | US-NY | pub.1144816179 | NaN | Biao | Zeng |
1 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Pamela Sklar Division of Psychiatric Genomics,... | New York | US-NY | pub.1144816179 | NaN | Biao | Zeng |
2 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Department of Genetics and Genomic Sciences, I... | New York | US-NY | pub.1144816179 | NaN | Biao | Zeng |
3 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Icahn Institute for Data Science and Genomic T... | New York | US-NY | pub.1144816179 | NaN | Biao | Zeng |
4 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Department of Psychiatry, Icahn School of Medi... | New York | US-NY | pub.1144816179 | NaN | Biao | Zeng |
5 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Center for Disease Neurogenomics, Icahn School... | New York | US-NY | pub.1144816179 | NaN | Jaroslav | Bendl |
6 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Pamela Sklar Division of Psychiatric Genomics,... | New York | US-NY | pub.1144816179 | NaN | Jaroslav | Bendl |
7 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Department of Genetics and Genomic Sciences, I... | New York | US-NY | pub.1144816179 | NaN | Jaroslav | Bendl |
8 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Icahn Institute for Data Science and Genomic T... | New York | US-NY | pub.1144816179 | NaN | Jaroslav | Bendl |
9 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Department of Psychiatry, Icahn School of Medi... | New York | US-NY | pub.1144816179 | NaN | Jaroslav | Bendl |
Basic stats about affiliations¶
count how many affiliations statements in total
count how many affiliations have a GRID ID
count how many unique GRID IDs we have in total
[5]:
#
# segment the affiliations dataset
affiliations = affiliations.fillna('')
affiliations_with_grid = affiliations.query("aff_id != ''")
affiliations_without_grid = affiliations.query("aff_id == ''")
#
# save
save(affiliations_without_grid, "4_institutions_without_grid.csv")
[6]:
# build a summary barchart
df = pd.DataFrame({
'measure' : ['Affiliations in total (non unique)', 'Affiliations with a GRID ID', 'Affiliations with a GRID ID (unique)'],
'count' : [len(affiliations), len(affiliations_with_grid), affiliations_with_grid['aff_id'].nunique()],
})
px.bar(df, x="measure", y="count", title=f"Affiliations stats")
Enriching the unique affiliations (GRIDs list) with pubs count and authors count¶
We want a table with the following columns
grid ID
city
country
country code
name
tot_pubs
tot_affiliations
NOTE: tot_affiliations is a list of ‘authorships’ (ie authors in the context of each publication).
For out analysis we can start from the gridaffiliations
dataframe.
[7]:
affiliations_with_grid.head(5)
[7]:
aff_city | aff_city_id | aff_country | aff_country_code | aff_id | aff_name | aff_raw_affiliation | aff_state | aff_state_code | pub_id | researcher_id | first_name | last_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Center for Disease Neurogenomics, Icahn School... | New York | US-NY | pub.1144816179 | Biao | Zeng | |
1 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Pamela Sklar Division of Psychiatric Genomics,... | New York | US-NY | pub.1144816179 | Biao | Zeng | |
2 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Department of Genetics and Genomic Sciences, I... | New York | US-NY | pub.1144816179 | Biao | Zeng | |
3 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Icahn Institute for Data Science and Genomic T... | New York | US-NY | pub.1144816179 | Biao | Zeng | |
4 | New York | 5128581.0 | United States | US | grid.59734.3c | Icahn School of Medicine at Mount Sinai | Department of Psychiatry, Icahn School of Medi... | New York | US-NY | pub.1144816179 | Biao | Zeng |
[8]:
gridaffiliations = affiliations_with_grid.copy()
#
# group by GRIDID and add new column with affiliations count
gridaffiliations["tot_affiliations"] = gridaffiliations.groupby('aff_id')['aff_id'].transform('count')
#
# add new column with publications count, for each GRID
gridaffiliations["tot_pubs"] = gridaffiliations.groupby(['aff_id'])['pub_id'].transform('nunique')
#
# remove unnecessary columns
gridaffiliations = gridaffiliations.drop(['aff_city_id', 'pub_id', 'researcher_id', 'first_name', 'last_name'], axis=1).reset_index(drop=True)
#
# remove duplicate rows
gridaffiliations.drop_duplicates(inplace=True)
#
# update columns order
gridaffiliations = gridaffiliations[[ 'aff_id', 'aff_name','aff_city',
'aff_country', 'aff_country_code', 'aff_state',
'aff_state_code', 'tot_affiliations', 'tot_pubs']]
#
# sort
gridaffiliations = gridaffiliations.sort_values(['tot_affiliations', 'tot_pubs'], ascending=False)
#
#
# That's it! Let's see the result
gridaffiliations.head()
[8]:
aff_id | aff_name | aff_city | aff_country | aff_country_code | aff_state | aff_state_code | tot_affiliations | tot_pubs | |
---|---|---|---|---|---|---|---|---|---|
94 | grid.38142.3c | Harvard University | Cambridge | United States | US | Massachusetts | US-MA | 2291 | 370 |
267 | grid.38142.3c | Harvard University | Cambridge | United States | US | Massachusetts | US-MA | 2291 | 370 |
1071 | grid.38142.3c | Harvard University | Cambridge | United States | US | Massachusetts | US-MA | 2291 | 370 |
1108 | grid.38142.3c | Harvard University | Cambridge | United States | US | Massachusetts | US-MA | 2291 | 370 |
1110 | grid.38142.3c | Harvard University | Cambridge | United States | US | Massachusetts | US-MA | 2291 | 370 |
[9]:
# save the data
save(gridaffiliations, "4_institutions_with_grid_with_metrics.csv")
Couple of Dataviz¶
[12]:
treshold = 5000
px.scatter(gridaffiliations[:treshold],
x="aff_name", y="tot_pubs",
color="aff_country",
size='tot_affiliations',
hover_name="aff_name",
height=800,
hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'],
title=f"Top {treshold} affiliations by number of publications (country segmentation)")
[14]:
treshold = 5000
px.scatter(gridaffiliations[:treshold],
x="tot_affiliations", y="tot_pubs",
color="aff_country",
hover_name="aff_name",
hover_data=['aff_id', 'aff_name', 'aff_city', 'aff_country', 'tot_affiliations', 'tot_pubs'],
title=f"Top {treshold} affiliations: Number of Publications VS Number of Authors")
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.