Journal Profiling Part 3: Funding¶

This Python notebook shows how to use the Dimensions Analytics API to extract grants data for a list of researchers, so to enrich this list with funding-related metrics.

This tutorial is the third of a series that uses the data extracted in order to generate a ‘journal profile’ report. See the API Lab homepage for the other tutorials in this series.

In this notebook we are going to

Load the researchers impact metrics dataset previously extracted
Query the grants API using the researchers IDs, to obtain their total grants and funding amount
Enrich the original dataset with these new metrics
Build some simple visualizations in order to have a quick overview of the results

[20]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 24, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[1]:

!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *
import os, sys, time, json
from tqdm.notebook import tqdm as progress
import pandas as pd
import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Finally, let’s set up a folder to store the data we are going to extract:

[2]:

# create output data folder
FOLDER_NAME = "journal-profile-data"
if not(os.path.exists(FOLDER_NAME)):
    os.mkdir(FOLDER_NAME)

def save(df,filename_dot_csv):
    df.to_csv(FOLDER_NAME+"/"+filename_dot_csv, index=False)

Load previously saved researchers data¶

NOTE If you are using Google Colab or don’t have the data available, just do the following: * open up the ‘Files’ panel in Google Colab and create a new folder journal-profile-data * grab this file, unzip it, open the enclosed folder and upload the file called 2_researchers_impact_metrics.csv to Google Colab (‘Upload’ menu or also by dragging then inside the panel window) * move the file inside the journal-profile-data folder you just created

[3]:

researchers = pd.read_csv(FOLDER_NAME+"/2_researchers_impact_metrics.csv")

[4]:

# TIP to speed this up I'm taking only the top 2000 researchers!
# for a full analysis, just comment out the next line
researchers = researchers[:2000]

[5]:

# note the extra column will be dropped after re-running
researchers.head(5)

[5]:

	researcher_id	pubs	full_name	citations_mean	altmetric_mean	last_pub_year	url
0	ur.0723426172.10	77	Kari Stefansson	201.493506	248.818182	2021	NaN
1	ur.01277776417.51	51	Unnur Thorsteinsdottir	131.862745	201.666667	2021	NaN
2	ur.0641525362.39	42	Gonçalo R. Abecasis	223.714286	167.809524	2021	NaN
3	ur.011675454737.09	39	Cornelia M van Duijn	216.102564	303.179487	2021	NaN
4	ur.01264737414.70	39	Tõnu Esko	244.564103	339.923077	2021	NaN

Adding another impact measure: funding¶

We want to enhance the reseachers dataset by adding some funding information:

total amount of funding for each researcher
total number of grants
funding end date (useful to understand who is going to publish soon)

We’ll have to do it in two steps¶

1 Total grants and last grants year can be easy to extract using the ‘researchers’ source. For example:

[6]:

%%dsldf
search researchers
    where id in ["ur.0723426172.10", "ur.01277776417.51"]
return researchers[id+last_grant_year+total_grants]

Returned Researchers: 2 (total = 2)
Time: 0.52s

[6]:

	id	last_grant_year	total_grants
0	ur.0723426172.10	2018	8
1	ur.01277776417.51	2020	1

2 Aggregated funding needs to be extracted from the ``grants`` database. For example:

NOTE this kind of aggregate query will not return any data if a researcher that has no grants!

[7]:

%%dsldf
search grants
    where researchers.id in ["ur.0723426172.10", "ur.01277776417.51"]
return researchers[id] aggregate funding

Returned Researchers: 14
Time: 0.57s

[7]:

	count	funding	id
0	8	17760368.0	ur.0723426172.10
1	2	0.0	ur.012662217132.90
2	1	0.0	ur.010520247252.54
3	1	0.0	ur.01121653260.31
4	1	0.0	ur.01127672147.84
5	1	0.0	ur.011316554452.18
6	1	0.0	ur.012167132327.75
7	1	0.0	ur.012237141052.77
8	1	0.0	ur.012455520474.57
9	1	2592940.0	ur.01274135317.46
10	1	47946.0	ur.01277776417.51
11	1	0.0	ur.013362317204.29
12	1	0.0	ur.016620004052.08
13	1	0.0	ur.057767065.52

Next: full data for step 1¶

What we’re gonna do

loop over all researchers (400 at a time)
extract the tot grants and last grants year information
collect all data into a single dataframe
finally, add the data to our ‘researchers’ spreadsheet

[8]:

llist = list(researchers['researcher_id'])
VERBOSE = False
#
#
query = """search researchers
                where id in {}
            return researchers[id+last_grant_year+total_grants] limit 1000"""
#
#
out = []
for chunk in progress(list(chunks_of(list(llist), 400))):
    q = dsl.query(query.format(json.dumps(chunk)), verbose=VERBOSE)
    out += q.researchers
    time.sleep(1)
# save to a df
df1 = pd.DataFrame().from_dict(out)
print("======\nResearchers used to query: ", len(llist))
print("======\nResearchers returned: ", len(df1))

======
Researchers used to query:  2000
======
Researchers returned:  2000

Save the data so that we can use it later, and preview the results.

[9]:

df1.head(10)

[9]:

	id	last_grant_year	total_grants
0	ur.0777517431.61	2015.0	10
1	ur.07773763167.28	NaN	0
2	ur.0776707354.45	NaN	0
3	ur.07746151362.14	2017.0	7
4	ur.0773127601.34	NaN	0
5	ur.0772101217.51	NaN	0
6	ur.0770340363.50	2023.0	11
7	ur.0767751267.74	2025.0	10
8	ur.0767750151.59	2014.0	1
9	ur.0765447455.52	2017.0	8

Next: full data for step 2¶

For this part we will do the following steps:

loop over all researchers, chunked in groups of 50
for each group, query grants matching those researchers IDs. Instead of returning grants data though, we facet on researchers and aggregate funding information
process the researchers data we got, so to keep only the journal authors we are interested in

NOTE since we are querying for grants, each query can return many more researchers than the ones we are asking for, as the co-authors of a grant are also matched

Example query:

[10]:

%%dsldf
search grants
    where researchers.id in ["ur.0723426172.10", "ur.01277776417.51"]
return researchers[id] aggregate funding

Returned Researchers: 14
Time: 0.57s

[10]:

	count	funding	id
0	8	17760368.0	ur.0723426172.10
1	2	0.0	ur.012662217132.90
2	1	0.0	ur.010520247252.54
3	1	0.0	ur.01121653260.31
4	1	0.0	ur.01127672147.84
5	1	0.0	ur.011316554452.18
6	1	0.0	ur.012167132327.75
7	1	0.0	ur.012237141052.77
8	1	0.0	ur.012455520474.57
9	1	2592940.0	ur.01274135317.46
10	1	47946.0	ur.01277776417.51
11	1	0.0	ur.013362317204.29
12	1	0.0	ur.016620004052.08
13	1	0.0	ur.057767065.52

Here we ‘chunk’ using a lower number because each query will return more researchers than the ones we ask for (because we are querying ‘grants’ this time)

[11]:

llist = list(researchers['researcher_id'])
VERBOSE = False

#
#
query = """search grants
            where researchers.id in {}
           return researchers[id] aggregate funding limit 1000"""
#
#
out = []
for batch in progress(list(chunks_of(llist, 50))):
    q = dsl.query(query.format(json.dumps(batch)), verbose=VERBOSE)
    out += q.researchers
    time.sleep(1)
# save to a df
df2 = pd.DataFrame().from_dict(out)
print("======\nResearchers used to query: ", len(llist))
print("======\nResearchers returned: ", len(df2))

======
Researchers used to query:  2000
======
Researchers returned:  38726

Save to csv and preview the data

[12]:

df2.head(10)

[12]:

	count	funding	id
0	43	123777299.0	ur.011026120037.74
1	38	35429387.0	ur.010700366647.17
2	38	114643773.0	ur.013723776517.34
3	38	26808630.0	ur.015704744537.99
4	34	79909972.0	ur.01367711150.31
5	33	57010534.0	ur.016704245502.43
6	29	15551744.0	ur.01313145634.66
7	27	18617422.0	ur.010065171457.69
8	25	269879039.0	ur.01104262534.26
9	24	193563223.0	ur.011517303117.07

Finally: let’s merge the new data into the ‘researchers-impact’ table¶

[13]:

#
# first let's replace all empty values with zeros
#

df1 = df1.fillna(0)
df2 = df2.fillna(0)

#
# helper functions
#

def grants_and_year_from_id(researcher_id):
  "try/except to prevent some parsing errors."
  try:
    x = int(df1[df1['id'] == researcher_id]['total_grants'])
  except:
    x = 0
  try:
    y = int(df1[df1['id'] == researcher_id]['last_grant_year'])
  except:
    y = 0
  return (x, y)

def total_funding_from_id(researcher_id):
    """Since the bulk querying returned several rows for same researcher (due to various random combinations
    of researcher IDs lists in the query filters), we take the max value."""
    return df2[df2['id'] == researcher_id]['funding'].max()

#
# merge the results found into original researchers dataset
#

total_grants, last_grant_year, total_funding  = [], [], []

for i, row in progress(researchers.iterrows(), total=researchers.shape[0]):
    res_id = row['researcher_id']
    data = grants_and_year_from_id(res_id)
    total_grants.append(data[0])
    last_grant_year.append(data[1])
    total_funding.append(total_funding_from_id(res_id))

researchers['total_grants'] = total_grants
researchers['last_grant_year'] = last_grant_year
researchers['total_funding'] = total_funding
#
# finally..
#
print("=======\nResearchers total:",  len(researchers))
researchers.head(10)

=======
Researchers total: 2000

[13]:

	researcher_id	pubs	full_name	citations_mean	altmetric_mean	last_pub_year	url	total_grants	last_grant_year	total_funding
0	ur.0723426172.10	77	Kari Stefansson	201.493506	248.818182	2021	NaN	8	2018	17760368.0
1	ur.01277776417.51	51	Unnur Thorsteinsdottir	131.862745	201.666667	2021	NaN	1	2020	NaN
2	ur.0641525362.39	42	Gonçalo R. Abecasis	223.714286	167.809524	2021	NaN	21	2023	78903893.0
3	ur.011675454737.09	39	Cornelia M van Duijn	216.102564	303.179487	2021	NaN	6	2023	11051128.0
4	ur.01264737414.70	39	Tõnu Esko	244.564103	339.923077	2021	NaN	10	2025	2073171.0
5	ur.012264440652.05	36	Jerome I. Rotter	176.555556	175.638889	2021	NaN	16	2025	161922027.0
6	ur.01247426430.47	36	Gudmar Thorleifsson	142.805556	164.694444	2021	NaN	0	0	NaN
7	ur.01317433110.75	36	Caroline Hayward	256.166667	311.166667	2021	NaN	4	2024	9246651.0
8	ur.0637651205.48	35	Daniel F Gudbjartsson	124.057143	211.142857	2021	NaN	0	0	NaN
9	ur.01174076626.46	34	André G Uitterlinden	228.058824	274.558824	2021	NaN	1	2014	0.0

Save the data / download it

[14]:

save(researchers, "3_researchers_impact_metrics_and_funding.csv")

Couple of Dataviz¶

[15]:

temp1 = researchers.sort_values(by=["total_funding"], ascending=False)[:100]
temp2 = researchers.sort_values(by=["last_grant_year"], ascending=False)[:200]

[16]:

px.scatter(temp1,
           x="full_name", y="total_funding",
           hover_name="full_name",
           size="total_grants",
           color="total_grants",
           hover_data=['total_funding', 'total_grants', 'last_grant_year', 'citations_mean', 'altmetric_mean', 'last_pub_year'],
           height=600,
           title="Researchers By Total Funding and Number of Grants")

[17]:

px.scatter(temp2,
           x="full_name", y="last_grant_year",
           hover_name="full_name",
           size="total_grants",
           color="total_grants",
           hover_data=['total_funding', 'total_grants', 'last_grant_year', 'citations_mean', 'altmetric_mean', 'last_pub_year'],
           height=600,
           marginal_y="histogram",  title="Researchers By Grant End Year")

[18]:

px.scatter(temp2.sort_values("last_pub_year", ascending=False),
           x="full_name", y="last_grant_year",
           hover_name="full_name",
           size="total_grants",
           color="total_grants",
           hover_data=['total_funding', 'total_grants', 'last_grant_year', 'citations_mean', 'altmetric_mean', 'last_pub_year'],
           facet_col="last_pub_year",
           height=600,
           title="Researchers By Grant End Year & Last Publications Year")

[19]:

px.density_heatmap(temp2,
                   x="last_grant_year", y="last_pub_year",
                   marginal_x="histogram", marginal_y="histogram",
                   height=600,
                   title="Distribution of Grant End Year VS Last Publications Year")

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.