Calculating the H-index of a researcher¶

This notebook shows how to use Python and the Dimensions Analytics API to calculate the H-index of a researcher.

Background¶

The h-index is an author-level metric that attempts to measure both the productivity and citation impact of the publications of a scientist or scholar. The index is based on the set of the scientist’s most cited papers and the number of citations that they have received in other publications.

A more precise definition:

The h-index is defined as the maximum value of h such that the given author/journal has published h papers that have each been cited at least h times.

How to calculate it:

Formally, if f is the function that corresponds to the number of citations for each publication, we compute the h-index as follows. First we order the values of f from the largest to the lowest value. Then, we look for the last position in which f is greater than or equal to the position (we call h this position). For example, if we have a researcher with 5 publications A, B, C, D, and E with 10, 8, 5, 4, and 3 citations, respectively, the h-index is equal to 4 because the 4th publication has 4 citations and the 5th has only 3. In contrast, if the same publications have 25, 8, 5, 3, and 3 citations, then the index is 3 because the fourth paper has only 3 citations (wikipedia)

[1]:

import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Prerequisites¶

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

[2]:

!pip install dimcli -U --quiet

import dimcli
import pandas as pd
import sys

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

Searching config file credentials for 'https://app.dimensions.ai' endpoint..

==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

Selecting a researcher¶

Let’s take a researcher ID eg Michael Boutros ur.01357111535.49 and save its ID into a variable that can be referenced later.

Try modifying the researcher ID below to get different results!

[3]:

RESEARCHER = "ur.01357111535.49"

The H-Index function¶

The h-Index function takes a list of citations and outputs the h-index value as explained above:

[4]:

def the_H_function(sorted_citations_list, n=1):
    """from a list of integers [n1, n2 ..] representing publications citations,
    return the max list-position which is >= integer

    eg
    >>> the_H_function([10, 8, 5, 4, 3]) => 4
    >>> the_H_function([25, 8, 5, 3, 3]) => 3
    >>> the_H_function([1000, 20]) => 2
    """
    if sorted_citations_list and sorted_citations_list[0] >= n:
        return the_H_function(sorted_citations_list[1:], n+1)
    else:
        return n-1

The H-index function is generic and can take any list of numbers representing publication citations.

Getting citations data from Dimensions¶

In order to pass some real-world data to the H-Index function, we can easily use the Dimensions API to extract all publication citations for a researcher, like this:

[5]:

def get_pubs_citations(researcher_id):
    q = """search publications where researchers.id = "{}" return publications[times_cited] sort by times_cited limit 1000"""
    pubs = dsl.query(q.format(researcher_id))
    return list(pubs.as_dataframe().fillna(0)['times_cited'])

Wrapping things up¶

Finally, we combine the two functions to calculate the H-Index for a specific researcher:

[6]:

print("H_index is:", the_H_function(get_pubs_citations(RESEARCHER)))

Returned Publications: 283 (total = 283)
Time: 0.60s
H_index is: 63

Where to find out more¶

Please have a look at the official documentation on searching for researchers for more information on this topic.

Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.