../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Identification of reviewers for funders: Globally and among Panels

This notebook shows to use the expert identification workflow available via Dimensions Analytics API, with a focus on the use case of funders identifying reviewers.

For more general expert identification using the Dimensions Analytics API, see Expert Identification with the Dimensions API - An Introduction.

[1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Jan 25, 2022
==

Introduction: Use Cases For Reviewer Identification

A very common workflow for research funders is to solicit grant applications, then to have members of the research community critique and score those applications to determine what is ultimately funded.

A challenge for this workflow revolves around finding and assigning the appropriate reviewers to each application. This task is most often handled by relying on past experience and insider knowledge of a specific research area. This approach is not always reliable, is difficult to impart when program staff changes, and risks forming an insular subset of the research community.

To aid in this problem, Dimensions has developed a suite of tools to help identify researchers for scientific research proposals. These include a GUI available inside the Dimensions app, as well a programmatic approach using the Dimensions Analytics API which will be described here.


The overall workflow to identify reviewers is described in Expert Identification with the Dimensions API - An Introduction, and is discussed in more detail there. At its core, the process has two steps: 1. Use Dimensions Analytics API to extract key concepts from grant application text 2. Use those concepts to search in the Dimensions Analytics API for relevant researchers

These two steps will be performed and only commented on lightly in this notebook. For more details, see the example linked above.

Kinds of identification

There are two “kinds” of reviewer identification that can be performed. These are: 1. GLOBAL IDENTIFICATION - Where reviewers are identified from the body of all researchers - for example, find the best reviewers for an application worldwide, or from a specific country

Example 1: A funder is trying to find reviewers for a set of applications, but only wishes to use people they have already funded (e.g. people they already know, and do not need to worry about vetting). In that case, the body of researchers they are looking for is all researchers who have been funded by them in the past.

Example 2: A funder is working in partnership with a collection of universities to perform a body of research. As part of that partnership, personnel from these universities will review grant applications. In that case, the body of researchs they are looking for is all researchers currently affiliated with any of a list of universities.

  1. PANEL IDENTIFICATION - Where reviewers are identified from a list of known researchers - for example, find the best reviewers for an application from a panel of reviewers

Example: A funder has a pre-assembled body of a dozen researchers which it intends to use to review a body of a dozen applications, each of which needs three reviewers. A total of 12 x 12 x 3 = 432 assignments need to be made. These assignments should align application topic and reviewer expertise. This is a very arduous task for a human looking at blocks of text (e.g. application abstract vs. reviewer CV) for 144 possible combinations.

Each of these will be described in subsections below.

NOTE In neither of these scenarios is this utility meant to supplant the need for human work and produce a fully automated workflow. Instead, this approach is meant to assist human selection of reviewers, reducing workload by narrowing to a subset of relevant reviews. Final selections can then be made by program staff.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the ‘Getting Started’ tutorial.

To generate an API key from the Dimensions webapp, go to “My Account”. Under “General Settings” there is an “API key” section where there is a “Create API key” button. More information on this can be found here.

[2]:
!pip install dimcli -U --quiet

import dimcli
from dimcli.utils import *

import sys, json, time, os
import pandas as pd

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v0.9.6)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0
Method: dsl.ini file

1. Loading and preprocessing text data

Regardless of whether you are doing Global or Panel identification, the first step is to load grant application text data. A few examples will be hard coded here and loaded into a pandas dataframe. It will hopefully be trivial to alter this load step to read in data from an external source instead.

1.1 Loading from File (placeholder code)

General data structure needs some kind of document identifier and at least one text field. The code below relies on having the columns:

  • doc_id

  • title

  • abstract

[3]:
## Placeholder cell for uploading data from a file. Uncomment lines below and run if uploading file
# from google.colab import files
# uploaded = files.upload()
[4]:
## Uncomment below and point to uploaded file to load into pandas dataframe
# import io
# texts = pd.read_excel(io.BytesIO(uploaded['your_uploaded_excel.xlsx']))

1.2 Loading Sample Data

Sample data used for example purposes only. If using your own data, delete or comment out this cell

[5]:
grant1_identifier="A001"
grant1_title="Computational design of sustainable hydrogenation systems via a novel combination of data science, optimization, and ab initio methods"
grant1_abstract="""Sustainable, safe, and process-intensified hydrogenation technologies are essential for distributed, small-scale, and on-demand manufacturing of chemicals and fuels from shale gas and biomass, upgrading carbon dioxide to useful organic chemicals, and upcycling plastic waste. New technological developments in this area would contribute to increasing international competitiveness of the U.S. chemical manufacturing industries and meeting relevant U.N. goals on sustainable development. A promising chemistry to this end is catalytic transfer hydrogenation (CTH), a process that is carried out using hydrogen donors instead of pure molecular H2, thereby offering a safe, H2- and potentially CO2-free hydrogenation technology. A critical step towards deploying CTH is to optimally design the underlying process, a challenging task because atomic-scale information such as reaction thermodynamics, pathways, and rates have implications at the microscopic (e.g., product yield) and macroscopic levels (e.g., process economics). The research vision of this project is to develop and apply novel computational tools, in synergy with experiments, to design CTH processes by integrating information and decisions across the different size scales. In parallel with this research, the educational vision of this project is to promote computational thinking and programming literacy at various levels of STEM education. These two skills are well-recognized as being essential for the next generation of science and engineering innovators to tackle emerging grand challenges in the energy, health, and environmental spheres. This CAREER proposal specifically aims to computationally design a vapor-phase transition-metal catalyzed CTH reaction system of a model oxygenate, viz. acrolein, which is the smallest molecule having both C-C and C-O unsaturation; as such, it can be considered a model representative of biomass-derived molecules and functionalized intermediates in the chemical industry. Designing the acrolein CTH reaction system ultimately requires identifying the optimal donor-catalyst combination that maximizes the yield of a desired product, e.g., hydrogenation selectivity of acrolein to propanal versus propenol. To this end, a novel computational framework that integrates density functional theory (DFT), informatics, machine learning, and several other process systems engineering computational methods including nonlinear optimization and advanced data sampling via reinforcement and transfer learning, will be developed as part of this research to (i) build Gaussian Process surrogate models, (ii) formulate and solve coverage-cognizant microkinetic models, and (iii) solve reaction system optimization problems. This framework will allow the PI to address a critical gap in the fundamental mechanistic elucidation and multiscale design of acrolein CTH reaction systems and thereby identify the optimal donor-catalyst combination from a representative subset of donors and transition metal catalysts. A well-integrated educational program will be developed to target different age groups at Lehigh University and the broader Lehigh valley. This includes engaging high-school and undergraduate students in cutting-edge research at the intersection of data science and catalysis, developing online interactive visualization-based modules to explain high-school science and undergraduate engineering concepts via enquiry-based learning, and developing and offering an interdisciplinary elective to train chemical engineers in the burgeoning area of data science and machine learning. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."""

grant2_identifier="A002"
grant2_title="Simulating catalysis: Multiscale embedding of machine learning potentials"
grant2_abstract="""In the recent decades, computer simulations have become an essential part of the molecular scientist's toolbox. However, as for any computational method, molecular simulations require a compromise between speed and precision. The most precise techniques apply principles of quantum mechanics (QM) to the molecular systems and can precisely describe processes involving changes in the electronic structure, such as the breaking and forming of chemical bonds. However, they require tremendous computer resources, being prohibitively costly even for systems containing only several hundreds of atoms. On the other extreme are highly simplified \"Molecular Mechanics\" (MM) methods that ignore the quantum nature of molecules and instead describe the atoms as charged "balls" of certain size connected with springs representing the chemical bonds. The core limitation of MM is its inability to describe breaking/forming of chemical bonds, therefore making it unsuitable for simulating chemical reactions. This drawback motivated the invention of combined "multiscale" models that rely on precise but expensive QM calculations to describe the part of the simulation system where the chemical reaction takes place, while treating the rest of the system with an efficient MM method. This "Quantum Mechanics/Molecular Mechanics" approach (QM/MM), honoured by the Nobel Prize in Chemistry in 2013, is now the state-of-the-art simulation technique for reactions in complex environments, such as those happening inside living organisms. Such simulations are important to understand and design catalysts, which increase the rate of chemical reactions (and can thereby reduce the amount of energy and resources required to produce molecules). However, QM/MM calculations are still only as fast as the QM method used, limiting dramatically the precision and timescale of the simulations. A completely different approach is to employ techniques from the rapidly evolving field of machine learning (ML) and construct a method that can learn and then predict the outcome of a QM calculation. Once properly trained, an ML model can provide results with QM quality, but several orders of magnitude faster. However, ML models are still significantly slower than MM ones. Therefore, a multiscale "ML/MM" model would still offer huge savings of computer time compared to pure ML simulations. Unfortunately, however, existing ML training schemes are only suitable for calculations in gas phase and cannot take into account the presence of an MM environment. The goal of the proposed research project is to develop a novel multiscale embedding approach that will allow the use of ML models as part of a ML/MM scheme. This will enable molecular simulations of unprecedented precision on processes with high complexity without limiting the detailed exploration of molecular conformations. To achieve this goal, we will take advantage of recent advances in machine learning and understanding of intermolecular interactions to develop a specialised ML workflow that predicts the interaction energy between the molecule described by ML and the MM environment. The workflow will be implemented as an open, publicly available software package that allows to train ML/MM models and run ML/MM molecular dynamics simulations of complex chemical processes, such as catalysed reactions. We expect this package to be readily adopted by a wide community of computational chemists working on enzymatic reactions, homo/heterogeneous catalysis and generally on processes in condensed phases, aided by specific training materials and workshops that we will provide. This will allow, for example, the development efficient computational workflows to understand and help design catalysts for more environmentally friendly production of desired molecules."""

texts=pd.DataFrame(data=[[grant1_identifier,grant1_title,grant1_abstract],
                         [grant2_identifier,grant2_title,grant2_abstract]],
                   columns=['doc_id','title','abstract'])


1.3 Preprocess data and extract concepts

Additional text cleaning steps may be required if input data is messy. Details of text cleaning are beyond the scope of this tutorial.

It is worth mentioning that certain special characters may cause the Dimensions Analytics API to fail. When in doubt, remove all non-alphanumeric characters and ensure text is utf-8 encoded.

[6]:
# It is easiest to have a single text field to extract concepts from. There may be advanced use cases where title and abstract concepts might be handled seperately in order to produce different results.
texts['text']=texts['title'].fillna('')+' '+texts['abstract'].fillna('')

#simple regex to remove most non-alphanumeric, non-space, non-punctuation characters
texts['text']=texts['text'].str.replace('[^a-zA-Z0-9 \.,]', ' ',regex=True)

#force text encoding to utf-8
texts['text']=texts['text'].str.encode('utf-8')

#query for concepts
for idx,series in texts.iterrows():
  res = dsl.query(f"""extract_concepts("{series['text']}")""")
  concepts=  res['extracted_concepts']
  texts.loc[idx,'concepts']=';'.join(concepts)


#format concepts into python list
texts['concepts']=texts['concepts'].str.split(';')

#it is generally desirable to narrow the list of concepts to the top concepts to reduce noise. Concepts are returned ranked by relevance, so the top N most relevant concpets are used
topN_concepts=20
texts['concepts']=texts['concepts'].str[:topN_concepts]

#drop the now superfluous text column
texts=texts.drop(columns=['text'])

#display data preview
texts.head()
[6]:
doc_id title abstract concepts
0 A001 Computational design of sustainable hydrogenat... Sustainable, safe, and process-intensified hyd... [hydrogenation technology, reaction system, ca...
1 A002 Simulating catalysis: Multiscale embedding of ... In the recent decades, computer simulations ha... [chemical bonds, chemical reactions, QM calcul...

with concepts in hand, we can now begin to identify reviewers

2. The Two Use Cases - Recap

2.1 Global identification use case

This approach is most commonly used when potential reviewers aren’t limited to a pre-determined group of people. In this case, any researcher listed in Dimensions is a potential match. This is the only mode available in the GUI described above. While a global search casts a very wide net, it is possible to use additional data attached to researcher profiles in Dimensions to filter results to a less broad subset. Some examples: - Only researchers from a particular country - Only researchers from a particular organization - Only researchers that have been previously funded by a particular funder

2.2 Panel identification use case

Sometimes reviewers must be identified from a specific list of researchers. This can be thought of as similar to the global search above, but attempting to identify reviewers from only a very particular subset of researchers.

Another common use case though is to attempt to rank the best reviewers for a grant application among a preselected panel in order to assign reviewers to grant applications. If a body of multiple grants and review panelists are known, a “reviewer matrix” can be generated in order to visualize the best reviewers for each appliction, as well as any potential gaps in coverage.

NOTE (May 2021): Use of the Dimensions Analytics API to generate a reviewer matrix is still in beta and undergoing review. The overall workflow is likely to remain the same, but changes under the hood may produce slightly different results in the future.

3. Global identification

The basic query structure used to identify researchers based on concepts is laid out below.

By default the query returns only the top 20 researchers. Up to 200 researchers can be returned by paginating queries. It is generally desired to filter these results somehow, which will be covered in the advanced use cases laid out below.

For more on Dimensions researcher data and what the returned fields mean, see the documentation for the researchers data source.

[7]:
# instantiate dataframe to hold results
results=pd.DataFrame()

#Loop through texts and identify researchers
for idx,series in texts.iterrows():
  concepts_string = " OR ".join(['"%s"' % x for x in series['concepts']])

  q = f"""
          identify experts
              from concepts "{dsl_escape(concepts_string)}"
          return experts[basics+extras]
          """

  res=dsl.query(q).as_dataframe()
  res['match_for_doc_id']=series['doc_id']
  results=results.append(res)

#rename 'id' column for clarity
results=results.rename(columns={'id':'researcher_id'})

#display data preview
results.head()
[7]:
current_research_org docs_found first_grant_year first_name first_publication_year researcher_id last_grant_year last_name last_publication_year orcid_id research_orgs score total_grants total_publications match_for_doc_id
0 grid.412392.f 9 2007.0 Patrick 2000 ur.013717027613.00 2025.0 Linke 2021 [0000-0003-0105-9947] [grid.452146.0, grid.412603.2, grid.8647.d, gr... 326.619473 11 150 A001
1 grid.19006.3e 8 1998.0 Panagiotis D 1994 ur.01332115004.71 2022.0 Christofides 2021 NaN [grid.19006.3e, grid.11047.33, grid.9909.9, gr... 309.619233 12 556 A001
2 grid.264756.4 8 1992.0 Mahmoud M 1986 ur.01301461257.28 2021.0 El-Halwagi 2021 [0000-0002-0020-2281] [grid.453681.d, grid.264756.4, grid.55460.32, ... 287.066961 24 528 A001
3 grid.147455.6 6 1980.0 Ignacio E 1978 ur.011034041563.31 2021.0 Grossmann 2021 [0000-0002-7210-084X] [grid.37172.30, grid.187073.a, grid.7445.2, gr... 228.951728 37 778 A001
4 grid.264756.4 5 2007.0 Efstratios N 1988 ur.011111004073.75 2023.0 Pistikopoulos 2021 NaN [grid.461183.9, grid.169077.e, grid.89336.37, ... 177.012200 13 608 A001

4. Panel Identification

The use of panels of researchers to review grant applications is wide spread. Once a panel of researchers is identified, assignments must be made between individual researchers and individual applications. It is often the case that applications span a range of topics and subtopics, and that panelists may only have expertise in some of these areas. To conduct a fair review, it is necessary to align panelist expertise and application topic as best as possible.

The Dimensions Analytics API includes utilities to make these assignments easier. It is possible to calculate a numeric score for how well a reviewer’s research profile aligns with with grant application texts. By building a matrix of reviewer-application scores, it can become very easy to visualize best assignments, as well as to identify possible gap areas.


This approach relies on Dimensions researcher profiles. For more on Dimensions researcher profiles generally and how to search for them, see here.

For more on how to find particular researcher profiles using the API, see this link or the tutorial Extracting researchers based on affiliations and publications history

[10]:
#Curated list of panelists
panelist_researcher_ids = ["ur.011441227347.89",
                           "ur.01300337437.48",
                           "ur.01367255211.93",
                           "ur.01050122660.53",
                           "ur.01215263003.24"]

# instantiate dataframe to hold results
results=pd.DataFrame()

#iterate through all texts
for idx,series in texts.iterrows():
  concepts_string = " OR ".join(['"%s"' % x for x in series['concepts']])
  q= f"""
          identify experts
              from concepts "{dsl_escape(concepts_string)}"
              using publications
              where researchers in {json.dumps(panelist_researcher_ids)}
          return experts[basics+extras]
          """

  res=dsl.query(q).as_dataframe()
  res['match_for_doc_id']=series['doc_id']
  results=results.append(res)

#filter to panelist IDs only
results=results[results['id'].isin(panelist_researcher_ids)]

#rename columns for clarity
results=results.rename(columns={'id':'researcher_id',
                        'match_for_doc_id':'doc_id'})

#store researcher profile metadata seperately
researcher_data=results.drop(columns=['score','doc_id']).drop_duplicates(['researcher_id']).set_index('researcher_id')

#pivot resulting scores into researcher-document matrix
matrix = results.pivot(index='researcher_id',columns='doc_id',values="score")

#normalize scores for each appl so that the best match is always scored 1
for col in matrix.columns:
  matrix[col]=matrix[col]/matrix[col].max()

#fill missing values with 0
matrix=matrix.fillna(0)

#store grant column names to make pretty display
grant_cols=matrix.columns

#rejoin researcher metadata
matrix=matrix.join(researcher_data)

#Display matrix with colors
matrix.style.background_gradient(subset=grant_cols,cmap='Blues')
[10]:
  A001 A002 current_research_org docs_found first_grant_year first_name first_publication_year last_grant_year last_name last_publication_year orcid_id research_orgs total_grants total_publications
researcher_id                            
ur.01050122660.53 0.332839 0.240871 grid.34477.33 80 1983.000000 Charles T 1978 2023.000000 Campbell 2021 nan ['grid.253692.9', 'grid.411377.7', 'grid.148313.c', 'grid.5379.8', 'grid.451303.0', 'grid.5335.0', 'grid.89336.37', 'grid.34477.33'] 12 329
ur.011441227347.89 0.078677 0.113787 grid.410356.5 23 2000.000000 Cathleen M 1990 2025.000000 Crudden 2021 nan ['grid.431983.4', 'grid.27476.30', 'grid.5337.2', 'grid.28046.38', 'grid.17063.33', 'grid.23856.3a', 'grid.35403.31', 'grid.266820.8', 'grid.438548.6', 'grid.292836.4', 'grid.420469.d', 'grid.410356.5'] 60 218
ur.01215263003.24 1.000000 1.000000 grid.5170.3 281 2012.000000 Jens Kehlet 1977 2015.000000 Nørskov 2021 ['0000-0002-4427-7728'] ['grid.7491.b', 'grid.445003.6', 'grid.424590.e', 'grid.7048.b', 'grid.10548.38', 'grid.12082.39', 'grid.168010.e', 'grid.5170.3', 'grid.410387.9', 'grid.5117.2', 'grid.5371.0', 'grid.133342.4', 'grid.418028.7', 'grid.481554.9', 'grid.253264.4'] 2 756
ur.01300337437.48 0.086336 0.181127 grid.1957.a 32 2010.000000 Franziska 2005 2025.000000 Schoenebeck 2021 ['0000-0003-0047-0929'] ['grid.8547.e', 'grid.19006.3e', 'grid.11984.35', 'grid.417815.e', 'grid.5801.c', 'grid.1957.a'] 7 171
ur.01367255211.93 0.095684 0.053079 grid.411461.7 26 2018.000000 Konstantinos D 2009 2022.000000 Vogiatzis 2021 ['0000-0002-7439-3850'] ['grid.8127.c', 'grid.17635.36', 'grid.7892.4', 'grid.267305.5', 'grid.411461.7'] 1 63
[11]:

# FINALLY - if you are using Google Colab, download the results files:

if 'google.colab' in sys.modules:
    from google.colab import files

    # temporarily save pandas dataframe as file in colab environment
    results.to_csv('results.csv')
    # download file to local machine
    files.download('results.csv')

Conclusions

In this notebook we have shown how to use the Dimensions Analytics API to identify researchers to serve as reviewers for grant funders.

Two general approaches have been presented: Global Identification or Panel Identification. Regardless of whether you are doing Global Identification or Panel Identification, the final output is a list of researchers for each application. Additional outputs, like the reviewer matrix, can also be exported.

The selection of reviewers may be an interative process, and it may be desirable to try and produce different results by tuning the number of concepts used, or using a different pool of publications or grants to identify researchers. The Expert Identification with the Dimensions API - An Introduction contains many more examples on ways to fine tune and tweak expert identification using the Dimensions Analytics API.

Finally, it is worth stressing that the outputs of reviewer finding are meant to be taken with a grain of salt, and are intended to be further reviewed by human eyes. Hence this whole process is intended to aid and speed up pre-existing workflows, rather than replace them entirely.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg