../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Enrich text with Field of Research (FoR) codes

In this tutorial, we show how to use the Dimensions Analytics API classify function to retrieve suggested Field of Research (FoR) codes for a set of documents.

FoR classification is a component of the Australian and New Zealand Standard Research Classification system. It allows all R&D activity to be categorized using a single system. The system is hierarchical, with major fields subdivided into minor fields.

For more information on FoR classification, please see this article. For a complete list of all FoR categories in Dimensions, please visit this link.

The Dimensions API classifier suggests category classifications based on input title and abstract text. Category classifications allow analysts to gain insight into the area(s) of focus of a set of documents. For example, given a set of documents, how many of the documents relate to ‘Artificial Intelligence and Image Processing’ (FoR code 0801)? How does this compare to the number of documents related to ‘Statistics’ (FoR code 0104)?

A sample set of publications

Our starting point is a sample set of 100 titles/abstracts belonging to publications that were submitted to arxiv.org on June 7th 2021. At the time of writing, these publications have not yet been indexed by Dimensions, and thus have not yet been assigned categories.

Below, we will show below how to enrich this dataset with FoR codes.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[10]:
!pip install dimcli tqdm -U --quiet

import dimcli
from dimcli.utils import *

import sys, json, time, os
import pandas as pd
from tqdm.notebook import tqdm as pbar

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.9.1)
Connected to: https://app.dimensions.ai - DSL v1.31
Method: dsl.ini file

1. Loading the sample text

First, we are going to load the sample dataset ‘arxiv_june7.csv’.

[2]:
df = pd.read_csv('http://api-sample-data.dimensions.ai/data/arxiv_june7.csv')

Let’s preview the contents of the file:

[3]:
df.head()
[3]:
title abstract
0 SIMONe View Invariant Temporally Abstracted ... To help agents reason about scenes in terms ...
1 How planets grow by pebble accretion IV Envel... The amount of nebular gas that a planet can ...
2 GAN Cocktail mixing GANs without dataset access Today s generative models are capable of syn...
3 it CosmoPower emulating cosmological p... We present it CosmoPower a suite of neu...
4 A Matrix Trickle Down Theorem on Simplicial Co... We show that the natural Glauber dynamics mi...

As we see above, each document is represented by title text and abstract text. We will use the Dimensions API’s classify function to retrieve suggested FoR codes for this text.

2. FoR Classification

The classify function has three inputs: title, abstract, and system.

  • title: the document’s title text

  • abstract: the document’s abstract text

  • system: the desired classification system for output

In our case, we’re going to use the FoR classification system. For details on other available classification schemes, please see this article.

To classify each document, we iterate through the dataframe one row at a time and input the title and abstract text for each document. A list of suggested FoR codes is saved into a column called ‘FoR_Categories’.

We pause for a second after each iteration, which prevents us from hitting the max queries quota (~30 per minute).

[4]:
df['FoR_Categories'] = ''
[13]:
# for index, row in df.iterrows():
for index, row in pbar(df.iterrows(), total=df.shape[0]):
    search_string = f"""
                    classify(title="{row.title}", abstract="{row.abstract}", system="FOR")
            """
    a = dsl.query(search_string, verbose=False)
    list_of_categories = []
    for x in a.json['FOR']:
        list_of_categories.append(x['name'])
    df['FoR_Categories'][index] = list_of_categories
    time.sleep(1)

Now that we have classified our documents, let’s take a look at the updated dataframe:

[14]:
df.head(20)
[14]:
title abstract FoR_Categories Counts
0 SIMONe View Invariant Temporally Abstracted ... To help agents reason about scenes in terms ... [0801 Artificial Intelligence and Image Proces... 1
1 How planets grow by pebble accretion IV Envel... The amount of nebular gas that a planet can ... [] 0
2 GAN Cocktail mixing GANs without dataset access Today s generative models are capable of syn... [] 0
3 it CosmoPower emulating cosmological p... We present it CosmoPower a suite of neu... [0104 Statistics] 1
4 A Matrix Trickle Down Theorem on Simplicial Co... We show that the natural Glauber dynamics mi... [] 0
5 Mean Shifted Contrastive Loss for Anomaly Dete... Deep anomaly detection methods learn represe... [] 0
6 Equivariant Graph Neural Networks for D Macro... Representing and reasoning about D structur... [0801 Artificial Intelligence and Image Proces... 1
7 Non Abelian Hybrid Fracton Orders We introduce lattice gauge theories which de... [] 0
8 A Helix Down the Throat Internal Tidal Effects Tidal effects in capped geometries computed ... [] 0
9 Balancing Garbage Collection vs I O Amplificat... Key value KV separation is a technique tha... [] 0
10 NTIRE Challenge on Burst Super Resolution... This paper reviews the NTIRE challenge o... [] 0
11 The quantum p spin glass model A user manua... We study a large N bosonic quantum mechani... [0105 Mathematical Physics, 0206 Quantum Physics] 2
12 Tunable Trajectory Planner Using G Curves Trajectory planning is commonly used as part... [] 0
13 MemStream Memory Based Anomaly Detection in M... Given a stream of entries over time in a mul... [0801 Artificial Intelligence and Image Proces... 2
14 Negative times of the Davey Stewartson integr... We use example of the Davey Stewartson hier... [] 0
15 Khovanov homology for links in thickened multi... We define a variant of Khovanov homology for... [] 0
16 Learning without Knowing Unobserved Context i... In this paper we consider a transfer Reinfo... [0801 Artificial Intelligence and Image Proces... 1
17 Pattern Recognition on Oriented Matroids Symm... We consider decompositions of topes of the o... [] 0
18 Counterfactual Maximum Likelihood Estimation f... Although deep learning models have driven st... [0801 Artificial Intelligence and Image Proces... 2
19 A Simple Recipe for Multilingual Grammatical E... This paper presents a simple recipe to train... [2004 Linguistics, 1702 Cognitive Sciences] 2

Above, we see that some document texts did not receive any suggested FoR codes, while others received multiple codes. The classifier is programmed assign each document 0-4 FoR codes. It may fail to classify or produce unexpected results when working with longer texts.

3. Number of FoR categories per document

Below, we plot the frequency of each count of categories using matplotlib:

[15]:
df['Counts'] = ''
for index, row in df.iterrows():
    df['Counts'][index] = len(df['FoR_Categories'][index])

df['Counts'].value_counts().plot.bar(rot=0,
                                     title='Frequency of FoR counts',
                                     ylabel='Occurences',
                                     xlabel='Number of FoR categories')
[15]:
<AxesSubplot:title={'center':'Frequency of FoR counts'}, xlabel='Number of FoR categories', ylabel='Occurences'>
../../_images/cookbooks_10-misc_2-enrich-text-with-for-codes_22_1.png

Here, we see that many of the documents were not assigned to any FoR categories.

Of the documents that were successfully classified, the majority received only one FoR assignment.

4. Top FoR categories by document count

Below, we plot the top 10 FoR categories by document count.

[16]:
all_codes = pd.Series([category for item in df.FoR_Categories for category in item])
code_counts = all_codes.value_counts()
[20]:
code_counts[:10].plot.barh(rot=0,
                           title='Top FoR categories',
                           ylabel='Category',
                           xlabel='Number of documents')
[20]:
<AxesSubplot:title={'center':'Top FoR categories'}, ylabel='Number of documents'>
../../_images/cookbooks_10-misc_2-enrich-text-with-for-codes_27_1.png

‘Artificial Intelligence and Image Processing’ is the most common FoR category, followed by ‘Statistics’.


Conclusions

In this notebook we have shown how to use the Dimensions Analytics API classify function to retrieve suggested Field of Research (FoR) codes for a set of documents.

For more background, see the classify function documentation, as well as the other functions available via the Dimensions API.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg