../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Topic Modeling Analysis for a Set of Publications: Basic Workflow

This Python notebook shows how to use the Dimensions Analytics API in order to extract concepts from publications and use them as the basis for more advanced topic analyses tasks.

Note this tutorial is best experienced using Google Colab.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[1]:
!pip install dimcli plotly -U --quiet

import dimcli
from dimcli.shortcuts import *
import json
import sys
import pandas as pd
import plotly.express as px  # plotly>=4.8.1
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.7.1)
Connected to: https://app.dimensions.ai - DSL v1.25
Method: dsl.ini file

1. Creating a Dataset: e.g. Publications from a Research Organization

The dataset we’ll be using in this tutorial contains publications from University of Toronto.

By using the parameters below, we can restrict the dataset further using dates (publications year) and a subject area (field of research code).

After extracting the publications records, we are going to use the as_dataframe_concepts method to unpack concepts/scores data into a format that makes it easier to analyse it. For more background, see also the Working with concepts in the Dimensions API tutorial.

[2]:
#@markdown ## Define the criteria for extracting publications

#@markdown A GRID organization from https://grid.ac/institutes
GRIDID = "grid.17063.33" #@param {type:"string"}

#@markdown The start/end year of publications used to extract concepts
YEAR_START = 1996 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2020 #@param {type: "slider", min: 1950, max: 2020}

#@markdown The Field Of Research (FOR) category (None = all)
CATEGORY = "2203 Philosophy"  #@param ['None', '0101 Pure Mathematics', '0102 Applied Mathematics', '0103 Numerical and Computational Mathematics', '0104 Statistics', '0105 Mathematical Physics', '0201 Astronomical and Space Sciences', '0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics', '0203 Classical Physics', '0204 Condensed Matter Physics', '0205 Optical Physics', '0206 Quantum Physics', '0299 Other Physical Sciences', '0301 Analytical Chemistry', '0302 Inorganic Chemistry', '0303 Macromolecular and Materials Chemistry', '0304 Medicinal and Biomolecular Chemistry', '0305 Organic Chemistry', '0306 Physical Chemistry (incl. Structural)', '0307 Theoretical and Computational Chemistry', '0399 Other Chemical Sciences', '0401 Atmospheric Sciences', '0402 Geochemistry', '0403 Geology', '0404 Geophysics', '0405 Oceanography', '0406 Physical Geography and Environmental Geoscience', '0499 Other Earth Sciences', '0501 Ecological Applications', '0502 Environmental Science and Management', '0503 Soil Sciences', '0599 Other Environmental Sciences', '0601 Biochemistry and Cell Biology', '0602 Ecology', '0603 Evolutionary Biology', '0604 Genetics', '0605 Microbiology', '0606 Physiology', '0607 Plant Biology', '0608 Zoology', '0699 Other Biological Sciences', '0701 Agriculture, Land and Farm Management', '0702 Animal Production', '0703 Crop and Pasture Production', '0704 Fisheries Sciences', '0705 Forestry Sciences', '0706 Horticultural Production', '0707 Veterinary Sciences', '0799 Other Agricultural and Veterinary Sciences', '0801 Artificial Intelligence and Image Processing', '0802 Computation Theory and Mathematics', '0803 Computer Software', '0804 Data Format', '0805 Distributed Computing', '0806 Information Systems', '0807 Library and Information Studies', '0899 Other Information and Computing Sciences', '0901 Aerospace Engineering', '0902 Automotive Engineering', '0903 Biomedical Engineering', '0904 Chemical Engineering', '0905 Civil Engineering', '0906 Electrical and Electronic Engineering', '0907 Environmental Engineering', '0908 Food Sciences', '0909 Geomatic Engineering', '0910 Manufacturing Engineering', '0911 Maritime Engineering', '0912 Materials Engineering', '0913 Mechanical Engineering', '0914 Resources Engineering and Extractive Metallurgy', '0915 Interdisciplinary Engineering', '0999 Other Engineering', '1001 Agricultural Biotechnology', '1002 Environmental Biotechnology', '1003 Industrial Biotechnology', '1004 Medical Biotechnology', '1005 Communications Technologies', '1006 Computer Hardware', '1007 Nanotechnology', '1099 Other Technology', '1101 Medical Biochemistry and Metabolomics', '1102 Cardiorespiratory Medicine and Haematology', '1103 Clinical Sciences', '1104 Complementary and Alternative Medicine', '1105 Dentistry', '1106 Human Movement and Sports Science', '1107 Immunology', '1108 Medical Microbiology', '1109 Neurosciences', '1110 Nursing', '1111 Nutrition and Dietetics', '1112 Oncology and Carcinogenesis', '1113 Ophthalmology and Optometry', '1114 Paediatrics and Reproductive Medicine', '1115 Pharmacology and Pharmaceutical Sciences', '1116 Medical Physiology', '1117 Public Health and Health Services', '1199 Other Medical and Health Sciences', '1201 Architecture', '1202 Building', '1203 Design Practice and Management', '1205 Urban and Regional Planning', '1299 Other Built Environment and Design', '1301 Education Systems', '1302 Curriculum and Pedagogy', '1303 Specialist Studies In Education', '1399 Other Education', '1401 Economic Theory', '1402 Applied Economics', '1403 Econometrics', '1499 Other Economics', '1501 Accounting, Auditing and Accountability', '1502 Banking, Finance and Investment', '1503 Business and Management', '1504 Commercial Services', '1505 Marketing', '1506 Tourism', '1507 Transportation and Freight Services', '1601 Anthropology', '1602 Criminology', '1603 Demography', '1604 Human Geography', '1605 Policy and Administration', '1606 Political Science', '1607 Social Work', '1608 Sociology', '1699 Other Studies In Human Society', '1701 Psychology', '1702 Cognitive Sciences', '1799 Other Psychology and Cognitive Sciences', '1801 Law', '1899 Other Law and Legal Studies', '1901 Art Theory and Criticism', '1902 Film, Television and Digital Media', '1903 Journalism and Professional Writing', '1904 Performing Arts and Creative Writing', '1905 Visual Arts and Crafts', '1999 Other Studies In Creative Arts and Writing', '2001 Communication and Media Studies', '2002 Cultural Studies', '2003 Language Studies', '2004 Linguistics', '2005 Literary Studies', '2099 Other Language, Communication and Culture', '2101 Archaeology', '2102 Curatorial and Related Studies', '2103 Historical Studies', '2199 Other History and Archaeology', '2201 Applied Ethics', '2202 History and Philosophy of Specific Fields', '2203 Philosophy', '2204 Religion and Religious Studies', '2299 Other Philosophy and Religious Studies']

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

print(f"""You selected:\n->{GRIDID}\n->{YEAR_START}-{YEAR_END}\n->{CATEGORY}""")

from IPython.core.display import display, HTML
display(HTML('-><br /><a href="{}">Preview {} in Dimensions.ai &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID)))


# subject area query component
if CATEGORY == "None":
  subject_area_q =  ""
else:
  subject_area_q = f""" and category_for.name="{CATEGORY}" """

query = f"""
search publications
    where research_orgs.id = "{GRIDID}"
    and year in [{YEAR_START}:{YEAR_END}]
    {subject_area_q}
    return publications[id+concepts_scores+year]
"""


print("===\nQuery:\n", query)
print("===\nRetrieving Publications.. ")
data = dsl.query_iterative(query)

print("===\nExtracting Concepts.. ")
concepts = data.as_dataframe_concepts()
concepts_unique = concepts.drop_duplicates("concept")[['concept', 'frequency', 'score_avg']]
print("===\nConcepts Found (total):", len(concepts))
print("===\nUnique Concepts Found:", len(concepts_unique))
print("===\nConcepts with frequency major than 1:", len(concepts_unique.query("frequency > 1")))
You selected:
->grid.17063.33
->1996-2020
->2203 Philosophy
===
Query:

search publications
    where research_orgs.id = "grid.17063.33"
    and year in [1996:2020]
     and category_for.name="2203 Philosophy"
    return publications[id+concepts_scores+year]

===
Retrieving Publications..
1000 / ...
1000 / 1246
1246 / 1246
===
Records extracted: 1246
===
Extracting Concepts..
===
Concepts Found (total): 32949
===
Unique Concepts Found: 13586
===
Concepts with frequency major than 1: 2994

2. Concepts: exploratory analysis

The goal of this section is understanding what are the best settings to use, in order to extract useful concepts.

[3]:
#@markdown ## Define the best parameters to isolate 'interesting' concepts
#@markdown Frequency: how many documents include a concept (100 = no upper limit).
#@markdown Tip: concepts with very high frequencies tend to be common words, so it is useful to exclude them.
FREQ_MIN = 2 #@param {type: "slider", min: 1, max: 10, step:1}
FREQ_MAX = 70 #@param {type: "slider", min: 10, max: 100, step:10}
#@markdown ---
#@markdown Score: the average relevancy score of concepts, for the dataset we extracted above.
#@markdown This value tends to be a good indicator of 'interesting' concepts.
SCORE_MIN = 0.4  #@param {type: "slider", min: 0, max: 1, step:0.1}
#@markdown ---
#@markdown Select how many concepts to include in the visualization
MAX_CONCEPTS = 200 #@param {type: "slider", min: 20, max: 1000, step:10}

if FREQ_MAX == 100:
  FREQ_MAX = 100000000
print(f"""You selected:\n->{MAX_CONCEPTS}\n->{FREQ_MIN}-{FREQ_MAX}\n->{SCORE_MIN}""")

filtered_concepts = concepts_unique.query(f"""frequency >= {FREQ_MIN} & frequency <= {FREQ_MAX} & score_avg >= {SCORE_MIN} """)\
                    .sort_values(["score_avg", "frequency"], ascending=False)[:MAX_CONCEPTS]

px.scatter(filtered_concepts,
           x="concept",
           y="frequency",
           height=700,
           color="score_avg",
           size="score_avg")
You selected:
->200
->2-70
->0.4

2.1 Building a word cloud for exploratory analysis

Wordclouds are often a pretty effective way of summarising the relative importance of keywords.

The wordcloud Python library can be used to turn our concepts list into an image.

[4]:
!pip install wordcloud --quiet
import matplotlib.pyplot as plt
from wordcloud import WordCloud
[5]:
# take top N concepts
topn = filtered_concepts[['concept', 'frequency']][:300]

# wordcloud requires a list of dictionaries with words/frequency eg
# [
#    {'concept': 'study', 'frequency': 5707},
#    {'concept': 'patients', 'frequency': 3579}
#   ... etc...
#. ]
d = {x['concept'] : x['frequency'] for x in topn.to_dict(orient="records")}

wordcloud = WordCloud(width=900, height=600).generate_from_frequencies(d)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig("topics_wordcloud.png", dpi=300)
../../_images/cookbooks_2-publications_Simple-topic-analysis_10_0.png

3. Trend analysis - ‘emerging’ new topics each year

In this section we will look at the concept distribution over the years. In particular, we will focus on the ‘new’ concepts in each year - i.e. concepts that have never appeared in previous years.

This can be easily achieved by adding a new column first_year, marking the earliest publication year a concept was found in.

[6]:
# enrich initial dataframe by adding 'first_year'
concepts["first_year"] = concepts.groupby('concept')['year'].transform('min')
concepts_unique = concepts.drop_duplicates("concept")[['concept', 'frequency', 'score_avg', 'first_year']]

#@markdown ## Emerging concepts: parameters
#@markdown Start year of the publications/concepts to include
YEAR_START = 2010 #@param {type:"integer"}
#@markdown How many 'emerging' concepts to retain, for each year
CONCEPTS_PER_YEAR = 5 #@param {type:"integer"}


#
# note: we reuse the freq/score threasholds above in order to focus on concepts of interest
emerging_concepts = concepts_unique.query(f"""first_year >= {YEAR_START}""")\
                    .query(f"""frequency >= {FREQ_MIN} & frequency <= {FREQ_MAX} & score_avg >= {SCORE_MIN} """)\
                    .sort_values(["score_avg", "frequency"], ascending=False)\
                    .groupby("first_year").head(CONCEPTS_PER_YEAR)

px.treemap(emerging_concepts,
           path=['first_year', 'concept'],
           values='frequency',
           color="concept",
           color_discrete_sequence=px.colors.diverging.Spectral,
           height=600,
          )

4. Trend Analysis: topics growth

If we focus on concepts that appear across multiple years, it is pretty interesting to examine their growth / decrease by comparing their frequency (or score) across those years.

[7]:
#
# create dataframe with statistics for all years
#

concepts_by_year = concepts.copy()
concepts_by_year['frequency_year'] = concepts_by_year.groupby(['year', 'concept'])['concept'].transform('count')
concepts_by_year['score_avg_year'] = concepts_by_year.groupby(['year', 'concept'])['score'].transform('mean').round(5)

concepts_by_year = concepts_by_year.drop_duplicates(subset=['concept', 'year'])\
                    [['concept', 'frequency', 'score_avg', 'year', 'frequency_year', 'score_avg_year']]


#
# select some concepts of interest and plot their frequency over the years
#

FOCUS = ["history", "theory", "political philosophy", "mind"]

data = concepts_by_year[concepts_by_year['concept'].isin(FOCUS)].sort_values("year")

px.line(data,
        x="year",
        y="frequency_year",
        color="concept",
       )

4.1 A more generalised method

We can generalised the method above so to look at the topics that grew the most, within any two chosen years.

[8]:
YEAR_START = 2014 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2019 #@param {type: "slider", min: 1950, max: 2020}


#
# from two years, find all common concepts, calculate growth for all of them
#

first_year = concepts_by_year.query(f"year == {YEAR_START}")
second_year = concepts_by_year.query(f"year == {YEAR_END}")
two_years = first_year.append(second_year)
common_concepts = two_years[two_years.duplicated('concept')].reset_index()[['concept', 'frequency', 'score_avg']]

def calculate_difference(concept, column="frequency_year"):
    """Compare a numeric col value across two years
    eg `calculate_difference("mind", "frequency_year")`
    """
    return second_year.query(f"concept == '{concept}'")[column].values[0] - first_year.query(f"concept == '{concept}'")[column].values[0]

common_concepts['freq_growth'] = common_concepts.apply(lambda x: calculate_difference(x['concept'], "frequency_year"), axis=1)
common_concepts['score_growth'] = common_concepts.apply(lambda x: calculate_difference(x['concept'], "score_avg_year"), axis=1)

common_concepts.sort_values("freq_growth", ascending=False).head()

[8]:
concept frequency score_avg freq_growth score_growth
15 conditions 56 0.11374 11 0.03175
20 law 61 0.36898 8 -0.21461
27 questions 130 0.32561 7 -0.08450
18 theory 220 0.30251 7 -0.08608
1 nature 85 0.29454 6 0.04134

We can build a simple scatter plot using the freq_growth and score_growth values just calculated.

  • by using the MIN_SCORE parameter we can filter out concepts with an overall score_avg which is too low (note: this value is the average from all years, see section 1 above)

  • by mapping score_avg to the dots color, we can quickly visually identify concepts with a high score in the graph

[9]:
MIN_SCORE = 0.2

px.scatter(common_concepts.query(f"score_avg >= {MIN_SCORE}"),
           x="freq_growth", y="score_growth",
            color="score_avg",
           hover_name="concept", hover_data=['concept', 'frequency', 'score_avg'],
            height=600, title=f"Concepts growth from {YEAR_START} to {YEAR_END}"
          )

5. Conclusion

In this tutorial we have demonstrated how to process publication concepts obtained using the Dimensions Analytics API.

Concepts are normalised keywords found in documents. By playing with parameters like frequency and average score, it is possible to highlight interesting concepts programmatically and use them to carry out semantic analyses of the documents dataset they derive from.

For more information on this topic, see also the official documentation on concepts and the Working with concepts in the Dimensions API tutorial.



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg