../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Working with concepts in the Dimensions API

This Python notebook shows how to use the Dimensions Analytics API in order to extract concepts from documents and use them as the basis for more advanced topic-analysis tasks.

Prerequisites

This notebook assumes you have installed the Dimcli library and are familiar with the Getting Started tutorial.

[23]:
!pip install dimcli plotly -U --quiet

import dimcli
from dimcli.shortcuts import *
import json
import sys
import pandas as pd
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
#

print("==\nLogging in..")
# https://github.com/digital-science/dimcli#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  USERNAME = getpass.getpass(prompt='Username: ')
  PASSWORD = getpass.getpass(prompt='Password: ')
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
else:
  USERNAME, PASSWORD  = "", ""
  dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
==
Logging in..
Dimcli - Dimensions API Client (v0.7)
Connected to: https://app.dimensions.ai - DSL v1.25
Method: dsl.ini file

1. Background: What are Dimensions concepts?

Concepts are normalized noun phrases describing the main topics of a document (see also the official documentation). Concepts are automatically derived from documents abstracts using machine learning techniques, and are ranked based on their relevance.

In the JSON data, concepts are available as an ordered list (=first items are the most relevant), including a relevance score. E.g. for the publications with ID ‘pub.1122072646’:

{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
                    {'concept': 'conversion', 'relevance': 0.055053872555463006},
                    {'concept': 'formic acid', 'relevance': 0.048144671935356},
                    {'concept': 'CO2', 'relevance': 0.032150964737607}
                    [........]
                    ],
 }

Please note that (as of version 1.25 of the DSL API) it is possible to return either concepts_scores or concepts with Publications queries, but only concepts with Grants queries.

Extracting Concepts with Dimcli: the as_dataframe_concepts method

Since each publication has an associated list of concepts, in order to analyse them we need to ‘expand’ that list, so to have a new table with one row per concept.

Dimcli provides a method that does exactly that: as_dataframe_concepts().

[24]:
q = """search publications for "graphene"
            where year=2019
       return publications[id+title+year+concepts_scores] limit 100"""
concepts = dsl.query(q).as_dataframe_concepts()
concepts.head(5)
Returned Publications: 100 (total = 100863)
[24]:
title id year concepts_count concept score frequency score_avg
0 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 non-woven fiber mats 0.78424 1 0.78424
1 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 polymer matrix 0.72761 3 0.64380
2 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 atom transfer radical polymerization 0.72668 1 0.72668
3 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 transfer radical polymerization 0.70781 1 0.70781
4 Smart Non-Woven Fiber Mats with Light-Induced ... pub.1123764889 2019 63 ray photoelectron spectroscopy 0.69869 3 0.64896

The as_dataframe_concepts() method internally uses pandas to explode the concepts list, plus it adds some extra metrics that are handy in order to carry out further analyses:

  1. concepts_count: the total number of concepts for each single document. E.g., if a document has 35 concepts, concepts_count=35.

  2. frequency: how often a concept occur within a dataset, i.e. how many documents include that concept. E.g., if a concept appears in 5 documents, frequency=5.

  3. score: the relevancy of a concept in the context of the document it is extracted from. Concept scores go from 0 (= not relevant) to 1 (= very relevant). NOTE if concepts are returned without scores, these are generated automatically by normalizing its ranking against the total number of concepts for a single document. E.g., if a document has 10 concepts in total, the first concept gets a score=1, the second score=0.9, etc..

  4. score_avg: the average (mean) value of all scores of a concept across multiple documents, within a given in a dataset.

As we will see, by sorting and segmenting data using these parameters, it is possible to filter out common-name concepts and highlight more interesting ones.

2. Data acquisition: querying publications with all the associated concepts

Let’s pull all publications from University College London classified with the FOR code “16 Studies in Human Society”.

Tip: you can experiment by changing the parameters below as you want, eg by choosing another GRID organization.

[25]:
GRIDID = "grid.83440.3b" #@param {type:"string"}
FOR = "16 Studies in Human Society" #@param {type:"string"}

query = f"""
search publications
    where research_orgs.id = "{GRIDID}"
    and category_for.name= "{FOR}"
    return publications[id+doi+concepts_scores+year]
"""

print("===\nQuery:\n", query)
print("===\nRetrieving Publications.. ")
data = dsl.query_iterative(query)
===
Query:

search publications
    where research_orgs.id = "grid.83440.3b"
    and category_for.name= "16 Studies in Human Society"
    return publications[id+doi+concepts_scores+year]

===
Retrieving Publications..
1000 / ...
1000 / 8267
2000 / 8267
3000 / 8267
4000 / 8267
5000 / 8267
6000 / 8267
7000 / 8267
8000 / 8267
8267 / 8267
===
Records extracted: 8267

Let’s turn the results into a dataframe and have a quick look at the data. You’ll see a column concepts_scores that contains a list of concepts for each of the publications retrieved.

[26]:
pubs = data.as_dataframe()
pubs.head(5)
[26]:
doi concepts_scores id year
0 10.1080/01419870.2018.1544651 [{'concept': 'second-generation migrants', 're... pub.1110346939 2020
1 10.1007/s41109-019-0246-9 [{'concept': 'specific industrial sectors', 'r... pub.1123927393 2020
2 10.1140/epjds/s13688-020-00225-y [{'concept': 'development indicators', 'releva... pub.1126594457 2020
3 10.1186/s40163-020-00117-6 [{'concept': 'non-residential burglaries', 're... pub.1127684774 2020
4 10.1186/s40545-020-00215-5 [{'concept': 'gender equity', 'relevance': 0.7... pub.1127643349 2020

2.1 Extracting concepts

Now it’s time to start digging into the ‘concepts’ column of publications.

Each publications has an associated list of concepts, so in order to analyse them we need to ‘explode’ that list so to have a new table with one row per concept.

[27]:
concepts = data.as_dataframe_concepts()

# in this case NO NEED TO HAVE YEAR, PUB ETC.. columns
concepts_unique = concepts.drop_duplicates("concept")[['concept', 'frequency', 'score_avg']]

print("===\nConcepts Found (total):", len(concepts))
print("===\nPreview:")
display(concepts)

print("===\nUnique Concepts Found:", len(concepts_unique))
print("===\nPreview:")
display(concepts_unique)
===
Concepts Found (total): 316998
===
Preview:
doi id year concepts_count concept score frequency score_avg
0 10.1080/01419870.2018.1544651 pub.1110346939 2020 50 second-generation migrants 0.69108 1 0.69108
1 10.1080/01419870.2018.1544651 pub.1110346939 2020 50 children of migrants 0.68700 1 0.68700
2 10.1080/01419870.2018.1544651 pub.1110346939 2020 50 Donald Trump’s election 0.68565 1 0.68565
3 10.1080/01419870.2018.1544651 pub.1110346939 2020 50 second-generation group 0.68345 1 0.68345
4 10.1080/01419870.2018.1544651 pub.1110346939 2020 50 feelings of exclusion 0.67114 1 0.67114
... ... ... ... ... ... ... ... ...
316993 10.1038/002374c0 pub.1032460376 1870 28 importance 0.01678 352 0.28769
316994 10.1038/002374c0 pub.1032460376 1870 28 matter 0.01657 124 0.26206
316995 10.1038/002374c0 pub.1032460376 1870 28 additional remarks 0.01414 1 0.01414
316996 10.1038/002374c0 pub.1032460376 1870 28 letter 0.01358 16 0.19310
316997 10.1038/002374c0 pub.1032460376 1870 28 engineers 0.01144 17 0.15923

316998 rows × 8 columns

===
Unique Concepts Found: 85647
===
Preview:
concept frequency score_avg
0 second-generation migrants 1 0.69108
1 children of migrants 1 0.68700
2 Donald Trump’s election 1 0.68565
3 second-generation group 1 0.68345
4 feelings of exclusion 1 0.67114
... ... ... ...
316977 same subjects 1 0.02285
316978 personal opinions 1 0.02257
316979 engineering colleges 1 0.02195
316990 serious objections 1 0.01875
316995 additional remarks 1 0.01414

85647 rows × 3 columns

3. Basic statistics about Publications / Concepts

In this section we’ll show how to get an overview of the concepts data we obtained.

These statistics are important because they will help us contextualize more in-depth analyses of the concepts data we’ll do later on.

3.1 Documents With concepts VS Without

You’ll soon discover that not all documents have associated concepts (eg cause there’s no text to extract them from, in some cases).

Let’s see how many:

[28]:
CONCEPTS_FIELD = "concepts_scores"

df = pd.DataFrame({
    'type': ['with_concepts', 'without_concepts'] ,
    'count': [pubs[CONCEPTS_FIELD].notnull().sum(), pubs[CONCEPTS_FIELD].isnull().sum()]
             })

px.pie(df,
       names='type', values="count",
      title = "How many documents have concepts?")

3.2 Yearly breakdown of Documents With concepts VS Without

It’s also useful to look at whether the ratio of with/without concepts is stable across the years.

To this end we can use * the publications id column to count the total number of publications per year
* the concepts column to count the ones that have concepts
[29]:
temp1 = pubs.groupby('year', as_index=False).count()[['year', 'id', CONCEPTS_FIELD]]
temp1.rename(columns={'id': "documents", CONCEPTS_FIELD: "with_concepts"}, inplace=True)

# reorder cols/rows
temp1 = temp1.melt(id_vars=["year"],
         var_name="type",
        value_name="count")

px.bar(temp1, title="How many documents have concepts? Yearly breakdown.",
       x="year", y="count",
       color="type",
       barmode="group")

3.3 Concepts frequency

It is useful to look at how many concepts appear more than once in our dataset. As you’ll discovert, is often the case that only a subset of concepts appear more than once. That is because documents tend to be highly specialised hence a large number of extracted noun phrases aren’t very common.

By looking at this basic frequency statistics we can determine a useful frequency threshold for our analysis - ie to screen out concepts that are not representative of the overall dataset we have.

Tip: change the value of THRESHOLD to explore the data.

[30]:
THRESHOLD = 2

df = pd.DataFrame({
    'type': [f'freq<{THRESHOLD}',
             f'freq={THRESHOLD}',
             f'freq>{THRESHOLD}'] ,
    'count': [concepts_unique.query(f"frequency < {THRESHOLD}")['concept'].count(),
              concepts_unique.query(f"frequency == {THRESHOLD}")['concept'].count(),
              concepts_unique.query(f"frequency > {THRESHOLD}")['concept'].count()]
             })

px.pie(df,
       names='type', values="count",
      title = f"Concepts with a frequency major than: {THRESHOLD}")
[31]:
temp = concepts_unique.groupby('frequency', as_index=False)['concept'].count()
temp.rename(columns={'concept' : 'concepts with this frequency'}, inplace=True)
px.scatter(temp,
           x="frequency",
           y="concepts with this frequency",
          title="Distribution of concepts frequencies")

3.4 Yearly breakdown: unique VS repeated concepts

Also useful to look at the number of concepts per year, VS the number of unique concepts. That will give us a sense of whether the distribution of repeated concepts is stable across the years.

[32]:
series1 = concepts.groupby("year")['concept'].count().rename("All concepts")
series2 = concepts.groupby("year")['concept'].nunique().rename("Unique concepts")
temp2 = pd.concat([series1, series2], axis=1).reset_index()
temp2 = temp2.melt(id_vars=["year"],
         var_name="type",
        value_name="count")

px.bar(temp2,
       title="Yearly breakdown: Tot concepts VS Unique concepts",
       x="year", y="count",
       color="type", barmode="group",
       color_discrete_sequence=px.colors.carto.Antique)

4. Making sense of concepts information

In this section we will take a deep dive into the concepts themselves, in particular by using the two metrics obtained above: frequency and score_avg.

These metrics will let us filter out concepts that are too common or very infrequent, while instead highlighting concepts that are semantically representative of our dataset.

The main thing to keep in mind is that only the combination of these two metrics can lead to interesting results. For example, if we used only frequency it’ll lead to common keywords that are not very relevant; on the other hand, using only relevancy will result in concepts that important but just to one or two documents.

For example, let’s see what happens if we get the top concepts based on frequency only:

[33]:
top = concepts_unique.sort_values("frequency", ascending=False)[:20]

px.bar(top,
       title="Concepts sorted by frequency",
       x="concept", y="frequency",
       color="score_avg")

Not very interesting at all! Those keywords are obviously very common (eg study or development) in the scientific literature, but of very little semantic interest.

4.1 [Method 1] prefiltering by score_avg and sorting by frequency

IE by doing so we aim at extracting concepts that are both frequent and tend to be very relevant (within their documents).

[34]:
temp = concepts_unique.query("score_avg > 0.6").sort_values("frequency", ascending=False)

px.bar(temp[:50],
       title="Concepts with high average score, sorted by frequency",
       x="concept", y="frequency",
       color="score_avg")

4.2 [Method 2] prefiltering by frequency and sorting by score_avg

This method also allows to isolate interesting concepts, even if they are not very frequently appearing in our dataset.

[35]:
temp = concepts_unique.query("frequency > 10 & frequency < 100").sort_values(["score_avg", "frequency"], ascending=False)

px.bar(temp[:100],
       title="Concepts with medium frequency, sorted by score_avg",
       x="concept", y="score_avg",
       height=600,
       color="frequency")

Another visualization of the same data: plotting the frequency on the y axis makes it easier to isolate the most common concepts.

[36]:
temp = concepts_unique.query("frequency > 10 & frequency < 100").sort_values(["score_avg", "frequency"], ascending=False)

px.scatter(temp[:100],
           x="concept",
           y="frequency",
           height=600,
           color="score_avg",
           size="score_avg")

5. Analyses By Year

In this section we will show how to use the methods above together with a yearly segmentation of the documents data. This will allow us to draw up some cool comparison of concepts/topics across years.

5.1 Adding year-based metrics to the concepts dataframe

These are the steps

  • recalculate freq and score_avg for each year, using the original concepts dataset from section 2.1

  • note this will result in duplicates (as many as the appearances of a concept within the same year), which of course we should remove

[41]:
concepts['frequency_year'] = concepts.groupby(['year', 'concept'])['concept'].transform('count')
concepts['score_avg_year'] = concepts.groupby(['year', 'concept'])['score'].transform('mean').round(5)

concepts_by_year = concepts.copy().drop_duplicates(subset=['concept', 'year'])\
                    [['year', 'concept', 'frequency_year', 'score_avg_year']]
concepts_by_year.head()
[41]:
year concept frequency_year score_avg_year
0 2020 second-generation migrants 1 0.69108
1 2020 children of migrants 1 0.68700
2 2020 Donald Trump’s election 1 0.68565
3 2020 second-generation group 1 0.68345
4 2020 feelings of exclusion 1 0.67114

For example, let’s look at the yearly-distribution of a specific concept: migrants

[42]:
concepts_by_year[concepts_by_year['concept'] == "migrants"]
[42]:
year concept frequency_year score_avg_year
15 2020 migrants 13 0.57374
24802 2019 migrants 20 0.54701
58574 2018 migrants 19 0.56283
91568 2017 migrants 10 0.45743
116217 2016 migrants 11 0.48838
133627 2015 migrants 3 0.50477
153586 2014 migrants 4 0.57564
167388 2013 migrants 3 0.58176
185555 2012 migrants 2 0.52287
197645 2011 migrants 3 0.51742
204038 2010 migrants 4 0.35890
219759 2009 migrants 2 0.56712
223898 2008 migrants 5 0.47707
234404 2007 migrants 4 0.60750
244544 2006 migrants 4 0.52949
274089 2000 migrants 1 0.49055
276538 1999 migrants 2 0.55131
283362 1997 migrants 1 0.65821
288184 1996 migrants 1 0.24544
291173 1994 migrants 1 0.59848
296980 1991 migrants 1 0.52219
308946 1980 migrants 1 0.53460
311505 1976 migrants 1 0.03168

5.2 Charting the variation: multi-year visualization

We can use Plotly’s ‘facets’ to have subsections that show variation across years. Plotly will plot all the values retrieved - which allows to spot the trends up and down.

  • tip: to have an equal representation for each year, we take the top N concepts across a chosen years-span and then look at their frequency distribution over the years

In order to isolate interesting concepts, we can use the same formula from above (filter by score, then sort by frequency). Only this time using yearly values of course!

[75]:
MAX_CONCEPTS = 50
YEAR_START = 2015
YEAR_END = 2019
SCORE_MIN = 0.4

segment = concepts_by_year.query(f"year >= {YEAR_START} & year <= {YEAR_END}").copy()

# create metrics for the segment only
segment['frequency'] = concepts.groupby('concept')['concept'].transform('count')
segment['score_avg'] = concepts.groupby('concept')['score'].transform('mean').round(5)

# get top N concepts for the dataviz
top_concepts = segment.drop_duplicates('concept')\
        .query(f"score_avg > {SCORE_MIN}")\
        .sort_values("frequency", ascending=False)[:MAX_CONCEPTS]

# use yearly data only for top N concepts
segment_subset = segment[segment['concept'].isin(top['concept'].tolist())]

px.bar(segment_subset,
       x="concept",
       y="frequency_year",
       facet_row="year",
       title=f"Top concepts {YEAR_START}-{YEAR_END} with score_avg > {SCORE_MIN}, sorted by frequency",
       height=1000,
       color="frequency_year")

Lastly, let’s represent the same data using a treemap.

This visualization is less suitable for comparing the years, but it is definitely quite fun to navigate around :-)

[92]:
px.treemap(segment_subset,
           path=['year', 'concept'],
           values='frequency_year',
           color="concept",
           height=600,
          )

6. Conclusion

In this tutorial we have demonstrated how to query for concepts using the Dimensions Analytics API.

Here are the main takeaways:

  • concepts can be easily extracted by using the as_dataframe_concepts() method

  • concepts have an implicit score relative to the document they belong to - but we can create more absolute metrics by normalizing these scores

  • it is useful to look at the frequency of concepts in the context of the entire dataset we have

  • there can be a long tail of concepts that are very infrequent, hence it’s useful to filter those out

  • by using a combination of frequency and score_avg metrics, we can filter out uninteresting concepts

Where to go from here

Using these methods, you can take advantage of concepts data in a number of real-world scenarios. Here are some ideas:

  • you can segment concepts using other dimensions: eg by journal or by field of research, in order to identify more specific trends;

  • concepts extracted can be used to create new DSL searches - using the in concepts search syntax;

  • concepts data can be grouped further using semantic similarity or clustering techniques;

  • you can look at the co-occurence of concepts withing the same document, in order to build a semantic network.

Happy data analyses!



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg