../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Measuring the policy impact of an organization using policy document citations

This notebook shows how to use the Dimensions Analytics API to analyze the policy impact of an organization.

We will do this by counting the indirect links via publications, hence it permits to gain some insight into the indirect policy impact of a research organization. The goal therefore is to extract and inspect policy documents that cite publications from the research organization in question.

Since the various content types included in the Dimensions database are deeply interlinked, the Dimensions APIs allow to perform this analysis via a few simple steps:

  • We start from a GRID identifier (representing a research organization in Dimensions).

  • We use the publications API to extract all publications where at least one author is/was affiliated to the GRID organization, for a selected time period.

  • We then use the policy documents API to discover policy documents that include citations to any of those publications.

  • Finally, we analyse the policy documents data to highlight trends.

[2]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))
==
CHANGELOG
This notebook was last run on Apr 20, 2023
==

Prerequisites

This notebook assumes you have installed the Dimcli library and have followed the steps in the ‘Getting Started’ tutorial.

[3]:
!pip install dimcli plotly tqdm -U --quiet

import dimcli
from dimcli.utils import *

import os, sys, time, json
from tqdm.notebook import tqdm as progressbar

import pandas as pd

import plotly.express as px
from plotly.offline import plot
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()
Searching config file credentials for 'https://app.dimensions.ai' endpoint..
==
Logging in..
Dimcli - Dimensions API Client (v1.0.2)
Connected to: <https://app.dimensions.ai/api/dsl> - DSL v2.6
Method: dsl.ini file

1. Choosing a research organization’s GRID identifier

For the purpose of this exercise, we are going to use GRID ID grid.1005.4 (UNSW Sydney, Australia). Feel free to change the parameters below as you want, for example by choosing another organization’s GRID ID, or doing a fuzzy search by organization name, below. (The name you enter will appear in the graphs produced below.)

[4]:
ORG = "UNSW Sydney" #@param {type:"string"}
dsl.query(f"""search organizations where name ~ "{ORG}" return organizations""").as_dataframe()
Returned Organizations: 1 (total = 1)
Time: 6.71s
[4]:
id name acronym city_name country_name latitude linkout longitude state_name types
0 grid.1005.4 UNSW Sydney UNSW Sydney Australia -33.917732 [https://www.unsw.edu.au/] 151.23096 New South Wales [Education]
[5]:

GRIDID = "grid.1005.4" #@param {type:"string"}

#@markdown The start/end year of publications used to extract policy documents
YEAR_START = 2017 #@param {type: "slider", min: 1950, max: 2022}
YEAR_END = 2019 #@param {type: "slider", min: 1950, max: 2022}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

from IPython.display import display, HTML
display(HTML(f'---<br /><a href="{dimensions_url(GRIDID)}">View {GRIDID} in Dimensions &#x29c9;</a>'.format(dimensions_url(GRIDID))))

#@markdown ---

2. Extracting publications data

By looking at the Dimensions API data model, we can see that policy documents and publications are connected by a property called publication_ids, which goes from policy documents to publications. This property represents the publication citations found in policy documents.

Hence, we need to… 1. …query for all publications with authors affiliated to our GRID ID; 2. …query for policy documents citing these publications.

[5]:
# Get full list of publications linked to this organization for the selected time frame

q = f"""search publications
        where research_orgs.id="{GRIDID}"
        and year in [{YEAR_START}:{YEAR_END}]
        return publications[id+doi+title+type+journal+year+research_orgs+researchers+category_for+times_cited]"""
print("===\n", q, "\n===")
pubs_json = dsl.query_iterative(q, limit=1000)
pubs = pubs_json.as_dataframe()
Starting iteration with limit=1000 skip=0 ...
===
 search publications
        where research_orgs.id="grid.1005.4"
        and year in [2017:2019]
        return publications[id+doi+title+type+journal+year+research_orgs+researchers+category_for+times_cited]
===
0-1000 / 29175 (6.10s)
1000-2000 / 29175 (8.72s)
2000-3000 / 29175 (7.96s)
3000-4000 / 29175 (5.99s)
4000-5000 / 29175 (5.46s)
5000-6000 / 29175 (5.45s)
6000-7000 / 29175 (5.11s)
7000-8000 / 29175 (5.59s)
8000-9000 / 29175 (5.73s)
9000-10000 / 29175 (6.26s)
10000-11000 / 29175 (5.76s)
11000-12000 / 29175 (5.31s)
12000-13000 / 29175 (7.22s)
13000-14000 / 29175 (5.68s)
14000-15000 / 29175 (5.38s)
15000-16000 / 29175 (5.15s)
16000-17000 / 29175 (4.97s)
17000-18000 / 29175 (5.62s)
18000-19000 / 29175 (5.40s)
19000-20000 / 29175 (6.45s)
20000-21000 / 29175 (4.25s)
21000-22000 / 29175 (4.82s)
22000-23000 / 29175 (6.50s)
23000-24000 / 29175 (5.86s)
24000-25000 / 29175 (6.04s)
25000-26000 / 29175 (5.70s)
26000-27000 / 29175 (5.46s)
27000-28000 / 29175 (6.11s)
28000-29000 / 29175 (5.46s)
29000-29175 / 29175 (2.61s)
===
Records extracted: 29175

3. Extracting policy documents linked to publications

In this section, we extract all policy documents linked to the publications dataset previously created. The steps are the following:

  • we loop over the publication IDs and create policy document queries, via the referencing publication_ids field of policy documents;

  • we collate all policy documents data, remove duplicates from policy documents and save the results;

  • finally, we count policy documents per publication and enrich the original publications dataset with these numbers.

[6]:
#
# the main policy documents query
#
q = """search policy_documents
        where publication_ids in {}
      return policy_documents[basics+publication_ids+category_for]"""

BATCHSIZE = 250
VERBOSE = False # set to True to see policy documents extraction logs


#
# loop through all pub IDs in chunks and query policy documents
#
print("===\nExtracting policy documents data ...")
policy_documents_json = []
pubsids = pubs['id']

for chunk in progressbar(list(chunks_of(list(pubsids), BATCHSIZE))):
    data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=VERBOSE)
    policy_documents_json += data.policy_documents
    time.sleep(1)


policy_documents = pd.DataFrame().from_dict(policy_documents_json)
policy_documents.drop_duplicates(subset='id', inplace=True)
print("Policy documents found: ", len(policy_documents))
===
Extracting policy documents data ...
Policy documents found:  1073

Enriching publications with policy document citation metrics

Each policy document record contains all the publication_ids it cites, so we can take this metric to enrich the original publications dataset we created above.

[7]:
def count_policy_documents_per_pub(pubid):
  global policy_documents
  return len(policy_documents[policy_documents['publication_ids'].str.contains(pubid)])

# turn lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
policy_documents['publication_ids'] = policy_documents['publication_ids'].apply(lambda x: ','.join(map(str, x)))

progressbar.pandas()
pubs['policy_documents'] = pubs['id'].progress_apply(lambda x: count_policy_documents_per_pub(x))

Now the policy_documents column gives us the top publications by number of citing policy documents.

[8]:
pubs.sort_values("policy_documents", ascending=False).head(5)
[8]:
id title category_for doi research_orgs researchers times_cited type year journal.id journal.title policy_documents
26407 pub.1084771026 Climate change impacts in the Middle East and ... [{'id': '80154', 'name': '4406 Human Geography... 10.1007/s10113-017-1144-2 [{'acronym': 'ODI', 'city_name': 'London', 'co... [{'first_name': 'Katharina', 'id': 'ur.0132307... 111 article 2017 jour.1051354 Regional Environmental Change 20
20622 pub.1099661937 Estimates of global seasonal influenza-associa... [{'id': '80003', 'name': '32 Biomedical and Cl... 10.1016/s0140-6736(17)33293-2 [{'acronym': 'CDC', 'city_name': 'Nairobi', 'c... [{'first_name': 'Angela Danielle', 'id': 'ur.0... 1639 article 2017 jour.1077219 The Lancet 16
28292 pub.1003731534 Surviving Sepsis Campaign: International Guide... [{'id': '80003', 'name': '32 Biomedical and Cl... 10.1007/s00134-017-4683-6 [{'city_name': 'Houston', 'country_name': 'Uni... [{'first_name': 'Andrew', 'id': 'ur.0104066565... 4572 article 2017 jour.1313639 Intensive Care Medicine 14
26159 pub.1085284420 Mortality risk during and after opioid substit... [{'id': '80003', 'name': '32 Biomedical and Cl... 10.1136/bmj.j1550 [{'acronym': 'ENS', 'city_name': 'Madrid', 'co... [{'first_name': 'Luis', 'id': 'ur.01346441735.... 1045 article 2017 jour.1017377 The BMJ 14
17519 pub.1103160874 Investing in non-communicable disease preventi... [{'id': '80141', 'name': '4206 Public Health'}... 10.1016/s0140-6736(18)30667-6 [{'acronym': 'WHO', 'city_name': 'Geneva', 'co... [{'first_name': 'Rachel A', 'id': 'ur.01013221... 214 article 2018 jour.1077219 The Lancet 13
[9]:
pubs_cited = pubs.query("policy_documents > 0 ").sort_values('policy_documents', ascending=False).copy()
pubs_cited.head(5)
[9]:
id title category_for doi research_orgs researchers times_cited type year journal.id journal.title policy_documents
26407 pub.1084771026 Climate change impacts in the Middle East and ... [{'id': '80154', 'name': '4406 Human Geography... 10.1007/s10113-017-1144-2 [{'acronym': 'ODI', 'city_name': 'London', 'co... [{'first_name': 'Katharina', 'id': 'ur.0132307... 111 article 2017 jour.1051354 Regional Environmental Change 20
20622 pub.1099661937 Estimates of global seasonal influenza-associa... [{'id': '80003', 'name': '32 Biomedical and Cl... 10.1016/s0140-6736(17)33293-2 [{'acronym': 'CDC', 'city_name': 'Nairobi', 'c... [{'first_name': 'Angela Danielle', 'id': 'ur.0... 1639 article 2017 jour.1077219 The Lancet 16
26159 pub.1085284420 Mortality risk during and after opioid substit... [{'id': '80003', 'name': '32 Biomedical and Cl... 10.1136/bmj.j1550 [{'acronym': 'ENS', 'city_name': 'Madrid', 'co... [{'first_name': 'Luis', 'id': 'ur.01346441735.... 1045 article 2017 jour.1017377 The BMJ 14
28292 pub.1003731534 Surviving Sepsis Campaign: International Guide... [{'id': '80003', 'name': '32 Biomedical and Cl... 10.1007/s00134-017-4683-6 [{'city_name': 'Houston', 'country_name': 'Uni... [{'first_name': 'Andrew', 'id': 'ur.0104066565... 4572 article 2017 jour.1313639 Intensive Care Medicine 14
17519 pub.1103160874 Investing in non-communicable disease preventi... [{'id': '80141', 'name': '4206 Public Health'}... 10.1016/s0140-6736(18)30667-6 [{'acronym': 'WHO', 'city_name': 'Geneva', 'co... [{'first_name': 'Rachel A', 'id': 'ur.01013221... 214 article 2018 jour.1077219 The Lancet 13
[10]:
px.bar(pubs_cited[:1000],
       color="type",
       x="year", y="policy_documents",
       hover_name="title",  hover_data=["journal.title"],
       title=f"Top publications from {ORG} mentioned in policy documents, by year of publication")

4. Policy documents data analysis

Now that we have extracted all the data we need, let’s start exploring it by building a few visualizations.

How many policy documents per year?

[11]:
# unpack publisher_org dictionary
temp = policy_documents.assign(**json_normalize(policy_documents['publisher_org'], errors="ignore"))

px.histogram(temp.dropna(subset=['name']), x="year",
             color="name",
             barmode="stack",
             title=f"Policy documents referencing publications from {ORG} - by year")

Who is publishing the policy documents?

This can be done by looking at the field publisher_org of policy documents. Since the field contains nested information, we first need to extract it into its own table (similarly to what we’ve done above with publications categories).

[12]:
temp = policy_documents.assign(**json_normalize(policy_documents['publisher_org'], errors="ignore"))

top_publisher_org = temp.groupby(['name', 'country_name'],
                                          as_index=False).count().sort_values(by="id", ascending=False)
# preview the data: ps the policy_documents_id column is the COUNT of policy documents
top_publisher_org[['name', 'country_name', 'id']].head()
[12]:
name country_name id
61 World Health Organization Switzerland 265
5 Canadian Agency for Drugs and Technologies in ... Canada 94
42 Publications Office of the European Union Luxembourg 60
31 National Institute for Health and Care Excellence United Kingdom 59
11 Food and Agriculture Organization of the Unite... Italy 46
[13]:
px.bar(top_publisher_org,
       x="name", y="id",
       hover_name="name", color="country_name",
       height=900,
       title=f"Top publishing organizations for policy documents referencing publications from {ORG}")

What are the main subject areas of the publications most frequently referenced in policy documents?

We can use the Fields of Research categories information in publications to obtain a breakdown of the publications by subject area.

This can be achieved by ‘exploding’ the category_for data into a separate table, since there can be more than one category per publication. The new categories table also retains some basic info about the publications it relates to, e.g. journal, title, publication id, etc., to make it easier to analyse the data.

[14]:
pubs_categories = pubs_cited.explode('category_for')
pubs_categories.dropna(subset=["category_for"], inplace=True)

def for_nice_name(for_dict):
    "transforms a category JSON into a nice looking title"
    if type(for_dict) == dict:
        name = for_dict['name']
        return ''.join([i for i in name if not i.isdigit()])
    else:
        return ""

# new col for nice name
pubs_categories["category_for_name"] = pubs_categories['category_for'].apply(lambda x: for_nice_name(x))
# new col for tot-pubs count
pubs_categories['count_pubs'] = pubs_categories.groupby("category_for_name")['id'].transform('count')

# Visualize the top categories using a pie chart...
categories = pubs_categories.drop_duplicates(subset="category_for_name").sort_values("count_pubs", ascending=False)[['category_for_name', 'count_pubs']]

px.pie(categories[:20],
       names="category_for_name", # the dimension for the slices
       values="count_pubs",  # the metric
       color_discrete_sequence=px.colors.sequential.Bluyl,
       title=f"Top FOR categories")
[15]:
THRESHOLD_PUBS = 1000
citedids = list(pubs_cited[:THRESHOLD_PUBS]['id'])
pubs_categories_cited = pubs_categories[pubs_categories['id'].isin(citedids)]
[16]:
px.scatter(pubs_categories_cited, x="year", y="category_for_name", color="type",
           hover_name="title",
           hover_data=["doi", "year", "journal.title"],
           height=800,
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top {THRESHOLD_PUBS} {ORG} publications cited by policy documents - by subject area")

Is there a correlation between publication citations and policy document citations?

Note: if the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see below), this is indicative of a positive correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be high as well, and vice versa.

[17]:
px.scatter(pubs, x="policy_documents", y="times_cited",
           title=f"Policy document citations vs. publication citations")