Dictionary search • textpress

search_dict() matches a vocabulary list against a sentence corpus and returns hits with standardized term names. This vignette uses textpress’s built-in dict_generations and dict_political – surface variants like “Zoomers” or “MAGA” mapped to canonical labels – across the same generational-politics corpus, then computes PMI-weighted co-occurrence to see which generation and political terms travel together in the text.

Load packages.

library(textpress)
library(dplyr)
library(DT)

Search terms and web URLs

Four queries covering each generation’s political alignment in 2026.

search_terms <- c(
  "US Gen Z voters 2026",
  "US Millennial political party 2026",
  "US Gen X politics forgotten generation 2026",
  "US Baby Boomer Republican Democrat 2026"
)

Fetch candidate URLs for each query and deduplicate.

web_urls <- lapply(search_terms, function(x)
  textpress::fetch_urls(query = x,
                        n_pages = 3,
                        date_filter = "m")) |>
  bind_rows() |>
  unique()

Web text and sentence split

read_urls() scrapes and parses the article text; nlp_split_sentences() segments each document into analysis-ready sentence rows.

web_text_list <- web_urls |>
  filter(path_depth > 0) |>
  pull(url) |>
  textpress::read_urls(cores = 4)

web_ss <- web_text_list$text |>
  mutate(doc_id = match(url, unique(url))) |>
  relocate(doc_id, .before = 1) |>
  textpress::nlp_split_sentences(by = c("doc_id", "node_id"))

Dictionaries and search

textpress ships with two curated dictionaries: dict_generations maps surface variants (e.g., “Gen Z”, “Zoomers”, “iGen”) to a canonical TermName; dict_political does the same for political identity labels (“Democrat”, “progressive”, “MAGA”, etc.). We show the political dictionary first, then stack both and run a single dictionary search.

DT::datatable(
  textpress::dict_political,
  options = list(pageLength = 10),
  rownames = FALSE
)

Stack both dictionaries (with a category), pass all variants to search_dict() once, then join back to get standardized TermName and category. One search covers generation and political terms.

dict_gen <- textpress::dict_generations |>
  mutate(variant_lc = tolower(variant), cat = "gen") |>
  select(variant, variant_lc, term_name = TermName, cat)
dict_pol <- textpress::dict_political |>
  mutate(variant_lc = tolower(variant), cat = "affil") |>
  select(variant, variant_lc, term_name = TermName, cat)
stacked_dict <- bind_rows(dict_gen, dict_pol)

Run search_dict() with all variants, then join back to the stacked dictionary to attach canonical term_name and category.

matches <- web_ss |>
  textpress::search_dict(
    by   = c("doc_id", "sentence_id"),
    terms = stacked_dict$variant
  ) |>
  left_join(
    stacked_dict |> select(variant_lc, term_name, cat),
    by = c("term" = "variant_lc")
  )
matches |> arrange(id) |> DT::datatable(rownames = F)

Co-occurrence: counts and PMI (gen × affiliation)

Sentences that appear in both generation and affiliation matches form (gen, affil) pairs. We count pairs and compute PMI:

$\text{PMI}(x, y) = \log \frac{N \cdot n_{xy}}{n_x \cdot n_y}$

where $N$ is total pair count, $n_{xy}$ is joint count, and $n_x$ , $n_y$ are marginal counts.

ggs   <- matches |> filter(cat == "gen")   |> distinct(id, gen = term_name)
affil <- matches |> filter(cat == "affil") |> distinct(id, affil = term_name)
ids_both <- intersect(ggs$id, affil$id)

pairs <- ggs |> filter(id %in% ids_both) |>
  inner_join(affil |> filter(id %in% ids_both), by = "id")

n_total  <- nrow(pairs)
count_xy <- pairs |> count(gen, affil, name = "n_xy")
count_x  <- pairs |> count(gen,   name = "n_gen")
count_y  <- pairs |> count(affil, name = "n_affil")

cooccur <- count_xy |>
  left_join(count_x, by = "gen") |>
  left_join(count_y, by = "affil") |>
  mutate(pmi = round(log(n_total * n_xy / (n_gen * n_affil)), 2)) |>
  select(gen, affil, n_xy, n_gen, n_affil, pmi) |>
  arrange(desc(n_xy))

DT::datatable(cooccur, options = list(pageLength = 15), rownames = FALSE)

Summary

search_dict() covers vocabulary breadth that regex can’t – every surface form of a concept matched in a single pass, with canonical labels for grouping. Stacking dict_generations and dict_political into one search gives co-labeled hits across both dimensions; PMI co-occurrence over those sentences surfaces which generation and political terms are actually associated in the corpus.