Pattern matching works when you know what you’re looking for; dictionaries work when you want to cast a wider net systematically. This vignette uses search_dict() with textpress’s built-in dict_generations and dict_political — which map surface variants like “Zoomers” or “MAGA” to canonical term names — to match whole vocabulary sets at once across a web corpus, then uses PMI-weighted co-occurrence to surface which generation and political terms actually travel together in the text.

Search terms and web URLs

search_terms <- c(
  "US Gen Z voters 2026",
  "US Millennial political party 2026",
  "US Gen X politics forgotten generation 2026",
  "US Baby Boomer Republican Democrat 2026"
)
web_urls <- lapply(search_terms, function(x)
  textpress::fetch_urls(query = x, 
                        n_pages = 3, 
                        date_filter = "m")) |>
  bind_rows() |>
  unique()

Web text and sentence split

read_urls() scrapes and parses the article text; nlp_split_sentences() segments each document into analysis-ready sentence rows.

web_text_list <- web_urls |>
  filter(path_depth > 0) |>
  pull(url) |>
  textpress::read_urls(cores = 4)

web_ss <- web_text_list$text |>
  mutate(doc_id = match(url, unique(url))) |>
  relocate(doc_id, .before = 1) |>
  textpress::nlp_split_sentences(by = c("doc_id", "node_id"))

textpress ships with two curated dictionaries: dict_generations maps surface variants (e.g., “Gen Z”, “Zoomers”, “iGen”) to a canonical TermName; dict_political does the same for political identity labels (“Democrat”, “progressive”, “MAGA”, etc.). We show the political dictionary first, then stack both and run a single dictionary search.

DT::datatable(
  textpress::dict_political,
  options = list(pageLength = 10),
  rownames = FALSE
)

Stack both dictionaries (with a category), pass all variants to search_dict() once, then join back to get standardized TermName and category. One search covers generation and political terms.

dict_gen <- textpress::dict_generations |>
  mutate(variant_lc = tolower(variant), cat = "gen") |>
  select(variant, variant_lc, term_name = TermName, cat)
dict_pol <- textpress::dict_political |>
  mutate(variant_lc = tolower(variant), cat = "affil") |>
  select(variant, variant_lc, term_name = TermName, cat)
stacked_dict <- bind_rows(dict_gen, dict_pol)
matches <- web_ss |>
  textpress::search_dict(
    by   = c("doc_id", "sentence_id"),
    terms = stacked_dict$variant
  ) |>
  left_join(
    stacked_dict |> select(variant_lc, term_name, cat),
    by = c("term" = "variant_lc")
  )
matches |> arrange(id) |> DT::datatable(rownames = F)

Co-occurrence: counts and PMI (gen × affiliation)

Sentences that appear in both generation and affiliation matches form (gen, affil) pairs. We count pairs and compute PMI:

PMI(x,y)=logNnxynxny\text{PMI}(x, y) = \log \frac{N \cdot n_{xy}}{n_x \cdot n_y}

where NN is total pair count, nxyn_{xy} is joint count, and nxn_x, nyn_y are marginal counts.

ggs   <- matches |> filter(cat == "gen")   |> distinct(id, gen = term_name)
affil <- matches |> filter(cat == "affil") |> distinct(id, affil = term_name)
ids_both <- intersect(ggs$id, affil$id)

pairs <- ggs |> filter(id %in% ids_both) |>
  inner_join(affil |> filter(id %in% ids_both), by = "id")

n_total  <- nrow(pairs)
count_xy <- pairs |> count(gen, affil, name = "n_xy")
count_x  <- pairs |> count(gen,   name = "n_gen")
count_y  <- pairs |> count(affil, name = "n_affil")

cooccur <- count_xy |>
  left_join(count_x, by = "gen") |>
  left_join(count_y, by = "affil") |>
  mutate(pmi = round(log(n_total * n_xy / (n_gen * n_affil)), 2)) |>
  select(gen, affil, n_xy, n_gen, n_affil, pmi) |>
  arrange(desc(n_xy))

DT::datatable(cooccur, options = list(pageLength = 15), rownames = FALSE)