Pattern matching works when you know what you’re looking for;
dictionaries work when you want to cast a wider net systematically. This
vignette uses search_dict() with textpress’s built-in
dict_generations and dict_political — which
map surface variants like “Zoomers” or “MAGA” to canonical term names —
to match whole vocabulary sets at once across a web corpus, then uses
PMI-weighted co-occurrence to surface which generation and political
terms actually travel together in the text.
search_terms <- c(
"US Gen Z voters 2026",
"US Millennial political party 2026",
"US Gen X politics forgotten generation 2026",
"US Baby Boomer Republican Democrat 2026"
)
web_urls <- lapply(search_terms, function(x)
textpress::fetch_urls(query = x,
n_pages = 3,
date_filter = "m")) |>
bind_rows() |>
unique()read_urls() scrapes and parses the article text;
nlp_split_sentences() segments each document into
analysis-ready sentence rows.
textpress ships with two curated dictionaries:
dict_generations maps surface variants (e.g., “Gen
Z”, “Zoomers”, “iGen”) to a canonical
TermName; dict_political does the same for
political identity labels (“Democrat”, “progressive”,
“MAGA”, etc.). We show the political dictionary first, then
stack both and run a single dictionary search.
DT::datatable(
textpress::dict_political,
options = list(pageLength = 10),
rownames = FALSE
)Stack both dictionaries (with a category), pass all variants to
search_dict() once, then join back to get standardized
TermName and category. One search covers generation and
political terms.
dict_gen <- textpress::dict_generations |>
mutate(variant_lc = tolower(variant), cat = "gen") |>
select(variant, variant_lc, term_name = TermName, cat)
dict_pol <- textpress::dict_political |>
mutate(variant_lc = tolower(variant), cat = "affil") |>
select(variant, variant_lc, term_name = TermName, cat)
stacked_dict <- bind_rows(dict_gen, dict_pol)Sentences that appear in both generation and affiliation matches form (gen, affil) pairs. We count pairs and compute PMI:
where is total pair count, is joint count, and , are marginal counts.
ggs <- matches |> filter(cat == "gen") |> distinct(id, gen = term_name)
affil <- matches |> filter(cat == "affil") |> distinct(id, affil = term_name)
ids_both <- intersect(ggs$id, affil$id)
pairs <- ggs |> filter(id %in% ids_both) |>
inner_join(affil |> filter(id %in% ids_both), by = "id")
n_total <- nrow(pairs)
count_xy <- pairs |> count(gen, affil, name = "n_xy")
count_x <- pairs |> count(gen, name = "n_gen")
count_y <- pairs |> count(affil, name = "n_affil")
cooccur <- count_xy |>
left_join(count_x, by = "gen") |>
left_join(count_y, by = "affil") |>
mutate(pmi = round(log(n_total * n_xy / (n_gen * n_affil)), 2)) |>
select(gen, affil, n_xy, n_gen, n_affil, pmi) |>
arrange(desc(n_xy))
DT::datatable(cooccur, options = list(pageLength = 15), rownames = FALSE)