search_regex() matches patterns against a text corpus
and returns results in KWIC format with inline highlighting. This
vignette applies it to a generational-politics corpus with three
patterns – age ranges, numeric change expressions, and
voter-mobilization language – each match returned at the sentence level,
ready for close reading or aggregation.
Four queries covering each generation’s political alignment in 2026.
search_terms <- c(
"US Gen Z voters 2026",
"US Millennial political party 2026",
"US Gen X politics forgotten generation 2026",
"US Baby Boomer Republican Democrat 2026"
)Fetch candidate URLs for each query and deduplicate.
read_urls() scrapes and parses the article text;
nlp_split_sentences() segments each document into
analysis-ready sentence rows.
web_text_list <- web_urls |>
filter(path_depth > 0) |>
pull(url) |>
textpress::read_urls(cores = 6)
web_text <- web_text_list$text |>
mutate(doc_id = match(url, unique(url))) |>
relocate(doc_id, .before = 1)Segment each document into sentence rows.
web_ss <- web_text |>
textpress::nlp_split_sentences(by = c('doc_id', 'node_id'))Three patterns: age_range matches explicit age
constructions; from_to captures numeric change expressions;
energized catches voter-mobilization language.
patterns <- list(
energized = "\\b(?:energi[zs]ed|motivated|mobili[zs]ed|fired\\s+up|disillusioned|apathetic)\\b",
age_range = paste(
"\\b(?:aged?(?:\\s+between)?|ages?)\\s+\\d{2,3}(?:(?:\\s*(?:[-–]|to|and)\\s*\\d{2,3})|(?:\\s+and\\s+(?:older|younger|over|under)))\\b",
"\\bunder\\s+\\d{2}(?!\\d|,|%)\\b",
"\\b\\d{2,3}(?!\\d|,|%)\\s*(?:\\+|and\\s+(?:older|younger|over|under))\\b",
sep = "|"
),
from_to = "\\bfrom\\s+\\d+\\s+to\\s+\\d+\\b"
)age_range
Sentences containing explicit age ranges or constructions (e.g. “adults aged 18 to 29”, “under 30”, “65+”).
fs <- web_ss |>
textpress::search_regex(
query = patterns$age_range,
by = c('doc_id', 'node_id'),
highlight = c('<span style="background:#a6cbe1;">', '</span>')
) |>
distinct(text, .keep_all = TRUE) |>
select(doc_id, node_id, pattern, text)
if (!is.null(fs) && nrow(fs) > 0) fs |> DT::datatable(rownames = FALSE, escape = FALSE)from_to
Sentences with numeric change expressions (e.g. “from 42 to 51 percent”) – useful for tracking shifts in polling or support figures.
fs1 <- web_ss |>
textpress::search_regex(
query = patterns$from_to,
by = c('doc_id', 'node_id'),
highlight = c('<span style="background:#fdc4a8;">', '</span>')
) |>
distinct(text, .keep_all = TRUE) |>
select(doc_id, node_id, pattern, text)
if (!is.null(fs1) && nrow(fs1) > 0) fs1 |> DT::datatable(rownames = FALSE, escape = FALSE)energized
Sentences with voter-mobilization language – terms like “energized”, “motivated”, “mobilized”, “disillusioned”, “apathetic”.
fs2 <- web_ss |>
textpress::search_regex(
query = patterns$energized,
by = c('doc_id', 'node_id'),
highlight = c('<span style="background:#c8e6c9;">', '</span>')
) |>
distinct(text, .keep_all = TRUE) |>
select(doc_id, node_id, pattern, text)
if (!is.null(fs2) && nrow(fs2) > 0) fs2 |> DT::datatable(rownames = FALSE, escape = FALSE)search_regex() is best for precise, expressive patterns
– age ranges, numeric constructions, multi-word sequences – where you
control the match exactly. Results come back sentence-level with inline
highlighting. For broader vocabulary coverage without hand-crafting
every variant, see the dictionary search vignette.