Regex search

search_regex() matches patterns against a text corpus and returns results in KWIC format with inline highlighting. This vignette applies it to a generational-politics corpus with three patterns – age ranges, numeric change expressions, and voter-mobilization language – each match returned at the sentence level, ready for close reading or aggregation.

Search terms and web URLs

Four queries covering each generation’s political alignment in 2026.

search_terms <- c(
  "US Gen Z voters 2026",
  "US Millennial political party 2026",
  "US Gen X politics forgotten generation 2026",
  "US Baby Boomer Republican Democrat 2026"
)

Fetch candidate URLs for each query and deduplicate.

library(textpress)
library(dplyr)
library(DT)

web_urls <- lapply(search_terms, function(x)
  textpress::fetch_urls(query = x, 
                        n_pages = 3, 
                        date_filter = 'm')) |>
  bind_rows() |>
  unique()

Web text and sentence split

read_urls() scrapes and parses the article text; nlp_split_sentences() segments each document into analysis-ready sentence rows.

web_text_list <- web_urls |>
  filter(path_depth > 0) |>
  pull(url) |>
  textpress::read_urls(cores = 6) 

web_text <- web_text_list$text |>
  mutate(doc_id = match(url, unique(url))) |>
  relocate(doc_id, .before = 1)

Segment each document into sentence rows.

web_ss <- web_text |>
  textpress::nlp_split_sentences(by = c('doc_id', 'node_id'))

Regex patterns

Three patterns: age_range matches explicit age constructions; from_to captures numeric change expressions; energized catches voter-mobilization language.

patterns <- list(
  energized = "\\b(?:energi[zs]ed|motivated|mobili[zs]ed|fired\\s+up|disillusioned|apathetic)\\b",
age_range = paste(
  "\\b(?:aged?(?:\\s+between)?|ages?)\\s+\\d{2,3}(?:(?:\\s*(?:[-–]|to|and)\\s*\\d{2,3})|(?:\\s+and\\s+(?:older|younger|over|under)))\\b",
  "\\bunder\\s+\\d{2}(?!\\d|,|%)\\b",
  "\\b\\d{2,3}(?!\\d|,|%)\\s*(?:\\+|and\\s+(?:older|younger|over|under))\\b",
  sep = "|"
),
  from_to   = "\\bfrom\\s+\\d+\\s+to\\s+\\d+\\b"
)

Search results

Search ~ `age_range`

Sentences containing explicit age ranges or constructions (e.g. “adults aged 18 to 29”, “under 30”, “65+”).

fs <- web_ss |> 
  
  textpress::search_regex(
    query = patterns$age_range,
    by = c('doc_id', 'node_id'),
    highlight = c('<span style="background:#a6cbe1;">', '</span>')
    ) |>
  
  distinct(text, .keep_all = TRUE) |>
  select(doc_id, node_id, pattern, text)

if (!is.null(fs) && nrow(fs) > 0) fs |> DT::datatable(rownames = FALSE, escape = FALSE)

Search ~ `from_to`

Sentences with numeric change expressions (e.g. “from 42 to 51 percent”) – useful for tracking shifts in polling or support figures.

fs1 <- web_ss |> 
  
  textpress::search_regex(
    query = patterns$from_to,
    by = c('doc_id', 'node_id'),
    highlight = c('<span style="background:#fdc4a8;">', '</span>')
    ) |>
  
  distinct(text, .keep_all = TRUE) |>
  select(doc_id, node_id, pattern, text)

if (!is.null(fs1) && nrow(fs1) > 0) fs1 |> DT::datatable(rownames = FALSE, escape = FALSE)

Search ~ `energized`

Sentences with voter-mobilization language – terms like “energized”, “motivated”, “mobilized”, “disillusioned”, “apathetic”.

fs2 <- web_ss |>

  textpress::search_regex(
    query = patterns$energized,
    by = c('doc_id', 'node_id'),
    highlight = c('<span style="background:#c8e6c9;">', '</span>')
    ) |>

  distinct(text, .keep_all = TRUE) |>
  select(doc_id, node_id, pattern, text)

if (!is.null(fs2) && nrow(fs2) > 0) fs2 |> DT::datatable(rownames = FALSE, escape = FALSE)

Summary

search_regex() is best for precise, expressive patterns – age ranges, numeric constructions, multi-word sequences – where you control the match exactly. Results come back sentence-level with inline highlighting. For broader vocabulary coverage without hand-crafting every variant, see the dictionary search vignette.