Web data • textpress

textpress is organized around four actions – fetch, read, process, search. This vignette covers the first two. fetch_urls() runs a search query and returns candidate URLs with metadata; read_urls() scrapes their content into a node-level data frame. Together they turn a search term into an analysis-ready corpus in a few lines, which the remaining vignettes build on.

Search and read

fetch_urls() retrieves candidate URLs for a query; read_urls() scrapes and parses them. Build a snippet from the first 15 words of each article, then display metadata alongside the text preview.

library(textpress)
library(dplyr)
library(DT)

web_urls <- textpress::fetch_urls(
  query       = "us polling on immigration",
  n_pages     = 4,
  date_filter = "m"
)

Scrape and parse the URLs returned above into a node-level data frame with $text and $meta components.

web_text_list <- web_urls |>
  filter(path_depth > 0) |>
  pull(url) |>
  textpress::read_urls(cores = 4)

Build a text snippet from the first 15 words of each article, join to metadata, and display as an interactive table.

snippets <- web_text_list$text |>
  group_by(doc_id) |>
  summarise(text = {
    words <- unlist(strsplit(paste(text, collapse = " "), "\\s+"))
    paste(paste(words[seq_len(min(15, length(words)))], collapse = " "), "...")
  }, .groups = "drop")

metas_dt <- web_text_list$meta |>
  filter(!is.na(h1_title) & nzchar(trimws(h1_title))) |>
  left_join(snippets, by = "doc_id") |>
  arrange(desc(date)) |>
  mutate(
    title_link = paste0(
      '<a href="', url, '" target="_blank">', h1_title, '</a>'
    )
  )

DT::datatable(
  metas_dt |> select(date, source, title_link, text),
  options = list(columnDefs = list(
    list(targets = 2, orderable = FALSE)
  )),
  escape   = FALSE,
  rownames = FALSE
)

Summary

fetch_urls() and read_urls() are the entry points for any textpress pipeline – from search query to node-level corpus in a few lines. fetch_urls() returns candidate URLs with metadata; read_urls() scrapes and parses them into $text (one row per node) and $meta (one row per URL). The remaining vignettes take this output as their starting point.