textpress is organized around four actions – fetch,
read, process, search. This vignette covers the first two.
fetch_urls() runs a search query and returns candidate URLs
with metadata; read_urls() scrapes their content into a
node-level data frame. Together they turn a search term into an
analysis-ready corpus in a few lines, which the remaining vignettes
build on.
fetch_urls() retrieves candidate URLs for a query;
read_urls() scrapes and parses them. Build a snippet from
the first 15 words of each article, then display metadata alongside the
text preview.
library(textpress)
library(dplyr)
library(DT)
web_urls <- textpress::fetch_urls(
query = "us polling on immigration",
n_pages = 4,
date_filter = "m"
)Scrape and parse the URLs returned above into a node-level data frame
with $text and $meta components.
Build a text snippet from the first 15 words of each article, join to metadata, and display as an interactive table.
snippets <- web_text_list$text |>
group_by(doc_id) |>
summarise(text = {
words <- unlist(strsplit(paste(text, collapse = " "), "\\s+"))
paste(paste(words[seq_len(min(15, length(words)))], collapse = " "), "...")
}, .groups = "drop")
metas_dt <- web_text_list$meta |>
filter(!is.na(h1_title) & nzchar(trimws(h1_title))) |>
left_join(snippets, by = "doc_id") |>
arrange(desc(date)) |>
mutate(
title_link = paste0(
'<a href="', url, '" target="_blank">', h1_title, '</a>'
)
)
DT::datatable(
metas_dt |> select(date, source, title_link, text),
options = list(columnDefs = list(
list(targets = 2, orderable = FALSE)
)),
escape = FALSE,
rownames = FALSE
)fetch_urls() and read_urls() are the entry
points for any textpress pipeline – from search query to
node-level corpus in a few lines. fetch_urls() returns
candidate URLs with metadata; read_urls() scrapes and
parses them into $text (one row per node) and
$meta (one row per URL). The remaining vignettes take this
output as their starting point.