Same fetch-then-read pipeline as the web-data vignette, but starting from Wikipedia rather than a search engine. fetch_wiki_urls() finds article URLs by topic; fetch_wiki_refs() follows each article’s References section to collect cited external URLs — primary sources, reports, news. The result is a richer seed set than a single search query gives you, and read_urls() handles it the same way.

Expand via Wikipedia

Start from a topic. fetch_wiki_urls(query, limit) returns Wikipedia article URLs matching a search phrase. Use that as your seed set.

library(textpress)
library(dplyr)

wiki_urls <- textpress::fetch_wiki_urls("January 6 Capitol attack", limit = 5)
wiki_urls
## [1] "https://en.wikipedia.org/wiki/January_6_United_States_Capitol_attack"                            
## [2] "https://en.wikipedia.org/wiki/Pardon_of_January_6_United_States_Capitol_attack_defendants"       
## [3] "https://en.wikipedia.org/wiki/Aftermath_of_the_January_6_United_States_Capitol_attack"           
## [4] "https://en.wikipedia.org/wiki/Criminal_proceedings_in_the_January_6_United_States_Capitol_attack"
## [5] "https://en.wikipedia.org/wiki/List_of_cases_of_the_January_6_United_States_Capitol_attack"

Follow References. Articles cite external sources; those URLs are often high-value (primary sources, reports, news). fetch_wiki_refs(url, n) returns a data.table with source_url (the Wikipedia page) and ref_url (the cited link). Pass one URL for one table, or multiple URLs for a named list of tables. Use n to cap refs per page.

refs_list <- textpress::fetch_wiki_refs(wiki_urls[1:3], n = 15) |>
  bind_rows()

refs_list |> 
  slice(1:10) |> 
  select(-source_url) |> 
  DT::datatable(rownames = F)

So: wiki article URLs + URLs from their References sections = an expanded URL list. You can then read both the articles and the cited pages.

Ingest with read_urls()

Character vector of URLs → data frame (one row per node: headings, paragraphs, lists). Use read_urls() on any of the URLs you collected above.

Wikipedia has special status in read_urls(). When the URL is a Wikipedia page, the function uses Wikipedia’s main-content selector (div.mw-parser-output) and preserves section structure via parent_heading. Boilerplate detection is off for Wikipedia. exclude_wiki_refs = TRUE (default) drops nodes under References, See also, Bibliography, and Sources so you get article body only; set to FALSE to include those sections.

# Article body only (no References / See also); wiki_read$text, wiki_read$meta
wiki_text_list <- textpress::read_urls(wiki_urls, exclude_wiki_refs = TRUE)

wiki_text_list$text |> 
  select(-url) |>
  slice(1:5) |> 
  DT::datatable(rownames = F)

Summary: use fetch_wiki_urls() and fetch_wiki_refs() to follow Wikipedia’s links and build a URL list; read_urls() returns list(text = node-level corpus, meta = one row per URL).