Wikipedia data • textpress

Same fetch-then-read pipeline as the web-data vignette, but starting from Wikipedia rather than a search engine. fetch_wiki_urls() finds article URLs by topic; fetch_wiki_refs() follows each article’s References section to collect cited external URLs – primary sources, reports, news. The result is a richer seed set than a single search query gives you, and read_urls() handles it the same way.

Expand via Wikipedia

Start from a topic. fetch_wiki_urls(query, limit) returns Wikipedia article URLs matching a search phrase. Use that as your seed set.

library(textpress)
library(dplyr)

wiki_urls <- textpress::fetch_wiki_urls("January 6 Capitol attack", limit = 5)
wiki_urls

## [1] "https://en.wikipedia.org/wiki/January_6_United_States_Capitol_attack"                            
## [2] "https://en.wikipedia.org/wiki/Pardon_of_January_6_United_States_Capitol_attack_defendants"       
## [3] "https://en.wikipedia.org/wiki/Aftermath_of_the_January_6_United_States_Capitol_attack"           
## [4] "https://en.wikipedia.org/wiki/Criminal_proceedings_in_the_January_6_United_States_Capitol_attack"
## [5] "https://en.wikipedia.org/wiki/Timeline_of_the_January_6_United_States_Capitol_attack"

Follow References. Articles cite external sources; those URLs are often high-value (primary sources, reports, news). fetch_wiki_refs(url, n) returns a data.table with source_url (the Wikipedia page) and ref_url (the cited link). Pass one URL for one table, or multiple URLs for a named list of tables. Use n to cap refs per page.

refs_list <- textpress::fetch_wiki_refs(wiki_urls[1:3], n = 15) |>
  bind_rows()

refs_list |>
  slice(1:5) |>
  select(-source_url) |> 
  DT::datatable(rownames = F)

So: wiki article URLs + URLs from their References sections = an expanded URL list. You can then read both the articles and the cited pages.

Ingest with read_urls()

Character vector of URLs → data frame (one row per node: headings, paragraphs, lists). Use read_urls() on any of the URLs you collected above.

Wikipedia has special status in read_urls(). When the URL is a Wikipedia page, the function uses Wikipedia’s main-content selector (div.mw-parser-output) and preserves section structure via parent_heading. Boilerplate detection is off for Wikipedia. exclude_wiki_refs = TRUE (default) drops nodes under References, See also, Bibliography, and Sources so you get article body only; set to FALSE to include those sections.

# Article body only (no References / See also); wiki_read$text, wiki_read$meta
wiki_text_list <- textpress::read_urls(wiki_urls, exclude_wiki_refs = TRUE)

wiki_text_list$text |>
  select(-url) |>
  slice(1:5) |>
  mutate(text = {
    words <- strsplit(text, "\\s+")
    sapply(words, function(w) paste(paste(w[seq_len(min(15, length(w)))], collapse = " "), "..."))
  }) |>
  DT::datatable(rownames = F)

Summary

fetch_wiki_urls() and fetch_wiki_refs() extend the same fetch-then-read pipeline to Wikipedia – useful when you want a topic-seeded corpus that includes both article text and cited primary sources. read_urls() handles the result identically to any other URL list, returning $text and $meta.