Same fetch-then-read pipeline as the web-data vignette, but starting
from Wikipedia rather than a search engine.
fetch_wiki_urls() finds article URLs by topic;
fetch_wiki_refs() follows each article’s References section
to collect cited external URLs – primary sources, reports, news. The
result is a richer seed set than a single search query gives you, and
read_urls() handles it the same way.
Start from a topic.
fetch_wiki_urls(query, limit) returns Wikipedia article
URLs matching a search phrase. Use that as your seed set.
library(textpress)
library(dplyr)
wiki_urls <- textpress::fetch_wiki_urls("January 6 Capitol attack", limit = 5)
wiki_urls## [1] "https://en.wikipedia.org/wiki/January_6_United_States_Capitol_attack"
## [2] "https://en.wikipedia.org/wiki/Pardon_of_January_6_United_States_Capitol_attack_defendants"
## [3] "https://en.wikipedia.org/wiki/Aftermath_of_the_January_6_United_States_Capitol_attack"
## [4] "https://en.wikipedia.org/wiki/Criminal_proceedings_in_the_January_6_United_States_Capitol_attack"
## [5] "https://en.wikipedia.org/wiki/Timeline_of_the_January_6_United_States_Capitol_attack"
Follow References. Articles cite external sources;
those URLs are often high-value (primary sources, reports, news).
fetch_wiki_refs(url, n) returns a data.table
with source_url (the Wikipedia page) and
ref_url (the cited link). Pass one URL for one table, or
multiple URLs for a named list of tables. Use n to cap refs
per page.
refs_list <- textpress::fetch_wiki_refs(wiki_urls[1:3], n = 15) |>
bind_rows()
refs_list |>
slice(1:5) |>
select(-source_url) |>
DT::datatable(rownames = F)So: wiki article URLs + URLs from their References sections = an expanded URL list. You can then read both the articles and the cited pages.
Character vector of URLs → data frame (one row per node: headings,
paragraphs, lists). Use read_urls() on any of the URLs you
collected above.
Wikipedia has special status in
read_urls(). When the URL is a Wikipedia page, the
function uses Wikipedia’s main-content selector
(div.mw-parser-output) and preserves section structure via
parent_heading. Boilerplate detection is off for Wikipedia.
exclude_wiki_refs = TRUE (default) drops
nodes under References, See also, Bibliography, and Sources so you get
article body only; set to FALSE to include those
sections.
# Article body only (no References / See also); wiki_read$text, wiki_read$meta
wiki_text_list <- textpress::read_urls(wiki_urls, exclude_wiki_refs = TRUE)
wiki_text_list$text |>
select(-url) |>
slice(1:5) |>
mutate(text = {
words <- strsplit(text, "\\s+")
sapply(words, function(w) paste(paste(w[seq_len(min(15, length(w)))], collapse = " "), "..."))
}) |>
DT::datatable(rownames = F)fetch_wiki_urls() and fetch_wiki_refs()
extend the same fetch-then-read pipeline to Wikipedia – useful when you
want a topic-seeded corpus that includes both article text and cited
primary sources. read_urls() handles the result identically
to any other URL list, returning $text and
$meta.