Same fetch-then-read pipeline as the web-data vignette, but starting
from Wikipedia rather than a search engine.
fetch_wiki_urls() finds article URLs by topic;
fetch_wiki_refs() follows each article’s References section
to collect cited external URLs — primary sources, reports, news. The
result is a richer seed set than a single search query gives you, and
read_urls() handles it the same way.
Start from a topic.
fetch_wiki_urls(query, limit) returns Wikipedia article
URLs matching a search phrase. Use that as your seed set.
library(textpress)
library(dplyr)
wiki_urls <- textpress::fetch_wiki_urls("January 6 Capitol attack", limit = 5)
wiki_urls## [1] "https://en.wikipedia.org/wiki/January_6_United_States_Capitol_attack"
## [2] "https://en.wikipedia.org/wiki/Pardon_of_January_6_United_States_Capitol_attack_defendants"
## [3] "https://en.wikipedia.org/wiki/Aftermath_of_the_January_6_United_States_Capitol_attack"
## [4] "https://en.wikipedia.org/wiki/Criminal_proceedings_in_the_January_6_United_States_Capitol_attack"
## [5] "https://en.wikipedia.org/wiki/List_of_cases_of_the_January_6_United_States_Capitol_attack"
Follow References. Articles cite external sources;
those URLs are often high-value (primary sources, reports, news).
fetch_wiki_refs(url, n) returns a data.table
with source_url (the Wikipedia page) and
ref_url (the cited link). Pass one URL for one table, or
multiple URLs for a named list of tables. Use n to cap refs
per page.
refs_list <- textpress::fetch_wiki_refs(wiki_urls[1:3], n = 15) |>
bind_rows()
refs_list |>
slice(1:10) |>
select(-source_url) |>
DT::datatable(rownames = F)So: wiki article URLs + URLs from their References sections = an expanded URL list. You can then read both the articles and the cited pages.
Character vector of URLs → data frame (one row per node: headings,
paragraphs, lists). Use read_urls() on any of the URLs you
collected above.
Wikipedia has special status in
read_urls(). When the URL is a Wikipedia page, the
function uses Wikipedia’s main-content selector
(div.mw-parser-output) and preserves section structure via
parent_heading. Boilerplate detection is off for Wikipedia.
exclude_wiki_refs = TRUE (default) drops
nodes under References, See also, Bibliography, and Sources so you get
article body only; set to FALSE to include those
sections.
# Article body only (no References / See also); wiki_read$text, wiki_read$meta
wiki_text_list <- textpress::read_urls(wiki_urls, exclude_wiki_refs = TRUE)
wiki_text_list$text |>
select(-url) |>
slice(1:5) |>
DT::datatable(rownames = F)Summary: use fetch_wiki_urls() and
fetch_wiki_refs() to follow Wikipedia’s links and build
a URL list; read_urls() returns
list(text = node-level corpus, meta = one row per URL).