Input: character vector of URLs. Output: structured data frame (one row per
node: headings, paragraphs, lists). Like read_csv or read_html:
bring an external resource into R. Follows fetch_urls or
fetch_wiki_urls in the pipeline—fetch gets locations, read gets
text. Wikipedia uses high-fidelity selectors; use parent_heading to see
which section each node belongs to. External links and empty text rows are
omitted; optionally exclude References/See also/Bibliography/Sources sections for
wiki URLs.
read_urls(
x,
cores = 1,
detect_boilerplate = TRUE,
remove_boilerplate = TRUE,
exclude_wiki_refs = TRUE
)Character vector of URLs.
Number of cores for parallel requests (default 1).
Logical. Detect boilerplate (e.g. sign-up, related links).
Logical. If detect_boilerplate is TRUE, remove boilerplate rows; if FALSE, keep them and add is_boilerplate.
Logical. For Wikipedia URLs only, drop nodes whose parent_heading is References, See also, Bibliography, or Sources. Default TRUE.
A list with text (node-level data: doc_id, url, node_id, parent_heading, text, and optionally type, is_boilerplate) and meta (one row per URL: doc_id, url, h1_title, date, source). doc_id is an integer key (1 to number of distinct URLs) in first-appearance order of the input vector.
if (FALSE) { # \dontrun{
urls <- fetch_urls("R programming", n_pages = 1)$url
out <- read_urls(urls[1:3], cores = 1)
nodes <- out$text
meta <- out$meta
} # }