Read content from URLs — read

Input: character vector of URLs. Output: structured data frame (one row per node: headings, paragraphs, lists). Like read_csv or read_html: bring an external resource into R. Follows fetch_urls or fetch_wiki_urls in the pipeline—fetch gets locations, read gets text. Wikipedia uses high-fidelity selectors; use parent_heading to see which section each node belongs to. External links and empty text rows are omitted; optionally exclude References/See also/Bibliography/Sources sections for wiki URLs.

read_urls(
  x,
  cores = 1,
  detect_boilerplate = TRUE,
  remove_boilerplate = TRUE,
  exclude_wiki_refs = TRUE
)

Arguments

x: Character vector of URLs.
cores: Number of cores for parallel requests (default 1).
detect_boilerplate: Logical. Detect boilerplate (e.g. sign-up, related links).
remove_boilerplate: Logical. If detect_boilerplate is TRUE, remove boilerplate rows; if FALSE, keep them and add is_boilerplate.
exclude_wiki_refs: Logical. For Wikipedia URLs only, drop nodes whose parent_heading is References, See also, Bibliography, or Sources. Default TRUE.

Value

A list with text (node-level data: doc_id, url, node_id, parent_heading, text, and optionally type, is_boilerplate) and meta (one row per URL: doc_id, url, h1_title, date, source). doc_id is an integer key (1 to number of distinct URLs) in first-appearance order of the input vector.

Examples

if (FALSE) { # \dontrun{
urls <- fetch_urls("R programming", n_pages = 1)$url
out <- read_urls(urls[1:3], cores = 1)
nodes <- out$text
meta <- out$meta
} # }