Three nlp_* functions form the core processing pipeline:
nlp_split_sentences() segments text into sentence rows;
nlp_tokenize_text() tokenizes each sentence into a named
list with optional character spans; nlp_cast_tokens()
flattens that list into a long-format data frame ready for counting,
filtering, or indexing. This vignette walks through each step on a
Wikipedia corpus, then collapses them into a single pipe.
Load packages.
Fetch Wikipedia articles on the Strauss-Howe generational theory and scrape the article bodies.
wiki_urls <- textpress::fetch_wiki_urls("Strauss Howe generational theory", limit = 5)
wiki_text_list <- textpress::read_urls(wiki_urls, exclude_wiki_refs = TRUE)
corpus <- wiki_text_list$text
nrow(corpus)#> [1] 144
nlp_split_sentences() segments each text into individual
sentences, adding sentence_id and
start/end character offsets. Known
abbreviations (Mr., U.S., vs.) and single-letter initials are
protected from false boundary detection via a built-in
abbreviations vector, which can be extended as needed.
nlp_tokenize_text() tokenizes each sentence into a named
list. With include_spans = TRUE the result carries two
parallel lists – tokens and spans – keyed by a
composite uid built from the by columns.
tokens <- web_ss |>
textpress::nlp_tokenize_text(
by = c("doc_id", "node_id", "sentence_id"),
method = 'biber',
include_spans = TRUE
)
# tokens list: first unit
head(tokens$tokens[[1]])#> [1] "The" "Strauss" "–" "Howe" "generational"
#> [6] "theory"
# spans matrix: matching start/end positions
head(tokens$spans[[1]])#> start end
#> [1,] 1 3
#> [2,] 5 11
#> [3,] 12 12
#> [4,] 13 16
#> [5,] 18 29
#> [6,] 31 36
nlp_cast_tokens() flattens the token list into a
long-format data table: one row per token, with id,
token, start, and end.
token_df <- tokens |> textpress::nlp_cast_tokens()
token_df |>
slice(1:20) |>
DT::datatable(rownames = FALSE)All three steps as a single pipe from the abstract corpus.
corpus |>
textpress::nlp_split_sentences(by = c("doc_id", "node_id")) |>
textpress::nlp_tokenize_text(
by = c("doc_id", "node_id", "sentence_id"),
include_spans = TRUE
) |>
textpress::nlp_cast_tokens()#> id token start end
#> <char> <char> <int> <int>
#> 1: 1.1.1 The 1 3
#> 2: 1.1.1 Strauss 5 11
#> 3: 1.1.1 – 12 12
#> 4: 1.1.1 Howe 13 16
#> 5: 1.1.1 generational 18 29
#> ---
#> 11605: 5.2.4 Bannon 57 62
#> 11606: 5.2.4 . 63 63
#> 11607: 5.2.4 [ 64 64
#> 11608: 5.2.4 4 65 65
#> 11609: 5.2.4 ] 66 66
nlp_split_sentences() → nlp_tokenize_text()
→ nlp_cast_tokens() is the standard processing path from
raw text to a token-level data frame. The
start/end spans thread through each step,
keeping token positions recoverable relative to the source text.