Basic NLP • textpress

Three nlp_* functions form the core processing pipeline: nlp_split_sentences() segments text into sentence rows; nlp_tokenize_text() tokenizes each sentence into a named list with optional character spans; nlp_cast_tokens() flattens that list into a long-format data frame ready for counting, filtering, or indexing. This vignette walks through each step on a Wikipedia corpus, then collapses them into a single pipe.

Load packages.

library(textpress)
library(dplyr)
library(DT)

Corpus

Fetch Wikipedia articles on the Strauss-Howe generational theory and scrape the article bodies.

wiki_urls <- textpress::fetch_wiki_urls("Strauss Howe generational theory", limit = 5)

wiki_text_list <- textpress::read_urls(wiki_urls, exclude_wiki_refs = TRUE)

corpus <- wiki_text_list$text
nrow(corpus)

#> [1] 144

Sentence split

nlp_split_sentences() segments each text into individual sentences, adding sentence_id and start/end character offsets. Known abbreviations (Mr., U.S., vs.) and single-letter initials are protected from false boundary detection via a built-in abbreviations vector, which can be extended as needed.

web_ss <- corpus |>
  textpress::nlp_split_sentences(by = c("doc_id", "node_id"))

web_ss |>
  slice(1:5) |>
  mutate(text = {
    words <- strsplit(text, "\\s+")
    sapply(words, function(w) paste(paste(w[seq_len(min(15, length(w)))], 
                                          collapse = " "), "..."))
  }) |>
  DT::datatable(rownames = FALSE)

Tokenize

nlp_tokenize_text() tokenizes each sentence into a named list. With include_spans = TRUE the result carries two parallel lists – tokens and spans – keyed by a composite uid built from the by columns.

tokens <- web_ss |>
  textpress::nlp_tokenize_text(
    by = c("doc_id", "node_id", "sentence_id"),
    method = 'biber',
    include_spans = TRUE
  )

# tokens list: first unit
head(tokens$tokens[[1]])

#> [1] "The"          "Strauss"      "–"            "Howe"         "generational"
#> [6] "theory"

# spans matrix: matching start/end positions
head(tokens$spans[[1]])

#>      start end
#> [1,]     1   3
#> [2,]     5  11
#> [3,]    12  12
#> [4,]    13  16
#> [5,]    18  29
#> [6,]    31  36

Cast to data frame

nlp_cast_tokens() flattens the token list into a long-format data table: one row per token, with id, token, start, and end.

token_df <- tokens |> textpress::nlp_cast_tokens()

token_df |>
  slice(1:20) |>
  DT::datatable(rownames = FALSE)

Full pipeline

All three steps as a single pipe from the abstract corpus.

corpus |>
  textpress::nlp_split_sentences(by = c("doc_id", "node_id")) |>
  textpress::nlp_tokenize_text(
    by            = c("doc_id", "node_id", "sentence_id"),
    include_spans = TRUE
  ) |>
  textpress::nlp_cast_tokens()

#>            id        token start   end
#>        <char>       <char> <int> <int>
#>     1:  1.1.1          The     1     3
#>     2:  1.1.1      Strauss     5    11
#>     3:  1.1.1            –    12    12
#>     4:  1.1.1         Howe    13    16
#>     5:  1.1.1 generational    18    29
#>    ---                                
#> 11605:  5.2.4       Bannon    57    62
#> 11606:  5.2.4            .    63    63
#> 11607:  5.2.4            [    64    64
#> 11608:  5.2.4            4    65    65
#> 11609:  5.2.4            ]    66    66

Summary

nlp_split_sentences() → nlp_tokenize_text() → nlp_cast_tokens() is the standard processing path from raw text to a token-level data frame. The start/end spans thread through each step, keeping token positions recoverable relative to the source text.