Normalize text into a clean token stream. Tokenizes corpus text, preserving
structure (capitalization, punctuation). The last column in by determines
the tokenization unit.
nlp_tokenize_text(
corpus,
by = c("doc_id", "paragraph_id", "sentence_id"),
id_col = "uid",
include_spans = TRUE,
method = "word"
)Data frame or data.table with a text column and the identifier columns specified in by.
Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). Default c("doc_id", "paragraph_id", "sentence_id"). The last column is the finest granularity.
Character. Name of the column (and list names) used for the unit id (default "uid").
Logical. Include start/end character spans for each token (default TRUE).
Character. "word" or "biber".
Named list of tokens; or list of tokens and spans if include_spans = TRUE.
corpus <- data.frame(doc_id = c('1', '1', '2'),
sentence_id = c('1', '2', '1'),
text = c("Hello world.",
"This is an example.",
"This is a party!"))
tokens <- nlp_tokenize_text(corpus, by = c('doc_id', 'sentence_id'))