Normalize text into a clean token stream. Tokenizes corpus text, preserving structure (capitalization, punctuation). The last column in by determines the tokenization unit.

nlp_tokenize_text(
  corpus,
  by = c("doc_id", "paragraph_id", "sentence_id"),
  id_col = "uid",
  include_spans = TRUE,
  method = "word"
)

Arguments

corpus

Data frame or data.table with a text column and the identifier columns specified in by.

by

Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). Default c("doc_id", "paragraph_id", "sentence_id"). The last column is the finest granularity.

id_col

Character. Name of the column (and list names) used for the unit id (default "uid").

include_spans

Logical. Include start/end character spans for each token (default TRUE).

method

Character. "word" or "biber".

Value

Named list of tokens; or list of tokens and spans if include_spans = TRUE.

Examples

corpus <- data.frame(doc_id = c('1', '1', '2'),
                    sentence_id = c('1', '2', '1'),
                    text = c("Hello world.",
                             "This is an example.",
                             "This is a party!"))
tokens <- nlp_tokenize_text(corpus, by = c('doc_id', 'sentence_id'))