Refine blocks into individual sentences. Splits text into sentences with accurate start/end offsets; handles abbreviations (Wikipedia and web optimized).

nlp_split_sentences(
  corpus,
  by = c("doc_id"),
  abbreviations = textpress::abbreviations
)

Arguments

corpus

Data frame or data.table with a text column and the identifier columns specified in by.

by

Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). Default c("doc_id").

abbreviations

Character vector of abbreviations to protect (default textpress::abbreviations).

Value

Data.table with by columns, sentence_id, text, start, end.