Refine blocks into individual sentences. Splits text into sentences with accurate start/end offsets; handles abbreviations (Wikipedia and web optimized).
nlp_split_sentences(
corpus,
by = c("doc_id"),
abbreviations = textpress::abbreviations
)Data frame or data.table with a text column and the identifier columns specified in by.
Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). Default c("doc_id").
Character vector of abbreviations to protect (default textpress::abbreviations).
Data.table with by columns, sentence_id, text, start, end.