Break documents into structural blocks (paragraphs). Splits text from the
text column by a paragraph delimiter.
nlp_split_paragraphs(corpus, by = c("doc_id"), paragraph_delim = "\\n+")Data frame or data.table with a text column and the identifier columns specified in by.
Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). Default c("doc_id").
Regular expression used to split text into paragraphs (default "\\n+").
Data.table with the by columns, paragraph_id, and text. One row per paragraph.
corpus <- data.frame(doc_id = c('1', '2'),
text = c("Hello world.\n\nMind your business!",
"This is an example.n\nThis is a party!"))
paragraphs <- nlp_split_paragraphs(corpus)