R/nlp_roll_chunks.R
nlp_roll_chunks.RdRoll units (e.g. sentences) into fixed-size chunks with optional context
(RAG-style). Groups consecutive rows at the finest by level into chunks
and optionally adds surrounding context.
nlp_roll_chunks(corpus, by, chunk_size, context_size, id_col = "uid")Data frame or data.table with a text column and the identifier columns specified in by.
Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). The last column is the level rolled into chunks (e.g. sentences).
Integer. Number of units per chunk.
Integer. Number of units of context around each chunk.
Character. Name of the column holding the unique chunk id (default "uid").
Data.table with id_col (pasted grouping + chunk index), grouping columns from by, and text (chunk plus context). Unique on by[1] and text.
corpus <- data.frame(doc_id = c('1', '1', '2'),
sentence_id = c('1', '2', '1'),
text = c("Hello world.",
"This is an example.",
"This is a party!"))
chunks <- nlp_roll_chunks(corpus, by = c('doc_id', 'sentence_id'),
chunk_size = 2, context_size = 1)