Roll units into fixed-size chunks with optional context

Roll units (e.g. sentences) into fixed-size chunks with optional context (RAG-style). Groups consecutive rows at the finest by level into chunks and optionally adds surrounding context.

nlp_roll_chunks(corpus, by, chunk_size, context_size, id_col = "uid")

Arguments

corpus: Data frame or data.table with a text column and the identifier columns specified in by.
by: Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). The last column is the level rolled into chunks (e.g. sentences).
chunk_size: Integer. Number of units per chunk.
context_size: Integer. Number of units of context around each chunk.
id_col: Character. Name of the column holding the unique chunk id (default "uid").

Value

Data.table with id_col (pasted grouping + chunk index), grouping columns from by, and text (chunk plus context). Unique on by[1] and text.

Examples

corpus <- data.frame(doc_id = c('1', '1', '2'),
                    sentence_id = c('1', '2', '1'),
                    text = c("Hello world.",
                             "This is an example.",
                             "This is a party!"))
chunks <- nlp_roll_chunks(corpus, by = c('doc_id', 'sentence_id'),
                          chunk_size = 2, context_size = 1)