Search corpus by regex. Specific strings/patterns; good for KWIC-style results. Returns matches with optional highlighting.

search_regex(corpus, query, by = c("doc_id"), highlight = c("<b>", "</b>"))

Arguments

corpus

Data frame or data.table with a text column and the identifier columns specified in by.

query

Search pattern (regex).

by

Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). Default c("doc_id").

highlight

Length-two character vector for wrapping matches (default c("<b>", "</b>")).

Value

Data.table with id, by columns, text, start, end, pattern.