Exact phrase or multi-word expression (MWE) matcher; no partial-match risk. Tokenizes corpus, builds n-grams, and exact-joins against terms. Word boundaries respected. N-gram range is set from the min and max word count of terms. Good for deterministic entity extraction (e.g. before an LLM call).

search_dict(corpus, by = c("doc_id"), terms)

Arguments

corpus

Data frame or data.table with a text column and the identifier columns specified in by.

by

Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). Default c("doc_id").

terms

Character vector of terms or phrases to match exactly. N-gram range derived from word counts of terms.

Value

Data.table with id, start, end, n, ngram, term.

Examples

corpus <- data.frame(doc_id = "1", text = "Gen Z and Millennials use social media.")
search_dict(corpus, by = "doc_id", terms = c("Gen Z", "Millennials", "social media"))
#>        id start   end     n        ngram         term
#>    <char> <num> <num> <num>       <char>       <char>
#> 1:      1     1     5     2        Gen Z        gen z
#> 2:      1    11    21     1  Millennials  millennials
#> 3:      1    27    38     2 social media social media