Exact phrase or multi-word expression (MWE) matcher; no partial-match risk.
Tokenizes corpus, builds n-grams, and exact-joins against terms. Word
boundaries respected. N-gram range is set from the min and max word count of
terms. Good for deterministic entity extraction (e.g. before an LLM call).
search_dict(corpus, by = c("doc_id"), terms)Data frame or data.table with a text column and the identifier columns specified in by.
Character vector of identifier columns that define the text unit (e.g. doc_id or c("url", "node_id")). Default c("doc_id").
Character vector of terms or phrases to match exactly. N-gram range derived from word counts of terms.
Data.table with id, start, end, n, ngram, term.
corpus <- data.frame(doc_id = "1", text = "Gen Z and Millennials use social media.")
search_dict(corpus, by = "doc_id", terms = c("Gen Z", "Millennials", "social media"))
#> id start end n ngram term
#> <char> <num> <num> <num> <char> <char>
#> 1: 1 1 5 2 Gen Z gen z
#> 2: 1 11 21 1 Millennials millennials
#> 3: 1 27 38 2 social media social media