Searches a text corpus for specified patterns, with support for parallel processing.
A data frame or data.table containing the text corpus.
A character vector indicating the column(s) by which to group the data.
The search pattern or query.
Numeric, default 0. Specifies the context size, in sentences, around the found patterns.
Logical, default FALSE. Indicates if the search should be inline.
A character vector of length two, default c('<b>', '</b>'). Used to highlight the found patterns in the text.
Numeric, default 1. The number of cores to use for parallel processing.
A data.table with the search results.
tif <- data.frame(doc_id = c('1', '1', '2'),
sentence_id = c('1', '2', '1'),
text = c("Hello world.",
"This is an example.",
"This is a party!"))
sem_search_corpus(tif, search = 'This is', text_hierarchy = c('doc_id', 'sentence_id'))
#> doc_id sentence_id text start end pattern pattern2
#> <char> <char> <char> <int> <int> <char> <lgcl>
#> 1: 1 2 <b>This is</b> an example. 1 7 This is NA
#> 2: 2 1 <b>This is</b> a party! 1 7 This is NA
#> pos
#> <lgcl>
#> 1: NA
#> 2: NA