Searches a text corpus for specified patterns, with support for parallel processing.
A data frame or data.table containing the text corpus.
A character vector indicating the column(s) by which to group the data.
The search pattern or query.
Numeric, default 0. Specifies the context size, in sentences, around the found patterns.
Logical, default FALSE. Indicates if the search should be inline.
A character vector of length two, default c('<b>', '</b>'). Used to highlight the found patterns in the text.
Numeric, default 1. The number of cores to use for parallel processing.
A data.table with the search results.
tif <- data.frame(doc_id = c('1', '1', '2'),
                  sentence_id = c('1', '2', '1'),
                  text = c("Hello world.",
                           "This is an example.",
                           "This is a party!"))
sem_search_corpus(tif, search = 'This is', text_hierarchy = c('doc_id', 'sentence_id'))
#>    doc_id sentence_id                       text start   end pattern pattern2
#>    <char>      <char>                     <char> <int> <int>  <char>   <lgcl>
#> 1:      1           2 <b>This is</b> an example.     1     7 This is       NA
#> 2:      2           1    <b>This is</b> a party!     1     7 This is       NA
#>       pos
#>    <lgcl>
#> 1:     NA
#> 2:     NA