NLP Search Corpus — sem_search

Searches a text corpus for specified patterns, with support for parallel processing.

sem_search_corpus(
  tif,
  text_hierarchy = c("doc_id", "paragraph_id", "sentence_id"),
  search,
  context_size = 0,
  is_inline = FALSE,
  highlight = c("<b>", "</b>"),
  cores = 1
)

Arguments

tif: A data frame or data.table containing the text corpus.
text_hierarchy: A character vector indicating the column(s) by which to group the data.
search: The search pattern or query.
context_size: Numeric, default 0. Specifies the context size, in sentences, around the found patterns.
is_inline: Logical, default FALSE. Indicates if the search should be inline.
highlight: A character vector of length two, default c('<b>', '</b>'). Used to highlight the found patterns in the text.
cores: Numeric, default 1. The number of cores to use for parallel processing.

Value

A data.table with the search results.

Examples

tif <- data.frame(doc_id = c('1', '1', '2'),
                  sentence_id = c('1', '2', '1'),
                  text = c("Hello world.",
                           "This is an example.",
                           "This is a party!"))
sem_search_corpus(tif, search = 'This is', text_hierarchy = c('doc_id', 'sentence_id'))
#>    doc_id sentence_id                       text start   end pattern pattern2
#>    <char>      <char>                     <char> <int> <int>  <char>   <lgcl>
#> 1:      1           2 <b>This is</b> an example.     1     7 This is       NA
#> 2:      2           1    <b>This is</b> a party!     1     7 This is       NA
#>       pos
#>    <lgcl>
#> 1:     NA
#> 2:     NA