Tokenize Data Frame by Specified Column(s) — nlp_melt

This function tokenizes a data frame based on a specified token column and groups the data by one or more specified columns.

nlp_melt_tokens(
  df,
  melt_col = "token",
  parent_cols = c("doc_id", "sentence_id")
)

Arguments

df: A data frame containing the data to be tokenized.
melt_col: The name of the column in `df` that contains the tokens.
parent_cols: A character vector indicating the column(s) by which to group the data.

Value

A list of vectors, each containing the tokens of a group defined by the `by` parameter.

Examples

dtm <- data.frame(doc_id = as.character(c(1, 1, 1, 1, 1, 1, 1, 1)),
                  sentence_id = as.character(c(1, 1, 1, 2, 2, 2, 2, 2)),
                  token = c("Hello", "world", ".", "This", "is", "an", "example", "."))

tokens <- nlp_melt_tokens(dtm, melt_col = 'token', parent_cols = c('doc_id', 'sentence_id'))