MeSH Descriptor Keyness for a Retrieved Corpus

Scores the MeSH descriptors of a retrieved corpus against PubMed-wide descriptor frequencies, identifying the terms that are over- or under-represented relative to PubMed as a whole. This is a local transform of the pubmed_abstracts output – it makes no API calls – and is intended to characterise a corpus and to guide search refinement and expansion.

mesh_keyness(
  records,
  frequencies = NULL,
  measure = c("log_odds", "g2"),
  smoothing = 0.5,
  min_count = 1L
)

Arguments

records: A pubmed_abstracts table from get_records(endpoint = "pubmed_abstracts") (with its annotations list-column), or a long data.frame already exposing pmid and DescriptorUI (optionally DescriptorName and a type column, in which case only type == "MeSH" rows are used).
frequencies: Baseline descriptor frequencies. Defaults to the bundled data_mesh_frequencies; must contain DescriptorUI, n_pmids, and prop_total.
measure: Keyness statistic: "log_odds" (default) for a Haldane-corrected log odds ratio with standard error and z-score, or "g2" for the signed Dunning log-likelihood ratio.
smoothing: Positive continuity correction added to each cell of the 2x2 incidence table for measure = "log_odds" (default 0.5, the Haldane-Anscombe correction).
min_count: Drop descriptors indexed in fewer than min_count corpus PMIDs before scoring (default 1).

Value

A data.table, one row per scored descriptor, ordered by keyness (descending). Common columns: DescriptorUI, DescriptorName, corpus_count, corpus_total, corpus_prop, baseline_count, baseline_total, baseline_prop, and direction ("over"/"under"). With measure = "log_odds": log_odds, std_error, z. With measure = "g2": g2.

Details

Keyness is computed on document incidence: for each descriptor, the number of distinct corpus PMIDs indexed with it is compared against the number of distinct PubMed PMIDs indexed with it (data_mesh_frequencies).

Examples

if (FALSE) { # \dontrun{
pmids   <- search_pubmed('"doxorubicin"[TiAb] AND "cardiotoxicity"[TiAb]')
records <- get_records(pmids, endpoint = "pubmed_abstracts")

mesh_keyness(records)                       # most over-represented descriptors
mesh_keyness(records, measure = "g2")
} # }