Counts pairs of biomedical entities that co-occur within the same sentence (window = 0) or within window sentences of each other, using the sentence-mapped annotation table returned by pubtator_sentences. Co-occurrence is computed within each pmid/tiab passage: title and abstract are treated separately because their sentence offsets are numbered independently.

pubtator_cooccurrence(
  mapped,
  window = 0L,
  by = c("type", "entity"),
  evidence = FALSE
)

Arguments

mapped

A data.table returned by pubtator_sentences. Must contain pmid, tiab, type, identifier, text, sentence_id, and sentence columns.

window

Non-negative integer sentence distance. 0 (default) counts entities in the same sentence; n counts entities whose sentences are at most n apart within the same passage.

by

One of "type" (default) or "entity". "type" aggregates counts by entity-type pair; "entity" aggregates by the specific (type, identifier, text) pair. Ignored when evidence = TRUE.

evidence

Logical. When FALSE (default), returns aggregated counts. When TRUE, returns the supporting sentence context for each co-occurring pair, so counts can be traced back to concrete text.

Value

A data.table. With evidence = FALSE and by = "type": type_x, type_y, n (co-occurrence instances), and n_pmids (distinct documents), ordered by n. With by = "entity": the same plus identifier_x/text_x/identifier_y/text_y. With evidence = TRUE: one row per distinct context string for an entity pair (identical contexts de-duplicated), with pmid, tiab, the two entities' type/identifier/text, and context.

Details

Entities are de-duplicated to one mention per sentence before pairing, and pairs of the same entity (identical type, identifier, and text) are dropped, so same-type pairs between two distinct entities (e.g. two different genes) are retained.

Counting follows windowed-collocation semantics: a pair contributes one instance for each pair of mentions within window sentences of each other. At window = 0 this is simply one instance per shared sentence, but for window > 0 a pair recurring across several sentences yields multiple instances, so counts scale with mention frequency. n_pmids (distinct documents) is unaffected and is the more conservative signal.

Examples

if (FALSE) { # \dontrun{
pmids <- search_pubmed('"biomarker"[TiAb] AND "cancer"[TiAb]')

mapped <- pmids |>
  get_records(endpoint = "pubtations") |>
  pubtator_sentences()

# same-sentence entity-type co-occurrence
mapped |> pubtator_cooccurrence(window = 0, by = "type")

# specific entity pairs within one sentence on either side
mapped |> pubtator_cooccurrence(window = 1, by = "entity")

# traceable evidence: every instance with its sentence context
mapped |> pubtator_cooccurrence(window = 0, evidence = TRUE)
} # }