R/pubtator_cooccurrence.R
pubtator_cooccurrence.RdCounts pairs of biomedical entities that co-occur within the same sentence
(window = 0) or within window sentences of each other, using the
sentence-mapped annotation table returned by pubtator_sentences.
Co-occurrence is computed within each pmid/tiab passage: title
and abstract are treated separately because their sentence offsets are
numbered independently.
pubtator_cooccurrence(
mapped,
window = 0L,
by = c("type", "entity"),
evidence = FALSE
)A data.table returned by
pubtator_sentences. Must contain pmid, tiab,
type, identifier, text, sentence_id, and
sentence columns.
Non-negative integer sentence distance. 0 (default)
counts entities in the same sentence; n counts entities whose
sentences are at most n apart within the same passage.
One of "type" (default) or "entity". "type"
aggregates counts by entity-type pair; "entity" aggregates by the
specific (type, identifier, text) pair. Ignored when
evidence = TRUE.
Logical. When FALSE (default), returns aggregated
counts. When TRUE, returns the supporting sentence context for
each co-occurring pair, so counts can be traced back to concrete text.
A data.table. With evidence = FALSE and
by = "type": type_x, type_y, n (co-occurrence
instances), and n_pmids (distinct documents), ordered by n.
With by = "entity": the same plus
identifier_x/text_x/identifier_y/text_y. With
evidence = TRUE: one row per distinct context string for an
entity pair (identical contexts de-duplicated), with pmid,
tiab, the two entities' type/identifier/text,
and context.
Entities are de-duplicated to one mention per sentence before pairing, and
pairs of the same entity (identical type, identifier,
and text) are dropped, so same-type pairs between two distinct
entities (e.g. two different genes) are retained.
Counting follows windowed-collocation semantics: a pair contributes one
instance for each pair of mentions within window sentences of each
other. At window = 0 this is simply one instance per shared sentence,
but for window > 0 a pair recurring across several sentences yields
multiple instances, so counts scale with mention frequency. n_pmids
(distinct documents) is unaffected and is the more conservative signal.
if (FALSE) { # \dontrun{
pmids <- search_pubmed('"biomarker"[TiAb] AND "cancer"[TiAb]')
mapped <- pmids |>
get_records(endpoint = "pubtations") |>
pubtator_sentences()
# same-sentence entity-type co-occurrence
mapped |> pubtator_cooccurrence(window = 0, by = "type")
# specific entity pairs within one sentence on either side
mapped |> pubtator_cooccurrence(window = 1, by = "entity")
# traceable evidence: every instance with its sentence context
mapped |> pubtator_cooccurrence(window = 0, evidence = TRUE)
} # }