vignettes/citation-snowball.Rmd
citation-snowball.RmdA keyword search only finds papers that use your keywords; it misses related work phrased in different terms. Citation snowballing follows citation links outward from a set of seed papers to surface those neighbors regardless of vocabulary – a standard supplementary search step in evidence synthesis.
This vignette expands a seed corpus with
citation_snowball(), inspects why each paper was admitted,
and characterizes the expansion space (the citation-adjacent
literature the query never returned) with MeSH keyness. Both
citation_snowball() and mesh_keyness() are
part of puremoe’s local analysis layer: transforms over
tables get_records() already returned, with no further API
calls. It proposes candidates to screen; it does not replace manual
review. Note that iCite links cover PubMed-indexed articles only, so
snowballing inherits PubMed’s coverage.
Search PubMed, then pull iCite records for the hits. Snowballing uses
the icites endpoint specifically: each record carries a
citation_net of that paper’s references and citing
papers.
pmids <- search_pubmed('"political ideology"[TiAb]')
length(pmids)#> [1] 963
seed_icites <- get_records(pmids, endpoint = "icites", cores = 1L, sleep = 0.25)citation_snowball() walks the links already in the iCite
response (no extra API call). direction = "both" looks
backward (papers the seeds cite) and forward (papers that cite the
seeds); min_links admits a candidate only if it connects to
at least that many seeds.
snowball <- seed_icites |>
citation_snowball(direction = "both", min_links = 2)
snowball |>
count(seed) # seeds vs newly discovered candidates#> seed n
#> <lgcl> <int>
#> 1: FALSE 1049
#> 2: TRUE 951
Every row carries its provenance: seed,
cited_links (seeds that cite it), citing_links
(seeds it cites), and link_count (the ranking total and
min_links gate). Candidates are, by construction, papers
the keyword query did not return.
Fetch metadata – including MeSH – once for the whole snowballed corpus (seeds plus candidates) and reuse it below. Joining the candidates back to their titles shows what the snowball surfaced.
corpus_meta <- snowball$pmid |>
get_records(endpoint = "pubmed_abstracts", cores = 1L, sleep = 0.25)
corpus_meta |>
left_join(snowball, by = "pmid") |>
filter(!seed) |>
arrange(desc(link_count)) |>
select(pmid, year, journal, articletitle,
cited_links, citing_links, link_count) |>
head(25) |>
DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))Raw MeSH counts are dominated by terms common everywhere
(Humans). mesh_keyness() compares each
descriptor’s document rate in a corpus to its PubMed-wide rate from
data_mesh_frequencies, returning log-odds or signed G2
scores. Here we use the default log-odds score and inspect the
over-represented descriptors.
The seed corpus alone should surface the obvious topic terms, a sanity check that keyness behaves.
pass1 <- corpus_meta |>
filter(pmid %in% pmids) |>
mesh_keyness(min_count = 3L)
pass1 |>
filter(direction == "over") |>
arrange(desc(z)) |>
select(DescriptorName, corpus_count, corpus_prop, baseline_prop, log_odds, z) |>
mutate(across(where(is.numeric), ~ round(.x, 3))) |>
head(15) |>
DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))Recent papers may not be MeSH-indexed yet and the baseline is a snapshot; keyness describes the corpus, it does not judge relevance.
direction matches the search intent:
"cited" finds shared foundational references,
"citing" finds later work building on the corpus, and
"both" casts the widest net.
backward <- citation_snowball(seed_icites, direction = "cited", min_links = 2)
forward <- citation_snowball(seed_icites, direction = "citing", min_links = 2)
data.frame(
direction = c("cited (foundational)", "citing (downstream)"),
n_candidates = c(sum(!backward$seed), sum(!forward$seed))
)#> direction n_candidates
#> 1 cited (foundational) 1049
#> 2 citing (downstream) 1049
Re-seed by feeding the expanded PMIDs back through the iCite endpoint and snowballing again; each hop keeps the same audit columns.
hop2 <- snowball$pmid |>
get_records(endpoint = "icites", cores = 1L, sleep = 0.25) |>
citation_snowball(direction = "both", min_links = 3, max_nodes = 500)citation_snowball() turns an iCite response into a
ranked, auditable candidate set: it finds citation-adjacent papers a
keyword query misses, the audit columns document why each was admitted,
and MeSH keyness against data_mesh_frequencies
characterizes the expansion space. It complements keyword search and
manual screening rather than replacing them.