Citation Snowballing for Literature Discovery • puremoe

A keyword search only finds papers that use your keywords; it misses related work phrased in different terms. Citation snowballing follows citation links outward from a set of seed papers to surface those neighbors regardless of vocabulary – a standard supplementary search step in evidence synthesis.

This vignette expands a seed corpus with citation_snowball(), inspects why each paper was admitted, and characterizes the expansion space (the citation-adjacent literature the query never returned) with MeSH keyness. Both citation_snowball() and mesh_keyness() are part of puremoe’s local analysis layer: transforms over tables get_records() already returned, with no further API calls. It proposes candidates to screen; it does not replace manual review. Note that iCite links cover PubMed-indexed articles only, so snowballing inherits PubMed’s coverage.

library(puremoe)
library(dplyr)
library(DT)

Seed corpus

Search PubMed, then pull iCite records for the hits. Snowballing uses the icites endpoint specifically: each record carries a citation_net of that paper’s references and citing papers.

pmids <- search_pubmed('"political ideology"[TiAb]')

length(pmids)

#> [1] 963

seed_icites <- get_records(pmids, endpoint = "icites", cores = 1L, sleep = 0.25)

Expand by snowballing

citation_snowball() walks the links already in the iCite response (no extra API call). direction = "both" looks backward (papers the seeds cite) and forward (papers that cite the seeds); min_links admits a candidate only if it connects to at least that many seeds.

snowball <- seed_icites |>
  citation_snowball(direction = "both", min_links = 2)

snowball |>
  count(seed)   # seeds vs newly discovered candidates

#>      seed     n
#>    <lgcl> <int>
#> 1:  FALSE  1049
#> 2:   TRUE   951

The audit trail

Every row carries its provenance: seed, cited_links (seeds that cite it), citing_links (seeds it cites), and link_count (the ranking total and min_links gate). Candidates are, by construction, papers the keyword query did not return.

Ranked candidates

candidates <- snowball |>
  filter(!seed) |>
  arrange(desc(link_count))

candidates |>
  head(25) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

The expansion space

Fetch metadata – including MeSH – once for the whole snowballed corpus (seeds plus candidates) and reuse it below. Joining the candidates back to their titles shows what the snowball surfaced.

Snowballed candidates with metadata

corpus_meta <- snowball$pmid |>
  get_records(endpoint = "pubmed_abstracts", cores = 1L, sleep = 0.25)

corpus_meta |>
  left_join(snowball, by = "pmid") |>
  filter(!seed) |>
  arrange(desc(link_count)) |>
  select(pmid, year, journal, articletitle,
         cited_links, citing_links, link_count) |>
  head(25) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

Keyness: how the corpus profile shifts

Raw MeSH counts are dominated by terms common everywhere (Humans). mesh_keyness() compares each descriptor’s document rate in a corpus to its PubMed-wide rate from data_mesh_frequencies, returning log-odds or signed G2 scores. Here we use the default log-odds score and inspect the over-represented descriptors.

The seed corpus alone should surface the obvious topic terms, a sanity check that keyness behaves.

Over-represented seed descriptors

pass1 <- corpus_meta |>
  filter(pmid %in% pmids) |>
  mesh_keyness(min_count = 3L)

pass1 |>
  filter(direction == "over") |>
  arrange(desc(z)) |>
  select(DescriptorName, corpus_count, corpus_prop, baseline_prop, log_odds, z) |>
  mutate(across(where(is.numeric), ~ round(.x, 3))) |>
  head(15) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

Recent papers may not be MeSH-indexed yet and the baseline is a snapshot; keyness describes the corpus, it does not judge relevance.

Choosing a direction

direction matches the search intent: "cited" finds shared foundational references, "citing" finds later work building on the corpus, and "both" casts the widest net.

backward <- citation_snowball(seed_icites, direction = "cited",  min_links = 2)
forward  <- citation_snowball(seed_icites, direction = "citing", min_links = 2)

data.frame(
  direction    = c("cited (foundational)", "citing (downstream)"),
  n_candidates = c(sum(!backward$seed), sum(!forward$seed))
)

#>              direction n_candidates
#> 1 cited (foundational)         1049
#> 2  citing (downstream)         1049

Iterating

Re-seed by feeding the expanded PMIDs back through the iCite endpoint and snowballing again; each hop keeps the same audit columns.

hop2 <- snowball$pmid |>
  get_records(endpoint = "icites", cores = 1L, sleep = 0.25) |>
  citation_snowball(direction = "both", min_links = 3, max_nodes = 500)

Summary

citation_snowball() turns an iCite response into a ranked, auditable candidate set: it finds citation-adjacent papers a keyword query misses, the audit columns document why each was admitted, and MeSH keyness against data_mesh_frequencies characterizes the expansion space. It complements keyword search and manual screening rather than replacing them.