Pubmed Unified REtrieval for Multi-Output Exploration • puremoe

PubMed Unified REtrieval for Multi-Output Exploration. An R package that provides a single interface for accessing a range of NLM/PubMed databases, including:

PubMed abstract records,
iCite bibliometric data,
PubTator3 named entity annotations, and
full-text entries from PubMed Central (PMC).

This unified interface simplifies the data retrieval process, allowing users to interact with multiple PubMed services/APIs/output formats through a single R function.

The package also includes MeSH thesaurus resources as simple data frames, including Descriptor Terms, Descriptor Tree Structures, Supplementary Concept Terms, and Pharmacological Actions; it also includes descriptor-level word embeddings (Noh & Kavuluru 2021). Via the mesh-resources library.

Installation

Get the released version from CRAN:

install.packages('puremoe')

Or the development version from GitHub with:

remotes::install_github("jaytimm/puremoe")

Usage

PubMed search

The package has two basic functions: search_pubmed and get_records. The former fetches PMIDs from the PubMed API based on user search; the latter scrapes PMID records from a user-specified PubMed endpoint – pubmed_abstracts, pubmed_affiliations, pubtations, icites, or pmc_fulltext.

Search syntax is the same as that implemented in standard PubMed search.

pmids <- puremoe::search_pubmed('("political ideology"[TiAb])',
                                 use_pub_years = F)

# pmids <- puremoe::search_pubmed('immunity', 
#                                  use_pub_years = T,
#                                  start_year = 2022,
#                                  end_year = 2024)

Get record-level data

pubmed <- pmids |> 
  puremoe::get_records(endpoint = 'pubmed_abstracts', 
                       cores = 3, 
                       sleep = 1) 

affiliations <- pmids |> 
  puremoe::get_records(endpoint = 'pubmed_affiliations', 
                       cores = 1, 
                       sleep = 0.5)

icites <- pmids |>
  puremoe::get_records(endpoint = 'icites',
                       cores = 3,
                       sleep = 0.25)

pubtations <- pmids |> 
  puremoe::get_records(endpoint = 'pubtations',
                       cores = 2)

When the endpoint is PMC, the get_records() function takes a vector of filepaths (from the PMC Open Access list) instead of PMIDs.

pmclist <- puremoe::data_pmc_list(use_persistent_storage = T)
pmc_pmids <- pmclist[PMID %in% pmids]

pmc_fulltext <- pmc_pmids$fpath[1:5] |> 
  puremoe::get_records(endpoint = 'pmc_fulltext', cores = 1)

Summary

Output	Colname	Description
pubmed_abstracts	pmid	PMID
pubmed_abstracts	year	Publication year
pubmed_abstracts	journal	Journal name
pubmed_abstracts	articletitle	Article title
pubmed_abstracts	abstract	Article abstract
pubmed_abstracts	annotations	Mesh/Chem/Keywords annotations
pubmed_affiliations	pmid	PMID
pubmed_affiliations	Author	Author name
pubmed_affiliations	affiliation	Author affiliation
pubtations	pmid	PMID
pubtations	tiab	Title or abstract
pubtations	id	Entity ID
pubtations	entity	Extracted entity
pubtations	identifier	Knowledge base link (KB link)
pubtations	type	Entity type
pubtations	start	Start position (char)
pubtations	end	End position (char)
pmc_fulltext	pmid	PMID
pmc_fulltext	section	Full text section
pmc_fulltext	text	Full text content
icites	pmid	PMID
icites	is_research_article	Research article indicator
icites	nih_percentile	NIH percentile rank
icites	is_clinical	Clinical article indicator
icites	citation_count	Citation count
icites	ref_count	Reference count
icites	citation_net	Citation network (to/from edgelist)

puremoe

Installation

Usage

PubMed search

Get record-level data

Summary

Links

License

Citation

Developers

Dev status