textpress is an R toolkit for building text corpora and searching them – no custom object classes, just plain data frames from start to finish. It covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all compatible with the native R pipe (|>).
From CRAN:
install.packages("textpress")Development version:
remotes::install_github("jaytimm/textpress")textpress API
Conventions: corpus is a data frame with a text column plus identifier column(s) passed to by (default doc_id). All outputs are plain data frames or data.tables; pipe-friendly.
fetch_*)
Find URLs and metadata – not full text. Pass results to read_urls() to get content.
fetch_urls(query, n_pages, date_filter) – Search engine query; returns candidate URLs with metadata.fetch_wiki_urls(query, limit) – Wikipedia article URLs matching a search phrase.fetch_wiki_refs(url, n) – External citation URLs from a Wikipedia article’s References section.read_*)
Scrape and parse URLs into a structured corpus.
read_urls(urls, ...) – Character vector of URLs → list(text, meta). text is one row per node (headings, paragraphs, lists); meta is one row per URL. For Wikipedia, exclude_wiki_refs = TRUE drops References / See also / Bibliography sections.nlp_*)
Prepare text for search or indexing.
nlp_split_paragraphs() – Break documents into structural blocks.nlp_split_sentences() – Segment blocks into individual sentences.nlp_tokenize_text() – Normalize text into a clean token stream.nlp_index_tokens() – Build a weighted BM25 index for ranked retrieval.nlp_roll_chunks() – Roll sentences into fixed-size chunks with surrounding context (RAG-style).search_*)
Four retrieval modes over the same corpus. Data-first, pipe-friendly.
| Function | Query type | Use case |
|---|---|---|
search_regex(corpus, query) |
Regex pattern | Specific strings, KWIC with inline highlighting. |
search_dict(corpus, terms) |
Term vector | Exact phrases and MWEs; built-in dict_generations, dict_political. |
search_index(index, query) |
Keywords | BM25 ranked retrieval over a token index. |
search_vector(embeddings, query) |
Numeric vector | Semantic nearest-neighbor search; use util_fetch_embeddings() to embed. |
textpress is designed to compose cleanly into retrieval-augmented generation pipelines.
Hybrid retrieval – run search_index() and search_vector() over the same chunks, then merge with reciprocal rank fusion (RRF). Chunks that rank well under both term frequency and meaning rise to the top.
Context assembly – nlp_roll_chunks() with context_size > 0 gives each chunk a focal sentence plus surrounding context, so retrieved passages are self-contained when passed to an LLM.
Agent tool-calling – the consistent API and plain data-frame outputs map naturally to tool use:
| Agent task | Function |
|---|---|
| “Find recent articles on X” | fetch_urls() |
| “Scrape these pages” | read_urls() |
| “Find all mentions of these entities” | search_dict() |
| “Follow citations from this Wikipedia article” | fetch_wiki_refs() |
fetch_urls() + read_urls()
fetch_wiki_urls() + fetch_wiki_refs()
search_regex(), KWICsearch_dict(), PMI co-occurrenceMIT © Jason Timm
citation("textpress")Report bugs or request features at https://github.com/jaytimm/textpress/issues