Documentation

  • Six vignettes added covering the full pipeline: web data, Wikipedia data, regex search, dictionary search, semantic search (RAG), and basic NLP processing.
  • Basic NLP vignette walks through nlp_split_sentences(), nlp_tokenize_text() (word and Biber methods), and nlp_cast_tokens() stepwise and as a single pipe.
  • README revamped: tighter intro, API map, RAG/agent positioning, vignette links.

Changes

  • util_fetch_embeddings() re-added for embedding generation via Hugging Face inference endpoints (reversed 1.1.0 removal; now calls the HF inference API rather than loading models locally).
  • nlp_cast_tokens() documented and surfaced – flattens the token list from nlp_tokenize_text() into a long-format data frame with optional character spans.
  • Suggests trimmed: ellmer and unused packages removed.

API and naming

Removed

  • In-package embedding generation (e.g. Hugging Face API). Use your own embedding pipeline and pass your embedding matrix as the argument to .
  • Legacy names: web_search, wiki_search, wiki_find_references, web_scrape_urls, ner_extract_entities, sem_nearest_neighbors / sem_search_corpus (replaced by search_vector and search_regex).

Docs

  • README revamped around the API map and a single “golden path” workflow.
  • DESCRIPTION and package help updated for the four-stage pipeline; version set to 1.1.0.

  • Initial release: URL fetching, URL content reading, NLP processing (split, tokenize, index), and corpus/search utilities.