R with RAGS: An Introduction to rchroma and ChromaDB

From the Blog

LLM

RAG

Author(s)

David Schoch

Christoph Sax

Published

10 Apr 2025

Large language models (LLMs) are developing rapidly, but they often lack real-time, specific information. Retrieval-augmented generation (RAG) addresses this by letting LLMs fetch relevant documents during text generation, instead of just using their internal — and potentially outdated — knowledge.

Now, thanks to the rchroma package, you can bring high-performance, vector-based retrieval into your R workflows, backed by the powerful ChromaDB vector database.

Hex Sticker of the R package rchroma. It shows a colorful database icon.

What are RAGs?

Typically, an LLM answers questions using only its training data. While often effective, this approach has key limitations:

LLMs may not know recent events or updates
They may “hallucinate” facts when unsure
You can’t give them custom/private context without fine-tuning

RAG can overcome these issues.

A RAG pipeline works like this:

A user asks a question: “How does the billing system in our API work?”
The system retrieves relevant documents (e.g., markdown files, logs, internal wiki pages)
The LLM is given both the question and those retrieved documents
The LLM then generates an answer grounded in actual knowledge

It’s like giving the model its own mini search engine — one that only looks through your data.

ChromaDB: A vector database

At the heart of every RAG pipeline is a vector database — a system that stores high-dimensional embeddings of text and lets you search by meaning, not just keywords.

ChromaDB is one of the most powerful and accessible vector databases: It is fast, lightweight, and open-source. It can easily be run locally via Docker and fully supports filtering, metadata, and semantic search. It also integrates well with popular embedding models (OpenAI, Hugging Face, etc.), a key component for a RAG system.

With ChromaDB, you can store 10,000s or even millions of text chunks, and instantly find the most relevant ones for any query — all by comparing their embeddings.

Introducing rchroma

rchroma provides an R interface to the ChromaDB API. It allows you to connect to a running ChromaDB instance (often via Docker), create collections, add documents with their metadata and embeddings, and query the database using embeddings to find relevant documents.

rchroma includes a convenience function to easily start a ChromaDB Docker container.

library(rchroma)
chroma_docker_run()

The function chroma_docker_run() has several arguments, but the defaults are usually sufficient. You might consider changing volume_host_dir to specify where the database files should be persisted on your host machine.

To connect to the running Docker container, simply use chroma_connect()

client <- chroma_connect()
client

<chromadb connection>

Now we’re ready to add data to our database.

Example

In the following example, we use Wikipedia articles of philosophers to create a knowledge base for our experiment. The (folded) code below shows how to retrieve the articles.

library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(tibble)

philosophers <- c(
  # Classical Western
  "Plato",
  "Aristotle",
  "Socrates",
  "Epicurus",
  "Pythagoras",
  # Medieval
  "Augustine_of_Hippo",
  "Thomas_Aquinas",
  "Boethius",
  "Avicenna",
  "Maimonides",
  # Early Modern
  "René_Descartes",
  "Baruch_Spinoza",
  "John_Locke",
  "David_Hume",
  "Immanuel_Kant",
  # 19th Century
  "Georg_Wilhelm_Friedrich_Hegel",
  "Arthur_Schopenhauer",
  "Karl_Marx",
  "Friedrich_Nietzsche",
  "John_Stuart_Mill",
  # 20th Century / Contemporary
  "Ludwig_Wittgenstein",
  "Bertrand_Russell",
  "Martin_Heidegger",
  "Jean-Paul_Sartre",
  "Simone_de_Beauvoir",
  "Michel_Foucault",
  "Hannah_Arendt",
  "Jacques_Derrida",
  "Jürgen_Habermas",
  "Richard_Rorty",
  # Non-Western Philosophers
  "Confucius",
  "Laozi",
  "Zhuangzi",
  "Nagarjuna",
  "Adi_Shankara",
  "Mencius",
  "Al-Farabi",
  "Ibn_Rushd",
  "Wang_Yangming",
  "Dogen"
)

uuid <- function() {
  shortuuid::uuid_to_bitcoin58(shortuuid::generate_uuid())
}

get_philosopher_article_with_metadata <- function(title) {
  url <- paste0("https://en.wikipedia.org/wiki/", title)
  page <- tryCatch(read_html(url), error = function(e) NULL)
  if (is.null(page)) {
    return(NULL)
  }

  # Get readable title
  readable_title <- str_replace_all(title, "_", " ")

  # Extract text content (paragraphs)
  content <- page |>
    html_elements("#mw-content-text .mw-parser-output > p") |>
    html_text2()

  content <- content[nchar(content) > 100] # Filter out short/noisy chunks

  # Extract infobox rows
  infobox_rows <- page |>
    html_element(".infobox") |>
    html_elements("tr")

  # Helper to extract values
  extract_row_value <- function(label) {
    value <- infobox_rows |>
      keep(~ str_detect(html_text2(.x), fixed(label))) |>
      html_elements("td") |>
      html_text2()
    if (length(value) > 0) value[[1]] else NA
  }

  metadata <- list(
    name = readable_title,
    birth = extract_row_value("Born"),
    died = extract_row_value("Died"),
    region = extract_row_value("Region"),
    school_tradition = extract_row_value("School"),
    main_interests = extract_row_value("Main interests"),
    notable_ideas = extract_row_value("Notable ideas")
  )
  Sys.sleep(runif(1, 0.5, 1))
  print(title)
  # Return content + metadata for each chunk
  tibble(
    id = map_chr(seq_along(content), ~ uuid()),
    title = readable_title,
    chunk = seq_along(content),
    content = content,
    metadata = list(metadata)
  )
}

philosopher_articles <- map_dfr(
  philosophers,
  get_philosopher_article_with_metadata
)

We also need to calculate an embedding for each of the text chunks. While there are many ways to do this (e.g., using cloud APIs), we’ll use a locally deployed Ollama instance with the nomic-embed-text model for this example.

get_embedding <- function(text, model = "nomic-embed-text") {
  response <- httr2::request("http://localhost:11434/api/embeddings") |>
    httr2::req_body_json(list(
      model = model,
      prompt = text
    )) |>
    httr2::req_perform()

  content <- httr2::resp_body_json(response)
  content$embedding
}

We now assume that we have a dataset that looks a little like the following.

dplyr::glimpse(philosopher_articles)

Rows: 2,905
Columns: 6
$ id        <chr> "UtQx4t2HmGVw3iZjeQCehx", "5x42RR9NkaNbK2QhcdqnP1", "WL4W87J…
$ title     <chr> "Plato", "Plato", "Plato", "Plato", "Plato", "Plato", "Plato…
$ chunk     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ content   <chr> "Plato (/ˈpleɪtoʊ/ PLAY-toe;[1]Greek: Πλάτων, Plátōn; born c…
$ metadata  <list> ["Plato", "428/427 or 424/423 BC\n\nAthens", "348/347 BC\n\…
$ embedding <list> <0.271020502, 1.358236551, -2.413264513, -2.002642870, 1.72…

Before pushing this into a database, we first create a new collection.

create_collection(client, "philosophers")

Now we add all the documents to this collection, including metadata and embeddings.

add_documents(
  client,
  collection_name = "philosophers",
  documents = philosopher_articles$content,
  ids = philosopher_articles$id,
  metadatas = lapply(seq_len(nrow(philosopher_articles)), function(i) {
    list(
      title = philosopher_articles$title[i],
      chunk = philosopher_articles$chunk[i]
    )
  }),
  embeddings = philosopher_articles$embedding
)

Now we can start asking questions. To do so, we also need to embed the question with the same embedding function as the database.

query_text <- "What is the role of ethics in philosophy?"
query_embedding <- get_embedding(query_text)

Once embedded, we can query our database. Below we return the 3 documents with the smallest distance to the query.

result <- query(
  client,
  collection_name = "philosophers",
  query_embeddings = list(query_embedding),
  n_results = 3
)

purrr::map(result, unlist)$documents

[1] "Schopenhauer asserts that the task of ethics is not to prescribe moral actions that ought to be done, but to investigate moral actions. As such, he states that philosophy is always theoretical: its task to explain what is given.[58]"
[2] "What is relevant for ethics are individuals who can act against their own self-interest. If we take a man who suffers when he sees his fellow men living in poverty and consequently uses a significant part of his income to support their needs instead of his own pleasures, then the simplest way to describe this is that he makes less distinction between himself and others than is usually made.[60]"
[3] "Aristotle considered ethics to be a practical rather than theoretical study, i.e., one aimed at becoming good and doing good rather than knowing for its own sake. He wrote several treatises on ethics, most notably including the Nicomachean Ethics.[139]"

query_text <- "Can we truly know anything?"
query_embedding <- get_embedding(query_text)

result <- query(
  client,
  collection_name = "philosophers",
  query_embeddings = list(query_embedding),
  n_results = 3
)

purrr::map(result, unlist)$documents

[1] "The Phenomenology of Spirit shows that the search for an externally objective criterion of truth is a fool's errand. The constraints on knowledge are necessarily internal to spirit itself. Yet, although theories and self-conceptions may always be reevaluated, renegotiated, and revised, this is not a merely imaginative exercise. Claims to knowledge must always prove their own adequacy in real historical experience.[99]"
[2] "Thomas Aquinas believed \"that for the knowledge of any truth whatsoever man needs divine help, that the intellect may be moved by God to its act.\"[162] However, he believed that human beings have the natural capacity to know many things without special divine revelation, even though such revelation occurs from time to time, \"especially in regard to such (truths) as pertain to faith.\"[163] But this is the light that is given to man by God according to man's nature: \"Now every form bestowed on created things by God has power for a determined act[uality], which it can bring about in proportion to its own proper endowment; and beyond which it is powerless, except by a superadded form, as water can only heat when heated by the fire. And thus the human understanding has a form, viz. intelligible light, which of itself is sufficient for knowing certain intelligible things, viz. those we can come to know through the senses.\"[163]"
[3] "For the advancement of science and protection of liberty of expression, Russell advocated The Will to Doubt, the recognition that all human knowledge is at most a best guess, that one should always remember:"

To stop the docker container, simply call

chroma_docker_stop()

✔ Container chromadb has been stopped.

Summary

That was a quick dive into the world of RAGs — and how you can start building them in R using rchroma and ChromaDB. We looked at what RAGs actually are, how they’re different from just prompting a big language model, and why having a vector database like ChromaDB is awesome. Then we got hands-on with a little example using Wikipedia philosopher data — because what better way to test semantic search than with some ancient wisdom?

With rchroma, you can now start plugging in your own documents, support pages, logs, research notes — whatever you want your model to “know.” It’s fast, flexible, and lot’s of fun to play with.

Consulting

R has become an extremely powerful and versatile platform for solving all kinds of data-related tasks. We support you to pick the right tools, implementing and supporting solutions, training and code review.

Services

Blog Post

Christoph Sax / 07 Feb 2025

Playing with AI Agents in R

LLM|R

It's local LLM time! What an adventure it has been since I first started exploring local LLMs. With the introduction of various new Llama models, we now have impressive small and large models that run seamlessly on consumer hardware.

Blog Post

Christoph Sax / 27 Jul 2024

Playing with Llama 3.1 in R

LLM

Meta recently announced Llama 3.1, and there's a lot of excitement. I finally had some time to experiment with locally run open-source models. The small 8B model, in particular, produces surprisingly useful output, with reasonable speed.

Enterprise Data Science Environment

Open Source Industry Solutions

Open Source