Large language models (LLMs) are developing rapidly, but they often lack real-time, specific information. Retrieval-augmented generation (RAG) addresses this by letting LLMs fetch relevant documents during text generation, instead of just using their internal — and potentially outdated — knowledge.
Now, thanks to the rchroma package, you can bring high-performance, vector-based retrieval into your R workflows, backed by the powerful ChromaDB vector database.
What are RAGs?
Typically, an LLM answers questions using only its training data. While often effective, this approach has key limitations:
- LLMs may not know recent events or updates
- They may “hallucinate” facts when unsure
- You can’t give them custom/private context without fine-tuning
RAG can overcome these issues.
A RAG pipeline works like this:
- A user asks a question: “How does the billing system in our API work?”
- The system retrieves relevant documents (e.g., markdown files, logs, internal wiki pages)
- The LLM is given both the question and those retrieved documents
- The LLM then generates an answer grounded in actual knowledge
It’s like giving the model its own mini search engine — one that only looks through your data.
ChromaDB: A vector database
At the heart of every RAG pipeline is a vector database — a system that stores high-dimensional embeddings of text and lets you search by meaning, not just keywords.
ChromaDB is one of the most powerful and accessible vector databases: It is fast, lightweight, and open-source. It can easily be run locally via Docker and fully supports filtering, metadata, and semantic search. It also integrates well with popular embedding models (OpenAI, Hugging Face, etc.), a key component for a RAG system.
With ChromaDB, you can store 10,000s or even millions of text chunks, and instantly find the most relevant ones for any query — all by comparing their embeddings.
Introducing rchroma
rchroma provides an R interface to the ChromaDB API. It allows you to connect to a running ChromaDB instance (often via Docker), create collections, add documents with their metadata and embeddings, and query the database using embeddings to find relevant documents.
rchroma includes a convenience function to easily start a ChromaDB Docker container.
library(rchroma)
chroma_docker_run()
The function chroma_docker_run()
has several arguments, but the
defaults are usually sufficient. You might consider changing
volume_host_dir
to specify where the database files should be
persisted on your host machine.
To connect to the running Docker container, simply use
chroma_connect()
client <- chroma_connect()
client
<chromadb connection>
Now we’re ready to add data to our database.
Example
In the following example, we use Wikipedia articles of philosophers to create a knowledge base for our experiment. The (folded) code below shows how to retrieve the articles.
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(tibble)
philosophers <- c(
# Classical Western
"Plato",
"Aristotle",
"Socrates",
"Epicurus",
"Pythagoras",
# Medieval
"Augustine_of_Hippo",
"Thomas_Aquinas",
"Boethius",
"Avicenna",
"Maimonides",
# Early Modern
"René_Descartes",
"Baruch_Spinoza",
"John_Locke",
"David_Hume",
"Immanuel_Kant",
# 19th Century
"Georg_Wilhelm_Friedrich_Hegel",
"Arthur_Schopenhauer",
"Karl_Marx",
"Friedrich_Nietzsche",
"John_Stuart_Mill",
# 20th Century / Contemporary
"Ludwig_Wittgenstein",
"Bertrand_Russell",
"Martin_Heidegger",
"Jean-Paul_Sartre",
"Simone_de_Beauvoir",
"Michel_Foucault",
"Hannah_Arendt",
"Jacques_Derrida",
"Jürgen_Habermas",
"Richard_Rorty",
# Non-Western Philosophers
"Confucius",
"Laozi",
"Zhuangzi",
"Nagarjuna",
"Adi_Shankara",
"Mencius",
"Al-Farabi",
"Ibn_Rushd",
"Wang_Yangming",
"Dogen"
)
uuid <- function() {
shortuuid::uuid_to_bitcoin58(shortuuid::generate_uuid())
}
get_philosopher_article_with_metadata <- function(title) {
url <- paste0("https://en.wikipedia.org/wiki/", title)
page <- tryCatch(read_html(url), error = function(e) NULL)
if (is.null(page)) {
return(NULL)
}
# Get readable title
readable_title <- str_replace_all(title, "_", " ")
# Extract text content (paragraphs)
content <- page |>
html_elements("#mw-content-text .mw-parser-output > p") |>
html_text2()
content <- content[nchar(content) > 100] # Filter out short/noisy chunks
# Extract infobox rows
infobox_rows <- page |>
html_element(".infobox") |>
html_elements("tr")
# Helper to extract values
extract_row_value <- function(label) {
value <- infobox_rows |>
keep(~ str_detect(html_text2(.x), fixed(label))) |>
html_elements("td") |>
html_text2()
if (length(value) > 0) value[[1]] else NA
}
metadata <- list(
name = readable_title,
birth = extract_row_value("Born"),
died = extract_row_value("Died"),
region = extract_row_value("Region"),
school_tradition = extract_row_value("School"),
main_interests = extract_row_value("Main interests"),
notable_ideas = extract_row_value("Notable ideas")
)
Sys.sleep(runif(1, 0.5, 1))
print(title)
# Return content + metadata for each chunk
tibble(
id = map_chr(seq_along(content), ~ uuid()),
title = readable_title,
chunk = seq_along(content),
content = content,
metadata = list(metadata)
)
}
philosopher_articles <- map_dfr(
philosophers,
get_philosopher_article_with_metadata
)
We also need to calculate an embedding for each of the text chunks.
While there are many ways to do this (e.g., using cloud APIs), we’ll use
a locally deployed Ollama instance with the
nomic-embed-text
model for this example.
get_embedding <- function(text, model = "nomic-embed-text") {
response <- httr2::request("http://localhost:11434/api/embeddings") |>
httr2::req_body_json(list(
model = model,
prompt = text
)) |>
httr2::req_perform()
content <- httr2::resp_body_json(response)
content$embedding
}
We now assume that we have a dataset that looks a little like the following.
dplyr::glimpse(philosopher_articles)
Rows: 2,905
Columns: 6
$ id <chr> "UtQx4t2HmGVw3iZjeQCehx", "5x42RR9NkaNbK2QhcdqnP1", "WL4W87J…
$ title <chr> "Plato", "Plato", "Plato", "Plato", "Plato", "Plato", "Plato…
$ chunk <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ content <chr> "Plato (/ˈpleɪtoʊ/ PLAY-toe;[1]Greek: Πλάτων, Plátōn; born c…
$ metadata <list> ["Plato", "428/427 or 424/423 BC\n\nAthens", "348/347 BC\n\…
$ embedding <list> <0.271020502, 1.358236551, -2.413264513, -2.002642870, 1.72…
Before pushing this into a database, we first create a new collection.
create_collection(client, "philosophers")
Now we add all the documents to this collection, including metadata and embeddings.
add_documents(
client,
collection_name = "philosophers",
documents = philosopher_articles$content,
ids = philosopher_articles$id,
metadatas = lapply(seq_len(nrow(philosopher_articles)), function(i) {
list(
title = philosopher_articles$title[i],
chunk = philosopher_articles$chunk[i]
)
}),
embeddings = philosopher_articles$embedding
)
Now we can start asking questions. To do so, we also need to embed the question with the same embedding function as the database.
query_text <- "What is the role of ethics in philosophy?"
query_embedding <- get_embedding(query_text)
Once embedded, we can query our database. Below we return the 3 documents with the smallest distance to the query.
result <- query(
client,
collection_name = "philosophers",
query_embeddings = list(query_embedding),
n_results = 3
)
purrr::map(result, unlist)$documents
[1] "Schopenhauer asserts that the task of ethics is not to prescribe moral actions that ought to be done, but to investigate moral actions. As such, he states that philosophy is always theoretical: its task to explain what is given.[58]"
[2] "What is relevant for ethics are individuals who can act against their own self-interest. If we take a man who suffers when he sees his fellow men living in poverty and consequently uses a significant part of his income to support their needs instead of his own pleasures, then the simplest way to describe this is that he makes less distinction between himself and others than is usually made.[60]"
[3] "Aristotle considered ethics to be a practical rather than theoretical study, i.e., one aimed at becoming good and doing good rather than knowing for its own sake. He wrote several treatises on ethics, most notably including the Nicomachean Ethics.[139]"
query_text <- "Can we truly know anything?"
query_embedding <- get_embedding(query_text)
result <- query(
client,
collection_name = "philosophers",
query_embeddings = list(query_embedding),
n_results = 3
)
purrr::map(result, unlist)$documents
[1] "The Phenomenology of Spirit shows that the search for an externally objective criterion of truth is a fool's errand. The constraints on knowledge are necessarily internal to spirit itself. Yet, although theories and self-conceptions may always be reevaluated, renegotiated, and revised, this is not a merely imaginative exercise. Claims to knowledge must always prove their own adequacy in real historical experience.[99]"
[2] "Thomas Aquinas believed \"that for the knowledge of any truth whatsoever man needs divine help, that the intellect may be moved by God to its act.\"[162] However, he believed that human beings have the natural capacity to know many things without special divine revelation, even though such revelation occurs from time to time, \"especially in regard to such (truths) as pertain to faith.\"[163] But this is the light that is given to man by God according to man's nature: \"Now every form bestowed on created things by God has power for a determined act[uality], which it can bring about in proportion to its own proper endowment; and beyond which it is powerless, except by a superadded form, as water can only heat when heated by the fire. And thus the human understanding has a form, viz. intelligible light, which of itself is sufficient for knowing certain intelligible things, viz. those we can come to know through the senses.\"[163]"
[3] "For the advancement of science and protection of liberty of expression, Russell advocated The Will to Doubt, the recognition that all human knowledge is at most a best guess, that one should always remember:"
To stop the docker container, simply call
chroma_docker_stop()
✔ Container chromadb has been stopped.
Summary
That was a quick dive into the world of RAGs — and how you can start building them in R using rchroma and ChromaDB. We looked at what RAGs actually are, how they’re different from just prompting a big language model, and why having a vector database like ChromaDB is awesome. Then we got hands-on with a little example using Wikipedia philosopher data — because what better way to test semantic search than with some ancient wisdom?
With rchroma, you can now start plugging in your own documents, support pages, logs, research notes — whatever you want your model to “know.” It’s fast, flexible, and lot’s of fun to play with.