Retrieval-Augmented Generation (RAG) is more than just a vector store lookup. Sure, maybe the interface is most applicable for an LLM (large language model) to use, but we shouldn’t stop there. There are so many different types of data stores that engineers need to and want to use. So why does just about everyone seem to only talk about RAG in the context of vector stores? The first time I heard about RAG, I thought it was some special “thing” that was related to AI/ML that I just needed to learn. It’s not. It’s a function call that returns data for the LLM to process. So after I got over that naivete, I realized how powerful it actually was. This is where agents come in and I got excited about all of the possibilities of using agents to call tools (functions). But more on that in a bit, because I keep seeing articles and discussions around RAG and vector databases. That’s all anyone seems to be talking about (or the content algorithms think that’s all I want to read I guess). Let me clarify one thing though before we continue. I am not a DS, ML, or AI expert. I haven’t built a model or neural network myself. What I am is an application developer that has utilized LLM technology for a variety of practical use cases. So that’s my frame of reference, and hopefully it’s more helpful than any sort of theoretical bike-shedding on RAG.
What is context?
Context is really just text that the LLM can use to inform it’s response. It helps it predict what words should come next. There are so many ways to add context to an LLM for better results. System prompts, chat history, RAG, fine-tuning, model training, and the list goes on. Any way you can transform information into text can be used as context for the LLM. So how does RAG add context for an LLM? It is just a pattern for fetching data from some external data source to put into the system prompt or user message. Honestly, I’m shocked it became a named pattern at all (and as my co-worker pointed out, has lead to an awful acronym). It’s just data fetching.
How RAG works with vectors
As I said earlier, RAG is mostly talked about in the context of vector databases. So before I expand into all of the other possibilities for the RAG pattern, let’s dive into how it works with vectors. Vector databases are a special kind of database that until the latest AI shift, were mostly used for logging and searching (e.g. Elasticsearch). They transform text into mathematical vectors for more efficient searching. They enable you to do similarity searching and semantic searching that traditional databases don’t support. It’s baked into the vector data model. Words can be related because their vectors are similar. It’s wonderful. Where RAG fits into this is in the embeddings. They are a way to take text and transform it into vectors to find similar objects. It allows machine learning algorithms to better understand the similarities of objects in the real world. Like a tree might be in a forest, and even though those words don’t have any pattern matching similarities, the embeddings will help those be related. So whenever you search for data from a vector database using embeddings, the output will contain related information that gives the LLM even more context around what word will likely come next.
Why vectors are not the end
So vectors sound like a perfect pairing with LLMs, however, don’t stop there. Don’t paint LLMs and RAG into a corner with only being able to use vector databases. There are so many more possibilities out there. Most of the time, RAG is implemented using the tool/function calling interface of the particular API you are using. This is most powerful when paired with the agent pattern, where the LLM uses its context to determine when to call functions and then uses those results to inform its output. Being able to call arbitrary functions from the LLM enables you to add context to the LLM from anywhere. That function can be any API call or database call. It isn’t limited to vector databases. As long as you return the data in a text format, the LLM can use it. JSON doesn’t always work perfectly, and bloats your token usage, but it does work. It may be less efficient that vectors, but it enables you to do so much more. You can fetch data from relational databases, graph databases, wide-column stores, and the list goes on. All you need to do is give the LLM the interface for calling the function to fetch the data. Can it be limited by the parameters you can pass? Do you risk security issues if you let it execute arbitrary SQL?Yeah, but as an application developer, you can build your interface in such a way that gives the LLM enough guardrails to avoid that. Don’t let it write arbitrary SQL, just give it a function that executes the SQL you define with a parameterized query for the LLM to fill out from the function inputs. You don’t need to learn or implement an entirely new database storage mechanism, as you can utilize the knowledge you already have to empower the LLM.
Simplify your RAG
You already have functions you call to fetch data. You already have the knowledge. The data is already in place. You just need to provide an interface to the LLM to interact with all of that. You can even put together multi-agent workflows where each agent is responsible for a specific subset of information. That way the context given to each agent is simple and concise. Software is built around inputs, processing, and outputs. LLMs are just a different way of processing the data and getting an output (a different kind of black box if you will). Anything you could do before with those inputs will still work with LLMs. You just need to translate it into text (which most things are already text anyway). To me, having an LLM call a function is just like passing in a callback function common in Node.js (and functional programming languages). You have an interface for how to call the function and provide specific inputs. That function processes the inputs and returns some output for the next part of the program. You still have to handle any errors that occur from invalid input. RAG is just passing a function to the LLM to retrieve some data to augment its generation of output text. Simple as that.