RAG is all the RAGe these days. If you don’t know the term, it’s short for “retrieval-augmented generation”, which means that we’re using retrieval (in our case OpenSearch) to augment our friendly chatbot by sending it some information relevant to the context.
Say you have some logs in OpenSearch. If you ask ChatGPT which of your app’s modules were loaded, it will have no idea. But if you search for module-loading logs in OpenSearch first, feed ChatGPT with the results, then ask it which module were loaded, it will know.
In this post we’ll implement the example above. So let’s get to it!
High-Level View
We’re mostly using OpenSearch’s ml-commons module here. ml-commons
is the place for all the Machine Learning functionality in OpenSearch, from Anomaly Detection to Semantic Search. Our RAG example will use different functionality than these two examples, but the APIs are common.
To implement RAG, we’ll need to do a few things:
- Enable RAG-related functionality in cluster settings.
- Define a connector to the external LLM. We mentioned ChatGPT, so it will be the OpenAI API.
- Use the
ml-commons
APIs to define a model that uses this connector and make it available for search. - Use the defined model in searches via search pipelines.
Cluster Settings
We need to enable two things:
- The RAG pipeline feature that allows us to use the external API.
- The conversation memory feature which stores ongoing conversations. We need this because OpenAI’s Completion API is stateless: we have to send the whole conversation every time, so the GPT model has all the context.
To enable them:
PUT /_cluster/settings { "persistent": { "plugins.ml_commons.memory_feature_enabled": true, "plugins.ml_commons.rag_pipeline_feature_enabled": true } }
Add an OpenAI Connector
In the connector we define how to connect to the OpenAI API. Here’s an example:
POST /_plugins/_ml/connectors/_create { "name": "OpenAI Chat Connector", "description": "The connector to OpenAI's GPT 3.5", "version": 2, "protocol": "http", "parameters": { "endpoint": "https://api.openai.com/", "model": "gpt-3.5-turbo", "temperature": 0.5 }, "credential": { "openAIkey": "OPENAI_API_KEY_GOES_HERE" }, "actions": [ { "action_type": "predict", "method": "POST", "url": "${parameters.endpoint}/v1/chat/completions", "headers": { "Authorization": "Bearer ${credential.openAIkey}" }, "request_body": """{ "model": "${parameters.model}", "messages": ${parameters.messages}, "temperature": ${parameters.temperature} }""" } ] }
Let’s go over the elements:
- name, description and version are metadata to define the connector, they’re up to you 🙂
- The protocol has to be
http
, because it’s what we need for OpenAI. The other option (as of OpenSearch 2.11) isaws_sigv4
, which is for the Anthropic model on Amazon Bedrock. - parameters have to do with your OpenAPI model, much as they would in a completion API request. You have to select the
endpoint
, themodel
and you can have other parameters, such astemperature
ortop_p
(which balance GPT’s creativity vs precision). By default, endpoints are limited to OpenAI proper, something like Azure OpenAI doesn’t work until you explicitly allow that endpoint. - credential is where you fill in authentication info, in this case the OpenAI API key.
- Finally, we’ll have one element in actions, which will be a
predict
action_type – it’s boilerplate for running the external call. This action describes the completion API request: aPOST
method to the URL we need, using the headers and the body that we want to send. Notice how all the referenced parameters are already specified, with the exception of${parameters.messages}
, which is built for every request – it represents the conversation so far (from the sytem prompt to the user question).
Note the connector ID that you get back, we’ll use it in the model we’re about to define.
Register and Deploy the Model
This is ml-commons
boilerplate, really. Our model will go in a model group, which is best declared explicitly. You’d use model groups for access control. Here’s an example:
POST /_plugins/_ml/model_groups/_register { "name": "openai_model_group", "description": "The model group we're using now for testing" }
Once again, make note of the ID you get back. Now we can register our model:
POST /_plugins/_ml/models/_register { "name": "test_OpenAI_model", "function_name": "remote", "model_group_id": "MODEL_GROUP_ID_GOES_HERE", "description": "test OpenAI gpt-3.5-turbo model", "connector_id": "CONNECTOR_ID_GOES_HERE" }
Note how we’re using the IDs of the connector and the model group that we got earlier. Besides the name and the description, we also declare that function_name is remote
. This is to signal that our model isn’t a typical text embedding model, but a remote model that we call through an API.
Now the model is registered (i.e. defined) and we should get the ID of the registered model. We’ll use this ID for deploying the model:
POST /_plugins/_ml/models/MODEL_ID_GOES_HERE/_deploy
Now let’s have fun and profit! 🤓
Use the Model in a Search Pipeline
First, we need a search pipeline to tell OpenSearch to use our model in its RAG feature (which we enabled in the beginning) as a search response processor:
PUT /_search/pipeline/openai_pipeline { "response_processors": [ { "retrieval_augmented_generation": { "tag": "openai_pipeline_demo", "description": "Demo pipeline Using OpenAI Connector", "model_id": "MODEL_ID_GOES_HERE", "context_field_list": ["severity", "message"], "system_prompt": "You are a helpful assistant. Never make up an answer. If you're not sure of something, you MUST say that you don't know.", "user_instructions": "I will send you some information that MAY be relevant to the context of the question." } } ] }
Besides the descriptive fields, like the name in the URL (openai_pipeline
), the tag and the description, we also have:
- Our model ID, the same ID that you used while deploying
- The system prompt and the user instructions (effectively, the first message from the user in the conversation) that we give to GPT, along with the search results.
- Search results will be sent to the GPT model in one “user” message per result. context_field_list configures which fields we want to send. Note that (as of 2.11), fields have to be in all documents we send, otherwise the whole request fails.
As sample data, I’ve indexed some OpenSearch logs that I had handy from our OpenSearch logs integration. For this experiment, I’m interested in the message
field (which is in every log) and in the severity
field (which is in most logs, but not 100%). For example, I might have:
"_source": { ... "message": "loaded module [geo]", "severity": "INFO", ... }
At this point, if I ask OpenAI which modules were loaded, I might make this request:
GET /test-index/_search?search_pipeline=openai_pipeline { "query": { "bool": { "filter": [ { "exists": { "field": "severity" } } ], "must": [ { "match": { "message": { "query": "which modules were loaded?" } } } ] } }, "ext": { "generative_qa_parameters": { "llm_model": "gpt-3.5-turbo", "llm_question": "which modules were loaded?", "conversation_id": "my_first_conversation", "interaction_size": 20, "context_size": 20, "timeout": 30 } } }
I get back the top 10 results as in a regular query, but I also get this nice reply at the end:
"ext": { "retrieval_augmented_generation": { "answer": "Based on the information provided, the modules that were loaded are geo, percolator, reindex, and systemd.", "interaction_id": "tUiYhosBP1_cixkKNy6D" } }
Now we have our minimal OpenSearch RAG 🎉 Let’s zoom in to see what happened.
Request Details
Let’s start with ext
, which is the element from both the request and the reply that deals with the response processor. In the request we put all the parameters under the reserved generative_qa_parameters
key:
- llm_model is where we can override the OpenAI model name that we have in the connector definition.
- llm_question is the question or request coming from the user.
- conversation_id is the unique ID of our conversation. Typically, you’d put a session ID in there. Note that in the answer we have interaction_id and we can use these to get conversation details from the Memory API. Even if you don’t need to check the conversation yourself, OpenSearch will need to have access to it when the user wants to ask another question in the same conversation. You’ll reuse the
conversation_id
in that case. - interaction_size determines the number of interactions (i.e. request-reply pairs like the one we did above) that are kept for this particular conversation in the “conversation memory”. Which is stored in the
.plugins-ml-conversation-interactions
index. Even if OpenSearch can store a lot of data, you’ll need to make sure you don’t cross the LLM’s token limit for a conversation (e.g. forgpt-3.5-turbo
the limit is 4K tokens) otherwise that request will fail. - context_size determines how many OpenSearch documents (context) will be sent to the LLM. This adds up to the number of tokens sent in every interaction. In short, the number of tokens you send is a factor of interaction_size, context_size and the context_field_list that you defined in the search pipeline. And of course, the amount of text in those fields 🙂
- Finally, timeout defined how long to wait for the remote endpoint (in our case the OpenAI API) to respond.
Limitations
Speaking of time, you might notice that the took time from the OpenSearch reply only contains the query time, not the time to call the LLM. As of now, if you need to monitor the total time, you’ll have to log it on the application side. Then you can centralize logs somewhere, e.g. in Sematext Logs.
As for the query itself, we need to make sure that it contains all the fields we ask for. Hence the exists
clause above. The query string doesn’t have to be the user question, but we’d need an automatic way to translate the user question into a query.
It’s also important that the top N most relevant results (where N is context_size
) are… relevant 😅 Otherwise we’re giving the LLM the wrong context. Semantic search might help with that, especially in the context of user questions, where the user’s words are less likely to match the words in our index than in a “traditional” text box.
Closing Thoughts
I think this approach to RAG looks promising. You get the conversation memory out of the box with some control over the number of tokens you store in there. The fact that all fields are mandatory is limiting, but you can work around this limit by throwing all the fields you find in a script field.
If you’re using OpenAI, an alternative approach could be function calling. In short, you can let the LLM choose when to get additional context from OpenSearch. It’s up to the LLM (based on your input) to fill in the parameters (query text, for example) which has potential to be smart but also to introduce creative bugs, like a field name typo or the wrong date format. Which means it’s up to you to make sure the interaction between GPT and OpenSearch goes smoothly.
In both approaches, you’ll want to return relevant results from OpenSearch to the LLM, because:
- LLMs have trouble telling whether a result is relevant or not (at least for the moment), so we need to be precise.
- Like humans, LLMs don’t know what they don’t know, so the context needs to be complete. But it also feels like – for the moment – LLMs have a harder time than humans in estimating how much of the context they’re missing.
But don’t worry! If you need help with search relevance, don’t hesitate to reach out to your friendly OpenSearch consultants 😊