Introduction
OpsMx provides continuous delivery solutions to its customers using Spinnaker and is one of the top contributors to the open source Spinnaker. OpsMx has built a body of knowledge distributed in various documents as FAQs, Blogs, and Customer Support issues, that have encapsulated knowledge of the systems over the years. This knowledge base includes issues and solutions across multiple cloud environments, as well as Spinnaker specific configurations and solutions.
With all the data being too scattered to be quickly used for specific problems, we wanted to use LLMs to make the information accessible as self service for its support personnel, as well as for customers to provide better solutions for troubleshooting and configuration options. We hope to make a chatbot that can efficiently use historical data on all the previous incidents as well as other documents to come to a solution to a problem, with improved response time and quality.
Overview
Our LLM required a good base model to formulate responses with pre-trained knowledge, and the ability to access our data to search for the correct solutions to the customers’ problems. We followed these steps to build our model, give it access to our data stores, and publish it for general use:
1. Selecting the model
Our first step was determining the base LLM model to use and how we were going to train it. We tried two separate methods for this:
Finetuning is about choosing a subset of the weights of a base LLM and changing them based on the training data. This would allow our company’s data to be baked into the model’s knowledge and it would be used for any response. We chose the base model of llama3 with 7 billion parameters from Ollama which would make local training easy. However we later abandoned this method (because the results were more similar to the model’s base knowledge than the extra trained data, meaning it wasn’t considering our examples as much). We then chose to go with the Retrieval Augmented Generation (RAG) approach.
In this new method, we create a database of our training data instead of actually putting it in the model. We create vector embeddings for every document such that similar documents will have similar values. When the user asks a question, we can find the vector embedding of the question and use it to find documents in our database that have similar embeddings, which provides context for our answer.
For this method we chose a gemini model from vertex ai for fast and high quality responses.
ChatVertexAI(
temperature=0,
model_name="gemini-1.0-pro",
max_output_tokens=2048,
)
2. Data cleaning
Our data was in the form of FAQs in google docs. The previous incident data was in google sheets. We used the google drive API to access the data and separate them into question and answer pairs. For the FAQ data, we simply parsed the document and extracted the question and answer data. For the incidents, we tagged the problem description as the question and the implemented solution as the answer. We created hundreds of documents, each containing a question and answer pair to train the model.
Our created documents were stored as a list of dictionaries containing the title of the problem, the question the user asked, and the expected answer. We used pandas DataFrame methods to get the data ready.
documents = pd.read_csv('documents.csv')
documents = documents[['title', 'question', 'answer']]
documents = json.loads(documents.T.to_json())
texts = []
for doc in documents:
texts.append(str(documents[doc]))
3. Training the model
We first tried training the llama3 model using finetuning on a local machine. We used a PyTorch and Unsloth environment to choose a subset of model weights and alter them with our documents. Each piece of training data contained one question and answer pair from a document as well as a prompt explaining how to format the answer based on the question. We ran the training loop for several hours but were not happy with the results.
We then moved to google cloud and used its features to create our RAG implementation using Gemini as the LLM and Langchain as the framework. In vertex ai, we created a colab notebook to take in our training data consisting of questions and answers. We created a vector search index to store the embeddings of each document, as well as an endpoint to retrieve relevant documents to get the right questions and answers.
index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
display_name="llm_documents_index",
dimensions=768,
approximate_neighbors_count=150,
leaf_node_embedding_count=500,
leaf_nodes_to_search_percent=7,
description="Document Storage for LLM",
)
index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
display_name="llm_documents_endpoint",
description="Index Endpoint for LLM",
public_endpoint_enabled=True,
)
We loaded the documents into this data store by first creating a vector store object to do the embeddings. We also created a retriever to match a query with documents with similar embeddings. From there we uploaded a list of document objects containing our information, which are the needed type for the vectorstore.
# vectorstore contains the embeddings of all documents
vectorstore = VectorSearchVectorStore.from_components(
project_id=PROJECT_ID,
region=LOCATION,
gcs_bucket_name=GCS_BUCKET,
index_id=index.name,
endpoint_id=index_endpoint.name,
embedding=VertexAIEmbeddings(model_name=EMBEDDING_MODEL_NAME),
stream_update=False,
)
# docstore contains mappings from document to embedding
docstore = InMemoryStore()
id_key = "doc_id"
# retriever finds similar embeddings to given query
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=docstore,
id_key=id_key,
)
# create document objects that can be added to storage
all_docs = [
Document(page_content=str(s), metadata={id_key: doc_ids[i]})
for i, s in enumerate(texts)
]
retriever.docstore.mset(list(zip(doc_ids, texts)))
retriever.vectorstore.add_documents(all_docs)
We then created a pipeline in Langchain to retrieve the four most relevant documents to the user’s question. A prompt would be created writing in all these documents as well as an instruction to answer the user’s question based on the information contained in the context. The outputs using this method were much better than the previous method.
def combine_context_question(inputs):
"""
Combine the context and question to create the prompt for the LLM
"""
context = inputs.get("context", "")
question = inputs.get("question", "")
prompt = f"Context: {context}\n\nQuestion: {question}"
return prompt
llm_chain = (
{
"context": retriever, # Retrieve similar documents
"question": RunnablePassthrough(), # Question from user
}
| RunnableLambda(combine_context_question) # Create prompt
| ChatVertexAI( # Ask question to LLM
temperature=0,
model_name=MODEL_NAME,
max_output_tokens=TOKEN_LIMIT,
)
| StrOutputParser() # Return output
)
4. Pushing to production
We tested the model with slight changes to certain parameters like number of documents retrieved or model temperature to find the best possible output. We got feedback from internal teams using their example questions to ensure high quality outputs for the model. To push this to production, we created a docker container with our model’s pipeline and all the necessary packages. Then we uploaded it to vertex ai’s model registry to host it for teams to use.
Training Methods
1. Finetuning
Our first idea was baking the knowledge into the weights of the LLM so that it could be used easily by just calling the LLM. This method is a lot easier to deploy and use, since only the model weights are needed for running it. It can also be trained locally saving on cost or on the cloud for better speed, giving more options. Since the data is already embedded in the model, prompts are very simple and easy for anyone to use.
However, we found that the results were more similar to the model’s base knowledge than the extra training data, meaning it wasn’t considering our examples as much. We wanted the model to prioritize the company’s data over what the model was trained on because it would be more relevant to the problem. In addition, training can take a lot of time and memory, making it hard for adjustments to be made, like adding more training data in the future.
2. RAG
We used RAG to address the problem of the answers not being relevant to the training data which happened with finetuning. Since we are telling the LLM to generate the answers only based on similar answers, it is much more likely to be the result we want. This method saves a lot of training time since it is much faster to create a vector database than to tune an LLM. Since we are using just the base LLM, we can also change it easily between models like Gemini 1.5 Flash and Gemini 1.0 Pro, for different use cases.
However an issue with this approach is that we now have to maintain a vector database on google cloud which costs money. Also, to keep the model running, we now have to make sure the database and endpoints are all running rather than just having the weights. In addition, the database might not contain any documents relevant to the question if it is a completely new issue, making it unable to give an answer. However, despite these limitations, we found this method to be the better one because of its higher quality answers.
Conclusion
Our final chatbot LLM responded with accurate information and was useful in answering questions about errors that could pop up while using OpsMx products. When creating the database of documents for the RAG pipeline, we embedded the documents with text-embedding-004 in vertex ai which provided good results in retrieving the documents. We could consistently retrieve documents that were relevant to the user’s question and could help create a useful answer. We also used the gemini-1.0-pro model as the base model because it was very good at summarizing the provided documents into an answer for the given question. By changing the model’s temperature to 0, we could make sure that the answers were more grounded to the documents and not hallucinating new information. We experimented with the number of documents to be included in the context for the LLM and decided on 4 as it was enough information while not exceeding the token limit or slowing performance down. Overall our setup gives satisfactory results for the questions and can be used in production to supplement our customer service representatives in their responses.
0 Comments