December 19, 2024
Introduction BERTScore represents a pivotal shift in LLM evaluation, moving beyond traditional heuristic-based metrics like…
Welcome to Lesson 10 of 11 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready “LLM twin” of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with Lesson 1.
Before jumping into the lesson, let’s walk through a short recap, to understand how we’ve got here:
→ In Lesson 8 we’ve focused on common evaluation methods for various tasks LLMs are performing, specifically in our case of content generation, we’ve used a larger model (GPT3.5-Turbo) via API to assess the coherence and quantified other metrics for our LLM generations.
→ In Lesson 9 we’ve showcased how to implement and deploy the inference pipeline of the LLM twin system on Qwak [2]. Iterated on the microservice-based design, separating the ML and business logic into two layers.
In Lesson 10 we’ll focus on the RAG-evaluation logic.
Here, we’ll showcase the evaluation steps we’re performing, and how we structure the evaluation payload step-by-step. We’ll present one of the best RAG evaluation frameworks (RAGAs [5]) and discuss the metrics, implementation, and other nice functionalities it provides.
Ultimately, we’ll learn how to monitor complex chains by designing each chain step individually, attaching metadata to it, and logging to Comet-LLM.
Here’s what we’re going to learn in this lesson:
RAG evaluation involves assessing how well the model integrates retrieved information into its responses. This requires evaluating not just the quality of the generated text, but also the accuracy and relevance of the retrieved information, and how effectively it enhances the final output.
Building an RAG pipeline is fairly simple. You just need a Vector-DB knowledge base, an LLM to process your prompts, and additional logic for interactions between these modules.
Reaching a satisfying performance level for a RAG pipeline imposes its challenges because of the “separate” components:
When evaluating a RAG pipeline, we must evaluate both components separately and together to understand if and where the RAG pipeline still needs improvement, this will help us identify its “quality”. Additionally, to understand whether its performance is improving, we need to evaluate it quantitatively.
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. There are existing tools and frameworks that help you build these pipelines, (e.g. LLamaIndex), but evaluating it and quantifying your pipeline performance can be hard.
This is where Ragas (RAG Assessment) comes in.
The RAGAs [5] framework (5.3k ⭐️) is open-source, part of explodinggradients group, and it comes with a paper submission: RAGAs Paper [6]
One of the core concepts of RAGAs is Metric-Driven-Development (MDD) which is a product development approach that relies on data to make well-informed decisions. The focus is to leverage powerful LLMs under the hood to conduct targeted evaluation processes, instead of relying on HITL (human-in-the-loop) for ground truth annotations.
Let’s iterate over the metrics that RAGAs Metrics [4] expose:
Metrics for Retrieval Stage 🔽:
Metrics for Generation Stage 🔽:
A subset or all of these metrics could be used throughout the evaluation setup. In our LLM-Twin RAG use case, we’ll use 6 metrics that target both the Retrieval and Generation modules :
To evaluate the RAG pipeline, RAGAs expects the following dataset format:
question : The user query, this is the input to our RAG.
answer : The generated answer from the RAG pipeline, given the query + context prompt
contexts : Context retrieved from the knowledge base (the Vector Database)
ground_truths : The ground truth answer to the question.
[Note] : The `ground_truths` is necessary only if the ContextRecall metric is used.
📓 All the listed RAGAs metrics use the
question
,answer
andcontexts
fields. It is important to note that the only metric that requires theground_truths
field is Context Recall. As it measures if all the relevant information required to answer the question was retrieved from the Vector DB.
Here’s a quick example of how a dataset setup for RAGAs looks like:
from datasets import Dataset
questions= ["When was the Eiffel Tower built and how tall is it?"],
answers= ["As of my last update in April 2023, the Eiffel Tower was built in 1889 and is 324m tall"]
contexts= [
"The Eiffel Tower is one of the most attractive monuments to visit when in Paris, France. It was constructed in 1889 as the entrance arch to the 1889 World's Fair. It stands at 324 meters tall."
]
ground_truths=[
["The Eiffel Tower was built in 1889 and it stands at 324 meters tall."]
]
sample = {
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truths": ground_truths
}
eval_dataset = Dataset.from_dict(sample)
Here’s what the dataset looks like:
#> print(eval_dataset)
Dataset({
features: ['question', 'answer', 'contexts', 'ground_truths'],
num_rows: 1
})
Once the dataset is created, RAGAs require a set of metrics to be passed to the evaluation method:
from ragas import evaluate
from ragas.metrics import (
answer_similarity,
context_relevancy,
)
scores = evaluate(
dataset=eval_dataset,
metrics=[context_recall, answer_similarity]
)
# Scores will be a dictionary of this format
# scores = {
# "context_recall": 0.95,
# "answer_similarity": 0.98
# }
Now that we’ve gone over the prerequisites necessary to work with RAGAs, let’s see the framework applied to our LLM-Twin RAG evaluation use case.
Within this evaluation stage, we’ll focus on this section of the LLM Twin system design:
Here’s the workflow overview:
1. Defining the Evaluation Prompt Template
2. Define the user query
3. Retrieve context from our Vector Database, related to our user query
4. Format the prompt and pass it to our LLM model.
5. Capture the answer, and use query/context to prepare the evaluation data samples
4. Evaluate with RAGAs
5. Construct the evaluation Chain, append metadata, and log to CometML
🗒 One interesting detail before diving into the implementation, we should note that we aim to make the LLM-Twin replicate our writing style.
For this particular use case, we could make use of the context that we’re retrieving from our Vector Database as the ground_truth itself when evaluating.
Why❓
Since we already store our writings (posts/articles/code) in the Vector DB, they might play a double role, being the context
we’re passing to the LLM for generation, and at the same time the ground_truth
that we’re comparing the RAG response to, during evaluation.
With that detail in mind, let’s now go through the implementation, following this blueprint:
query/response/context
payloads for evaluationThe Generation Prompt Template
class InferenceTemplate(BasePromptTemplate):
simple_prompt: str = """You are an AI language model assistant. Your task is to generate a cohesive and concise response to the user question.
Question: {question}
"""
rag_prompt: str = """ You are a specialist in technical content writing. Your task is to create technical content based on a user query given a specific context
with additional information consisting of the user's previous writings and his knowledge.
Here is a list of steps that you need to follow in order to solve this task:
Step 1: You need to analyze the user provided query : {question}
Step 2: You need to analyze the provided context and how the information in it relates to the user question: {context}
Step 3: Generate the content keeping in mind that it needs to be as cohesive and concise as possible related to the subject presented in the query and similar to the users writing style and knowledge presented in the context.
"""
def create_template(self, enable_rag: bool = True) -> PromptTemplate:
if enable_rag is True:
return PromptTemplate(
template=self.rag_prompt, input_variables=["question", "context"]
)
return PromptTemplate(template=self.simple_prompt, input_variables=["question"])
Unpacking this template, we’re specifying in the system prompt
that our LLM model should analyze the query
in Step1, analyze the retrieved context
in Step2 and to comply with the generation instructions in Step3.
Preparing the Evaluation Payload
Let’s start by iterating each module sequentially. We have defined our PromptTemplate and have assigned the question
field with the input query. Next, we would have to retrieve context samples from our Vector Database.
Here’s how the retrieval logic works:
# 1. We instantiate a VectorRetriever that communicates with Vector DB.
retriever = VectorRetriever(query=query)
# 2. Initial fetch of K entries
hits = retriever.retrieve_top_k(
k=settings.TOP_K, to_expand_to_n_queries=settings.EXPAND_N_QUERY
)
# 3. Re-rank entries using post-retrieval augmentation techniques
context = retriever.rerank(hits=hits, keep_top_k=settings.KEEP_TOP_K)
# 4. Update context
prompt_template_variables["context"] = context
prompt = prompt_template.format(question=query, context=context)
To get a deeper dive into the Re-Ranking techniques mentioned at Step 3 in the code above, make sure to check 📓 Lesson 5
After we’ve retrieved the context, it’s time to pass our prompt to the inference pipeline deployed on Qwak [2] and get the LLM generation response.
To get a deeper dive how the inference-pipeline was built and deployed,
📓Lesson 9 covers it in great detail.
Next, we have the evaluation block code:
if enable_evaluation is True:
if enable_rag:
st_time = time.time_ns()
rag_eval_scores = evaluate_w_ragas(
query=query, output=answer, context=context
)
en_time = time.time_ns()
self._timings["evaluation_rag"] = (en_time - st_time) / 1e9
st_time = time.time_ns()
llm_eval = evaluate_llm(query=query, output=answer)
en_time = time.time_ns()
self._timings["evaluation_llm"] = (en_time - st_time) / 1e9
evaluation_result = {
"llm_evaluation": "" if not llm_eval else llm_eval,
"rag_evaluation": {} if not rag_eval_scores else rag_eval_scores,
}
else:
evaluation_result = None
Key insights from this implementation:
(query,response)
pairs.(query,response,context)
pairs._timings
dictionary to track the execution duration for performance profiling purposes.The core RAGAs evaluation functionality is handled within the evaluate_w_ragas
method, here’s what it looks like:
from ragas.metrics import (
answer_correctness,
answer_similarity,
context_entity_recall,
context_recall,
context_relevancy,
context_utilization,
)
METRICS = [
context_utilization,
context_relevancy,
context_recall,
answer_similarity,
context_entity_recall,
answer_correctness,
]
def evaluate_w_ragas(query: str, context: list[str], output: str) -> DataFrame:
"""
Evaluate the RAG (query,context,response) using RAGAS
"""
data_sample = {
"question": [query], # Question as Sequence(str)
"answer": [output], # Answer as Sequence(str)
"contexts": [context], # Context as Sequence(str)
"ground_truth": [context], # Ground Truth as Sequence(str)
}
oai_model = ChatOpenAI(
model=settings.OPENAI_MODEL_ID,
api_key=settings.OPENAI_API_KEY,
)
embd_model = HuggingfaceEmbeddings(model=settings.EMBEDDING_MODEL_ID)
dataset = Dataset.from_dict(data_sample)
score = evaluate(
llm=oai_model,
embeddings=embd_model,
dataset=dataset,
metrics=METRICS,
)
return score
What should we note here:
data_sample
dictionary.gpt-4–1106-preview
sentence-transformers/all-MiniLM-L6-v2
evaluate
method.Once the execution gets to this stage, we might see the following logs section in the console:
Once the evaluation is completed, in the score
variable we would have a dict of this format:
score = {
"context_utilization": float, # how useful is context to generated answer
"context_relevancy": float, # how relevant is context to given query
"context_recall": float, # proportion of relevant retrieved context
"answer_similarity": float, # semantic similarity
"answer_correctness": float, # factually correctness
"context_entity_recall": float,# recall of relevant entities in context
}
In the next section, let’s compose in a step-by-step fashion, the full evaluation chain and log it to Comet LLM [3] for monitoring.
Prompt monitoring is crucial in LLM-based applications for several reasons. It helps ensure the quality and relevance of responses, maintaining accuracy and coherence in user interactions but at the same time allows ML engineers maintaining the project to identify and mitigate bias or hallucination and work on fixing them early on.
📓 In Lesson 8, we’ve described Prompt Monitoring advantages in more detail.
In this section, we’ll focus solely on how to compose end-to-end Chains and log them to Comet LLM [3]. Let’s dive into the code and describe each component a Chain consists of.
Step 1: Defining the Chain Start
Here we specify the project, workspace
from CometML where we want to log this chain and set its inputs
to mark the start.
import comet_llm
comet_llm.init([project])
comet_llm.start_chain(
inputs={'user_query' : [our query]},
project=[comet-llm-project],
api_key=[comet-llm-api-key],
workspace=[comet-llm-ws]
)
Step 2: Defining Chain Stages
We’re using multiple Span (comet_llm.Span)
objects to define chain stages. Inside a Span
object, we have to define:
category
— which acts as a group key.name
— the name of the current chain step (will appear in CometML UI)inputs
— as a dictionary, used to link with previous chain steps (Spans)outputs
— as a dictionary, where we define the outputs from this chain step.with comet_llm.Span(
"category"="RAG Evaluation",
"name"="ragas_eval",
"inputs"={"query": [our_query], "context": [our_context], "answers": [llm_answers]}
) as span:
span.set_outputs(outputs={"rag-eval-scores" : [ragas_scores]})
Step 3: Defining the Chain End
The last step, after starting the chain and appending chain-stages, is to mark the chain’s ending and returning response.
comet_llm.end_chain(outputs={"response": [our-rag-response]})
Now that we’ve understood the logic behind Comet LLM [3] Chain monitoring, let’s see what the actual implementation looks like:
# == START CHAIN ==
comet_llm.init(project=f"{settings.COMET_PROJECT}-monitoring")
comet_llm.start_chain(
inputs={"user_query": query},
project=f"{settings.COMET_PROJECT}-monitoring",
api_key=settings.COMET_API_KEY,
workspace=settings.COMET_WORKSPACE,
)
# == CHAINING STEPS ==
with comet_llm.Span(
category="Vector Retrieval",
name="retrieval_step",
inputs={"user_query": query},
) as span:
span.set_outputs(outputs={"retrieved_context": context})
with comet_llm.Span(
category="LLM Generation",
name="generation_step",
inputs={"user_query": query},
) as span:
span.set_outputs(outputs={"generation": llm_gen})
with comet_llm.Span(
category="Evaluation",
name="llm_eval_step",
inputs={"query": llm_gen, "user_query": query},
metadata={"model_used": settings.OPENAI_MODEL_ID},
) as span:
span.set_outputs(outputs={"llm_eval_result": llm_eval_output})
with comet_llm.Span(
category="Evaluation",
name="rag_eval_step",
inputs={
"user_query": query,
"retrieved_context": context,
"llm_gen": llm_gen,
},
metadata={
"model_used": settings.OPENAI_MODEL_ID,
"embd_model": settings.EMBEDDING_MODEL_ID,
"eval_framework": "RAGAS",
},
) as span:
span.set_outputs(outputs={"rag_eval_scores": rag_eval_scores})
# == END CHAIN ==
comet_llm.end_chain(outputs={"response": llm_gen})
📓 For the full chain monitoring implementation, check the PromptMonitoringManager class.
You might have noticed that Spans
also have a metadata
field attached, we’re using it to log additional data that is important solely to the current chain step.
For instance, in the rag_eval_step
, we’re adding the evaluation framework and model types used. In CometML UI, we can see the metadata attached.
For a refresher on how we evaluate the LLM model only, make sure to check
📓Lesson 8 where we’ve described it in detail.
And if we want to see the RAG evaluation scores:
Here we’re wrapping up Lesson 10 of the LLM Twin free course.
We’ve described the LLM-Twin RAG evaluation workflow using a powerful framework called RAGAs. We’ve explained the metrics used, how to implement the evaluation functionality and how to compose the evaluation dataset.
Additionally, we’ve showcased and exemplified how to effectively monitor chains with multiple execution steps on Comet LLM [3], how to attach metatada, how to group chain-steps and more.
By completing Lesson 10, you’ve gained a good understanding of how you can build a full RAG evaluation pipeline using RAGAs. You’ve learned the Retrieval & Generation specific metrics you could use and all the details required to log large LLM chains to Comet LLM [3].
In Lesson 11, we’ll start our bonus series on improving the RAG feature pipeline to make the RAG system more scalable and accurate. We will also show you how to make the code cleaner and more concise.
🔗 Check out the code on GitHub [1] and support itwith a ⭐️
[1] LLM Twin Github Repository, 2024, Decoding ML GitHub Organization
[2] Qwak, 2024, The Qwak.ai Platform landing Page
[3] Comet LLM, The Comet LLM Platform
[4] RAGAs Metrics, The RAGAs Framework Metrics Documentation
[5] RAGAs, The RAGAs Framework Github Repository
[6] RAGAs Paper, 2023, The RAGAs Arxiv Paper