skip to Main Content

How to Evaluate Your RAG Using the RAGAs Framework

Welcome to Lesson 10 of 11 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready “LLM twin” of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with Lesson 1.


Before jumping into the lesson, let’s walk through a short recap, to understand how we’ve got here:

→ In Lesson 8 we’ve focused on common evaluation methods for various tasks LLMs are performing, specifically in our case of content generation, we’ve used a larger model (GPT3.5-Turbo) via API to assess the coherence and quantified other metrics for our LLM generations.

→ In Lesson 9 we’ve showcased how to implement and deploy the inference pipeline of the LLM twin system on Qwak [2]. Iterated on the microservice-based design, separating the ML and business logic into two layers.

In Lesson 10 we’ll focus on the RAG-evaluation logic.

Here, we’ll showcase the evaluation steps we’re performing, and how we structure the evaluation payload step-by-step. We’ll present one of the best RAG evaluation frameworks (RAGAs [5]) and discuss the metrics, implementation, and other nice functionalities it provides.

Ultimately, we’ll learn how to monitor complex chains by designing each chain step individually, attaching metadata to it, and logging to Comet-LLM.

Here’s what we’re going to learn in this lesson:

  • Evaluation techniques for RAG applications.
  • How to use RAGAs to evaluate RAG applications.
  • How to build metadata chains and log them to CometML-LLM.
  • The LLM-Twin RAG evaluation workflow.
technical flow chart showing the logic and processes for evaluating the LLM twin retrieval-augmented generation
The LLM-Twin RAG Evaluation Workflow. Image by Author.

Table of Contents

  1. What is RAG evaluation?
  2. The RAGAs Framework
  3. How Do We Evaluate Our RAG Application?
  4. Advanced Prompt-Chain Monitoring
  5. Conclusion

What is RAG evaluation?

RAG evaluation involves assessing how well the model integrates retrieved information into its responses. This requires evaluating not just the quality of the generated text, but also the accuracy and relevance of the retrieved information, and how effectively it enhances the final output.

Building an RAG pipeline is fairly simple. You just need a Vector-DB knowledge base, an LLM to process your prompts, and additional logic for interactions between these modules.

Reaching a satisfying performance level for a RAG pipeline imposes its challenges because of the “separate” components:

  • Retriever — which takes care of querying the Knowledge Database and retrieves additional context that matches the user’s query.
  • Generator — which encompasses the LLM module, generating an answer based on the context-augmented prompt.

When evaluating a RAG pipeline, we must evaluate both components separately and together to understand if and where the RAG pipeline still needs improvement, this will help us identify its “quality”. Additionally, to understand whether its performance is improving, we need to evaluate it quantitatively.

The RAGAs Framework

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. There are existing tools and frameworks that help you build these pipelines, (e.g. LLamaIndex), but evaluating it and quantifying your pipeline performance can be hard.
This is where Ragas (RAG Assessment) comes in.

The RAGAs [5] framework (5.3k ⭐️) is open-source, part of explodinggradients group, and it comes with a paper submission: RAGAs Paper [6]

One of the core concepts of RAGAs is Metric-Driven-Development (MDD) which is a product development approach that relies on data to make well-informed decisions. The focus is to leverage powerful LLMs under the hood to conduct targeted evaluation processes, instead of relying on HITL (human-in-the-loop) for ground truth annotations.

RAGAs Metrics

Let’s iterate over the metrics that RAGAs Metrics [4] expose:

Metrics for Retrieval Stage 🔽:

  1. Context Precision
    Evaluates the precision of the context used to generate an answer, ensuring relevant information is selected from the context
  2. Context Relevancy
    Measures how relevant the selected context is to the question. Helps improve context selection to enhance answer accuracy.
  3. Context Recall
    Measures if all the relevant information required to answer the question was retrieved.
  4. Context Entities Recall
    Evaluates the recall of entities within the context, ensuring that no important entities are overlooked in context retrieval.

Metrics for Generation Stage 🔽:

  1. Faithfulness
    Measures how accurately the generated answer reflects the source content, ensuring the generated content is truthful and reliable.
  2. Answer Relevance
    Assesses how pertinent the answer is to the given question. It is validating that the response directly addresses the user’s query.
  3. Answer Semantic Similarity
    Quantifies the semantic similarity between the generated answer and the expected “ideal” answer. Shows that the generated content is semantically aligned with expected responses.
  4. Answer Correctness
    Focuses on fact-checking, assessing the factual accuracy of the generated answer.

A subset or all of these metrics could be used throughout the evaluation setup. In our LLM-Twin RAG use case, we’ll use 6 metrics that target both the Retrieval and Generation modules :

  • Context Precision, Recall, Relevancy, and Entity Recall — for Retrieval.
  • Answer Relevancy, Answer Semantic Similarity — for Generation.

RAGAs Evaluation Format

To evaluate the RAG pipeline, RAGAs expects the following dataset format:

question       : The user query, this is the input to our RAG.
answer         : The generated answer from the RAG pipeline, given the query + context prompt
contexts       : Context retrieved from the knowledge base (the Vector Database)
ground_truths  : The ground truth answer to the question.

[Note] : The `ground_truths` is necessary only if the ContextRecall metric is used.

📓 All the listed RAGAs metrics use the question , answer and contexts fields. It is important to note that the only metric that requires the ground_truths field is Context Recall. As it measures if all the relevant information required to answer the question was retrieved from the Vector DB.

Here’s a quick example of how a dataset setup for RAGAs looks like:

from datasets import Dataset

questions= ["When was the Eiffel Tower built and how tall is it?"],
answers= ["As of my last update in April 2023, the Eiffel Tower was built in 1889 and is 324m tall"]
contexts= [
   "The Eiffel Tower is one of the most attractive monuments to visit when in Paris, France. It was constructed in 1889 as the entrance arch to the 1889 World's Fair. It stands at 324 meters tall."
  ]
ground_truths=[
    ["The Eiffel Tower was built in 1889 and it stands at 324 meters tall."]
]

sample = {
  "question": questions,
  "answer": answers,
  "contexts": contexts,
  "ground_truths": ground_truths
}

eval_dataset = Dataset.from_dict(sample)

Here’s what the dataset looks like:

#> print(eval_dataset)
Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 1
})

Once the dataset is created, RAGAs require a set of metrics to be passed to the evaluation method:

from ragas import evaluate
from ragas.metrics import (
    answer_similarity,
    context_relevancy,
)

scores = evaluate(
    dataset=eval_dataset,
    metrics=[context_recall, answer_similarity]
)

# Scores will be a dictionary of this format
# scores = {
#    "context_recall": 0.95,
#    "answer_similarity": 0.98
# }

Now that we’ve gone over the prerequisites necessary to work with RAGAs, let’s see the framework applied to our LLM-Twin RAG evaluation use case.

How Do We Evaluate Our RAG Application?

Within this evaluation stage, we’ll focus on this section of the LLM Twin system design:

section of llm twin system design showing evaluation, prompt monitoring, and inference pipeline
Section from LLM Twin’s System Design. Image by the author.

Here’s the workflow overview:
1. Defining the Evaluation Prompt Template
2. Define the user query
3. Retrieve context from our Vector Database, related to our user query
4. Format the prompt and pass it to our LLM model.
5. Capture the answer, and use query/context to prepare the evaluation data samples
4. Evaluate with RAGAs
5. Construct the evaluation Chain, append metadata, and log to CometML

detailed diagram of RAG evaluation workflow
The RAG Evaluation workflow. Image by author.

🗒 One interesting detail before diving into the implementation, we should note that we aim to make the LLM-Twin replicate our writing style.

For this particular use case, we could make use of the context that we’re retrieving from our Vector Database as the ground_truth itself when evaluating.

Why❓

Since we already store our writings (posts/articles/code) in the Vector DB, they might play a double role, being the contextwe’re passing to the LLM for generation, and at the same time the ground_truth that we’re comparing the RAG response to, during evaluation.

With that detail in mind, let’s now go through the implementation, following this blueprint:

  1. We’ll go over the Prompt Templates
  2. We’ll prepare the query/response/context payloads for evaluation
  3. Evaluate using RAGAs
  4. Monitoring everything on CometML

The Generation Prompt Template

class InferenceTemplate(BasePromptTemplate):
    simple_prompt: str = """You are an AI language model assistant. Your task is to generate a cohesive and concise response to the user question.
    Question: {question}
    """

    rag_prompt: str = """ You are a specialist in technical content writing. Your task is to create technical content based on a user query given a specific context 
    with additional information consisting of the user's previous writings and his knowledge.
    
    Here is a list of steps that you need to follow in order to solve this task:
    Step 1: You need to analyze the user provided query : {question}
    Step 2: You need to analyze the provided context and how the information in it relates to the user question: {context}
    Step 3: Generate the content keeping in mind that it needs to be as cohesive and concise as possible related to the subject presented in the query and similar to the users writing style and knowledge presented in the context.
    """

    def create_template(self, enable_rag: bool = True) -> PromptTemplate:
        if enable_rag is True:
            return PromptTemplate(
                template=self.rag_prompt, input_variables=["question", "context"]
            )

        return PromptTemplate(template=self.simple_prompt, input_variables=["question"])

Unpacking this template, we’re specifying in the system prompt that our LLM model should analyze the queryin Step1, analyze the retrieved contextin Step2 and to comply with the generation instructions in Step3.

Preparing the Evaluation Payload

Let’s start by iterating each module sequentially. We have defined our PromptTemplate and have assigned the question field with the input query. Next, we would have to retrieve context samples from our Vector Database.

Here’s how the retrieval logic works:

# 1. We instantiate a VectorRetriever that communicates with Vector DB.
retriever = VectorRetriever(query=query)
# 2. Initial fetch of K entries
hits = retriever.retrieve_top_k(
    k=settings.TOP_K, to_expand_to_n_queries=settings.EXPAND_N_QUERY
)
# 3. Re-rank entries using post-retrieval augmentation techniques
context = retriever.rerank(hits=hits, keep_top_k=settings.KEEP_TOP_K)
# 4. Update context
prompt_template_variables["context"] = context
prompt = prompt_template.format(question=query, context=context)

To get a deeper dive into the Re-Ranking techniques mentioned at Step 3 in the code above, make sure to check 📓 Lesson 5

After we’ve retrieved the context, it’s time to pass our prompt to the inference pipeline deployed on Qwak [2] and get the LLM generation response.

To get a deeper dive how the inference-pipeline was built and deployed,
📓Lesson 9 covers it in great detail.

Next, we have the evaluation block code:

if enable_evaluation is True:
    if enable_rag:
        st_time = time.time_ns()
        rag_eval_scores = evaluate_w_ragas(
            query=query, output=answer, context=context
        )
        en_time = time.time_ns()
        self._timings["evaluation_rag"] = (en_time - st_time) / 1e9
    st_time = time.time_ns()
    llm_eval = evaluate_llm(query=query, output=answer)
    en_time = time.time_ns()
    self._timings["evaluation_llm"] = (en_time - st_time) / 1e9
    evaluation_result = {
        "llm_evaluation": "" if not llm_eval else llm_eval,
        "rag_evaluation": {} if not rag_eval_scores else rag_eval_scores,
    }
else:
    evaluation_result = None

Key insights from this implementation:

  • We’re applying the LLM evaluation stage described in Lesson8, to evaluate (query,response) pairs.
  • We’re applying the RAG evaluation stage to evaluate (query,response,context) pairs.
  • We use a _timings dictionary to track the execution duration for performance profiling purposes.

The core RAGAs evaluation functionality is handled within the evaluate_w_ragas method, here’s what it looks like:

from ragas.metrics import (
    answer_correctness,
    answer_similarity,
    context_entity_recall,
    context_recall,
    context_relevancy,
    context_utilization,
)

METRICS = [
    context_utilization,
    context_relevancy,
    context_recall,
    answer_similarity,
    context_entity_recall,
    answer_correctness,
]

def evaluate_w_ragas(query: str, context: list[str], output: str) -> DataFrame:
    """
    Evaluate the RAG (query,context,response) using RAGAS
    """
    data_sample = {
        "question": [query],  # Question as Sequence(str)
        "answer": [output],  # Answer as Sequence(str)
        "contexts": [context],  # Context as Sequence(str)
        "ground_truth": [context],  # Ground Truth as Sequence(str)
    }

    oai_model = ChatOpenAI(
        model=settings.OPENAI_MODEL_ID,
        api_key=settings.OPENAI_API_KEY,
    )
    embd_model = HuggingfaceEmbeddings(model=settings.EMBEDDING_MODEL_ID)
    dataset = Dataset.from_dict(data_sample)
    score = evaluate(
        llm=oai_model,
        embeddings=embd_model,
        dataset=dataset,
        metrics=METRICS,
    )

    return score

What should we note here:

  • We’re preparing the evaluation dataset using the data_sample dictionary.
  • We’re instantiating a connector to the OpenAI GPT model, this will be used as the underlying LLM to perform the evaluation logic within RAGAs. The model tag from settings = gpt-4–1106-preview
  • We’re instantiating a connector to a HuggingFaceEmbeddings model.
    We’re using the same embedding model we’ve used to encode our samples before storing them in our Qdrant VectorDB instance.
    The model tag from settings = sentence-transformers/all-MiniLM-L6-v2
  • We’re composing the payload and passing it to the evaluate method.

Once the execution gets to this stage, we might see the following logs section in the console:

screenshot of RAGAs evaluation process console logs
RAGAs evaluation process console logs.

Once the evaluation is completed, in the score variable we would have a dict of this format:

score = {
  "context_utilization": float,  #  how useful is context to generated answer
  "context_relevancy": float,    #  how relevant is context to given query
  "context_recall": float,       #  proportion of relevant retrieved context
  "answer_similarity": float,    #  semantic similarity
  "answer_correctness": float,   #  factually correctness
  "context_entity_recall": float,#  recall of relevant entities in context
}

In the next section, let’s compose in a step-by-step fashion, the full evaluation chain and log it to Comet LLM [3] for monitoring.

Advanced Prompt-Chain Monitoring

Prompt monitoring is crucial in LLM-based applications for several reasons. It helps ensure the quality and relevance of responses, maintaining accuracy and coherence in user interactions but at the same time allows ML engineers maintaining the project to identify and mitigate bias or hallucination and work on fixing them early on.

📓 In Lesson 8, we’ve described Prompt Monitoring advantages in more detail.

In this section, we’ll focus solely on how to compose end-to-end Chains and log them to Comet LLM [3]. Let’s dive into the code and describe each component a Chain consists of.

Step 1: Defining the Chain Start
Here we specify the project, workspace from CometML where we want to log this chain and set its inputs to mark the start.

import comet_llm

comet_llm.init([project])
comet_llm.start_chain(
  inputs={'user_query' : [our query]},
  project=[comet-llm-project],
  api_key=[comet-llm-api-key],
  workspace=[comet-llm-ws]
)

Step 2: Defining Chain Stages
We’re using multiple Span (comet_llm.Span)objects to define chain stages. Inside a Span object, we have to define:

  • category — which acts as a group key.
  • name — the name of the current chain step (will appear in CometML UI)
  • inputs — as a dictionary, used to link with previous chain steps (Spans)
  • outputs — as a dictionary, where we define the outputs from this chain step.
with comet_llm.Span(
  "category"="RAG Evaluation",
  "name"="ragas_eval",
  "inputs"={"query": [our_query], "context": [our_context], "answers": [llm_answers]}
) as span:
  span.set_outputs(outputs={"rag-eval-scores" : [ragas_scores]})

Step 3: Defining the Chain End
The last step, after starting the chain and appending chain-stages, is to mark the chain’s ending and returning response.


comet_llm.end_chain(outputs={"response": [our-rag-response]})

Now that we’ve understood the logic behind Comet LLM [3] Chain monitoring, let’s see what the actual implementation looks like:

# == START CHAIN ==
comet_llm.init(project=f"{settings.COMET_PROJECT}-monitoring")
comet_llm.start_chain(
    inputs={"user_query": query},
    project=f"{settings.COMET_PROJECT}-monitoring",
    api_key=settings.COMET_API_KEY,
    workspace=settings.COMET_WORKSPACE,
)

# == CHAINING STEPS ==
with comet_llm.Span(
    category="Vector Retrieval",
    name="retrieval_step",
    inputs={"user_query": query},
) as span:
    span.set_outputs(outputs={"retrieved_context": context})

with comet_llm.Span(
    category="LLM Generation",
    name="generation_step",
    inputs={"user_query": query},
) as span:
    span.set_outputs(outputs={"generation": llm_gen})

with comet_llm.Span(
    category="Evaluation",
    name="llm_eval_step",
    inputs={"query": llm_gen, "user_query": query},
    metadata={"model_used": settings.OPENAI_MODEL_ID},
) as span:
    span.set_outputs(outputs={"llm_eval_result": llm_eval_output})

with comet_llm.Span(
    category="Evaluation",
    name="rag_eval_step",
    inputs={
        "user_query": query,
        "retrieved_context": context,
        "llm_gen": llm_gen,
    },
    metadata={
        "model_used": settings.OPENAI_MODEL_ID,
        "embd_model": settings.EMBEDDING_MODEL_ID,
        "eval_framework": "RAGAS",
    },
) as span:
    span.set_outputs(outputs={"rag_eval_scores": rag_eval_scores})

# == END CHAIN ==
comet_llm.end_chain(outputs={"response": llm_gen})

📓 For the full chain monitoring implementation, check the PromptMonitoringManager class.

You might have noticed that Spans also have a metadata field attached, we’re using it to log additional data that is important solely to the current chain step.

For instance, in the rag_eval_step , we’re adding the evaluation framework and model types used. In CometML UI, we can see the metadata attached.

Chain Step specific Metadata. Image by Author.
Once the evaluation process is completed, and the chain is logged successfully to Comet LLM [3], this is what we’re expecting to see:
Chain logged on CometML. Focus on the LLM Evaluation Stage only.

For a refresher on how we evaluate the LLM model only, make sure to check
📓Lesson 8 where we’ve described it in detail.

And if we want to see the RAG evaluation scores:

Chain logged on CometML. Focus on the RAG Evaluation Stage only.

Conclusion

Here we’re wrapping up Lesson 10 of the LLM Twin free course.

We’ve described the LLM-Twin RAG evaluation workflow using a powerful framework called RAGAs. We’ve explained the metrics used, how to implement the evaluation functionality and how to compose the evaluation dataset.

Additionally, we’ve showcased and exemplified how to effectively monitor chains with multiple execution steps on Comet LLM [3], how to attach metatada, how to group chain-steps and more.

By completing Lesson 10, you’ve gained a good understanding of how you can build a full RAG evaluation pipeline using RAGAs. You’ve learned the Retrieval & Generation specific metrics you could use and all the details required to log large LLM chains to Comet LLM [3].

In Lesson 11, we’ll start our bonus series on improving the RAG feature pipeline to make the RAG system more scalable and accurate. We will also show you how to make the code cleaner and more concise.

🔗 Check out the code on GitHub [1] and support itwith a ⭐️

References

[1] LLM Twin Github Repository, 2024, Decoding ML GitHub Organization

[2] Qwak, 2024, The Qwak.ai Platform landing Page

[3] Comet LLM, The Comet LLM Platform

[4] RAGAs Metrics, The RAGAs Framework Metrics Documentation

[5] RAGAs, The RAGAs Framework Github Repository

[6] RAGAs Paper, 2023, The RAGAs Arxiv Paper

Alexandru Razvant, Decoding ML

Alexandru Razvant

Decoding ML

Decoding ML

Back To Top