Ragas | Opik Documentation

In this notebook, we will showcase how to use Opik with Ragas for monitoring and evaluation of RAG (Retrieval-Augmented Generation) pipelines.

There are two main ways to use Opik with Ragas:

Using Ragas metrics to score traces
Using the Ragas evaluate function to score a dataset

Creating an account on Comet.com

Comet provides a hosted version of the Opik platform, simply create an account and grab your API Key.

You can also run the Opik platform locally, see the installation guide for more information.

1 %pip install --quiet --upgrade opik ragas nltk

1 import opik
2 
3 opik.configure(use_local=False)

Preparing our environment

First, we will configure the OpenAI API key.

1 import os
2 import getpass
3 
4 if "OPENAI_API_KEY" not in os.environ:
5     os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Integrating Opik with Ragas

Using Ragas metrics to score traces

Ragas provides a set of metrics that can be used to evaluate the quality of a RAG pipeline, including but not limited to: answer_relevancy, answer_similarity, answer_correctness, context_precision, context_recall, context_entity_recall, summarization_score. You can find a full list of metrics in the Ragas documentation.

These metrics can be computed on the fly and logged to traces or spans in Opik. For this example, we will start by creating a simple RAG pipeline and then scoring it using the answer_relevancy metric.

Create the Ragas metric

In order to use the Ragas metric without using the evaluate function, you need to initialize the metric with a RunConfig object and an LLM provider. For this example, we will use LangChain as the LLM provider with the Opik tracer enabled.

We will first start by initializing the Ragas metric:

1 # Import the metric
2 from ragas.metrics import AnswerRelevancy
3 
4 # Import some additional dependencies
5 from langchain_openai.chat_models import ChatOpenAI
6 from langchain_openai.embeddings import OpenAIEmbeddings
7 from ragas.llms import LangchainLLMWrapper
8 from ragas.embeddings import LangchainEmbeddingsWrapper
9 
10 # Initialize the Ragas metric
11 llm = LangchainLLMWrapper(ChatOpenAI())
12 emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
13 
14 answer_relevancy_metric = AnswerRelevancy(llm=llm, embeddings=emb)

Once the metric is initialized, you can use it to score a sample question. Given that the metric scoring is done asynchronously, you need to use the asyncio library to run the scoring function.

1 # Run this cell first if you are running this in a Jupyter notebook
2 import nest_asyncio
3 
4 nest_asyncio.apply()

1 import asyncio
2 from ragas.integrations.opik import OpikTracer
3 from ragas.dataset_schema import SingleTurnSample
4 import os
5 
6 os.environ["OPIK_PROJECT_NAME"] = "ragas-integration"
7 
8 
9 # Define the scoring function
10 def compute_metric(metric, row):
11     row = SingleTurnSample(**row)
12 
13     opik_tracer = OpikTracer(tags=["ragas"])
14 
15     async def get_score(opik_tracer, metric, row):
16         score = await metric.single_turn_ascore(row, callbacks=[opik_tracer])
17         return score
18 
19     # Run the async function using the current event loop
20     loop = asyncio.get_event_loop()
21 
22     result = loop.run_until_complete(get_score(opik_tracer, metric, row))
23     return result
24 
25 
26 # Score a simple example
27 row = {
28     "user_input": "What is the capital of France?",
29     "response": "Paris",
30     "retrieved_contexts": ["Paris is the capital of France.", "Paris is in France."],
31 }
32 
33 score = compute_metric(answer_relevancy_metric, row)
34 print("Answer Relevancy score:", score)

If you now navigate to Opik, you will be able to see that a new trace has been created in the Default Project project.

Score traces

You can score traces by using the update_current_trace function.

The advantage of this approach is that the scoring span is added to the trace allowing for a more fine-grained analysis of the RAG pipeline. It will however run the Ragas metric calculation synchronously and so might not be suitable for production use-cases.

1 from opik import track, opik_context
2 
3 
4 @track
5 def retrieve_contexts(question):
6     # Define the retrieval function, in this case we will hard code the contexts
7     return ["Paris is the capital of France.", "Paris is in France."]
8 
9 
10 @track
11 def answer_question(question, contexts):
12     # Define the answer function, in this case we will hard code the answer
13     return "Paris"
14 
15 
16 @track(name="Compute Ragas metric score", capture_input=False)
17 def compute_rag_score(answer_relevancy_metric, question, answer, contexts):
18     # Define the score function
19     row = {"user_input": question, "response": answer, "retrieved_contexts": contexts}
20     score = compute_metric(answer_relevancy_metric, row)
21     return score
22 
23 
24 @track
25 def rag_pipeline(question):
26     # Define the pipeline
27     contexts = retrieve_contexts(question)
28     answer = answer_question(question, contexts)
29 
30     score = compute_rag_score(answer_relevancy_metric, question, answer, contexts)
31     opik_context.update_current_trace(
32         feedback_scores=[{"name": "answer_relevancy", "value": round(score, 4)}]
33     )
34 
35     return answer
36 
37 
38 rag_pipeline("What is the capital of France?")

Evaluating datasets using the Opik `evaluate` function

You can use Ragas metrics with the Opik evaluate function. This will compute the metrics on all the rows of the dataset and return a summary of the results.

As Ragas metrics are only async, we will need to create a wrapper to be able to use them with the Opik evaluate function.

1 from datasets import load_dataset
2 from opik.evaluation.metrics import base_metric, score_result
3 import opik
4 
5 
6 opik_client = opik.Opik()
7 
8 # Create a small dataset
9 fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
10 
11 # Reformat the dataset to match the schema expected by the Ragas evaluate function
12 hf_dataset = fiqa_eval["baseline"].select(range(3))
13 dataset_items = hf_dataset.map(
14     lambda x: {
15         "user_input": x["question"],
16         "reference": x["ground_truths"][0],
17         "retrieved_contexts": x["contexts"],
18     }
19 )
20 dataset = opik_client.get_or_create_dataset("ragas-demo-dataset")
21 dataset.insert(dataset_items)
22 
23 
24 # Create an evaluation task
25 def evaluation_task(x):
26     return {
27         "user_input": x["question"],
28         "response": x["answer"],
29         "retrieved_contexts": x["contexts"],
30     }
31 
32 
33 # Create scoring metric wrapper
34 class AnswerRelevancyWrapper(base_metric.BaseMetric):
35     def __init__(self, metric):
36         self.name = "answer_relevancy_metric"
37         self.metric = metric
38 
39     async def get_score(self, row):
40         row = SingleTurnSample(**row)
41         score = await self.metric.single_turn_ascore(row)
42         return score
43 
44     def score(self, user_input, response, **ignored_kwargs):
45         # Run the async function using the current event loop
46         loop = asyncio.get_event_loop()
47 
48         result = loop.run_until_complete(self.get_score(row))
49 
50         return score_result.ScoreResult(value=result, name=self.name)
51 
52 
53 scoring_metric = AnswerRelevancyWrapper(answer_relevancy_metric)
54 opik.evaluation.evaluate(
55     dataset,
56     evaluation_task,
57     scoring_metrics=[scoring_metric],
58     task_threads=1,
59 )

Evaluating datasets using the Ragas `evaluate` function

If you looking at evaluating a dataset, you can use the Ragas evaluate function. When using this function, the Ragas library will compute the metrics on all the rows of the dataset and return a summary of the results.

You can use the OpikTracer callback to log the results of the evaluation to the Opik platform:

1 from datasets import load_dataset
2 from ragas.metrics import context_precision, answer_relevancy, faithfulness
3 from ragas import evaluate
4 
5 fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
6 
7 # Reformat the dataset to match the schema expected by the Ragas evaluate function
8 dataset = fiqa_eval["baseline"].select(range(3))
9 
10 dataset = dataset.map(
11     lambda x: {
12         "user_input": x["question"],
13         "reference": x["ground_truths"][0],
14         "retrieved_contexts": x["contexts"],
15     }
16 )
17 
18 opik_tracer_eval = OpikTracer(tags=["ragas_eval"], metadata={"evaluation_run": True})
19 
20 result = evaluate(
21     dataset,
22     metrics=[context_precision, faithfulness, answer_relevancy],
23     callbacks=[opik_tracer_eval],
24 )
25 
26 print(result)