The Opik SDK provides a simple way to integrate with Ragas, a framework for evaluating RAG systems.

There are two main ways to use Ragas with Opik:

  1. Using Ragas to score traces or spans.
  2. Using Ragas to evaluate a RAG pipeline.

You can check out the Colab Notebook if you’d like to jump straight to the code:

Open In Colab

Getting started

You will first need to install the opik and ragas packages:

$pip install opik ragas

In addition, you can configure Opik using the opik configure command which will prompt you for the correct local server address or if you are using the Cloud platform your API key:

$opik configure

Using Ragas to score traces or spans

Ragas provides a set of metrics that can be used to evaluate the quality of a RAG pipeline, a full list of the supported metrics can be found in the Ragas documentation.

In addition to being able to track these feedback scores in Opik, you can also use the OpikTracer callback to keep track of the score calculation in Opik.

Due to the asynchronous nature of the score calculation, we will need to define a coroutine to compute the score:

1import asyncio
2
3# Import the metric
4from ragas.metrics import AnswerRelevancy
5
6# Import some additional dependencies
7from langchain_openai.chat_models import ChatOpenAI
8from langchain_openai.embeddings import OpenAIEmbeddings
9from ragas.dataset_schema import SingleTurnSample
10from ragas.embeddings import LangchainEmbeddingsWrapper
11from ragas.integrations.opik import OpikTracer
12from ragas.llms import LangchainLLMWrapper
13from ragas.metrics import AnswerRelevancy
14
15
16# Initialize the Ragas metric
17llm = LangchainLLMWrapper(ChatOpenAI())
18emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
19answer_relevancy_metric = AnswerRelevancy(llm=llm, embeddings=emb)
20
21
22# Define the scoring function
23def compute_metric(metric, row):
24 row = SingleTurnSample(**row)
25
26 opik_tracer = OpikTracer()
27
28 async def get_score(opik_tracer, metric, row):
29 score = await metric.single_turn_ascore(row, callbacks=[OpikTracer()])
30 return score
31
32 # Run the async function using the current event loop
33 loop = asyncio.get_event_loop()
34
35 result = loop.run_until_complete(get_score(opik_tracer, metric, row))
36 return result

Once the compute_metric function is defined, you can use it to score a trace or span:

1from opik import track
2from opik.opik_context import update_current_trace
3
4
5@track
6def retrieve_contexts(question):
7 # Define the retrieval function, in this case we will hard code the contexts
8 return ["Paris is the capital of France.", "Paris is in France."]
9
10
11@track
12def answer_question(question, contexts):
13 # Define the answer function, in this case we will hard code the answer
14 return "Paris"
15
16
17@track(name="Compute Ragas metric score", capture_input=False)
18def compute_rag_score(answer_relevancy_metric, question, answer, contexts):
19 # Define the score function
20 row = {"user_input": question, "response": answer, "retrieved_contexts": contexts}
21 score = compute_metric(answer_relevancy_metric, row)
22 return score
23
24
25@track
26def rag_pipeline(question):
27 # Define the pipeline
28 contexts = retrieve_contexts(question)
29 answer = answer_question(question, contexts)
30
31 score = compute_rag_score(answer_relevancy_metric, question, answer, contexts)
32 update_current_trace(
33 feedback_scores=[{"name": "answer_relevancy", "value": round(score, 4)}]
34 )
35
36 return answer
37
38
39print(rag_pipeline("What is the capital of France?"))

In the Opik UI, you will be able to see the full trace including the score calculation:

Using Ragas metrics to evaluate a RAG pipeline

In order to use a Ragas metric within the Opik evaluation framework, we will need to wrap it in a custom scoring method. In the example below we will:

  1. Define the Ragas metric
  2. Create a scoring metric wrapper
  3. Use the scoring metric wrapper within the Opik evaluation framework

1. Define the Ragas metric

We will start by defining the Ragas metric, in this example we will use AnswerRelevancy:

1from ragas.metrics import AnswerRelevancy
2
3# Import some additional dependencies
4from langchain_openai.chat_models import ChatOpenAI
5from langchain_openai.embeddings import OpenAIEmbeddings
6from ragas.llms import LangchainLLMWrapper
7from ragas.embeddings import LangchainEmbeddingsWrapper
8
9# Initialize the Ragas metric
10llm = LangchainLLMWrapper(ChatOpenAI())
11emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
12
13ragas_answer_relevancy = AnswerRelevancy(llm=llm, embeddings=emb)

2. Create a scoring metric wrapper

Once we have this metric, we will need to create a wrapper to be able to use it with the Opik evaluate function. As Ragas is an async framework, we will need to use asyncio to run the score calculation:

1# Create scoring metric wrapper
2from opik.evaluation.metrics import base_metric, score_result
3from ragas.dataset_schema import SingleTurnSample
4
5class AnswerRelevancyWrapper(base_metric.BaseMetric):
6 def __init__(self, metric):
7 self.name = "answer_relevancy_metric"
8 self.metric = metric
9
10 async def get_score(self, row):
11 row = SingleTurnSample(**row)
12 score = await self.metric.single_turn_ascore(row)
13 return score
14
15 def score(self, user_input, response, **ignored_kwargs):
16 # Run the async function using the current event loop
17 loop = asyncio.get_event_loop()
18
19 result = loop.run_until_complete(self.get_score(row))
20
21 return score_result.ScoreResult(
22 value=result,
23 name=self.name
24 )
25
26# Create the answer relevancy scoring metric
27answer_relevancy = AnswerRelevancyWrapper(ragas_answer_relevancy)

If you are running within a Jupyter notebook, you will need to add the following line to the top of your notebook:

1import nest_asyncio
2nest_asyncio.apply()

3. Use the scoring metric wrapper within the Opik evaluation framework

You can now use the scoring metric wrapper within the Opik evaluation framework:

1from opik.evaluation import evaluate
2
3evaluation_task = evaluate(
4 dataset=dataset,
5 task=evaluation_task,
6 scoring_metrics=[answer_relevancy],
7 nb_samples=10,
8)
Built with