Quickstart notebook | Opik Documentation

In this notebook, we will look at how you can use Opik to track your LLM calls, chains and agents. We will introduce the concept of tracing and how to automate the evaluation of your LLM workflows.

We will be using a technique called Chain of Density Summarization to summarize Arxiv papers. You can learn more about this technique in the From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting paper.

Getting started

We will first install the required dependencies and configure both Opik and OpenAI.

1 %pip install -U opik openai requests PyPDF2 --quiet

Comet provides a hosted version of the Opik platform, simply create an account and grab your API Key.

You can also run the Opik platform locally, see the installation guide for more information.

1 import opik
2 
3 # Configure Opik
4 opik.configure()

Implementing Chain of Density Summarization

The idea behind this approach is to first generate a sparse candidate summary and then iteratively refine it with missing information without making it longer. We will start by defining two prompts:

Iteration summary prompt: This prompt is used to generate and refine a candidate summary.
Final summary prompt: This prompt is used to generate the final summary from the sparse set of candidate summaries.

1 import opik
2 
3 ITERATION_SUMMARY_PROMPT = opik.Prompt(
4     name="Iteration Summary Prompt",
5     prompt="""
6 Document: {{document}}
7 Current summary: {{current_summary}}
8 Instruction to focus on: {{instruction}}
9 
10 Generate a concise, entity-dense, and highly technical summary from the provided Document that specifically addresses the given Instruction.
11 
12 Guidelines:
13 - Make every word count: If there is a current summary re-write it to improve flow, density and conciseness.
14 - Remove uninformative phrases like "the article discusses".
15 - The summary should become highly dense and concise yet self-contained, e.g. , easily understood without the Document.
16 - Make sure that the summary specifically addresses the given Instruction
17 """.rstrip().lstrip(),
18 )
19 
20 FINAL_SUMMARY_PROMPT = opik.Prompt(
21     name="Final Summary Prompt",
22     prompt="""
23 Given this summary: {{current_summary}}
24 And this instruction to focus on: {{instruction}}
25 Create an extremely dense, final summary that captures all key technical information in the most concise form possible, while specifically addressing the given instruction.
26 """.rstrip().lstrip(),
27 )

We can now define the summarization chain by combining the two prompts. In order to track the LLM calls, we will use Opik’s integration with OpenAI through the track_openai function and we will add the @opik.track decorator to each function so we can track the full chain and not just individual LLM calls:

1 from opik.integrations.openai import track_openai
2 from openai import OpenAI
3 import opik
4 
5 # Use a dedicated quickstart endpoint, replace with your own OpenAI API Key in your own code
6 openai_client = track_openai(
7     OpenAI(
8         base_url="https://odbrly0rrk.execute-api.us-east-1.amazonaws.com/Prod/",
9         api_key="Opik-Quickstart",
10     )
11 )
12 
13 
14 @opik.track
15 def summarize_current_summary(
16     document: str,
17     instruction: str,
18     current_summary: str,
19     model: str = "gpt-4o-mini",
20 ):
21     prompt = ITERATION_SUMMARY_PROMPT.format(
22         document=document, current_summary=current_summary, instruction=instruction
23     )
24 
25     response = openai_client.chat.completions.create(
26         model=model, max_tokens=4096, messages=[{"role": "user", "content": prompt}]
27     )
28 
29     return response.choices[0].message.content
30 
31 
32 @opik.track
33 def iterative_density_summarization(
34     document: str,
35     instruction: str,
36     density_iterations: int,
37     model: str = "gpt-4o-mini",
38 ):
39     summary = ""
40     for iteration in range(1, density_iterations + 1):
41         summary = summarize_current_summary(document, instruction, summary, model)
42     return summary
43 
44 
45 @opik.track
46 def final_summary(instruction: str, current_summary: str, model: str = "gpt-4o-mini"):
47     prompt = FINAL_SUMMARY_PROMPT.format(
48         current_summary=current_summary, instruction=instruction
49     )
50 
51     return (
52         openai_client.chat.completions.create(
53             model=model, max_tokens=4096, messages=[{"role": "user", "content": prompt}]
54         )
55         .choices[0]
56         .message.content
57     )
58 
59 
60 @opik.track(project_name="Chain of Density Summarization")
61 def chain_of_density_summarization(
62     document: str,
63     instruction: str,
64     model: str = "gpt-4o-mini",
65     density_iterations: int = 2,
66 ):
67     summary = iterative_density_summarization(
68         document, instruction, density_iterations, model
69     )
70     final_summary_text = final_summary(instruction, summary, model)
71 
72     return final_summary_text

Let’s call the summarization chain with a sample document:

1 import textwrap
2 
3 document = """
4 Artificial intelligence (AI) is transforming industries, revolutionizing healthcare, finance, education, and even creative fields. AI systems
5 today are capable of performing tasks that previously required human intelligence, such as language processing, visual perception, and
6 decision-making. In healthcare, AI assists in diagnosing diseases, predicting patient outcomes, and even developing personalized treatment plans.
7 In finance, it helps in fraud detection, algorithmic trading, and risk management. Education systems leverage AI for personalized learning, adaptive
8 testing, and educational content generation. Despite these advancements, ethical concerns such as data privacy, bias, and the impact of AI on employment
9 remain. The future of AI holds immense potential, but also significant challenges.
10 """
11 
12 instruction = "Summarize the main contributions of AI to different industries, and highlight both its potential and associated challenges."
13 
14 summary = chain_of_density_summarization(document, instruction)
15 
16 print("\n".join(textwrap.wrap(summary, width=80)))

Thanks to the @opik.track decorator and Opik’s integration with OpenAI, we can now track the entire chain and all the LLM calls in the Opik UI:

Trace UI

Automatting the evaluation process

Defining a dataset

Now that we have a working chain, we can automate the evaluation process. We will start by defining a dataset of documents and instructions:

1 import opik
2 
3 dataset_items = [
4     {
5         "pdf_url": "https://arxiv.org/pdf/2301.00234",
6         "title": "A Survey on In-context Learning",
7         "instruction": "Summarize the key findings on the impact of prompt engineering in in-context learning.",
8     },
9     {
10         "pdf_url": "https://arxiv.org/pdf/2301.03728",
11         "title": "Scaling Laws for Generative Mixed-Modal Language Models",
12         "instruction": "How do scaling laws apply to generative mixed-modal models according to the paper?",
13     },
14     {
15         "pdf_url": "https://arxiv.org/pdf/2308.10792",
16         "title": "Instruction Tuning for Large Language Models: A Survey",
17         "instruction": "What are the major challenges in instruction tuning for large language models identified in the paper?",
18     },
19     {
20         "pdf_url": "https://arxiv.org/pdf/2302.08575",
21         "title": "Foundation Models in Natural Language Processing: A Survey",
22         "instruction": "Explain the role of foundation models in the current natural language processing landscape.",
23     },
24     {
25         "pdf_url": "https://arxiv.org/pdf/2306.13398",
26         "title": "Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey",
27         "instruction": "What are the cutting edge techniques used in multi-modal pre-training models?",
28     },
29     {
30         "pdf_url": "https://arxiv.org/pdf/2103.07492",
31         "title": "Continual Learning in Neural Networks: An Empirical Evaluation",
32         "instruction": "What are the main challenges of continual learning for neural networks according to the paper?",
33     },
34     {
35         "pdf_url": "https://arxiv.org/pdf/2304.00685v2",
36         "title": "Vision-Language Models for Vision Tasks: A Survey",
37         "instruction": "What are the most widely used vision-language models?",
38     },
39     {
40         "pdf_url": "https://arxiv.org/pdf/2303.08774",
41         "title": "GPT-4 Technical Report",
42         "instruction": "What are the main differences between GPT-4 and GPT-3.5?",
43     },
44     {
45         "pdf_url": "https://arxiv.org/pdf/2406.04744",
46         "title": "CRAG -- Comprehensive RAG Benchmark",
47         "instruction": "What was the approach to experimenting with different data mixtures?",
48     },
49 ]
50 
51 client = opik.Opik()
52 DATASET_NAME = "arXiv Papers"
53 dataset = client.get_or_create_dataset(name=DATASET_NAME)
54 dataset.insert(dataset_items)

Note: Opik automatically deduplicates dataset items to make it easier to iterate on your dataset.

Defining the evaluation metrics

Opik includes a library of evaluation metrics that you can use to evaluate your chains. For this particular example, we will be using a custom metric that evaluates the relevance, conciseness and technical accuracy of each summary

1 from opik.evaluation.metrics import base_metric, score_result
2 import json
3 
4 # We will define the response format so the output has the correct schema. You can also use structured outputs with Pydantic models for this.
5 json_schema = {
6     "type": "json_schema",
7     "json_schema": {
8         "name": "summary_evaluation_schema",
9         "schema": {
10             "type": "object",
11             "properties": {
12                 "relevance": {
13                     "type": "object",
14                     "properties": {
15                         "score": {
16                             "type": "integer",
17                             "minimum": 1,
18                             "maximum": 5,
19                             "description": "Score between 1-5 for how well the summary addresses the instruction",
20                         },
21                         "explanation": {
22                             "type": "string",
23                             "description": "Brief explanation of the relevance score",
24                         },
25                     },
26                     "required": ["score", "explanation"],
27                 },
28                 "conciseness": {
29                     "type": "object",
30                     "properties": {
31                         "score": {
32                             "type": "integer",
33                             "minimum": 1,
34                             "maximum": 5,
35                             "description": "Score between 1-5 for how concise the summary is while retaining key information",
36                         },
37                         "explanation": {
38                             "type": "string",
39                             "description": "Brief explanation of the conciseness score",
40                         },
41                     },
42                     "required": ["score", "explanation"],
43                 },
44                 "technical_accuracy": {
45                     "type": "object",
46                     "properties": {
47                         "score": {
48                             "type": "integer",
49                             "minimum": 1,
50                             "maximum": 5,
51                             "description": "Score between 1-5 for how accurately the summary conveys technical details",
52                         },
53                         "explanation": {
54                             "type": "string",
55                             "description": "Brief explanation of the technical accuracy score",
56                         },
57                     },
58                     "required": ["score", "explanation"],
59                 },
60             },
61             "required": ["relevance", "conciseness", "technical_accuracy"],
62             "additionalProperties": False,
63         },
64     },
65 }
66 
67 
68 # Custom Metric: One template/prompt to extract 4 scores/results
69 class EvaluateSummary(base_metric.BaseMetric):
70     # Constructor
71     def __init__(self, name: str):
72         self.name = name
73 
74     def score(
75         self, summary: str, instruction: str, model: str = "gpt-4o-mini", **kwargs
76     ):
77         prompt = f"""
78             Summary: {summary}
79             Instruction: {instruction}
80 
81             Evaluate the summary based on the following criteria:
82             1. Relevance (1-5): How well does the summary address the given instruction?
83             2. Conciseness (1-5): How concise is the summary while retaining key information?
84             3. Technical Accuracy (1-5): How accurately does the summary convey technical details?
85 
86             Your response MUST be in the following JSON format:
87             {{
88                 "relevance": {{
89                     "score": <int>,
90                     "explanation": "<string>"
91                 }},
92             "conciseness": {{
93                 "score": <int>,
94                 "explanation": "<string>"
95                 }},
96             "technical_accuracy": {{
97                 "score": <int>,
98                 "explanation": "<string>"
99                 }}
100             }}
101 
102             Ensure that the scores are integers between 1 and 5, and that the explanations are concise.
103         """
104 
105         response = openai_client.chat.completions.create(
106             model=model,
107             max_tokens=1000,
108             messages=[{"role": "user", "content": prompt}],
109             response_format=json_schema,
110         )
111 
112         eval_dict = json.loads(response.choices[0].message.content)
113 
114         return [
115             score_result.ScoreResult(
116                 name="summary_relevance",
117                 value=eval_dict["relevance"]["score"],
118                 reason=eval_dict["relevance"]["explanation"],
119             ),
120             score_result.ScoreResult(
121                 name="summary_conciseness",
122                 value=eval_dict["conciseness"]["score"],
123                 reason=eval_dict["conciseness"]["explanation"],
124             ),
125             score_result.ScoreResult(
126                 name="summary_technical_accuracy",
127                 value=eval_dict["technical_accuracy"]["score"],
128                 reason=eval_dict["technical_accuracy"]["explanation"],
129             ),
130             score_result.ScoreResult(
131                 name="summary_average_score",
132                 value=round(sum(eval_dict[k]["score"] for k in eval_dict) / 3, 2),
133                 reason="The average of the 3 summary evaluation metrics",
134             ),
135         ]

Create the task we want to evaluate

We can now create the task we want to evaluate. In this case, we will have the dataset item as an input and return a dictionary containing the summary and the instruction so that we can use this in the evaluation metrics:

1 import requests
2 import io
3 from PyPDF2 import PdfReader
4 from typing import Dict
5 
6 
7 # Load and extract text from PDFs
8 @opik.track
9 def load_pdf(pdf_url: str) -> str:
10     # Download the PDF
11     response = requests.get(pdf_url)
12     pdf_file = io.BytesIO(response.content)
13 
14     # Read the PDF
15     pdf_reader = PdfReader(pdf_file)
16 
17     # Extract text from all pages
18     text = ""
19     for page in pdf_reader.pages:
20         text += page.extract_text()
21 
22     # Truncate the text to 100000 characters as this is the maximum supported by OpenAI
23     text = text[:100000]
24     return text
25 
26 
27 def evaluation_task(x: Dict):
28     text = load_pdf(x["pdf_url"])
29     instruction = x["instruction"]
30     model = MODEL
31     density_iterations = DENSITY_ITERATIONS
32 
33     result = chain_of_density_summarization(
34         document=text,
35         instruction=instruction,
36         model=model,
37         density_iterations=density_iterations,
38     )
39 
40     return {"summary": result}

Run the automated evaluation

We can now use the evaluate method to evaluate the summaries in our dataset:

1 from opik.evaluation import evaluate
2 
3 MODEL = "gpt-4o-mini"
4 DENSITY_ITERATIONS = 2
5 
6 experiment_config = {
7     "model": MODEL,
8     "density_iterations": DENSITY_ITERATIONS,
9 }
10 
11 res = evaluate(
12     dataset=dataset,
13     experiment_config=experiment_config,
14     task=evaluation_task,
15     scoring_metrics=[EvaluateSummary(name="summary-metrics")],
16     prompts=[ITERATION_SUMMARY_PROMPT, FINAL_SUMMARY_PROMPT],
17     project_name="Chain of Density Summarization - Experiments",
18 )

The experiment results are now available in the Opik UI:

Trace UI

Comparing prompt templates

We will update the iteration summary prompt and evaluate its impact on the evaluation metrics.

1 import opik
2 
3 ITERATION_SUMMARY_PROMPT = opik.Prompt(
4     name="Iteration Summary Prompt",
5     prompt="""Document: {{document}}
6 Current summary: {{current_summary}}
7 Instruction to focus on: {{instruction}}
8 
9 Generate a concise, entity-dense, and highly technical summary from the provided Document that specifically addresses the given Instruction.
10 
11 Guidelines:
12 1. **Maximize Clarity and Density**: Revise the current summary to enhance flow, density, and conciseness.
13 2. **Eliminate Redundant Language**: Avoid uninformative phrases such as "the article discusses."
14 3. **Ensure Self-Containment**: The summary should be dense and concise, easily understandable without referring back to the document.
15 4. **Align with Instruction**: Make sure the summary specifically addresses the given instruction.
16 
17 """.rstrip().lstrip(),
18 )

1 from opik.evaluation import evaluate
2 
3 MODEL = "gpt-4o-mini"
4 DENSITY_ITERATIONS = 2
5 
6 experiment_config = {
7     "model": MODEL,
8     "density_iterations": DENSITY_ITERATIONS,
9 }
10 
11 res = evaluate(
12     dataset=dataset,
13     experiment_config=experiment_config,
14     task=evaluation_task,
15     scoring_metrics=[EvaluateSummary(name="summary-metrics")],
16     prompts=[ITERATION_SUMMARY_PROMPT, FINAL_SUMMARY_PROMPT],
17     project_name="Chain of Density Summarization - Experiments",
18 )

You can now compare the results between the two experiments in the Opik UI:

Trace UI