Quickstart notebook - Summarization task

In this notebook, we will look at how you can use Opik to track your LLM calls, chains and agents. We will introduce the concept of tracing and how to automate the evaluation of your LLM workflows.

We will be using a technique called Chain of Density Summarization to summarize Arxiv papers. You can learn more about this technique in the From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting paper.

Getting started

We will first install the required dependencies and configure both Opik and OpenAI.

1%pip install -U opik openai requests PyPDF2 --quiet

Comet provides a hosted version of the Opik platform, simply create an account and grab you API Key.

You can also run the Opik platform locally, see the installation guide for more information.

1import opik
2import os
3
4# Configure Opik
5opik.configure()

Implementing Chain of Density Summarization

The idea behind this approach is to first generate a sparse candidate summary and then iteratively refine it with missing information without making it longer. We will start by defining two prompts:

  1. Iteration summary prompt: This prompt is used to generate and refine a candidate summary.
  2. Final summary prompt: This prompt is used to generate the final summary from the sparse set of candidate summaries.
1import opik
2
3ITERATION_SUMMARY_PROMPT = opik.Prompt(
4 name="Iteration Summary Prompt",
5 prompt="""
6Document: {{document}}
7Current summary: {{current_summary}}
8Instruction to focus on: {{instruction}}
9
10Generate a concise, entity-dense, and highly technical summary from the provided Document that specifically addresses the given Instruction.
11
12Guidelines:
13- Make every word count: If there is a current summary re-write it to improve flow, density and conciseness.
14- Remove uninformative phrases like "the article discusses".
15- The summary should become highly dense and concise yet self-contained, e.g. , easily understood without the Document.
16- Make sure that the summary specifically addresses the given Instruction
17""".rstrip().lstrip(),
18)
19
20FINAL_SUMMARY_PROMPT = opik.Prompt(
21 name="Final Summary Prompt",
22 prompt="""
23Given this summary: {{current_summary}}
24And this instruction to focus on: {{instruction}}
25Create an extremely dense, final summary that captures all key technical information in the most concise form possible, while specifically addressing the given instruction.
26""".rstrip().lstrip(),
27)

We can now define the summarization chain by combining the two prompts. In order to track the LLM calls, we will use Opik’s integration with OpenAI through the track_openai function and we will add the @opik.track decorator to each function so we can track the full chain and not just individual LLM calls:

1from opik.integrations.openai import track_openai
2from openai import OpenAI
3import opik
4
5# Use a dedicated quickstart endpoint, replace with your own OpenAI API Key in your own code
6openai_client = track_openai(
7 OpenAI(
8 base_url="https://odbrly0rrk.execute-api.us-east-1.amazonaws.com/Prod/",
9 api_key="Opik-Quickstart",
10 )
11)
12
13
14@opik.track
15def summarize_current_summary(
16 document: str,
17 instruction: str,
18 current_summary: str,
19 model: str = "gpt-4o-mini",
20):
21 prompt = ITERATION_SUMMARY_PROMPT.format(
22 document=document, current_summary=current_summary, instruction=instruction
23 )
24
25 response = openai_client.chat.completions.create(
26 model=model, max_tokens=4096, messages=[{"role": "user", "content": prompt}]
27 )
28
29 return response.choices[0].message.content
30
31
32@opik.track
33def iterative_density_summarization(
34 document: str,
35 instruction: str,
36 density_iterations: int,
37 model: str = "gpt-4o-mini",
38):
39 summary = ""
40 for iteration in range(1, density_iterations + 1):
41 summary = summarize_current_summary(document, instruction, summary, model)
42 return summary
43
44
45@opik.track
46def final_summary(instruction: str, current_summary: str, model: str = "gpt-4o-mini"):
47 prompt = FINAL_SUMMARY_PROMPT.format(
48 current_summary=current_summary, instruction=instruction
49 )
50
51 return (
52 openai_client.chat.completions.create(
53 model=model, max_tokens=4096, messages=[{"role": "user", "content": prompt}]
54 )
55 .choices[0]
56 .message.content
57 )
58
59
60@opik.track(project_name="Chain of Density Summarization")
61def chain_of_density_summarization(
62 document: str,
63 instruction: str,
64 model: str = "gpt-4o-mini",
65 density_iterations: int = 2,
66):
67 summary = iterative_density_summarization(
68 document, instruction, density_iterations, model
69 )
70 final_summary_text = final_summary(instruction, summary, model)
71
72 return final_summary_text

Let’s call the summarization chain with a sample document:

1import textwrap
2
3document = """
4Artificial intelligence (AI) is transforming industries, revolutionizing healthcare, finance, education, and even creative fields. AI systems
5today are capable of performing tasks that previously required human intelligence, such as language processing, visual perception, and
6decision-making. In healthcare, AI assists in diagnosing diseases, predicting patient outcomes, and even developing personalized treatment plans.
7In finance, it helps in fraud detection, algorithmic trading, and risk management. Education systems leverage AI for personalized learning, adaptive
8testing, and educational content generation. Despite these advancements, ethical concerns such as data privacy, bias, and the impact of AI on employment
9remain. The future of AI holds immense potential, but also significant challenges.
10"""
11
12instruction = "Summarize the main contributions of AI to different industries, and highlight both its potential and associated challenges."
13
14summary = chain_of_density_summarization(document, instruction)
15
16print("\n".join(textwrap.wrap(summary, width=80)))

Thanks to the @opik.track decorator and Opik’s integration with OpenAI, we can now track the entire chain and all the LLM calls in the Opik UI:

Trace UI

Automatting the evaluation process

Defining a dataset

Now that we have a working chain, we can automate the evaluation process. We will start by defining a dataset of documents and instructions:

1import opik
2
3dataset_items = [
4 {
5 "pdf_url": "https://arxiv.org/pdf/2301.00234",
6 "title": "A Survey on In-context Learning",
7 "instruction": "Summarize the key findings on the impact of prompt engineering in in-context learning.",
8 },
9 {
10 "pdf_url": "https://arxiv.org/pdf/2301.03728",
11 "title": "Scaling Laws for Generative Mixed-Modal Language Models",
12 "instruction": "How do scaling laws apply to generative mixed-modal models according to the paper?",
13 },
14 {
15 "pdf_url": "https://arxiv.org/pdf/2308.10792",
16 "title": "Instruction Tuning for Large Language Models: A Survey",
17 "instruction": "What are the major challenges in instruction tuning for large language models identified in the paper?",
18 },
19 {
20 "pdf_url": "https://arxiv.org/pdf/2302.08575",
21 "title": "Foundation Models in Natural Language Processing: A Survey",
22 "instruction": "Explain the role of foundation models in the current natural language processing landscape.",
23 },
24 {
25 "pdf_url": "https://arxiv.org/pdf/2306.13398",
26 "title": "Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey",
27 "instruction": "What are the cutting edge techniques used in multi-modal pre-training models?",
28 },
29 {
30 "pdf_url": "https://arxiv.org/pdf/2103.07492",
31 "title": "Continual Learning in Neural Networks: An Empirical Evaluation",
32 "instruction": "What are the main challenges of continual learning for neural networks according to the paper?",
33 },
34 {
35 "pdf_url": "https://arxiv.org/pdf/2304.00685v2",
36 "title": "Vision-Language Models for Vision Tasks: A Survey",
37 "instruction": "What are the most widely used vision-language models?",
38 },
39 {
40 "pdf_url": "https://arxiv.org/pdf/2303.08774",
41 "title": "GPT-4 Technical Report",
42 "instruction": "What are the main differences between GPT-4 and GPT-3.5?",
43 },
44 {
45 "pdf_url": "https://arxiv.org/pdf/2406.04744",
46 "title": "CRAG -- Comprehensive RAG Benchmark",
47 "instruction": "What was the approach to experimenting with different data mixtures?",
48 },
49]
50
51client = opik.Opik()
52DATASET_NAME = "arXiv Papers"
53dataset = client.get_or_create_dataset(name=DATASET_NAME)
54dataset.insert(dataset_items)

Note: Opik automatically deduplicates dataset items to make it easier to iterate on your dataset.

Defining the evaluation metrics

Opik includes a library of evaluation metrics that you can use to evaluate your chains. For this particular example, we will be using a custom metric that evaluates the relevance, conciseness and technical accuracy of each summary

1from opik.evaluation.metrics import base_metric, score_result
2import json
3
4# We will define the response format so the output has the correct schema. You can also use structured outputs with Pydantic models for this.
5json_schema = {
6 "type": "json_schema",
7 "json_schema": {
8 "name": "summary_evaluation_schema",
9 "schema": {
10 "type": "object",
11 "properties": {
12 "relevance": {
13 "type": "object",
14 "properties": {
15 "score": {
16 "type": "integer",
17 "minimum": 1,
18 "maximum": 5,
19 "description": "Score between 1-5 for how well the summary addresses the instruction",
20 },
21 "explanation": {
22 "type": "string",
23 "description": "Brief explanation of the relevance score",
24 },
25 },
26 "required": ["score", "explanation"],
27 },
28 "conciseness": {
29 "type": "object",
30 "properties": {
31 "score": {
32 "type": "integer",
33 "minimum": 1,
34 "maximum": 5,
35 "description": "Score between 1-5 for how concise the summary is while retaining key information",
36 },
37 "explanation": {
38 "type": "string",
39 "description": "Brief explanation of the conciseness score",
40 },
41 },
42 "required": ["score", "explanation"],
43 },
44 "technical_accuracy": {
45 "type": "object",
46 "properties": {
47 "score": {
48 "type": "integer",
49 "minimum": 1,
50 "maximum": 5,
51 "description": "Score between 1-5 for how accurately the summary conveys technical details",
52 },
53 "explanation": {
54 "type": "string",
55 "description": "Brief explanation of the technical accuracy score",
56 },
57 },
58 "required": ["score", "explanation"],
59 },
60 },
61 "required": ["relevance", "conciseness", "technical_accuracy"],
62 "additionalProperties": False,
63 },
64 },
65}
66
67
68# Custom Metric: One template/prompt to extract 4 scores/results
69class EvaluateSummary(base_metric.BaseMetric):
70 # Constructor
71 def __init__(self, name: str):
72 self.name = name
73
74 def score(
75 self, summary: str, instruction: str, model: str = "gpt-4o-mini", **kwargs
76 ):
77 prompt = f"""
78 Summary: {summary}
79 Instruction: {instruction}
80
81 Evaluate the summary based on the following criteria:
82 1. Relevance (1-5): How well does the summary address the given instruction?
83 2. Conciseness (1-5): How concise is the summary while retaining key information?
84 3. Technical Accuracy (1-5): How accurately does the summary convey technical details?
85
86 Your response MUST be in the following JSON format:
87 {{
88 "relevance": {{
89 "score": <int>,
90 "explanation": "<string>"
91 }},
92 "conciseness": {{
93 "score": <int>,
94 "explanation": "<string>"
95 }},
96 "technical_accuracy": {{
97 "score": <int>,
98 "explanation": "<string>"
99 }}
100 }}
101
102 Ensure that the scores are integers between 1 and 5, and that the explanations are concise.
103 """
104
105 response = openai_client.chat.completions.create(
106 model=model,
107 max_tokens=1000,
108 messages=[{"role": "user", "content": prompt}],
109 response_format=json_schema,
110 )
111
112 eval_dict = json.loads(response.choices[0].message.content)
113
114 return [
115 score_result.ScoreResult(
116 name="summary_relevance",
117 value=eval_dict["relevance"]["score"],
118 reason=eval_dict["relevance"]["explanation"],
119 ),
120 score_result.ScoreResult(
121 name="summary_conciseness",
122 value=eval_dict["conciseness"]["score"],
123 reason=eval_dict["conciseness"]["explanation"],
124 ),
125 score_result.ScoreResult(
126 name="summary_technical_accuracy",
127 value=eval_dict["technical_accuracy"]["score"],
128 reason=eval_dict["technical_accuracy"]["explanation"],
129 ),
130 score_result.ScoreResult(
131 name="summary_average_score",
132 value=round(sum(eval_dict[k]["score"] for k in eval_dict) / 3, 2),
133 reason="The average of the 3 summary evaluation metrics",
134 ),
135 ]

Create the task we want to evaluate

We can now create the task we want to evaluate. In this case, we will have the dataset item as an input and return a dictionary containing the summary and the instruction so that we can use this in the evaluation metrics:

1import requests
2import io
3from PyPDF2 import PdfReader
4from typing import Dict
5
6
7# Load and extract text from PDFs
8@opik.track
9def load_pdf(pdf_url: str) -> str:
10 # Download the PDF
11 response = requests.get(pdf_url)
12 pdf_file = io.BytesIO(response.content)
13
14 # Read the PDF
15 pdf_reader = PdfReader(pdf_file)
16
17 # Extract text from all pages
18 text = ""
19 for page in pdf_reader.pages:
20 text += page.extract_text()
21
22 # Truncate the text to 100000 characters as this is the maximum supported by OpenAI
23 text = text[:100000]
24 return text
25
26
27def evaluation_task(x: Dict):
28 text = load_pdf(x["pdf_url"])
29 instruction = x["instruction"]
30 model = MODEL
31 density_iterations = DENSITY_ITERATIONS
32
33 result = chain_of_density_summarization(
34 document=text,
35 instruction=instruction,
36 model=model,
37 density_iterations=density_iterations,
38 )
39
40 return {"summary": result}

Run the automated evaluation

We can now use the evaluate method to evaluate the summaries in our dataset:

1from opik.evaluation import evaluate
2
3os.environ["OPIK_PROJECT_NAME"] = "summary-evaluation-prompts"
4
5MODEL = "gpt-4o-mini"
6DENSITY_ITERATIONS = 2
7
8experiment_config = {
9 "iteration_summary_prompt": ITERATION_SUMMARY_PROMPT,
10 "final_summary_prompt": FINAL_SUMMARY_PROMPT,
11 "model": MODEL,
12 "density_iterations": DENSITY_ITERATIONS,
13}
14
15res = evaluate(
16 dataset=dataset,
17 experiment_config=experiment_config,
18 task=evaluation_task,
19 scoring_metrics=[EvaluateSummary(name="summary-metrics")],
20 prompt=ITERATION_SUMMARY_PROMPT,
21 project_name="Chain of Density Summarization",
22)

The experiment results are now available in the Opik UI:

Trace UI

Comparing prompt templates

We will update the iteration summary prompt and evaluate its impact on the evaluation metrics.

1import opik
2
3ITERATION_SUMMARY_PROMPT = opik.Prompt(
4 name="Iteration Summary Prompt",
5 prompt="""Document: {{document}}
6Current summary: {{current_summary}}
7Instruction to focus on: {{instruction}}
8
9Generate a concise, entity-dense, and highly technical summary from the provided Document that specifically addresses the given Instruction.
10
11Guidelines:
121. **Maximize Clarity and Density**: Revise the current summary to enhance flow, density, and conciseness.
132. **Eliminate Redundant Language**: Avoid uninformative phrases such as "the article discusses."
143. **Ensure Self-Containment**: The summary should be dense and concise, easily understandable without referring back to the document.
154. **Align with Instruction**: Make sure the summary specifically addresses the given instruction.
16
17""".rstrip().lstrip(),
18)
1from opik.evaluation import evaluate
2
3os.environ["OPIK_PROJECT_NAME"] = "summary-evaluation-prompts"
4
5MODEL = "gpt-4o-mini"
6DENSITY_ITERATIONS = 2
7
8experiment_config = {
9 "iteration_summary_prompt": ITERATION_SUMMARY_PROMPT,
10 "final_summary_prompt": FINAL_SUMMARY_PROMPT,
11 "model": MODEL,
12 "density_iterations": DENSITY_ITERATIONS,
13}
14
15res = evaluate(
16 dataset=dataset,
17 experiment_config=experiment_config,
18 task=evaluation_task,
19 scoring_metrics=[EvaluateSummary(name="summary-metrics")],
20 prompt=ITERATION_SUMMARY_PROMPT,
21 project_name="Chain of Density Summarization",
22)

You can now compare the results between the two experiments in the Opik UI:

Trace UI

Built with