## Prerequisites
Before you begin, you'll need to choose how you want to use Opik:
* **Opik Cloud**: Create a free account at [comet.com/opik](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=quickstart\&utm_campaign=opik)
* **Self-hosting**: Follow the [self-hosting guide](/self-host/overview) to deploy Opik locally or on Kubernetes
## Logging your first LLM calls
Opik makes it easy to integrate with your existing LLM application, here are some of our most
popular integrations:
Integrate with Opik faster using this pre-built prompt
Reviewing these sections can help pinpoint the source of the problem and suggest possible
resolutions.
### โจ๏ธ Using Comet Debugger Mode (UI/Browser)
**Comet Debugger Mode** is a hidden diagnostic feature in the **Opik web application** that displays real-time technical information to help you troubleshoot issues. This mode is particularly useful when investigating connectivity problems, reporting bugs, or verifying your deployment version.
**To toggle Comet Debugger Mode:**
Press `Command + Shift + .` on macOS or `Ctrl + Shift + .` on Windows/Linux
**What it displays:**
* **Network Status**: Real-time connectivity indicator with RTT (Round Trip Time) showing latency to the Opik backend server in seconds
* **Opik Version**: The current version of Opik you're running (click to copy to clipboard)
This information is helpful when:
* Reporting issues to the Opik team (include the version number and RTT)
* Verifying your Opik version matches expected deployment
* Diagnosing connectivity problems between UI and backend (check RTT for latency issues)
* Troubleshooting UI-related issues or unexpected behavior
* Confirming successful updates or deployments
* Monitoring network performance and latency to the backend server
**How it works:**
The keyboard shortcut toggles the debug information overlay on and off. When enabled, a small
status bar appears in the UI showing the network connectivity status and version information.
The mode persists across browser sessions (stored in local storage), so you only need to enable
it once until you toggle it off again.
Our new dashboards engine lets you build fully customizable views to track everything from token usage and cost to latency, quality across projects and experiments.
**๐ Where to find them?**
Dashboards are available in three places inside Opik:
* **Dashboards page** โ create and manage all dashboards from the sidebar
* **Project page** โ view project-specific metrics under the Dashboards tab
* **Experiment comparison page** โ visualize and compare experiment results
**๐งฉ Built-in templates to get started fast**
We ship dashboards with zero-setup pre-built templates, including Performance Overview, Experiment Insights and Project Operational Metrics.
Templates are fully editable and can be saved as new dashboards once customized.
**๐งฑ Flexible widgets**
Dashboards support multiple widget types:
* **Project Metrics** (time-series and bar charts for project data)
* **Project Statistics** (KPI number cards)
* **Experiment Metrics** (line, bar, radar charts for experiment data)
* **Markdown** (notes, documentation, context)
Widgets support filtering, grouping, resizing, drag-and-drop layouts, and global date range controls.
๐ [Documentation](https://www.comet.com/docs/opik/production/dashboards)
## ๐งช Improved Evaluation Capabilities
**Span-Level Metrics**
Span-level metrics are officially live in Opik supporting both LLMaaJ and code-based metrics!
Teams can now EASILY evaluate the quality of specific steps inside their agent flows with full precision.
Instead of assessing only the final output or top-level trace, you can attach metrics directly to individual call spans or segments of an agent's trajectory.
This unlocks dramatically finer-grained visibility and control. For example:
* Score critical decision points inside task-oriented or tool-using agents
* Measure the performance of sub-tasks independently to pinpoint bottlenecks or regressions
* Compare step-by-step agent behavior across runs, experiments, or versions
**New Support accessing full tree, subtree, or leaf nodes in Online Scores**
This update enhances the online scoring engine to support referencing entire root objects (input, output, metadata) in LLM-as-Judge and code-based evaluators, not just nested fields within them.
Online Scoring previously only exposed leaf-level values from an LLM's structured output. With this update, Opik now supports rendering any subtree: from individual nodes to entire nested structures.
## ๐ Tags Support & Metadata Filtering for Prompt Version Management
You can now tag individual prompt versions (not just the prompt!).
This provides a clean, intuitive way to mark best-performing versions, manage lifecycles, and integrate version selection into agent deployments.
## ๐ฅ More Multimodal Support: now Audio!
Now you can pass audio as part of your prompts, in the playground and on online evals for advanced multimodal scenarios.
## ๐ More Insights!
**Thread-level insights**
Added new metrics to the threads table with thread-level metrics and statistics, providing users with aggregated insights about their full multi-turn agentic interactions:
* Duration percentiles (p50, p90, p99) and averages
* Token usage statistics (total, prompt, completion tokens)
* Cost metrics and aggregations
* Also added filtering support by project, time range, and custom filters
**Experiment insights**
Added additional aggregation methods in headers for experiment items.
This new release adds percentile aggregation methods (p50, p90, p99) for all numerical metrics in experiment items table headers, extending the existing pattern used for duration to cost, feedback scores, and total tokens.
## ๐ Integrations
**Support for GPT-5.2 in Playground and Online Scoring**
Added full support for GPT 5.2 models in both the playground and online scoring features for OpenAI and OpenRouter providers.
**Harbor Integration**
Added a comprehensive Opik integration for Harbor, a benchmark evaluation framework for autonomous LLM agents. The integration enables observability for agent benchmark evaluations (SWE-bench, LiveCodeBench, Terminal-Bench, etc.).
๐ [Harbor Integration Documentation](https://www.comet.com/docs/opik/integrations/harbor)
***
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.9.40...1.9.56)
*Releases*: `1.9.41`, `1.9.42`, `1.9.43`, `1.9.44`, `1.9.45`, `1.9.46`, `1.9.47`, `1.9.48`, `1.9.49`, `1.9.50`, `1.9.51`, `1.9.52`, `1.9.53`, `1.9.54`, `1.9.55`, `1.9.56`
# December 9, 2025
Here are the most relevant improvements we've made since the last release:
## ๐ Dataset Improvements
We've enhanced dataset functionality with several key improvements:
* **Edit Dataset Items** - You can now edit dataset items directly from the UI, making it easier to update and refine your evaluation data.
* **Remove Dataset Upload Limit for Self-Hosted** - Self-hosted deployments no longer have dataset upload limits, giving you more flexibility for large-scale evaluations.
* **Dataset Item Tagging Support** - Added comprehensive tagging support for dataset items, enabling better organization and filtering of your evaluation data.
* **Dataset Filtering Capabilities by Any Column** - Filter datasets by any column in both the playground and dataset view, giving you flexible ways to find and work with specific data subsets.
* **Ability to Rename Datasets** - Rename datasets directly from the UI, making it easier to organize and manage your evaluation datasets.
## ๐ Experiment Updates
We've made significant improvements to experiment management and analysis:
* **Experiment-Level Metrics** - Compute experiment-level metrics (as opposed to experiment-item-level metrics) for better insights into your evaluation results. Read more in the [experiment-level metrics documentation](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm#computing-experiment-level-metrics).
* **Rename Experiments & Metadata** - Update experiment names and metadata config directly from the dashboard, giving you more control over experiment organization.
* **Token & Cost Columns** - Token usage and cost are now surfaced in the experiment items table for easy scanning and cost visibility.
## ๐ฎ Playground Improvements
We've made the Playground more powerful and easier to use for non-technical users:
* **Easy Navigation from Playground to Dataset and Metrics** - Quick navigation links from the playground to related datasets and metrics, streamlining your workflow.
* **Advanced filtering for Playground Datasets** - Filter playground datasets by tags and any other columns, making it easier to find and work with specific dataset items.
* **Pagination for the Playground** - Added pagination support to handle large datasets more efficiently in the playground.
* **Added Experiment Progress Bar in the Playground** - Visual progress indicators for running experiments, giving you real-time feedback on experiment status.
* **Added Model-Specific Throttling and Concurrency Configs in the Playground** - Configure throttling and concurrency settings per model in the playground, giving you fine-grained control over resource usage.
## ๐จ Enhanced Alerts
We've expanded alert capabilities with threshold support:
* **Added Threshold Support for Trace and Thread Feedback Scores** - Configure thresholds for feedback scores on traces and threads, enabling more precise alerting based on quality metrics.
* **Added Threshold to Trace Error Alerts** - Set thresholds for trace error alerts to get notified only when error rates exceed your configured limits.
* **Trigger Experiment Created Alert from the Playground** - Receive alerts when experiments are created directly from the playground.
## ๐ค Opik Optimizer Updates
Significant enhancements to the Opik Optimizer:
* **Cost and Latency Optimization Support** - Added support for optimizing both cost and latency metrics simultaneously. Read more in the [optimization metrics documentation](/docs/opik/agent_optimization/optimization/define_metrics#include-cost-and-duration-metrics).
* **Training and Validation Dataset Support** - Introduced support for training and validation dataset splits, enabling better optimization workflows. Learn more in the [dataset documentation](/docs/opik/agent_optimization/optimization/define_datasets#trainvalidation-splits).
* **Example Scripts for Microsoft Agents and CrewAI** - New example scripts demonstrating how to use Opik Optimizer with popular LLM frameworks. Check out the [example scripts](https://github.com/comet-ml/opik/tree/main/sdks/opik_optimizer/scripts/llm_frameworks).
* **UI Enhancements and Optimizer Improvements** - Several UI enhancements and various improvements to Few Shot, MetaPrompt, and GEPA optimizers for better usability and performance.
## ๐จ User Experience Enhancements
Improved usability across the platform:
* **Added `has_tool_spans` Field to Show Tool Calls in Thread View** - Tool calls are now visible in thread views, providing better visibility into agent tool usage.
* **Added Export Capability (JSON/CSV) Directly from Trace, Thread, and Span Detail Views** - Export data directly from detail views in JSON or CSV format, making it easier to analyze and share your observability data.
## ๐ค New Models!
Expanded model support:
* **Added Support for Gemini 3 Pro, GPT 5.1, OpenRouter Models** - Added support for the latest model versions including Gemini 3 Pro, GPT 5.1, and OpenRouter models, giving you access to the newest AI capabilities.
***
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.9.17...1.9.40)
*Releases*: `1.9.18`, `1.9.19`, `1.9.20`, `1.9.21`, `1.9.22`, `1.9.23`, `1.9.25`, `1.9.26`, `1.9.27`, `1.9.28`, `1.9.29`, `1.9.31`, `1.9.32`, `1.9.33`, `1.9.34`, `1.9.35`, `1.9.36`, `1.9.37`, `1.9.38`, `1.9.39`, `1.9.40`
# November 18, 2025
Here are the most relevant improvements we've made since the last release:
## ๐ More Metrics!
We have shipped **37 new built-in metrics**, faster & more reliable LLM judging, plus robustness fixes.
**New Metrics Added** - We've expanded the evaluation metrics library with a comprehensive set of out-of-the-box metrics including:
* **Classic NLP Heuristics** - BERTScore, Sentiment analysis, Bias detection, Conversation drift, and more
* **Lightweight Heuristics** - Fast, non-LLM based metrics perfect for CI/CD pipelines and large-scale evaluations
* **LLM-as-a-Judge Presets** - More out-of-the-box presets you can use without custom configuration
**LLM-as-a-Judge & G-Eval Improvements**:
* **Compatible with newer models** - Now works seamlessly with the latest model versions
* **Faster default judge** - Default judge is now `gpt-5-nano` for faster, more accurate evals
* **LLM Jury support** - Aggregate scores across multiple models/judges into a single ensemble score for more reliable evaluations
**Enhanced Preprocessing**:
* **Improved English text handling** - Better processing of English text to reduce false negatives
* **Better emoji handling** - Enhanced emoji processing for more accurate evaluations
**Robustness Improvements**:
* **Automatic retries** - LLM judge will retry on transient failures to avoid flaky test results
* **More reliable evaluation runs** - Faster, more consistent evaluation runs for CI and experiments
๐ Access the metrics docs here: [Evaluation Metrics Overview](/docs/opik/evaluation/metrics/overview)
## ๐ Anonymizers - PII Information Redaction
We've added **support for PII (Personally Identifiable Information) redaction** before sending data to Opik. This helps you protect sensitive information while still getting the observability insights you need.
With anonymizers, you can:
* **Automatically redact PII** from traces and spans before they're sent to Opik
* **Configure custom anonymization rules** to match your specific privacy requirements
* **Maintain compliance** with data protection regulations
* **Protect sensitive data** without losing observability
๐ Read the full docs: [Anonymizers](/docs/opik/production/anonymizers)
## ๐จ New Alert Types
We've expanded our alerting capabilities with new alert types and improved functionality:
* **Experiment Finished Alert** - Get notified when an experiment completes, so you can review results immediately or trigger your CI/CD pipelines.
* **Cost Alerts** - Set thresholds for cost metrics and receive alerts when spending exceeds your limits
* **Latency Alerts** - Monitor response times and get notified when latency exceeds configured thresholds
These new alert types help you stay on top of your LLM application's performance and costs, enabling proactive monitoring and faster response to issues.
๐ Read more: [Alerts Guide](/docs/opik/production/alerts)
## ๐ฅ Multimodal Support
We've significantly enhanced multimodal capabilities across the platform:
* **Video LLM-as-a-Judge** - Added support for Video LLM-as-a-Judge, enabling evaluation of video content in your traces
* **Video Cost Tracking** - Added cost tracking for video models, so you can monitor spending on video processing operations
* **Image support in LLM-as-a-Judge** - Both Python and TypeScript SDKs now support image processing in LLM-as-a-Judge evaluations, allowing you to evaluate traces containing images
These enhancements make it easier to build and evaluate multimodal applications that work with images and video content.
## ๐ Custom AI Providers
We've improved support for custom AI providers with enhanced configuration options:
* **Multiple Custom Providers** - Set up multiple custom AI providers for use in the Playground and online scoring
* **Custom Headers Support** - Configure custom headers for your custom providers, giving you more flexibility in how you connect to enterprise AI services
## ๐งช Enhanced Evals & Observability
We've added several improvements to make evaluation and observability more powerful:
* **Trace and Span Metadata in Datasets** - Ability to add trace and span metadata to datasets for advanced agent evaluation, enabling more sophisticated evaluation workflows
* **Tokens Breakdown Display** - Display tokens breakdown (input/output) in the trace view, giving you detailed visibility into token usage for each span and trace
* **Binary (Boolean) Feedback Scores** - New support for binary (Boolean) feedback scores, allowing you to capture simple yes/no or pass/fail evaluations
## ๐จ UX Improvements
We've made several user experience enhancements across the platform:
* **Improved Pretty Mode** - Enhanced pretty mode for traces, threads, and annotation queues, making it easier to read and understand your data
* **Date Filtering for Traces, Threads, and Spans** - Added date filtering capabilities, allowing you to focus on specific time ranges when analyzing your data
* **New Optimization Runs Section** - Added a new optimization runs section to the home page, giving you quick access to your optimization results
* **Comet Debugger Mode** - Added Comet Debugger Mode with app version and connectivity status, helping you troubleshoot issues and understand your application's connection status. Read more about it [here](/docs/opik/faq#using-comet-debugger-mode-uibrowser)
***
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.8.97...1.9.17)
*Releases*: `1.8.98`, `1.8.99`, `1.8.100`, `1.8.101`, `1.8.102`, `1.9.0`, `1.9.1`, `1.9.2`, `1.9.3`, `1.9.4`, `1.9.5`, `1.9.6`, `1.9.7`, `1.9.8`, `1.9.9`, `1.9.10`, `1.9.11`, `1.9.12`, `1.9.13`, `1.9.14`, `1.9.15`, `1.9.16`, `1.9.17`
# November 4, 2025
Here are the most relevant improvements we've made since the last release:
## ๐จ Native Slack and PagerDuty Alerts
We now offer **native Slack and PagerDuty alert integrations**, eliminating the need for any middleware configuration. Set up alerts directly in Opik to receive notifications when important events happen in your workspace.
With native integrations, you can:
* **Configure Slack channels** directly from Opik settings
* **Set up PagerDuty incidents** without additional webhook setup
* **Receive real-time notifications** for errors, feedback scores, and critical events
* **Streamline your monitoring workflow** with built-in integrations
๐ Read the full docs here - [Alerts Guide](/docs/opik/production/alerts)
## ๐ผ๏ธ Multimodal LLM-as-a-Judge Support for Visual Evaluation
LLM as a Judge metrics can now evaluate traces that contain images when using vision-capable models. This is useful for:
* **Evaluating image generation quality** - Assess the quality and relevance of generated images
* **Analyzing visual content** in multimodal applications - Evaluate how well your application handles visual inputs
* **Validating image-based responses** - Ensure your vision models produce accurate and relevant outputs
To reference image data from traces in your evaluation prompts:
* In the prompt editor, click the **"Images +"** button to add an image variable
* Map the image variable to the trace field containing image data using the Variable Mapping section
๐ Read more: [Evaluating traces with images](/docs/opik/production/rules#evaluating-traces-with-images)
## โจ Prompt Generator & Improver
We've launched the **Prompt Generator** and **Prompt Improver** โ two AI-powered tools that help you create and refine prompts faster, directly inside the Playground.
Designed for non-technical users, these features automatically apply best practices from OpenAI, Anthropic, and Google, helping you craft clear, effective, and production-grade prompts without leaving the Playground.
### Why it matters
Prompt engineering is still one of the biggest bottlenecks in LLM development. With these tools, teams can:
* **Generate high-quality prompts** from simple task descriptions
* **Improve existing prompts** for clarity, specificity, and consistency
* **Iterate and test prompts seamlessly** in the Playground
### How it works
* **Prompt Generator** โ Describe your task in plain language; Opik creates a complete system prompt following proven design principles
* **Prompt Improver** โ Select an existing prompt; Opik enhances it following best practices
๐ Read the full docs: [Prompt Generator & Improver](/docs/opik/prompt_engineering/improve)
## ๐ Advanced Prompt Integration in Spans & Traces
We've implemented **prompt integration into spans and traces**, creating a seamless connection between your Prompt Library, Traces, and the Playground.
You can now associate prompts directly with traces and spans using the `opik_context` module โ so every execution is automatically tied to the exact prompt version used.
Understanding which prompt produced a given trace is key for users building both simple and advanced multi-prompt and multi-agent systems.
With this integration, you can:
* **Track which prompt version** was used in each function or span
* **Audit and debug prompts** directly from trace details
* **Reproduce or improve prompts** instantly in the Playground
* **Close the loop** between prompt design, observability, and iteration
Once added, your prompts appear in the trace details view โ with links back to the Prompt Library and the Playground, so you can iterate in one click.
๐ Read more: [Adding prompts to traces and spans](/docs/opik/prompt_engineering/prompt_management#adding-prompts-to-traces-and-spans)
## ๐งช Better No-Code Experiment Capabilities in the Playground
We've introduced a series of improvements directly in the Playground to make experimentation easier and more powerful:
**Key enhancements:**
1. **Create or select datasets** directly from the Playground
2. **Create or select online score rules** - Ability to choose the ones that you want to use on each run
3. **Ability to pass dataset items to online score rules** - This enables reference-based experiments, where outputs are automatically compared to expected answers or ground truth, making objective evaluation simple
4. **One-click navigation to experiment results** - From the Playground, users can now:
* Jump into the Single Experiment View to inspect metrics and examples in detail, or
* Go to the Compare Experiments View to benchmark multiple runs side-by-side
## ๐ On-Demand Online Evaluation on Existing Traces and Threads
We've added **on-demand online evaluation** in Opik, letting users run metrics on already logged traces and threads โ perfect for evaluating historical data or backfilling new scores.
### How it works
Select traces/threads, choose any online score rule (e.g., Moderation, Equals, Contains), and run evaluations directly from the UI โ no code needed.
Results appear inline as feedback scores and are fully logged for traceability.
This enables:
* **Fast, no-code evaluation** of existing data
* **Easy retroactive measurement** of model and agent performance
* **Historical data analysis** without re-running traces
๐ Read more: [Manual Evaluation](/docs/opik/tracing/annotate_traces#manual-evaluation)
## ๐ค Agent Evaluation Guides
We've added two new comprehensive guides on evaluating agents:
### 1. Evaluating Agent Trajectories
This guide helps you evaluate that your agent is making the right tool calls before returning the final answer. It's fundamentally about evaluating and scoring what is happening within a trace.
๐ Read the full guide: [Evaluating Agent Trajectories](/docs/opik/evaluation/evaluate_agent_trajectory)
### 2. Evaluating Multi-Turn Agents
Evaluating chatbots is tough because you need to evaluate not just a single LLM response but instead a conversation. This guide walks you through how you can use the new `opik.simulation.SimulatedUser` method to create simulated threads for your agent.
๐ Read the full guide: [Evaluating Multi-Turn Agents](/docs/opik/evaluation/evaluate_multi_turn_agents)
These new docs significantly strengthen our agent evaluation feature-set and include diagrams to visualize how each evaluation strategy works.
## ๐ฆ Import/Export Commands
Added new command-line functions for importing and exporting Opik data: you can now export all traces, spans, datasets, prompts, and evaluation rules from a project to local JSON or CSV files. Also helps you import data from local JSON files into an existing project.
### Top use cases it is useful for
* **Migrate** - Move data between projects or environments
* **Backup** - Create local backups of your project data
* **Version control** - Track changes to your prompts and evaluation rules
* **Data portability** - Easily transfer your Opik workspace data
Read the full docs: [Import/Export Commands](/docs/opik/tracing/import_export_commands)
***
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.8.83...1.8.97)
*Releases*: `1.8.83`, `1.8.84`, `1.8.85`, `1.8.86`, `1.8.87`, `1.8.88`, `1.8.89`, `1.8.90`, `1.8.91`, `1.8.92`, `1.8.93`, `1.8.94`, `1.8.95`, `1.8.96`, `1.8.97`
# October 21, 2025
Here are the most relevant improvements we've made since the last release:
## ๐จ Alerts
We've launched **Alerts** โ a powerful way to get automated webhook notifications from your Opik workspace whenever important events happen (errors, feedback scores, prompt changes, and more). Opik now sends an HTTP POST to your endpoint with rich, structured event data you can route anywhere.
Now, you can make Opik a seamless part of your end-to-end workflows! With the new Alerts you can:
* **Spot production errors** in near-real time
* **Track feedback scores** to monitor model quality and user satisfaction
* **Audit prompt changes** across your workspace
* **Funnel events** into your existing workflows and CI/CD pipelines
And this is just v1.0! We'll keep adding events and advanced filtering, thresholds and more fine-grained control in future iterations, always based on community feedback.
Read the full docs here - [Alerts Guide](/docs/opik/production/alerts)
## ๐ผ๏ธ Expanded Multimodal Image Support
We've added a better image support across our platform!
### What's new?
**1. Image Support in LLM as a Judge online Evaluations** - LLM as a Judge evaluations now support images alongside text, enabling you to evaluate vision models and multimodal applications. Upload images and get comprehensive feedback on both text and visual content.
**2. Enhanced Playground Experience** - The playground now supports image inputs, allowing you to test prompts with images before running full evaluations. Perfect for experimenting with vision models and multimodal prompts.
**3. Improved Data Display** - Base64 image previews in data tables, better image handling in trace views, and enhanced pretty formatting for multimodal content.
Links to official docs: [Evaluating traces with images](/docs/opik/production/rules#evaluating-traces-with-images) and [Using images in the Plaground](/docs/opik/prompt_engineering/playground#using-images-in-the-playground)
## Opik Optimizer Updates
**1. Support Multi-Metric Optimization** - Support for optimizing multiple metrics simultaneously with comprehensive frontend and backend changes. Read [more](/docs/opik/agent_optimization/optimization/define_metrics#compose-metrics)
**2. HRPO (Hierarchical Reflective Prompt Optimizer)** - New optimizer with self-reflective capabilities. Read more about it [here](/docs/opik/agent_optimization/algorithms/hierarchical_adaptive_optimizer)
## Enhanced Feedback & Annotation experience
**1. Improved Annotation Queue Export** - Enhanced export functionality for annotation queues: export your annotated data seamlessly for further analysis.
**2. Annotation Queue UX Enhancements**
* **Hotkeys Navigation** - Improved keyboard navigation throughout the interface for a fast annotation experience
* **Return to Annotation Queue Button** - Easy navigation back to annotation queues
* **Resume Functionality** - Continue annotation work where you left off
* **Queue Creation from Traces** - Create annotation queues directly from trace tables
**3. Inline Feedback Editing** - Quickly edit user feedback directly in data tables with our new inline editing feature. Hover over feedback cells to reveal edit options, making annotation workflows faster and more intuitive.
Read more about our [Annotation Queues](/docs/opik/evaluation/annotation_queues)
## User Experience Enhancements
**1. Dark Mode Refinements** - Improved dark mode styling across UI components for better visual consistency and user experience.
**2. Enhanced Prompt Readability** - Better formatting and display of long prompts in the interface, making them easier to read and understand.
**3. Improved Online Evaluation Page** - Added search, filtering, and sorting capabilities to the online evaluation page for better data management.
**4. Better token and cost control**
* **Thread Cost Display** - Show cost information in thread sidebar headers
* **Sum Statistics** - Display sum statistics for cost and token columns in the traces table.
**5. Filter-Aware Metric Aggregation** - Better experiment item filtering in the experiments details tables for better data control.
**6. Pretty Mode Enhancements** - Improved the Pretty mode for Input/Output display with better formatting and readability across the product.
## TypeScript SDK Updates
* **Opik Configure Tool** - New `opik-ts` configure tool with a guided developer experience and local flag support
* **Prompt Management** - Comprehensive prompt management implementation
* **LangChain Integration** - Aligned LangChain integration with Python architecture
## Python SDK Improvements
* **Context Managers** - New context managers for span and trace creation
* **Bedrock Integration** - Enhanced Bedrock integration with invoke\_model support
* **Trace Updates** - New `update_trace()` method for easier trace modifications
* **Parallel Agent Support** - Support for logging parallel agents in ADK integration
* **Enhanced feedback score handling** with better category support
## Integration updates
**1. OpenTelemetry Improvements**
* **Thread ID Support** - Added support for thread\_id in OpenTelemetry endpoint
* **System Information in Telemetry** - Enhanced telemetry with system information
**2. Model Support Updates** - Added support for [Claude Haiku 4.5](https://www.anthropic.com/news/claude-haiku-4-5) and updated model pricing information across the platform.
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.8.62...1.8.83)
*Releases*: `1.8.63`, `1.8.64`, `1.8.65`, `1.8.66`, `1.8.67`, `1.8.68`, `1.8.69`, `1.8.70`, `1.8.71`, `1.8.72`, `1.8.73`, `1.8.74`, `1.8.75`, `1.8.76`, `1.8.77`, `1.8.78`, `1.8.79`, `1.8.80`, `1.8.81`, `1.8.82`, `1.8.83`
# October 3, 2025
Here are the most relevant improvements we've made since the last release:
## ๐ Multi-Value Feedback Scores & Annotation Queues
We're excited to announce major improvements to our evaluation and annotation capabilities!
### What's new?
**1. Multi-Value Feedback Scores**
Multiple users can now independently score the same trace or thread. No more overwriting each other's inputโevery reviewer's perspective is preserved and is visible in the product. This enables richer, more reliable consensus-building during evaluation.
**2. Annotation Queues**
Create queues of traces or threads that need expert review. Share them with SMEs through simple links. Organize work systematically, track progress, and collect both structured and unstructured feedback at scale.
**3. Simplified Annotation Experience**
A clean, focused UI designed for non-technical reviewers. Support for clear instructions, predefined feedback metrics, and progress indicators. Lightweight and distraction-free, so SMEs can concentrate on providing high-quality feedback.
[Full Documentation: Annotation Queues](/docs/opik/evaluation/annotation_queues)
## ๐ Opik Optimizer - GEPA Algorithm & MCP Tool Optimization
### What's new?
**1. GEPA (Genetic-Pareto) Support**
[GEPA](https://github.com/gepa-ai/gepa/) is the new algorithm for optimizing prompts from Stanford. This bolsters our existing optimizers with the latest algorithm to give users more options.
**2. MCP Tool Calling Optimization**
The ability to tune MCP servers (external tools used by LLMs). Our solution uses our existing algorithm (MetaPrompter) to use LLMs to tune how LLMs interact with an MCP tool. The final output is a new tool signature which you can commit back to your code.
[Full Documentation: Tool Optimization](/docs/opik/agent_optimization/algorithms/tool_optimization) | [GEPA Optimizer](/docs/opik/agent_optimization/algorithms/gepa_optimizer)
## ๐ Dataset & Search Enhancements
* Added dataset search and dataset items download functionality
## ๐ Python SDK Improvements
* Implement granular support for choosing dataset items in experiments
* Better project name setting and onboarding
* Implement calculation of mean/min/max/std for each metric in experiments
* Update CrewAI to support CrewAI flows
## ๐จ UX Enhancements
* Add clickable links in trace metadata
* Add description field to feedback definitions
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.8.42...1.8.62)
*Releases*: `1.8.43`, `1.8.44`, `1.8.45`, `1.8.46`, `1.8.47`, `1.8.48`, `1.8.49`, `1.8.50`, `1.8.51`, `1.8.52`, `1.8.53`, `1.8.54`, `1.8.55`, `1.8.56`, `1.8.57`, `1.8.58`, `1.8.59`, `1.8.60`, `1.8.61`, `1.8.62`
# September 5, 2025
Here are the most relevant improvements we've made since the last release:
## ๐ Opik Trace Analyzer Beta is Live!
We're excited to announce the launch of **Opik Trace Analyzer** on Opik Cloud!
What this means: faster debugging & analysis!
Our users can now easily understand, analyze, and debug their development and production traces.
Want to give it a try? All you need to do is go to one of your traces and click on "Inspect trace" to start getting valuable insights.
## โจ Features and Improvements
* We've finally added **dark mode** support! This feature has been requested many times by our community members. You can now switch your theme in your account settings.
* Now you can filter the widgets in the metrics tab by trace and threads attributes
* Annotating tons of threads? We've added the ability to export feedback score comments for threads to CSV for easier analysis in external tools.
* We have also improved the discoverability of the experiment comparison feature.
* Added new filter operators to the Experiments table
* Adding assets as part of your experiment's metadata? We now display clickable links in the experiment config tab for easier navigation.
## ๐ Documentation
* We've released [Opik University](/opik/opik-university)! This is a new section of the docs full of video guides explaining the product.
## ๐ SDK & Integration Improvements
* Enhanced *LangChain* integration with comprehensive tests and build fixes
* Implemented new search\_prompts method in the Python SDK
* Added [documentation for models, providers, and frameworks supported for cost tracking](/docs/opik/tracing/cost_tracking#supported-models-providers-and-integrations)
* Enhanced Google ADK integration to log **error information to corresponding spans and traces**
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.8.33...1.8.42)
*Releases*: `1.8.34`, `1.8.35`, `1.8.36`, `1.8.37`, `1.8.38`, `1.8.39`, `1.8.40`, `1.8.41`, `1.8.42`
# August 22, 2025
Here are the most relevant improvements we've made in the last couple of weeks:
## ๐งช Experiment Grouping
Instantly organize and compare experiments by model, provider, or custom metadata to surface top performers, identify slow configurations, and discover winning parameter combinations. The new Group by feature provides aggregated statistics for each group, making it easier to analyze patterns across hundreds of experiments.
## ๐ค Expanded Model Support
Added support for 144+ new models, including:
* OpenAI's GPT-5 and GPT-4.1-mini
* Anthropic Claude Opus 4.1
* Grok 4
* DeepSeek v3
* Qwen 3
## ๐ซ Streamlined Onboarding
New quick start experience with AI-assisted installation, interactive setup guides, and instant access to team collaboration features and support.
## ๐ Integrations
Enhanced support for leading AI frameworks including:
* **LangChain**: Improved token usage tracking functionality
* **Bedrock**: Comprehensive cost tracking for Bedrock models
## ๐ Custom Trace Filters
Advanced filtering capabilities with support for list-like keys in trace and span filters, enabling precise data segmentation and analysis across your LLM operations.
## โก Performance Optimizations
* Python scoring performance improvements with pre-warming
* Optimized ClickHouse async insert parameters
* Improved deduplication for spans and traces in batches
## ๐ ๏ธ SDK Improvements
* Python SDK configuration error handling improvements
* Added dataset & dataset item ID to evaluate task inputs
* Updated OpenTelemetry integration
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.8.16...1.8.33)
*Releases*: `1.8.16`, `1.8.17`, `1.8.18`, `1.8.19`, `1.8.20`, `1.8.21`, `1.8.22`, `1.8.23`, `1.8.24`, `1.8.25`, `1.8.26`, `1.8.27`, `1.8.28`, `1.8.29`, `1.8.30`, `1.8.31`, `1.8.32`, `1.8.33`
# August 1, 2025
## ๐ฏ Advanced Filtering & Search Capabilities
We've expanded filtering and search capabilities to help you find and analyze data more effectively:
* **Custom Trace Filters**: Support for custom filters on input/output fields for traces and spans, allowing more precise data filtering
* **Enhanced Search**: Improved search functionality with better result highlighting and local search within code blocks
* **Better Search Results**: Enhanced search result highlighting and improved local search functionality within code blocks
* **Crash Filtering**: Fixed filtering issues for values containing special characters like `%` to prevent crashes
* **Dataset Filtering**: Added support for experiments filtering by datasetId and promptId
## ๐ Metrics & Analytics Improvements
We've enhanced the metrics and analytics capabilities:
* **Thread Feedback Scores**: Added comprehensive thread feedback scoring system for better conversation quality assessment
* **Thread Duration Monitoring**: New duration widgets in the Metrics dashboard for monitoring conversation length trends
* **Online Evaluation Rules**: Added ability to enable/disable online evaluation rules for more flexible monitoring
* **Cost Optimization**: Reduced cost prompt queries to improve performance and reduce unnecessary API calls
## ๐จ UX Enhancements
We've made several UX improvements to make the platform more intuitive and efficient:
* **Full-Screen Popup Improvements**: Enhanced the full-screen popup experience with better navigation and usability
* **Tag Component Optimization**: Made tag components smaller and more compact for better space utilization
* **Column Sorting**: Enabled sorting and filtering on all Prompt columns for better data organization
* **Multi-Item Tagging**: Added ability to add tags to multiple items in the Traces and Spans tables simultaneously
## ๐ SDK, integrations and docs
* **LangChain Integration**: Enhanced LangChain integration with improved provider and model logging
* **Google ADK Integration**: Updated Google ADK integration with better graph building capabilities
* **Bedrock Integration**: Added comprehensive cost tracking support for ChatBedrock and ChatBedrockConverse
## ๐ Security & Stability Enhancements
We've implemented several security and stability improvements:
* **Dependency Updates**: Updated critical dependencies including MySQL connector, OpenTelemetry, and various security patches
* **Error Handling**: Improved error handling and logging across the platform
* **Performance Monitoring**: Enhanced NewRelic support for better performance monitoring
* **Sentry Integration**: Added more metadata about package versions to Sentry events for better debugging
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.8.7...1.8.16)
*Releases*: `1.8.7`, `1.8.8`, `1.8.9`, `1.8.10`, `1.8.11`, `1.8.12`, `1.8.13`, `1.8.14`, `1.8.15`, `1.8.16`
# July 18, 2025
## ๐งต Thread-level LLMs-as-Judge
We now support **thread-level LLMs-as-a-Judge metrics**!
We've implemented **Online evaluation for threads**, enabling the evaluation of **entire conversations between humans and agents**.
This allows for scalable measurement of metrics such as user frustration, goal achievement, conversational turn quality, clarification request rates, alignment with user intent, and much more.
We've also implemented **Python metrics support for threads**, giving you full code control over metric definitions.
To improve visibility into trends and to help detect spikes in these metrics when the agent is running in production, weโve added Thread Feedback Scores and Thread Duration widgets to the Metrics dashboard.
These additions make it easier to monitor changes over time in live environments.
## ๐ Improved Trace Inspection Experience
Once youโve identified problematic sessions or traces, weโve made it easier to inspect and analyze them with the following improvements:
* Field Selector for Trace Tree: Quickly choose which fields to display in the trace view.
* Span Type Filter: Filter spans by type to focus on what matters.
* Improved Agent Graph: Now supports full-page view and zoom for easier navigation.
* Free Text Search: Search across traces and spans freely without constraints.
* Better Search Usability: search results are now highlighted and local search is available within code blocks.
## ๐ Spans Tab Improvements
The Spans tab provides a clearer, more comprehensive view of agent activity to help you analyze tool and sub-agent usage across threads, uncover trends, and spot latency outliers more easily.
Whatโs New:
* LLM Calls โ Spans: weโve renamed the LLM Calls tab to Spans to reflect broader coverage and richer insights.
* Unified View: see all spans in one place, including LLM calls, tools, guardrails, and more.
* Span Type Filter: quickly filter spans by type to focus on what matters most.
* Customizable Columns: highlight key span types by adding them as dedicated columns.
These improvements make it faster and easier to inspect agent behavior and performance at a glance.
## ๐ Experiments Improvements
Slow model response times can lead to frustrating user experiences and create hidden bottlenecks in production systems.
However, identifying latency issues early (during experimentation) is often difficult without clear visibility into model performance.
To help address this, weโve added Duration as a key metric for monitoring model latency in the Experiments engine.
You can now include Duration as a selectable column in both the Experiments and Experiment Details views.
This makes it easier to identify slow-responding models or configurations early, so you can proactively address potential performance risks before they impact users.
## ๐ฆ Enhanced Data Organization & Tagging
When usage grows and data volumes increase, effective data management becomes crucial.
We've added several capabilities to make team workflows easier:
* Tagging, filtering, and column sorting support for Prompts
* Tagging, filtering, and column sorting support for Datasets
* Ability to add tags to multiple items in the Traces and Spans tables
## ๐ค New Models Support
We've added support for:
* OpenAI GPT-4.1 and GPT-4.1-mini models
* Anthropic Claude 4 Sonnet model
## ๐ Integration Updates
We've enhanced several integrations:
* Build graph for Google ADK agents
* Update Langchain integration to log provider, model and usage when using Google Generative AI models
* Implement Groq LLM usage tracking support in the Langchain integration
And much more! ๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.8.0...1.8.6)
*Releases*: `1.8.0`, `1.8.1`, `1.8.2`, `1.8.3`, `1.8.4`, `1.8.5`, `1.8.6`
# July 4, 2025
## ๐ Agent Optimizer 1.0 released!
The Opik Agent Optimizer now supports full agentic systems and not just single prompts.
With support for LangGraph, Google ADK, PydanticAI, and more, this release brings a simplified API, model customization for evaluation, and standardized interfaces to streamline optimization workflows. [Learn more in the docs.](/docs/opik/agent_optimization/overview)
## ๐งต Thread-level improvements
Added **Thread-Level Feedback, Tags & Comments**: You can now add expert feedback scores directly at the thread level, enabling SMEs to review full agent conversations, flag risks, and collaborate with dev teams more effectively. Added support for thread-level tags and comments to streamline workflows and improve context sharing.
## ๐ฅ๏ธ UX improvements
* Weโve redesigned the **Opik Home Page** to deliver a cleaner, more intuitive first-use experience, with a focused value proposition, direct access to key metrics, and a polished look. The demo data has also been upgraded to showcase Opikโs capabilities more effectively for new users. Additionally, we've added **inter-project comparison capabilities** for metrics and cost control, allowing you to benchmark and monitor performance and expenses across multiple projects.
* **Improved Error Visualization**: Enhanced how span-level errors are surfaced across the project. Errors now bubble up to the project view, with quick-access shortcuts to detailed error logs and variation stats for better debugging and error tracking.
* **Improved Sidebar Hotkeys**: Updated sidebar hotkeys for more efficient keyboard navigation between items and detail views.
## ๐ SDK, integrations and docs
* Added **Langchain** support in metric classes, allowing use of Langchain as a model proxy alongside LiteLLM for flexible LLM judge customization.
* Added support for the **Gemini 2.5** model family.
* Updated pretty mode to support **Dify** and **LangGraph + OpenAI** responses.
* Added the **OpenAI agents integration cookbook** ([link](/integrations/openai_agents)).
* Added a cookbook on how to import **Huggingface Datasets to Opik**
๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.7.37...1.7.42)
*Releases*: `1.7.37`, `1.7.38`, `1.7.39`, `1.7.40`, `1.7.41`, `1.7.42`
# June 20, 2025
## ๐ Integrations and SDK
* Added **CloudFlare's WorkersAI** integration ([docs](/docs/opik/integrations/cloudflare-workers-ai))
* **Google ADK** integration: tracing is now automatically propagated to all sub-agents in agentic systems with the new `track_adk_agent_recursive` feature, eliminating the need to manually add tracing to each sub-agent.
* **Google ADK** integration: now we retrieve session-level information from the ADK framework to enrich the threads data.
* **New in the SDK!** Real-time tracking for long-running spans/traces is now supported. When enabled (set `os.environ["OPIK_LOG_START_TRACE_SPAN"] = "True"` in your environment), you can see traces and spans update live in the UIโeven for jobs that are still running. This makes debugging and monitoring long-running agents much more responsive and convenient.
## ๐งต Threads improvements
* Added **Token Count and Cost Metrics** in Thread table
* Added **Sorting on all Thread table columns**
* Added **Navigation** from Thread Detail to all related traces
* Added support for **"pretty mode"** in OpenAI Agents threads
## ๐งช Experiments improvements
* Added support for filtering by **configuration metadata** to experiments. It is now also possible to add a new column displaying the configuration in the experiments table.
## ๐ Agent Optimizer improvements
* New Public API for Agent Optimization
* Added optimization run display link
* Added `optimization_context`
## ๐ก๏ธ Security Fixes
* Fixed: h11 accepted some malformed Chunked-Encoding bodies
* Fixed: setuptools had a path traversal vulnerability in PackageIndex.download that could lead to Arbitrary File Write
* Fixed: LiteLLM had an Improper Authorization Vulnerability
๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.7.31...1.7.36)
*Releases*: `1.7.32`, `1.7.33`, `1.7.34`, `1.7.35`, `1.7.36`
# June 6, 2025
## ๐ก Product Enhancements
* Ability to upload **CSV datasets** directly through the user interface
* Add **experiment cost tracking** to the Experiments table
* Add hinters and helpers for **onboarding new users** across the platform
* Added "LLM calls count" to the traces table
* Pretty formatting for complex agentic threads
* Preview **support for MP3** files in the frontend
## ๐ SDKs and API Enhancements
* Good news for JS developers! We've released **experiments support for the JS SDK** (official docs coming very soon)
* New Experiments Bulk API: a new API has been introduced for logging Experiments in bulk.
* Rate Limiting improvements both in the API and the SDK
## ๐ Integrations
* Support for OpenAI o3-mini and Groq models added to the Playground
* OpenAI Agents: context awareness implemented and robustness improved. Improve thread handling
* Google ADK: added support for multi-agent integration
* LiteLLM: token and cost tracking added for SDK calls. Integration now compatible with opik.configure(...)
๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.7.26...1.7.31)
*Releases*: `1.7.27`, `1.7.28`, `1.7.29`, `1.7.30`, `1.7.31`
# May 23, 2025
## โจ New Features
* **Opik Agent Optimizer**: A comprehensive toolkit designed to enhance the performance and efficiency of your Large Language Model (LLM) applications. [Read more](/docs/opik/agent_optimization/overview)
* **Opik Guardrails**: Guardrails help you protect your application from risks inherent in LLMs. Use them to check the inputs and outputs of your LLM calls, and detect issues like off-topic answers or leaking sensitive information. [Read more](/docs/opik/production/guardrails)
## ๐ก Product Enhancements
* **New Prompt Selector in Playground** โ Choose existing prompts from your Prompt Library to streamline your testing workflows.
* **Improved โPretty Formatโ for Agents** โ Enhanced readability for complex threads in the UI.
## ๐ Integrations
* **Vertex AI (Gemini)** โ Offline and online evaluation support integrated directly into Opik. Also available now in the Playground.
* **OpenAI Integration in the JS/TS SDK**
* **AWS Strands Agents**
* **Agno Framework**
* **Google ADK Multi-agent support**
## ๐ SDKs and API Enhancements
* **OpenAI LLM advanced configurations** โ Support for custom headers and base URLs.
* **Span Timing Precision** โ Time resolution improved to microseconds for accurate monitoring.
* **Better Error Messaging** โ More descriptive errors for SDK validation and runtime failures.
* **Stream-based Tracing and Enhanced Streaming support**
๐ [See full commit log on GitHub](https://github.com/comet-ml/opik/compare/1.7.18...1.7.26)
*Releases*: `1.7.19`, `1.7.20`, `1.7.21`, `1.7.22`, `1.7.23`, `1.7.24`, `1.7.25`, `1.7.26`
# May 5, 2025
**Opik Dashboard**:
**Python and JS / TS SDK**:
* Added support for streaming in ADK integration
* Add cost tracking for the ADK integration
* Add support for OpenAI `responses.parse`
* Reduce the memory and CPU overhead of the Python SDK through various
performance optimizations
**Deployments**:
* Updated port mapping when using `opik.sh`
* Fixed persistence when using Docker compose deployments
*Release*: `1.7.15`, `1.7.16`, `1.7.17`, `1.7.18`
# April 28, 2025
**Opik Dashboard**:
* Updated the experiment page charts to better handle nulls, all metric values
are now displayed.
* Added lazy loading for traces and span sidebar to better handle very large
traces.
* Added support for trace and span attachments, you can now log pdf, video and
audio files to your traces.
* Improved performance of some Experiment endpoints
**Python and JS / TS SDK**:
* Updated DSPy integration following latest DSPy release
* New Autogen integration based on Opik's OpenTelemetry endpoints
* Added compression to request payload
*Release*: `1.7.12`, `1.7.13`, `1.7.14`
# April 21, 2025
**Opik Dashboard**:
* Released Python code metrics for online evaluations for both Opik Cloud and
self-hosted deployments. This allows you to define python functions to evaluate
your traces in production.
**Python and JS / TS SDK**:
* Fixed LLM as a judge metrics so they return an error rather than a score of
0.5 if the LLM returns a score that wasn't in the range 0 to 1.
**Deployments**:
* Updated Dockerfiles to ensure all containers run as non root users.
*Release*: `1.7.11`
# April 14, 2025
**Opik Dashboard:**
* Updated the feedback scores UI in the experiment page to make it easier to
annotate experiment results.
* Fixed an issue with base64 encoded images in the experiment sidebar.
* Improved the loading speeds of the traces table and traces sidebar for traces
that have very large payloads (25MB+).
**Python and JS / TS SDK**:
* Improved the robustness of LLM as a Judge metrics with better parsing.
* Fix usage tracking for Anthropic models hosted on VertexAI.
* When using LiteLLM, we fallback to using the LiteLLM cost if no model provider
or model is specified.
* Added support for `thread_id` in the LangGraph integration.
*Releases*: `1.7.4`, `1.7.5`, `1.7.6`. `1.7.7` and `1.7.8`.
# April 7, 2025
**Opik Dashboard:**
* Added search to codeblocks in the input and output fields.
* Added sorting on feedback scores in the traces and spans tables:
* Added sorting on feedback scores in the experiments table.
**Python and JS / TS SDK**:
* Released a new integration with [Google ADK framework](https://google.github.io/adk-docs/).
* Cleanup up usage information by removing it from metadata field if it's already
part of the `Usage` field.
* Added support for `Rouge` metric - Thanks @rohithmsr !
* Updated the LangChain callback `OpikTracer()` to log the data in a structured
way rather than as raw text. This is expecially useful when using LangGraph.
* Updated the LangChainJS integration with additional examples and small fixes.
* Updated the OpenAI integration to support the Responses API.
* Introduced a new AggregatedMetric metric that can be used to compute aggregations
of metrics in experiments.
* Added logging for LLamaIndex streaming methods.
* Added a new `text` property on the Opik.Prompt object.
*Releases*: `1.6.14`, `1.7.0`, `1.7.1`, `1.7.2`
# March 31, 2025
**Opik Dashboard:**
* Render markdown in experiment output sidebar
* The preference between pretty / JSON and YAML views are now saved
* We now hide image base64 strings in the traces sidebar to make it easier to read
**Python and JS / TS SDK**:
* Released a new [integration with Flowise AI](https://docs.flowiseai.com/using-flowise/analytics/opik)
* LangChain JS integration
* Added support for jinja2 prompts
# March 24, 2025
**General**
* Introduced a new `.opik.sh` installation script
**Opik Dashboard:**
* You can now view the number of spans for each trace in the traces table
* Add the option to search spans from the traces sidebar
* Improved performance of the traces table
**Python and JS / TS SDK**:
* Fixed issue related to log\_probs in Geval metric
* Unknown fields are no longer excluded when using the OpenTelemetry integration
# March 17, 2025
**Opik Dashboard:**
* We have revamped the traces table, the header row is now sticky at the top of
the page when scrolling
* As part of this revamp, we also made rows clickable to make it easier to open
the traces sidebar
* Added visualizations in the experiment comparison page to help you analyze
your experiments
* You can now filter traces by empty feedback scores in the traces table
* Added support for Gemini options in the playground
* Updated the experiment creation code
* Many performance improvements
**Python and JS / TS SDK**:
* Add support for Anthropic cost tracking when using the LangChain integration
* Add support for images in google.genai calls
* [LangFlow integration](https://github.com/langflow-ai/langflow/pull/6928) has now been merged
# March 10, 2025
**Opik Dashboard:**
* Add CSV export for the experiment comparison page
* Added a pretty mode for rendering trace and span input / output fields
* Improved pretty mode to support new line characters and tabs
* Added time support for the Opik datetime filter
* Improved tooltips for long text
* Add `reason` field for feedback scores to json downloads
**Python and JS / TS SDK**:
* Day 0 integration with [OpenAI Agents](/integrations/openai_agents)
* Fixed issue with `get_experiment_by_name` method
* Added cost tracking for Anthropic integration
* Sped up the import time of the Opik library from \~5 seconds to less than 1 second
# March 3, 2025
**Opik Dashboard**:
* Chat conversations can now be reviewed in the platform
* Added the ability to leave comments on experiments
* You can now leave reasons on feedback scores, see [Annotating Traces](/tracing/annotate_traces)
* Added support for Gemini in the playground
* A thumbs up / down feedback score definition is now added to all projects by default to make it easier
to annotate traces.
**JS / TS SDK**:
* The AnswerRelevanceMetric can now be run without providing a context field
* Made some updates to how metrics are uploaded to optimize data ingestion
# February 24, 2025
**Opik Dashboard**:
* You can now add comments to your traces allowing for better collaboration:
* Added support for [OpenRouter](https://openrouter.ai/) in the playground - You can now use over 300 different
models in the playground !
**JS / TS SDK**:
* Added support for JSON data format in our OpenTelemetry endpoints
* Added a new `opik healthcheck` command in the Python SDK which simplifies the debugging of connectivity issues
# February 17, 2025
**Opik Dashboard**:
* Improved the UX when navigating between the project list page and the traces page
**Python SDK**:
* Make the logging of spans and traces optional when using Opik LLM metrics
* New integration with genai library
**JS / TS SDK**:
* Added logs and better error handling
# February 10, 2025
**Opik Dashboard**:
* Added support for local models in the Opik playground
**Python SDK**:
* Improved the `@track` decorator to better support nested generators.
* Added a new `Opik.copy_traces(project_name, destination_project_name)` method to copy traces
from one project to another.
* Added support for searching for traces that have feedback scores with spaces in their name.
* Improved the LangChain and LangGraph integrations
**JS / TS SDK**:
* Released the Vercel AI integration
* Added support for logging feedback scores
# February 3, 2025
**Opik Dashboard**:
* You can now view feedback scores for your projects in the Opik home page
* Added line highlights in the quickstart page
* Allow users to download experiments as CSV and JSON files for further analysis
**Python SDK**:
* Update the `evaluate_*` methods so feedback scores are logged after they computed rather than at the end of an experiment as previously
* Released a new [usefulness metric](/evaluation/metrics/usefulness)
* Do not display warning messages about missing API key when Opik logging is disabled
* Add method to list datasets in a workspace
* Add method to list experiments linked to a dataset
**JS / TS SDK**:
* Official release of the first version of the SDK - Learn more [here](/tracing/log_traces#logging-with-the-js--ts-sdk)
* Support logging traces using the low-level Opik client and an experimental decorator.
# January 27, 2025
**Opik Dashboard**:
* Performance improvements for workspaces with 100th of millions of traces
* Added support for cost tracking when using Gemini models
* Allow users to diff prompt
**SDK**:
* Fixed the `evaluate` and `evaluate_*` functions to better support event loops, particularly useful when using Ragas metrics
* Added support for Bedrock `invoke_agent` API
# January 20, 2025
**Opik Dashboard**:
* Added logs for online evaluation rules so that you can more easily ensure your online evaluation metrics are working as expected
* Added auto-complete support in the variable mapping section of the online evaluation rules modal
* Added support for Anthropic models in the playground
* Experiments are now created when using datasets in the playground
* Improved the Opik home page
* Updated the code snippets in the quickstart to make them easier to understand
**SDK**:
* Improved support for litellm completion kwargs
* LiteLLM required version is now relaxed to avoid conflicts with other Python packages
# January 13, 2025
**Opik Dashboard**:
* Datasets are now supported in the playground allowing you to quickly evaluate prompts on multiple samples
* Updated the models supported in the playground
* Updated the quickstart guides to include all the supported integrations
* Fix issue that means traces with text inputs can't be added to datasets
* Add the ability to edit dataset descriptions in the UI
* Released [online evaluation](/production/rules) rules - You can now define LLM as a Judge metrics that will automatically score all, or a subset, of your production traces.

**SDK**:
* New integration with [CrewAI](/integrations/crewai)
* Released a new `evaluate_prompt` method that simplifies the evaluation of simple prompts templates
* Added Sentry to the Python SDK so we can more easily
# January 6, 2025
**Opik Dashboard**:
* Fixed an issue with the trace viewer in Safari
**SDK**:
* Added a new `py.typed` file to the SDK to make it compatible with mypy
# December 30, 2024
**Opik Dashboard**:
* Added duration chart to the project dashboard
* Prompt metadata can now be set and viewed in the UI, this can be used to store any additional information about the prompt
* Playground prompts and settings are now cached when you navigate away from the page
**SDK**:
* Introduced a new `OPIK_TRACK_DISABLE` environment variable to disable the tracking of traces and spans
* We now log usage information for traces logged using the LlamaIndex integration
# December 23, 2024
**SDK**:
* Improved error messages when getting a rate limit when using the `evaluate` method
* Added support for a new metadata field in the `Prompt` object, this field is used to store any additional information about the prompt.
* Updated the library used to create uuidv7 IDs
* New Guardrails integration
* New DSPY integration
# December 16, 2024
**Opik Dashboard**:
* The Opik playground is now in public preview
* You can now view the prompt diff when updating a prompt from the UI
* Errors in traces and spans are now displayed in the UI
* Display agent graphs in the traces sidebar
* Released a new plugin for the [Kong AI Gateway](/production/gateway)
**SDK**:
* Added support for serializing Pydantic models passed to decorated functions
* Implemented `get_experiment_by_id` and `get_experiment_by_name` methods
* Scoring metrics are now logged to the traces when using the `evaluate` method
* New integration with [aisuite](/integrations/aisuite)
* New integration with [Haystack](/integrations/haystack)
# December 9, 2024
**Opik Dashboard**:
* Updated the experiments pages to make it easier to analyze the results of each experiment. Columns are now organized based on where they came from (dataset, evaluation task, etc) and output keys are now displayed in multiple columns to make it easier to review
* Improved the performance of the experiments so experiment items load faster
* Added descriptions for projects
**SDK**:
* Add cost tracking for OpenAI calls made using LangChain
* Fixed a timeout issue when calling `get_or_create_dataset`
# December 2, 2024
**Opik Dashboard**:
* Added a new `created_by` column for each table to indicate who created the record
* Mask the API key in the user menu
**SDK**:
* Implement background batch sending of traces to speed up processing of trace creation requests
* Updated OpenAI integration to track cost of LLM calls
* Updated `prompt.format` method to raise an error when it is called with the wrong arguments
* Updated the `Opik` method so it accepts the `api_key` parameter as a positional argument
* Improved the prompt template for the `hallucination` metric
* Introduced a new `opik_check_tls_certificate` configuration option to disable the TLS certificate check.
# November 25, 2024
**Opik Dashboard**:
* Feedback scores are now displayed as separate columns in the traces and spans table
* Introduce a new project dashboard to see trace count, feedback scores and token count over time.
* Project statistics are now displayed in the traces and spans table header, this is especially useful for tracking the average feedback scores
* Redesigned the experiment item sidebar to make it easier to review experiment results
* Annotating feedback scores in the UI now feels much faster
* Support exporting traces as JSON file in addition to CSV
* Sidebars now close when clicking outside of them
* Dataset groups in the experiment page are now sorted by last updated date
* Updated scrollbar styles for Windows users
**SDK**:
* Improved the robustness to connection issues by adding retry logic.
* Updated the OpenAI integration to track structured output calls using `beta.chat.completions.parse`.
* Fixed issue with `update_current_span` and `update_current_trace` that did not support updating the `output` field.
# November 18, 2024
**Opik Dashboard**:
* Updated the majority of tables to increase the information density, it is now easier to review many traces at once.
* Images logged to datasets and experiments are now displayed in the UI. Both images urls and base64 encoded images are supported.
**SDK**:
* The `scoring_metrics` argument is now optional in the `evaluate` method. This is useful if you are looking at evaluating your LLM calls manually in the Opik UI.
* When uploading a dataset, the SDK now prints a link to the dataset in the UI.
* Usage is now correctly logged when using the LangChain OpenAI integration.
* Implement a batching mechanism for uploading spans and dataset items to avoid `413 Request Entity Too Large` errors.
* Removed pandas and numpy as mandatory dependencies.
# November 11, 2024
**Opik Dashboard**:
* Added the option to sort the projects table by `Last updated`, `Created at` and `Name` columns.
* Updated the logic for displaying images, instead of relying on the format of the response, we now use regex rules to detect if the trace or span input includes a base64 encoded image or url.
* Improved performance of the Traces table by truncating trace inputs and outputs if they contain base64 encoded images.
* Fixed some issues with rendering trace input and outputs in YAML format.
* Added grouping and charts to the experiments page:
**SDK**:
* **New integration**: Anthropic integration
```python wordWrap
from anthropic import Anthropic, AsyncAnthropic
from opik.integrations.anthropic import track_anthropic
client = Anthropic()
client = track_anthropic(client, project_name="anthropic-example")
message = client.messages.create(
max_tokens=1024,
messages=[
{
"role": "user",
"content": "Tell a fact",
}
],
model="claude-3-opus-20240229",
)
print(message)
```
* Added a new `evaluate_experiment` method in the SDK that can be used to re-score an existing experiment, learn more in the [Update experiments](/evaluation/update_existing_experiment) guide.
# November 4, 2024
**Opik Dashboard**:
* Added a new `Prompt library` page to manage your prompts in the UI.
**SDK**:
* Introduced the `Prompt` object in the SDK to manage prompts stored in the library. See the [Prompt Management](/prompt_engineering/managing_prompts_in_code) guide for more details.
* Introduced a `Opik.search_spans` method to search for spans in a project. See the [Search spans](/tracing/export_data#exporting-spans) guide for more details.
* Released a new integration with [AWS Bedrock](/integrations/bedrock) for using Opik with Bedrock models.
# October 21, 2024
**Opik Dashboard**:
* Added the option to download traces and LLM calls as CSV files from the UI:
* Introduce a new quickstart guide to help you get started:
* Updated datasets to support more flexible data schema, you can now insert items with any key value pairs and not just `input` and `expected_output`. See more in the SDK section below.
* Multiple small UX improvements (more informative empty state for projects, updated icons, feedback tab in the experiment page, etc).
* Fix issue with `\t` characters breaking the YAML code block in the traces page.
**SDK**:
* Datasets now support more flexible data schema, we now support inserting items with any key value pairs:
```python wordWrap
import opik
client = opik.Opik()
dataset = client.get_or_create_dataset(name="Demo Dataset")
dataset.insert([
{
"user_question": "Hello, what can you do ?",
"expected_output": {
"assistant_answer": "I am a chatbot assistant that can answer questions and help you with your queries!"
}
},
{
"user_question": "What is the capital of France?",
"expected_output": {
"assistant_answer": "Paris"
}
},
])
```
* Released WatsonX, Gemini and Groq integration based on the LiteLLM integration.
* The `context` field is now optional in the [Hallucination](/integrations/overview) metric.
* LLM as a Judge metrics now support customizing the LLM provider by specifying the `model` parameter. See more in the [Customizing LLM as a Judge metrics](/evaluation/metrics/overview#customizing-llm-as-a-judge-metrics) section.
* Fixed an issue when updating feedback scores using the `update_current_span` and `update_current_trace` methods. See this Github issue for more details.
# October 18, 2024
**Opik Dashboard**:
* Added a new `Feedback modal` in the UI so you can easily provide feedback on any parts of the platform.
**SDK**:
* Released new evaluation metric: [GEval](/evaluation/metrics/g_eval) - This LLM as a Judge metric is task agnostic and can be used to evaluate any LLM call based on your own custom evaluation criteria.
* Allow users to specify the path to the Opik configuration file using the `OPIK_CONFIG_PATH` environment variable, read more about it in the [Python SDK Configuration guide](/tracing/sdk_configuration#using-a-configuration-file).
* You can now configure the `project_name` as part of the `evaluate` method so that traces are logged to a specific project instead of the default one.
* Added a new `Opik.search_traces` method to search for traces, this includes support for a search string to return only specific traces.
* Enforce structured outputs for LLM as a Judge metrics so that they are more reliable (they will no longer fail when decoding the LLM response).
# October 14, 2024
**Opik Dashboard**:
* Fix handling of large experiment names in breadcrumbs and popups
* Add filtering options for experiment items in the experiment page
**SDK:**
* Allow users to configure the project name in the LangChain integration
# October 7, 2024
**Opik Dashboard**:
* Added `Updated At` column in the project page
* Added support for filtering by token usage in the trace page
**SDK:**
* Added link to the trace project when traces are logged for the first time in a session
* Added link to the experiment page when calling the `evaluate` method
* Added `project_name` parameter in the `opik.Opik` client and `opik.track` decorator
* Added a new `nb_samples` parameter in the `evaluate` method to specify the number of samples to use for the evaluation
* Released the LiteLLM integration
# September 30, 2024
**Opik Dashboard**:
* Added option to delete experiments from the UI
* Updated empty state for projects with no traces
* Removed tooltip delay for the reason icon in the feedback score components
**SDK:**
* Introduced new `get_or_create_dataset` method to the `opik.Opik` client. This method will create a new dataset if it does not exist.
* When inserting items into a dataset, duplicate items are now silently ignored instead of being ingested.
# Tracing Core Concepts
> Learn about the core concepts of Opik's tracing system, including traces, spans, threads, and how they work together to provide comprehensive observability for your LLM applications.
Opik supports agent observability using our [Typescript SDK](/reference/typescript-sdk/overview),
[Python SDK](/reference/python-sdk/overview), [first class OpenTelemetry support](/integrations/opentelemetry)
and our [REST API](/reference/rest-api/overview).
Integrate with Opik faster using this pre-built prompt
As a next step, you can create an [offline evaluation](/evaluation/evaluate_prompt) to evaluate your
agent's performance on a fixed set of samples.
## Advanced usage
### Using function decorators
Function decorators are a great way to add Opik logging to your existing application. When you add
the `@track` decorator to a function, Opik will create a span for that function call and log the
input parameters and function output for that function. If we detect that a decorated function
is being called within another decorated function, we will create a nested span for the inner
function.
While decorators are most popular in Python, we also support them in our Typescript SDK:
## Understanding Threads
Threads in Opik are collections of traces that are grouped together using a unique `thread_id`. This is particularly useful for:
* **Multi-turn conversations**: Track complete chat sessions between users and AI assistants
* **User sessions**: Group all interactions from a single user session
* **Conversational agents**: Follow the flow of agent interactions and tool usage
* **Workflow tracking**: Monitor complex workflows that span multiple function calls
The `thread_id` is a user-defined identifier that must be unique per project. All traces with the same `thread_id` will be grouped together and displayed as a single conversation thread in the Opik UI.
## Logging conversations
You can log chat conversations by specifying the `thread_id` parameter when using either the low level SDK, Python decorators, or integration libraries:
## Scoring conversations
It is possible to assign conversation level feedback scores. For that, you need to understand how threads work in Opik. Threads are aggregated traces
that are created when tracking agents or simply traces interconnected by a `thread_id`. In order to score a conversation, we need to ensure that the
thread is inactive, meaning that no new traces are being created.
Once a thread is inactive, you can assign a feedback score to the thread. This score will be associated to the thread and will be displayed in the thread view.
And in the conversation list, you can see the feedback score associated to the thread.
## Advanced Thread Features
### Filtering and Searching Threads
You can filter threads using the `thread_id` field in various Opik features:
#### In Data Export
When exporting data, you can filter by `thread_id` using these operators:
* `=` (equals), `!=` (not equals)
* `contains`, `not_contains`
* `starts_with`, `ends_with`
* `>`, `<` (lexicographic comparison)
#### In Thread Evaluation
You can evaluate entire conversation threads using the thread evaluation features. This is particularly useful for:
* Conversation quality assessment
* Multi-turn coherence evaluation
* User satisfaction scoring across complete interactions
### Thread Lifecycle Management
Threads have a lifecycle that affects how you can interact with them:
1. **Active**: New traces can be added to the thread
2. **Inactive**: No new traces can be added, thread can be scored
Once a feedback scores has been provided, you can also add a reason to explain why this particular
score was provided. This is useful to add additional context to the score.
If multiple team members are annotating the same trace, you can see the annotations of each team
member in the UI in the `Feedback scores` section. The average score will be displayed at a trace
and trace level.
## Logging Attachments
In the Python SDK, you can use the `Attachment` type to add files to your traces.
Attachements can be images, videos, audio files or any other file that you might
want to log to Opik.
Each attachment is made up of the following fields:
* `data`: The path to the file or the base64 encoded string of the file
* `content_type`: The content type of the file formatted as a MIME type
These attachements can then be logged to your traces and spans using The
`opik_context.update_current_span` and `opik_context.update_current_trace`
methods:
```python wordWrap
from opik import opik_context, track, Attachment
@track
def my_llm_agent(input):
# LLM chain code
# ...
# Update the trace
opik_context.update_current_trace(
attachments=[
Attachment(
data="
## Embedded Attachments
When you embed base64-encoded media directly in your trace/span `input`, `output`, or `metadata` fields, Opik automatically optimizes storage and retrieval for performance.
### How It Works
For base64-encoded content larger than 250KB, Opik automatically extracts and stores it separately. This happens transparently - you don't need to change your code.
When you retrieve your traces or spans later, the attachments are automatically included by default. For faster queries when you don't need the attachment data, use the `strip_attachments=true` parameter.
### Size Limits
Opik Cloud supports embedded attachments up to **100MB per field**. This limit applies to individual string values in your `input`, `output`, or `metadata` fields.
## Data Privacy & Security
Your trace data security and privacy are our top priorities. Here's exactly how your data is handled:
### Privacy Guarantees
* **No Training**: Your trace data is **never used** to train AI models
* **Quality Assurance Only**: Comet retains logs of prompts and responses solely for internal quality assurance and debugging
* **No Third-Party Training**: These logs are not used for training third-party models
### What Data is Shared with OpenAI
To provide intelligent analysis, Opik Assist sends the following to OpenAI's AI models:
* **Trace and span metadata**: Names, timestamps, latency information
* **Input/output content**: The actual prompts, responses, and tool outputs from your trace
* **Performance metrics**: Token counts, costs, and timing data
* **Error information**: Any error messages or status codes in the trace
### What Data is NOT Shared
* **Opik system identifiers**: No API keys, or authentication tokens from Opik
* **Account information**: No workspace or project identifying information
* **Historical data**: Only the current trace being analyzed
Reviewing the health check output can help pinpoint the source of the problem and suggest possible resolutions.
### TypeScript SDK Troubleshooting
#### Configuration Validation Errors
The TypeScript SDK validates configuration at startup. Common errors:
* **"OPIK\_URL\_OVERRIDE is not set"**: Set the `OPIK_URL_OVERRIDE` environment variable
* **"OPIK\_API\_KEY is not set"**: Required for Opik Cloud deployments
* **"OPIK\_WORKSPACE is not set"**: Required for Opik Cloud deployments
#### Debug Logging
Enable debug logging to troubleshoot issues:
```bash
export OPIK_LOG_LEVEL="DEBUG"
```
Or programmatically:
```typescript
import { setLoggerLevel } from "opik";
setLoggerLevel("DEBUG");
```
#### Batch Queue Issues
If data isn't appearing in Opik:
1. **Check if data is batched**: Call `await client.flush()` to force sending
2. **Verify configuration**: Ensure correct API URL and credentials
3. **Check network connectivity**: Verify firewall and proxy settings
### General Troubleshooting
#### Environment Variables Not Loading
1. **Python**: Ensure `load_dotenv()` is called before importing `opik`
2. **TypeScript**: The SDK automatically loads `.env` files
3. **Verify file location**: `.env` file should be in project root
4. **Check file format**: No spaces around `=` in `.env` files
#### Configuration File Issues
1. **File location**: Default is `~/.opik.config`
2. **Custom location**: Use `OPIK_CONFIG_PATH` environment variable
3. **File format**: Python uses TOML, TypeScript uses INI format
4. **Permissions**: Ensure file is readable by your application
# Log Agent Graphs
> Learn to log agent execution graphs in Opik for enhanced debugging and flow visualization across frameworks like LangGraph and Google ADK.
Agent Graphs are a great way to visualize the flow of an agent and simplifies it's debugging.
Opik supports logging agent graphs for the following frameworks:
1. LangGraph
2. Google Agent Development Kit (ADK)
3. Manual Tracking
## LangGraph
You can log the agent execution graph by specifying the `graph` parameter in the
[OpikTracer](https://www.comet.com/docs/opik/python-sdk-reference/integrations/langchain/OpikTracer.html) callback:
```python
from opik.integrations.langchain import OpikTracer
opik_tracer = OpikTracer(graph=app.get_graph(xray=True))
```
Opik will log the agent graph definition in the Opik dashboard which you can access by clicking on
`Show Agent Graph` in the trace sidebar.
## Google Agent Development Kit (ADK)
Opik automatically generates visual representations of your agent workflows for Google ADK without requiring any additional configuration. Simply integrate Opik's OpikTracer callback as shown in the [ADK integration configuration guide](https://www.comet.com/docs/opik/integrations/adk#configuring-google-adk), and your agent graphs will be automatically captured and visualized.
The graph automatically shows:
* Agent hierarchy and relationships
* Sequential execution flows
* Parallel processing branches
* Tool connections and dependencies
* Loop structures and iterations
For example, a basic weather and time agent will display its execution flow with all agent steps, LLM calls, and tool invocations:
For more complex multi-agent architectures, the automatic graph visualization becomes even more valuable, providing clear visibility into nested agent hierarchies and complex execution patterns.
## Manual Tracking
You can also log the agent graph definition manually by logging the agent graph definition as a
mermaid graph definition in the metadata of the trace:
```python
import opik
from opik import opik_context
@opik.track
def chat_agent(input: str):
# Update the current trace with the agent graph definition
opik_context.update_current_trace(
metadata={
"_opik_graph_definition": {
"format": "mermaid",
"data": "graph TD; U[User]-->A[Agent]; A-->L[LLM]; L-->A; A-->R[Answer];"
}
}
)
return "Hello, how can I help you today?"
chat_agent("Hi there!")
```
Opik will log the agent graph definition in the Opik dashboard which you can access by clicking on
`Show Agent Graph` in the trace sidebar.
## Next steps
Why not check out:
* [Opik's 50+ integrations](/integrations/overview)
* [Logging traces](/tracing/log_traces)
* [Evaluating agents](/evaluation/evaluate_agents)
# Log distributed traces
> Learn to track distributed traces in complex LLM applications using Opik's built-in support for multi-service tracing.
When working with complex LLM applications, it is common to need to track a traces across multiple services. Opik supports distributed tracing out of the box when integrating using function decorators using a mechanism that is similar to how OpenTelemetry implements distributed tracing.
For the purposes of this guide, we will assume that you have a simple LLM application that is made up of two services: a client and a server. We will assume that the client will create the trace and span, while the server will add a nested span. In order to do this, the `trace_id` and `span_id` will be passed in the headers of the request from the client to the server.

The Python SDK includes some helper functions to make it easier to fetch headers in the client and ingest them in the server:
```python title="client.py"
from opik import track, opik_context
@track()
def my_client_function(prompt: str) -> str:
headers = {}
# Update the headers to include Opik Trace ID and Span ID
headers.update(opik_context.get_distributed_trace_headers())
# Make call to backend service
response = requests.post("http://.../generate_response", headers=headers, json={"prompt": prompt})
return response.json()
```
On the server side, you can pass the headers to your decorated function:
```python title="server.py"
from opik import track
from fastapi import FastAPI, Request
@track()
def my_llm_application():
pass
app = FastAPI() # Or Flask, Django, or any other framework
@app.post("/generate_response")
def generate_llm_response(request: Request) -> str:
return my_llm_application(opik_distributed_trace_headers=request.headers)
```
In the experiment pages, you will be able to:
1. Review the output provided by the LLM for each sample in the dataset
2. Deep dive into each sample by clicking on the `item ID`
3. Review the experiment configuration to know how the experiment was Run
4. Compare multiple experiments side by side
### Analyzing Evaluation Results in Python
To analyze the evaluation results in Python, you can use the `EvaluationResult.aggregate_evaluation_scores()` method
to retrieve the aggregated score statistics:
```python
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination
# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("Evaluation test dataset")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])
# Run the evaluation
result = evaluate_prompt(
dataset=dataset,
messages=[{"role": "user", "content": "Translate the following text to French: {{input}}"}],
model="gpt-5", # or model that you want to evaluate
scoring_metrics=[Hallucination()],
verbose=2, # print detailed statistics report for every metric
)
# Retrieve and print the aggregated scores statistics (mean, min, max, std) per metric
scores = result.aggregate_evaluation_scores()
for metric_name, statistics in scores.aggregated_scores.items():
print(f"{metric_name}: {statistics}")
```
You can use aggregated scores to compare the performance of different models or different versions of the same model.
### Experiment-Level Metrics
In addition to per-item metrics, you can compute experiment-level aggregate metrics that are calculated across all test results. These experiment scores are displayed in the Opik UI alongside feedback scores and can be used for sorting and filtering experiments.
In this section, we will walk through all the concepts associated with Opik's evaluation platform.
## Datasets
The first step in automating the evaluation of your LLM application is to create a dataset which is a collection of samples that your LLM application will be evaluated on. Each dataset is made up of Dataset Items which store the input, expected output and other metadata for a single sample.
Given the importance of datasets in the evaluation process, teams often spend a significant amount of time curating and preparing their datasets. There are three main ways to create a dataset:
1. **Manually curating examples**: As a first step, you can manually curate a set of examples based on your knowledge of the application you are building. You can also leverage subject matter experts to help in the creation of the dataset.
2. **Using synthetic data**: If you don't have enough data to create a diverse set of examples, you can turn to synthetic data generation tools to help you create a dataset. The [LangChain cookbook](/integrations/langchain) has a great example of how to use synthetic data generation tools to create a dataset.
3. **Leveraging production data**: If your application is in production, you can leverage the data that is being generated to augment your dataset. While this is often not the first step in creating a dataset, it can be a great way to enrich your dataset with real world data.
If you are using Opik for production monitoring, you can easily add traces to your dataset by selecting them in the UI and selecting `Add to dataset` in the `Actions` dropdown.
### Experiment Items
Experiment items store the input, expected output, actual output and feedback scores for each dataset sample that was processed during an experiment. In addition, a trace is associated with each item to allow you to easily understand why a given item scored the way it did.
### Experiment-Level Metrics
In addition to per-item metrics, you can compute experiment-level aggregate metrics that are calculated across all test results. These experiment scores are displayed in the Opik UI alongside feedback scores and can be used for sorting and filtering experiments.
Experiment scores are computed after all test results are collected using custom functions that take a list of test results and return aggregate metrics. Common use cases include computing maximum, minimum, or mean values across all test cases, or calculating custom statistics specific to your evaluation needs.
## Learn more
We have provided some guides to help you get started with Opik's evaluation platform:
1. [Overview of Opik's evaluation features](/evaluation/overview)
2. [Evaluate prompts](/evaluation/evaluate_prompt)
3. [Evaluate your LLM application](/evaluation/evaluate_your_llm)
# Manage datasets
> Evaluate your LLM using datasets. Learn to create and manage them via Python SDK, TypeScript SDK, or the Traces table.
Datasets can be used to track test cases you would like to evaluate your LLM on. Each dataset is made up of a dictionary
with any key value pairs. When getting started, we recommend having an `input` and optional `expected_output` fields for
example. These datasets can be created from:
* Python SDK: You can use the Python SDK to create a dataset and add items to it.
* TypeScript SDK: You can use the TypeScript SDK to create a dataset and add items to it.
* Traces table: You can add existing logged traces (from a production application for example) to a dataset.
* The Opik UI: You can manually create a dataset and add items to it.
Once a dataset has been created, you can run Experiments on it. Each Experiment will evaluate an LLM application based
on the test cases in the dataset using an evaluation metric and report the results back to the dataset.
## Create a dataset via the UI
The simplest and fastest way to create a dataset is directly in the Opik UI.
This is ideal for quickly bootstrapping datasets from CSV files without needing to write any code.
Steps:
1. Navigate to **Evaluation > Datasets** in the Opik UI.
2. Click **Create new dataset**.
3. In the pop-up modal:
* Provide a name and an optional description
* Optionally, upload a CSV file with your data
4. Click **Create dataset**.
If you need to create a dataset with more than 1,000 rows, you [can use the SDK](/evaluation/manage_datasets#creating-a-dataset-using-the-sdk).
#### Inserting items from a JSONL file
You can also insert items from a JSONL file:
### Configuration options
**Sample Count**: Start with a smaller number (10-20) to review the quality before generating larger batches.
**Preserve Fields**: Use this to maintain consistency in certain fields while allowing variation in others. For example, preserve the `category` field while varying the `input` and `expected_output`.
**Variation Instructions**: Provide specific guidance such as:
* "Create variations with different difficulty levels"
* "Generate edge cases and error scenarios"
* "Add examples with different input formats"
* "Include multilingual variations"
### Best practices
* **Start small**: Generate 10-20 samples first to evaluate quality before scaling up
* **Review generated content**: Always review AI-generated samples for accuracy and relevance
* **Use variation instructions**: Provide clear guidance on the type of variations you want
* **Preserve key fields**: Use field preservation to maintain important categorizations or metadata
* **Iterate and refine**: Use the custom prompt option to fine-tune generation for your specific needs
There are two way to evaluate a prompt in Opik:
1. Using the prompt playground
2. Using the `evaluate_prompt` function in the Python SDK
## Using the prompt playground
The Opik playground allows you to quickly test different prompts and see how they perform.
You can compare multiple prompts to each other by clicking the `+ Add prompt` button in the top
right corner of the playground. This will allow you to enter multiple prompts and compare them side
by side.
In order to evaluate the prompts on samples, you can add variables to the prompt messages using the
`{{variable}}` syntax. You can then connect a dataset and run the prompts on each dataset item.

## Programmatically evaluating prompts
The Opik SDKs provide a simple way to evaluate prompts using the `evaluate prompt` methods. This
method allows you to specify a dataset, a prompt and a model. The prompt is then evaluated on each
dataset item and the output can then be reviewed and annotated in the Opik UI.
To run the experiment, you can use the following code:
### Automate the scoring process
Manually reviewing each LLM output can be time-consuming and error-prone. The `evaluate_prompt`
function allows you to specify a list of scoring metrics which allows you to score each LLM output.
Opik has a set of built-in metrics that allow you to detect hallucinations, answer relevance, etc
and if we don't have the metric you need, you can easily create your own.
You can find a full list of all the Opik supported metrics in the
[Metrics Overview](/evaluation/metrics/overview) section or you can define your own metric using
[Custom Metrics](/evaluation/metrics/custom_metric) section.
By adding the `scoring_metrics` parameter to the `evaluate_prompt` function, you can specify a list
of metrics to use for scoring. We will update the example above to use the `Hallucination` metric
for scoring:
## Advanced usage
### Missing arguments for scoring methods
When you face the `opik.exceptions.ScoreMethodMissingArguments` exception, it means that the dataset
item and task output dictionaries do not contain all the arguments expected by the scoring method.
The way the evaluate function works is by merging the dataset item and task output dictionaries and
then passing the result to the scoring method. For example, if the dataset item contains the keys
`user_question` and `context` while the evaluation task returns a dictionary with the key `output`,
the scoring method will be called as `scoring_method.score(user_question='...', context= '...', output= '...')`.
This can be an issue if the scoring method expects a different set of arguments.
You can solve this by either updating the dataset item or evaluation task to return the missing
arguments or by using the `scoring_key_mapping` parameter of the `evaluate` function. In the example
above, if the scoring method expects `input` as an argument, you can map the `user_question` key to
the `input` key as follows:
### Logging traces to a specific project
You can use the `project_name` parameter of the `evaluate` function to log evaluation traces to a specific project:
## Prerequisites
Before evaluating agent trajectories, you need:
1. **Opik SDK installed and configured** โ See [Quickstart](/quickstart) for setup
2. **Agent with observability enabled** โ Your agent must be instrumented with Opik tracing
3. **Test dataset** โ Examples with expected agent behavior
If your agent isn't traced yet, see [Log Traces](/tracing/log_traces) to add observability first.
### Installing the Opik SDK
To install the Opik Python SDK you can run the following command:
```bash
pip install opik
```
Then you can configure the SDK by running the following command:
```bash
opik configure
```
This will prompt you for your API key and workspace or your instance URL if you are self-hosting.
### Adding observability to your agent
In order to be able to evaluate the agent's trajectory, you need to add tracing to your agent. This
will allow us to capture the agent's trajectory and evaluate it.
## Creating the user simulator
In order to perform multi-turn evaluation, we need to create a user simulator that will generate
the user's response based on previous turns
```python title="User simulator" maxLines=1000
from opik.simulation import SimulatedUser
user_simulator = SimulatedUser(
persona="You are a frustrated user who wants a refund",
model="openai/gpt-4.1",
)
# Generate a user message that will start the conversation
print(user_simulator.generate_response([
{"role": "assistant", "content": "Hello, how can I help you today?"}
]))
# Generate a user message based on a couple of back and forth turns
print(user_simulator.generate_response([
{"role": "assistant", "content": "Hello, how can I help you today?"},
{"role": "user", "content": "My product just broke 2 days after I bought it, I want a refund."},
{"role": "assistant", "content": "I'm sorry to hear that. What happened?"}
]))
```
Now that we have a way to simulate the user, we can create multiple simulations that we will in
turn evaluate.
## Running simulations
Now that we have a way to simulate the user, we can create multiple simulations:
## Next steps
* Learn more about [conversation metrics](/evaluation/metrics/conversation_threads_metrics)
* Learn more about [custom conversation metrics](/evaluation/metrics/custom_conversation_metric)
* Learn more about [evaluate\_threads](/evaluation/evaluate_threads)
* Learn more about [agent trajectory evaluation](/evaluation/evaluate_agent_trajectory)
# Manually logging experiments
> Evaluate your LLM application by logging pre-computed experiments and boosting confidence in performance with this detailed guide.
Evaluating your LLM application allows you to have confidence in the performance of your LLM
application. In this guide, we will walk through manually creating experiments using data you have
already computed.
5. Update the experiment name and/or configuration (JSON format)
6. Click **Update Experiment** to save your changes
The configuration is stored as JSON and is useful for tracking parameters like model names, temperatures, prompt templates, or any other metadata relevant to your experiment.
### From the Python SDK
Use the `update_experiment` method to update an experiment's name and configuration:
```python
import opik
client = opik.Opik()
# Update experiment name
client.update_experiment(
id="experiment-id",
name="Updated Experiment Name"
)
# Update experiment configuration
client.update_experiment(
id="experiment-id",
experiment_config={
"model": "gpt-4",
"temperature": 0.7,
"prompt_template": "Answer the following question: {question}"
}
)
# Update both name and configuration
client.update_experiment(
id="experiment-id",
name="Updated Experiment Name",
experiment_config={
"model": "gpt-4",
"temperature": 0.7
}
)
```
### From the TypeScript SDK
Use the `updateExperiment` method to update an experiment's name and configuration:
```typescript
import { Opik } from "opik";
const opik = new Opik();
// Update experiment name
await opik.updateExperiment("experiment-id", {
name: "Updated Experiment Name"
});
// Update experiment configuration
await opik.updateExperiment("experiment-id", {
experimentConfig: {
model: "gpt-4",
temperature: 0.7,
promptTemplate: "Answer the following question: {question}"
}
});
// Update both name and configuration
await opik.updateExperiment("experiment-id", {
name: "Updated Experiment Name",
experimentConfig: {
model: "gpt-4",
temperature: 0.7
}
});
```
## Update Experiment Scores
Sometimes you may want to update an existing experiment with new scores, or update existing scores for an experiment. You can do this using the [`evaluate_experiment` function](https://www.comet.com/docs/opik/python-sdk-reference/evaluation/evaluate_existing.html).
This function will re-run the scoring metrics on the existing experiment items and update the scores:
```python
from opik.evaluation import evaluate_experiment
from opik.evaluation.metrics import Hallucination
hallucination_metric = Hallucination()
# Replace "my-experiment" with the name of your experiment which can be found in the Opik UI
evaluate_experiment(experiment_name="my-experiment", scoring_metrics=[hallucination_metric])
```
Configure your queue with:
* **Name**: Clear identification for your queue
* **Scope**: Choose between traces or threads
* **Instructions**: Provide context and guidance for reviewers
* **Feedback Definitions**: Select the metrics SMEs will use for scoring
### Adding Content to Your Queue
You can add items to your queue in several ways:
**From Traces/Threads Lists:**
* Select one or multiple items
* Click **Add to -> Add to annotation queue**
* Choose an existing queue or create a new one
**From Individual Trace/Thread Details:**
* Open the trace or thread detail view
* Click **Add to -> Add to annotation queue** in the actions panel
* Select your target queue
### Sharing with Subject Matter Experts
Once your queue is set up, you can share it with SMEs:
**Copy Queue Link:**
The annotation workflow begins with clear instructions and context, allowing SMEs to understand what they're evaluating and how to provide meaningful feedback.
The SME interface provides:
1. **Clean, focused design**: No technical jargon or complex navigation
2. **Clear instructions**: Queue-specific guidance displayed prominently
3. **Structured feedback**: Predefined metrics with clear descriptions
4. **Progress tracking**: Visual indicators of completion status
5. **Comment system**: Optional text feedback for additional context
### Annotation Workflow
1. **Access the queue**: SME clicks the shared link
2. **Review content**: Examine the trace or thread output
3. **Provide feedback**: Score using predefined metrics
4. **Add comments**: Optional text feedback
5. **Submit and continue**: Move to the next item
## Learn more
You can learn more about Opik's annotation and evaluation features in:
1. [Evaluation overview](/evaluation/overview)
2. [Feedback definitions](../configuration/configuration/feedback_definitions)
# Overview
> Describes all the built-in evaluation metrics provided by Opik
# Overview
Opik provides a set of built-in evaluation metrics that you can mix and match to evaluate LLM behaviour. These metrics are broken down into two main categories:
1. **Heuristic metrics** โ deterministic checks that rely on rules, statistics, or classical NLP algorithms.
2. **LLM as a Judge metrics** โ delegate scoring to an LLM so you can capture semantic, task-specific, or conversation-level quality signals.
Heuristic metrics are ideal when you need reproducible checks such as exact matching, regex validation, or similarity scores against a reference. LLM as a Judge metrics are useful when you want richer qualitative feedback (hallucination detection, helpfulness, summarisation quality, regulatory risk, etc.).
## Built-in metrics
### Heuristic metrics
| Metric | Description | Documentation |
| ------------------ | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| BERTScore | Contextual embedding similarity score | [BERTScore](/evaluation/metrics/heuristic_metrics#bertscore) |
| ChrF | Character n-gram F-score (chrF / chrF++) | [ChrF](/evaluation/metrics/heuristic_metrics#chrf) |
| Contains | Checks whether the output contains a specific substring | [Contains](/evaluation/metrics/heuristic_metrics#contains) |
| Corpus BLEU | Computes corpus-level BLEU across multiple outputs | [CorpusBLEU](/evaluation/metrics/heuristic_metrics#bleu) |
| Equals | Checks if the output exactly matches an expected string | [Equals](/evaluation/metrics/heuristic_metrics#equals) |
| GLEU | Estimates grammatical fluency for candidate sentences | [GLEU](/evaluation/metrics/heuristic_metrics#gleu) |
| IsJson | Validates that the output can be parsed as JSON | [IsJson](/evaluation/metrics/heuristic_metrics#isjson) |
| JSDivergence | JensenโShannon similarity between token distributions | [JSDivergence](/evaluation/metrics/heuristic_metrics#jsdivergence) |
| JSDistance | Raw JensenโShannon divergence | [JSDistance](/evaluation/metrics/heuristic_metrics#jsdistance) |
| KLDivergence | KullbackโLeibler divergence with smoothing | [KLDivergence](/evaluation/metrics/heuristic_metrics#kldivergence) |
| Language Adherence | Verifies output language code | [Language Adherence](/evaluation/metrics/heuristic_metrics#language-adherence) |
| Levenshtein | Calculates the normalized Levenshtein distance between output and reference | [Levenshtein](/evaluation/metrics/heuristic_metrics#levenshteinratio) |
| Readability | Reports Flesch Reading Ease and FK grade | [Readability](/evaluation/metrics/heuristic_metrics#readability) |
| RegexMatch | Checks if the output matches a specified regular expression pattern | [RegexMatch](/evaluation/metrics/heuristic_metrics#regexmatch) |
| ROUGE | Calculates ROUGE variants (rouge1/2/L/Lsum/W) | [ROUGE](/evaluation/metrics/heuristic_metrics#rouge) |
| Sentence BLEU | Computes a BLEU score for a single output against one or more references | [SentenceBLEU](/evaluation/metrics/heuristic_metrics#bleu) |
| Sentiment | Scores sentiment using VADER | [Sentiment](/evaluation/metrics/heuristic_metrics#sentiment) |
| Spearman Ranking | Spearman's rank correlation | [Spearman Ranking](/evaluation/metrics/heuristic_metrics#spearman-ranking) |
| Tone | Flags tone issues such as shouting or negativity | [Tone](/evaluation/metrics/heuristic_metrics#tone) |
### Conversation heuristic metrics
| Metric | Description | Documentation |
| ------------------- | ------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| DegenerationC | Detects repetition and degeneration patterns over a conversation | [DegenerationC](/evaluation/metrics/conversation_threads_metrics#conversation-degeneration-metric) |
| Knowledge Retention | Checks whether the last assistant reply preserves user facts from earlier turns | [Knowledge Retention](/evaluation/metrics/conversation_threads_metrics#knowledge-retention-metric) |
### LLM as a Judge metrics
| Metric | Description | Documentation |
| ------------------------------- | ----------------------------------------------------------------- | -------------------------------------------------------------------------- |
| Agent Task Completion Judge | Checks whether an agent fulfilled its assigned task | [Agent Task Completion](/evaluation/metrics/agent_task_completion) |
| Agent Tool Correctness Judge | Evaluates whether an agent used tools correctly | [Agent Tool Correctness](/evaluation/metrics/agent_tool_correctness) |
| Answer Relevance | Checks whether the answer stays on-topic with the question | [Answer Relevance](/evaluation/metrics/answer_relevance) |
| Compliance Risk Judge | Identifies non-compliant or high-risk statements | [Compliance Risk](/evaluation/metrics/compliance_risk) |
| Context Precision | Ensures the answer only uses relevant context | [Context Precision](/evaluation/metrics/context_precision) |
| Context Recall | Measures how well the answer recalls supporting context | [Context Recall](/evaluation/metrics/context_recall) |
| Dialogue Helpfulness Judge | Evaluates how helpful an assistant reply is in a dialogue | [Dialogue Helpfulness](/evaluation/metrics/dialogue_helpfulness) |
| G-Eval | Task-agnostic judge configurable with custom instructions | [G-Eval](/evaluation/metrics/g_eval) |
| Hallucination | Detects unsupported or hallucinated claims using an LLM judge | [Hallucination](/evaluation/metrics/hallucination) |
| LLM Juries Judge | Averages scores from multiple judge metrics for ensemble scoring | [LLM Juries](/evaluation/metrics/llm_juries) |
| Meaning Match | Evaluates semantic equivalence between output and ground truth | [Meaning Match](/evaluation/metrics/meaning_match) |
| Moderation | Flags safety or policy violations in assistant responses | [Moderation](/evaluation/metrics/moderation) |
| Prompt Uncertainty Judge | Detects ambiguity in prompts that may confuse LLMs | [Prompt Diagnostics](/evaluation/metrics/prompt_diagnostics) |
| QA Relevance Judge | Determines whether an answer directly addresses the user question | [QA Relevance](/evaluation/metrics/g_eval#qa-relevance-judge) |
| Structured Output Compliance | Checks JSON or schema adherence for structured responses | [Structured Output](/evaluation/metrics/structure_output_compliance) |
| Summarization Coherence Judge | Rates the structure and coherence of a summary | [Summarization Coherence](/evaluation/metrics/summarization_coherence) |
| Summarization Consistency Judge | Checks if a summary stays faithful to the source | [Summarization Consistency](/evaluation/metrics/summarization_consistency) |
| Trajectory Accuracy | Scores how closely agent trajectories follow expected steps | [Trajectory Accuracy](/evaluation/metrics/trajectory_accuracy) |
| Usefulness | Rates how useful the answer is to the user | [Usefulness](/evaluation/metrics/usefulness) |
### Conversation LLM as a Judge metrics
| Metric | Description | Documentation |
| ---------------------------- | ----------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| Conversational Coherence | Evaluates coherence across sliding windows of a dialogue | [Conversational Coherence](/evaluation/metrics/conversation_threads_metrics#conversationalcoherencemetric) |
| Session Completeness Quality | Checks whether user goals were satisfied during the session | [Session Completeness](/evaluation/metrics/conversation_threads_metrics#sessioncompletenessquality) |
| User Frustration | Estimates the likelihood a user was frustrated | [User Frustration](/evaluation/metrics/conversation_threads_metrics#userfrustrationmetric) |
## Customizing LLM as a Judge metrics
By default, Opik uses GPT-5-nano from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different `model` parameter.
For each interaction with the end user, you can also know how the agent planned, chose tools, or crafted an answer based on the user input, the agent graph and much more.
During development phase, having access to all this information is fundamental for debugging and understanding what is working as expected and whatโs not.
**Error detection**
Having immediate access to all traces that returned an error can also be life-saving, and Opik makes it extremely easy to achieve:
For each of the errors and exceptions captured, you have access to all the details you need to fix the issue:
### 2. Evaluate Agent's End-to-end Behavior
Once you have full visibility on the agent interactions, memory and tool usage, and you made sure everything is working at the technical level, the next logical step is to start checking the quality of the responses and the actions your agent takes.
**Human Feedback**
The fastest and easiest way to do it is providing manual human feedback. Each trace and each span can be rated โCorrectโ or โIncorrectโ by a person (most probably you!) and that will give a baseline to understand the quality of the responses.
You can provide human feedback and a comment for each traceโs score in Opik and when youโre done you can store all results in a dataset that you will be using in next iterations of agent optimization.
**Online evaluation**
Marking an answer as simply โcorrectโ or โincorrectโ is a useful first step, but itโs rarely enough. As your agent grows more complex, youโll want to measure how well it performs across more nuanced dimensions.
Thatโs where online evaluation becomes essential.
With Opik, you can automatically score traces using a wide range of metrics, such as answer relevance, hallucination detection, agent moderation, user moderation, or even custom criteria tailored to your specific use case. These evaluations run continuously, giving you structured feedback on your agentโs quality without requiring manual review.
#### What Happens Next? Iterate, Improve, and Compare
Running the experiment once gives you a **baseline**: a first measurement of how good (or bad) your agent's tool selection behavior is.
But the real power comes from **using these results to improve your agent** โ and then **re-running the experiment** to measure progress.
Hereโs how you can use this workflow:
You can also edit a prompt by clicking on the prompt name in the library and clicking `Edit prompt`.
Further details on using prompts from the Prompt library are provided in the following sections.
#### Using prompts in supported integrations
Prompts can be used in all supported third-party integrations by attaching them to traces and spans through the [`opik_context` module](/docs/opik/prompt_engineering/prompt_management#adding-prompts-to-traces-and-spans).
For instance, you can use prompts with the `Google ADK` integration, as shown in the [example here](/docs/opik/integrations/adk#prompts-integration).
#### Downloading your text prompts
Once a prompt is created in the library, you can download it in code using the [`Opik.get_prompt`](https://www.comet.com/docs/opik/python-sdk-reference/Opik.html#opik.Opik.get_prompt) method:
```python
import opik
opik.configure()
client = opik.Opik()
# Get a dataset
dataset = client.get_or_create_dataset("test_dataset")
# Get the prompt
prompt = client.get_prompt(name="prompt-summary")
# Create the prompt message
prompt.format(text="Hello, world!")
```
If you are not using the SDK, you can download a prompt by using the [REST API](/reference/rest-api/overview).
#### Searching prompts
To discover prompts by name substring and/or filters, use [`search_prompts`](https://www.comet.com/docs/opik/python-sdk-reference/Opik.html#opik.Opik.search_prompts). Filters are written in Opik Query Language (OQL):
```python
import opik
client = opik.Opik()
# Search by name substring only
latest_versions = client.search_prompts(
filter_string='name contains "summary"'
)
# Search by name substring and tags filter
filtered = client.search_prompts(
filter_string='name contains "summary" AND tags contains "alpha" AND tags contains "beta"',
)
# Search for only text prompts
text_prompts = client.search_prompts(
filter_string='template_structure = "text"'
)
for prompt in filtered:
print(prompt.name, prompt.commit, prompt.prompt)
```
You can filter by `template_structure` to search for only text prompts (`"text"`) or only chat prompts (`"chat"`). Without the filter, `search_prompts` returns both types.
The `filter_string` parameter uses Opik Query Language (OQL) with the format:
`"
### Multimodal chat prompts
Chat prompts support multimodal content, allowing you to include images and videos alongside text. This is useful for vision-enabled models.
You can also edit a chat prompt by clicking on the prompt name in the library and clicking `Edit prompt`.
### Comparing prompt versions in experiments
*All of the conversations from the playground are logged to the `playground` project so that you can easily refer back to them later.*
## Configuring the prompt playground
The playground supports the following LLM providers:
* OpenAI
* Anthropic
* OpenRouter
* Gemini
* Vertex AI
* Azure OpenAI
* Amazon Bedrock
* LM Studio (coming soon)
* vLLM / Ollama / any other OpenAI API-compliant provider
### When to Use
Use the Prompt Generator when you:
* Are starting a new prompt from scratch
* Need inspiration for prompt structure
* Want to ensure your prompt follows industry best practices
* Are unsure how to phrase your task optimally
### System Prompt Used
Opik uses the following system prompt to generate new prompts:
```text
You are an expert prompt engineer. Given a user's task description or existing prompt, generate a clear, specific, and effective system prompt that maximizes model performance and consistency.
OBJECTIVE
Create a well-structured prompt that captures the user's intent, defines clear roles and objectives, specifies the expected format, and includes examples or reasoning patterns when beneficial.
CONSTRUCTION PRINCIPLES (in priority order)
1. Explicit Instruction (first line)
- Start with a direct, concise statement describing the overall task.
- The instruction must appear before any context or explanation.
2. Role Definition
- "You are a [role] specializing in [expertise]."
- Keep it to one sentence unless the domain demands elaboration.
3. Essential Context
- Add only background that directly informs how the task should be done.
- Skip generic or motivational context.
4. Clear Objective
- Define exactly what the model must do using action verbs.
- When applicable, outline the reasoning-before-conclusion order.
5. Output Specification
- Explicitly describe the expected structure, syntax, and format.
- Prefer deterministic formats when possible.
6. Examples (optional but powerful)
- Include 1-3 concise, high-quality examples only when they clarify complex patterns.
- Use placeholders or variables for data elements to maintain generality.
7. Key Constraints
- List critical limitations as bullet points.
- Avoid redundant or obvious constraints.
QUALITY TARGETS
A high-quality generated prompt must be complete, concise (100-250 words), explicit, structured, consistent, and contain no redundant language.
```
## Prompt Improver
### What is the Prompt Improver?
The Prompt Improver enhances your existing prompt by applying best industry practices. It refines your prompt for clarity, specificity, and effectiveness while maintaining your original intent.
### When to Use
Use the Prompt Improver when you:
* Have a working prompt that could be better
* Want to eliminate ambiguity or vagueness
* Need to make your prompt more concise
* Want to ensure your prompt follows industry best practices
* Have received inconsistent results
### System Prompt Used
Opik uses the following system prompt to improve existing prompts:
```text
You are an expert prompt engineer. Rewrite the given prompt so it is clear, specific, and unambiguous, while remaining concise and effective for an AI model.
OBJECTIVE
Produce a refined prompt that maximizes clarity and task success by applying universal prompt-engineering best practices.
CORE OPTIMIZATION PRIORITIES
1. Explicit Instruction First โ Begin with the main instruction or task goal.
2. Role & Context โ Include a brief, relevant role (if needed) and only essential background that shapes the task.
3. Conciseness โ Remove filler, redundant phrases, and unnecessary qualifiers. Every word must serve purpose.
4. Specific Task Definition โ Use precise, action-oriented verbs.
5. Output Schema or Format โ Define the response format clearly.
6. Constraints โ Include only key limitations. Avoid over-specification.
7. Examples (Few-Shot) โ Include one concise example only if it materially clarifies the pattern.
8. Neutrality & Safety โ Preserve factual tone, avoid assumptions, and ensure ethical neutrality.
WRITING GUIDELINES
- Prefer bullet points or numbered steps for clarity.
- Use positive instructions ("Do X") instead of negative ("Don't do X").
- Avoid vague words ("things," "somehow," "etc.").
- Combine related ideas into single, efficient statements.
- Keep structure readable with delimiters or sections when logical.
- When rephrasing variables, retain their exact identifiers.
- Never invent new variables unless explicitly required.
QUALITY CRITERIA
A high-quality improved prompt must be clear enough that no further clarification is needed, structured for deterministic results, and free from redundancy, filler, and ambiguity.
```
## Advanced Optimization
For production-critical prompts requiring systematic optimization with automated evaluation and dataset integration, explore the [Opik Agent Optimizer](/agent_optimization/overview).
# Opik's MCP server
> Configure Opik's MCP server with Cursor IDE to manage prompts and analyze traces efficiently.
Opik's [MCP server](https://github.com/comet-ml/opik-mcp) allows you to integrate
your IDE with Opik not just for prompt management, but also to access and
analyze traces.
## Setting up the MCP server
Once the MCP server is available, you can now test it out by asking Cursor:
`What is the latest trace available in Opik?`
Once the `mcp_config.json` file is created, you can configure the Opik server
by adding the following (make sure to update the API key placeholder with your
key):
```json wordWrap title="mcp_config.json"
{
"mcpServers": {
"opik": {
"command": "npx",
"args": [
"-y",
"opik-mcp",
"--apiKey",
"YOUR_API_KEY"
]
}
}
}
```
# Pytest integration
> Monitor your LLM applications' performance by using Opik's Pytest integration to track test results and ensure reliability before deployment.
Ensuring your LLM applications is working as expected is a crucial step before deploying to production. Opik provides a Pytest integration so that you can easily track the overall pass / fail rates of your tests as well as the individual pass / fail rates of each test.
## Using the Pytest Integration
We recommend using the `llm_unit` decorator to wrap your tests. This will ensure that Opik can track the results of your tests and provide you with a detailed report. It also works well when used in conjunction with the `track` decorator used to trace your LLM application.
```python
import pytest
from opik import track, llm_unit
@track
def llm_application(user_question: str) -> str:
# LLM application code here
return "Paris"
@llm_unit()
def test_simple_passing_test():
user_question = "What is the capital of France?"
response = llm_application(user_question)
assert response == "Paris"
```
When you run the tests, Opik will create a new experiment for each run and log each test result. By navigating to the `tests` dataset, you will see a new experiment for each test run.
In addition to viewing scores over time, you can also view the average feedback scores for all the traces in your project from the traces table.
## Logging feedback scores
To monitor the performance of your LLM application, you can log feedback scores using the [Python SDK and through the UI](/tracing/annotate_traces).
### Defining online evaluation metrics
You can define LLM as a Judge metrics in the Opik platform that will automatically score all, or a subset, of your production traces. You can find more information about how to define LLM as a Judge metrics in the [Online evaluation](/production/rules) section.
Once a rule is defined, Opik will score all the traces in the project and allow you to track these feedback scores over time.
When creating a new rule, you will be presented with the following options:
1. **Name:** The name of the rule
2. **Sampling rate:** The percentage of traces to score. When set to `100%`, all traces will be scored.
3. **Model:** The model to use to run the LLM as a Judge metric. For evaluating traces with images, make sure to select a model that supports vision capabilities.
4. **Prompt:** The LLM as a Judge prompt to use. Opik provides a set of base prompts (Hallucination, Moderation, Answer Relevance) that you can use or you can define your own. Variables in the prompt should be in `{{variable_name}}` format.
5. **Variable mapping:** This is the mapping of the variables in the prompt to the values from the trace.
6. **Score definition:** This is the format of the output of the LLM as a Judge metric. By adding more than one score, you can define LLM as a Judge metrics that score an LLM output along different dimensions.
### Opik's built-in LLM as a Judge metrics
Opik comes pre-configured with 3 different LLM as a Judge metrics:
1. Hallucination: This metric checks if the LLM output contains any hallucinated information.
2. Moderation: This metric checks if the LLM output contains any offensive content.
3. Answer Relevance: This metric checks if the LLM output is relevant to the given context.
When writing your own LLM as a Judge metric you will need to specify the prompt variables using the mustache syntax, ie.
`{{ variable_name }}`. You can then map these variables to your trace data using the `variable_mapping` parameter. When the
rule is executed, Opik will replace the variables with the values from the trace data.
You can control the format of the output using the `Scoring definition` parameter. This is where you can define the scores you want the LLM as a Judge metric to return. Under the hood, we will use this definition in conjunction with the [structured outputs](https://platform.openai.com/docs/guides/structured-outputs) functionality to ensure that the LLM as a Judge metric always returns trace scores.
### Evaluating traces with images
LLM as a Judge metrics can evaluate traces that contain images when using vision-capable models. This is useful for:
* Evaluating image generation quality
* Analyzing visual content in multimodal applications
* Validating image-based responses
To reference image data from traces in your evaluation prompts:
1. In the prompt editor, click the **"Images +"** button to add an image variable
2. Map the image variable to the trace field containing image data using the Variable Mapping section
We have built-in templates for the LLM as a Judge metrics that you can use to score the entire conversation:
1. **Conversation Coherence:** This metric checks if the conversation is coherent and follows a logical flow, return a decimal score between 0 and 1.
2. **User Frustration:** This metric checks if the user is frustrated with the conversation, return a decimal score between 0 and 1.
3. **Custom LLM as a Judge metrics:** You can use this template to score the entire conversation using your own LLM as a Judge metric. By default, this template uses binary scoring (true/false) following best practices.
For the LLM as a Judge metrics, keep in mind the only variable available is the `{{context}}` one, which is a dictionary containing the entire conversation:
```json
[
{
"role": "user",
"content": "Hello, how are you?"
},
{
"role": "assistant",
"content": "I'm good, thank you!"
}
]
```
Similarly, for the Python metrics, you have the `Conversation` object available to you. This object is a `List[Dict]` where each dict represents a message in the conversation.
```python
[
{
"role": "user",
"content": "Hello, how are you?"
},
{
"role": "assistant",
"content": "I'm good, thank you!"
}
]
```
The sampled threads are scored only after the threads are marked as inactive. This is to ensure that the scoring is done on the full context of the conversation.
# Guardrails
> Monitor your LLM calls with guardrails in Opik to prevent risks and ensure safe application outputs.
# How it works
Conceptually, we need to determine the presence of a series of risks for each input and
output, and take action on it.
The ideal method depends on the type of the problem,
and aims to pick the best combination of accuracy, latency and cost.
There are three commonly used methods:
1. **Heuristics or traditional NLP models**: ideal for checking for PII or competitor mentions
2. **Small language models**: ideal for staying on topic
3. **Large language models**: ideal for detecting complex issues like hallucination
# Types of guardrails
Providers like OpenAI or Anthropic have built-in guardrails for risks like harmful or
malicious content and are generally desirable for the vast majority of users.
The Opik Guardrails aim to cover the residual risks which are often very user specific, and need to be configured with more detail.
## PII guardrail
The PII guardrail checks for sensitive information, such as name, age, address, email, phone number, or credit card details.
The specific entities can be configured in the SDK call, see more in the reference documentation.
*The method used here leverages traditional NLP models for tokenization and named entity recognition.*
## Topic guardrail
The topic guardrail ensures that the inputs and outputs remain on topic.
You can configure the allowed or disallowed topics in the SDK call, see more in the reference documentation.
*This guardrails relies on a small language model, specifically a zero-shot classifier.*
## Custom guardrail
Custom guardrail allows you to define your own guardrails using a custom model, custom library or custom business logic and log the response to Opik. Below is a basic example that filters out competitor brands:
```python
import opik
import opik.opik_context
import traceback
# Brand mention detection
competitor_brands = [
"OpenAI",
"Anthropic",
"Google AI",
"Microsoft Copilot",
"Amazon Bedrock",
"Hugging Face",
"Mistral AI",
"Meta AI",
]
opik_client = opik.Opik()
def custom_guardrails(generation: str, trace_id: str) -> str:
# Start the guardrail span first so the duration is accurately captured
guardrail_span = opik_client.span(name="Guardrail", input={"generation": generation}, type="guardrail", trace_id=trace_id)
# Custom guardrail logic - detect competitor brand mentions
found_brands = []
for brand in competitor_brands:
if brand.lower() in generation.lower():
found_brands.append(brand)
# The key `guardrail_result` is required by Opik guardrails and must be either "passed" or "failed"
if found_brands:
guardrail_result = "failed"
output = {"guardrail_result": guardrail_result, "found_brands": found_brands}
else:
guardrail_result = "passed"
output = {"guardrail_result": guardrail_result}
# Log the spans
guardrail_span.end(output=output)
# Upload the guardrail data for project-level metrics
guardrail_data = {
"project_name": opik_client._project_name,
"entity_id": trace_id,
"secondary_id": guardrail_span.id,
"name": "TOPIC", # Supports either "TOPIC" or "PII"
"result": guardrail_result,
"config": {"blocked_brands": competitor_brands},
"details": output,
}
try:
opik_client.rest_client.guardrails.create_guardrails(guardrails=[guardrail_data])
except Exception as e:
traceback.print_exc()
return generation
@opik.track
def main():
good_generation = "You should use our AI platform for your machine learning projects!"
custom_guardrails(good_generation, opik.opik_context.get_current_trace_data().id)
bad_generation = "You might want to try OpenAI or Google AI for your project instead."
custom_guardrails(bad_generation, opik.opik_context.get_current_trace_data().id)
if __name__ == "__main__":
main()
```
After running the custom guardrail example above, you can view the results in the Opik dashboard. The guardrail spans will appear alongside your traces, showing which brand mentions were detected and whether the guardrail passed or failed.
# Getting started
## Running the guardrail backend
You can start the guardrails backend by running:
```bash
./opik.sh --guardrails
```
## Using the Python SDK
```python
from opik.guardrails import Guardrail, PII, Topic
from opik import exceptions
guardrail = Guardrail(
guards=[
Topic(restricted_topics=["finance", "health"], threshold=0.9),
PII(blocked_entities=["CREDIT_CARD", "PERSON"]),
]
)
llm_response = "You should buy some NVIDIA stocks!"
try:
guardrail.validate(llm_response)
except exceptions.GuardrailValidationFailed as e:
print(e)
```
The immediate result of a guardrail failure is an exception, and your application code will need to handle it.
The call is blocking, since the main purpose of the guardrail is to prevent the application from proceeding with a potentially undesirable response.
### Guarding streaming responses and long inputs
You can call `guardrail.validate` repeatedly to validate the response chunk by chunk, or their parts or combinations.
The results will be added as additional spans to the same trace.
```python
for chunk in response:
try:
guardrail.validate(chunk)
except exceptions.GuardrailValidationFailed as e:
print(e)
```
## Working with the results
### Examining specific traces
When a guardrail fails on an LLM call, Opik automatically adds the information to the trace.
You can filter the traces in your project to only view those that have failed the guardrails.
### Analyzing trends
You can also view how often each guardrail is failing in the Metrics section of the project.
## Performance and limit considerations
The guardrails backend will use a GPU automatically if there is one available.
For production use, running the guardrails backend on a GPU node is strongly recommended.
Current limits:
* Topic guardrail: the maximum input size is 1024 tokens
* Both Topic and PII guardrails support English language
# Anonymizers
> Protect sensitive information in your LLM applications with Opik's Anonymizers, ensuring compliance and preventing accidental data exposure.
# How it works
Anonymizers work by processing all data that flows through Opik's tracing system - including inputs, outputs, and metadata - before it's stored or displayed. They apply a set of rules to detect and replace sensitive information with anonymized placeholders.
The anonymization happens automatically and transparently:
1. **Data Ingestion**: When you log traces and spans to Opik
2. **Rule Application**: Registered anonymizers scan the data using their configured rules
3. **Replacement**: Sensitive information is replaced with anonymized placeholders
4. **Storage**: Only the anonymized data is stored in Opik
# Types of Anonymizers
## Rules-based Anonymizer
The most common type of anonymizer uses pattern-matching rules to identify and replace sensitive information. Rules can be defined in several formats:
### Regex Rules
Use regular expressions to match specific patterns:
```python
import opik
from opik.anonymizer import create_anonymizer
# Dictionary format
email_rule = {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"}
# Tuple format
phone_rule = (r"\b\d{3}-\d{3}-\d{4}\b", "[PHONE]")
# Create anonymizer with multiple rules
anonymizer = create_anonymizer([email_rule, phone_rule])
# Register globally
opik.hooks.add_anonymizer(anonymizer)
```
### Function Rules
Use custom Python functions for more complex anonymization logic:
```python
import opik
from opik.anonymizer import create_anonymizer
def mask_api_keys(text: str) -> str:
"""Custom function to anonymize API keys"""
import re
# Match common API key patterns
api_key_pattern = r'\b(sk-[a-zA-Z0-9]{32,}|pk_[a-zA-Z0-9]{24,})\b'
return re.sub(api_key_pattern, '[API_KEY]', text)
def anonymize_with_hash(text: str) -> str:
"""Replace emails with consistent hashes for tracking without exposing PII"""
import re
import hashlib
def hash_replace(match):
email = match.group(0)
hash_val = hashlib.md5(email.encode()).hexdigest()[:8]
return f"[EMAIL_{hash_val}]"
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
return re.sub(email_pattern, hash_replace, text)
# Create anonymizer with function rules
anonymizer = create_anonymizer([mask_api_keys, anonymize_with_hash])
opik.hooks.add_anonymizer(anonymizer)
```
### Mixed Rules
Combine different rule types for comprehensive anonymization:
```python
import opik
import opik.hooks
from opik.anonymizer import create_anonymizer
# Mix of dictionary, tuple, and function rules
mixed_rules = [
{"regex": r"\b\d{3}-\d{2}-\d{4}\b", "replace": "[SSN]"}, # Social Security Numbers
(r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "[CARD]"), # Credit Cards
lambda text: text.replace("CONFIDENTIAL", "[REDACTED]"), # Custom replacements
]
anonymizer = create_anonymizer(mixed_rules)
opik.hooks.add_anonymizer(anonymizer)
```
## Custom Anonymizers
For advanced use cases, create custom anonymizers by extending the `Anonymizer` base class.
### Understanding Anonymizer Arguments
When implementing custom anonymizers, you need to implement the `anonymize()` method with the following signature:
```python
def anonymize(self, data, **kwargs):
# Your anonymization logic here
return anonymized_data
```
**The `kwargs` parameters:**
The `anonymize()` method also receives additional context through `**kwargs`:
* **`field_name`**: Indicates which field is being anonymized (`"input"`, `"output"`, `"metadata"`, or nested field names in dots notation such as `"metadata.email"`)
* **`object_type`**: The type of the object being processed (`"span"`, `"trace"`)
**When are kwargs available?**
These kwargs are automatically passed by Opik's internal data processors when anonymizing trace and span data before sending it to the backend. This allows you to apply different anonymization strategies based on the field being processed.
**Example: Field-specific anonymization**
```python
from opik.anonymizer import Anonymizer
import opik.hooks
class FieldAwareAnonymizer(Anonymizer):
def anonymize(self, data, **kwargs):
field_name = kwargs.get("field_name", "")
# Only anonymize the output field, leave input as-is for debugging
if field_name == "output" and isinstance(data, str):
import re
# More aggressive anonymization for outputs
data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', data)
data = re.sub(r'\b\d{3}-\d{3}-\d{4}\b', '[PHONE]', data)
elif field_name == "metadata" and isinstance(data, dict):
# Remove specific metadata fields entirely
sensitive_keys = ["user_id", "session_token", "api_key"]
for key in sensitive_keys:
if key in data:
data[key] = "[REDACTED]"
return data
# Register the field-aware anonymizer
opik.hooks.add_anonymizer(FieldAwareAnonymizer())
```
anonymize() or a function rule, then return the
redacted data back to Opik.
## Creating an alert
### Prerequisites
* Access to the Opik Configuration page
* A webhook endpoint that can receive HTTP POST requests
* (Optional) An HTTPS endpoint with valid SSL certificate for production use
### Step-by-step guide
1. **Navigate to Alerts**
* Go to Configuration โ Alerts tab
* Click "Create new alert" button
2. **Configure basic settings**
* **Name**: Give your alert a descriptive name (e.g., "Production Errors Slack")
* **Enable alert**: Toggle on to activate the alert immediately
3. **Configure webhook settings**
* **Destination**: Select the alert destination type:
* **General**: For custom webhooks, no-code automation platforms, or middleware services
* **Slack**: For native Slack webhook integration (automatically formats messages for Slack)
* **PagerDuty**: For native PagerDuty integration (automatically formats events for PagerDuty)
* **Endpoint URL**: Enter your webhook URL (must start with `http://` or `https://`)
* For Slack: Use your Slack Incoming Webhook URL (e.g., `https://hooks.slack.com/services/...`)
* For PagerDuty: Use your PagerDuty Events API v2 integration URL (e.g., `https://events.pagerduty.com/v2/enqueue`)
* For General: Use any HTTP endpoint that can receive POST requests
4. **Advanced webhook settings** (optional)
* **Secret token**: Add a secret token to verify webhook authenticity (recommended for General destination)
* **Custom headers**: Add HTTP headers for authentication or routing
* Example: `X-Custom-Auth: Bearer your-token-here`
5. **Add triggers**
* Click "Add trigger" to select event types
* Choose one or more event types from the list
* Configure project scope for observability events (optional)
* For threshold-based alerts (errors, cost, latency, feedback scores):
* **Threshold**: Set the threshold value that triggers the alert
* **Operator**: Choose comparison operator (`>`, `<`) for feedback score alerts
* **Window**: Configure the time window in seconds for metric aggregation
* **Feedback Score Name**: Select which feedback score to monitor (for feedback score alerts only)
6. **Test your configuration**
* Click "Test connection" to send a sample webhook
* Verify your endpoint receives the test payload
* Check the response status in the Opik UI
7. **Create the alert**
* Click "Create alert" to save your configuration
* The alert will start monitoring events immediately
## Integration examples
Opik supports three main approaches for integrating alerts with external systems:
1. **Native integrations** (Slack, PagerDuty): Use built-in formatting for popular services - no middleware required
2. **General webhooks**: Send alerts to custom endpoints, no-code platforms, or middleware services
3. **Middleware services** (Optional): Add custom logic, routing, or transformations before forwarding to destinations
### Slack integration (Native)
Opik provides native Slack integration that automatically formats alert messages for Slack's Block Kit format.
#### Prerequisites
* [Create a Slack app and enable Incoming Webhooks](https://docs.slack.dev/messaging/sending-messages-using-incoming-webhooks/)
* Generate a webhook URL (e.g., `https://hooks.slack.com/services/T00000000/B00000000/XXXX`)
#### Setup steps
1. **In Slack**:
* Create a Slack app in your workspace
* Enable Incoming Webhooks
* Add the webhook to your desired channel
* Copy the webhook URL
2. **In Opik**:
* Go to Configuration โ Alerts tab
* Click "Create new alert"
* Give your alert a descriptive name
* Select **Slack** as the destination type
* Paste your Slack webhook URL in the Endpoint URL field
* Add triggers for the events you want to monitor
* Click "Test connection" to verify
* Click "Create alert"
Opik will automatically format all alert payloads into Slack-compatible messages with rich formatting, including:
* Alert name and event type
* Event count and details
* Relevant metadata
* Links to view full details in Opik
### PagerDuty integration (Native)
Opik provides native PagerDuty integration that automatically formats alert events for PagerDuty's Events API v2.
#### Prerequisites
* A PagerDuty account with permission to create integrations
* Access to a service where you want to receive alerts
#### Setup steps
1. **In PagerDuty**:
* Navigate to Services โ select your service โ Integrations tab
* Click "Add Integration"
* Select "Events API V2"
* Give the integration a name (e.g., "Opik Alerts")
* Save the integration and copy the Integration Key
2. **In Opik**:
* Go to Configuration โ Alerts tab
* Click "Create new alert"
* Give your alert a descriptive name
* Select **PagerDuty** as the destination type
* Enter the PagerDuty Events API v2 endpoint: `https://events.pagerduty.com/v2/enqueue`
* In the **Routing Key** field, enter your PagerDuty Integration Key (this field appears when PagerDuty is selected as the destination)
* Add triggers for the events you want to monitor
* Click "Test connection" to verify
* Click "Create alert"
Opik will automatically format all alert payloads into PagerDuty-compatible events with:
* Severity levels based on event type
* Detailed event information
* Custom fields for filtering and routing
* Deduplication keys to prevent duplicate incidents
### Custom integration with middleware service (Optional)
For more complex integrations or custom formatting requirements, you can use a middleware service to transform Opik's payload before sending it to your destination. This approach works with any destination type (General, Slack, or PagerDuty).
#### When to use middleware
* **Custom message formatting**: Transform payload structure or add custom fields
* **Multi-destination routing**: Send alerts to different endpoints based on event type
* **Additional processing**: Enrich alerts with data from other systems
* **Legacy systems**: Adapt Opik alerts to older webhook formats
#### Example middleware for Slack with custom formatting
```python
import requests
def transform_to_slack(opik_payload):
event_type = opik_payload.get('eventType')
alert_name = opik_payload['payload']['alertName']
event_count = opik_payload['payload']['eventCount']
# Custom formatting logic
return {
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"๐จ {alert_name}"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*{event_count}* new `{event_type}` events"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"View in Opik: https://www.comet.com/opik"
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": f"*Environment:*\nProduction"
},
{
"type": "mrkdwn",
"text": f"*Priority:*\nHigh"
}
]
}
]
}
@app.route('/opik-to-slack', methods=['POST'])
def opik_to_slack():
opik_data = request.json
slack_payload = transform_to_slack(opik_data)
# Forward to Slack
requests.post(
SLACK_WEBHOOK_URL,
json=slack_payload
)
return {'status': 'success'}, 200
```
#### Setup for middleware approach
1. Deploy your middleware service to a publicly accessible endpoint
2. In Opik, create an alert with destination type **General**
3. Use your middleware service URL as the Endpoint URL
4. Configure your middleware to forward to the final destination (Slack, PagerDuty, etc.)
### Using no-code automation platforms
No-code automation tools like [n8n](https://n8n.io), [Make.com](https://www.make.com), and [IFTTT](https://ifttt.com) provide an easy way to connect Opik alerts to other servicesโwithout writing or deploying code. These platforms can receive webhooks from Opik, apply filters or conditions, and trigger actions such as sending Slack messages, logging data in Google Sheets, or creating incidents in PagerDuty.
**To use them:**
1. **Create a new workflow or scenario** and add a **Webhook trigger** node/module
2. **Copy the webhook URL** generated by the platform
3. **In Opik**, create an alert with destination type **General** and paste the webhook URL from your automation platform
4. **Secure the connection** by validating the Authorization header or including a secret token parameter
5. **Add filters or routing logic** to handle different eventType values from Opik (for example, trace:errors or trace:feedback\_score)
6. **Chain the desired actions**, such as notifications, database updates, or analytics tracking
These tools also provide built-in monitoring, retries, and visual flow editors, making them suitable for both technical and non-technical users who want to automate Opik alert handling securely and efficiently. This approach works well when you need to route alerts to multiple destinations or apply complex business logic.
### Custom dashboard integration
Build a custom monitoring dashboard that receives alerts using the **General** destination type:
```python
from fastapi import FastAPI, Request
from datetime import datetime
app = FastAPI()
# In-memory storage (use a database in production)
alert_history = []
@app.post("/webhook")
async def receive_webhook(request: Request):
data = await request.json()
# Store alert
alert_history.append({
'timestamp': datetime.utcnow(),
'event_type': data.get('eventType'),
'alert_name': data['payload']['alertName'],
'event_count': data['payload']['eventCount'],
'data': data
})
# Keep only last 1000 alerts
if len(alert_history) > 1000:
alert_history.pop(0)
return {"status": "success"}
@app.get("/dashboard")
async def get_dashboard():
# Return aggregated statistics
return {
'total_alerts': len(alert_history),
'by_type': group_by_type(alert_history),
'recent_alerts': alert_history[-10:]
}
```
## Supported event types
Opik supports ten types of alert events:
### Observability events
**Trace errors threshold exceeded**
* **Event type**: `trace:errors`
* **Triggered when**: Total trace error count exceeds the specified threshold within a time window
* **Project scope**: Can be configured to specific projects
* **Configuration**: Requires threshold value (error count) and time window (in seconds)
* **Payload**: Metrics alert payload with error count details
* **Use case**: Proactive error monitoring, detect error spikes, prevent system degradation
**Trace feedback score threshold exceeded**
* **Event type**: `trace:feedback_score`
* **Triggered when**: Average trace feedback score meets the specified threshold criteria within a time window
* **Project scope**: Can be configured to specific projects
* **Configuration**: Requires feedback score name, threshold value, operator (`>`, `<`), and time window
* **Payload**: Metrics alert payload with average feedback score details
* **Use case**: Track model performance, monitor user satisfaction, detect quality degradation
**Thread feedback score threshold exceeded**
* **Event type**: `trace_thread:feedback_score`
* **Triggered when**: Average thread feedback score meets the specified threshold criteria within a time window
* **Project scope**: Can be configured to specific projects
* **Configuration**: Requires feedback score name, threshold value, operator (`>`, `<`), and time window
* **Payload**: Metrics alert payload with average feedback score details
* **Use case**: Monitor conversation quality, track multi-turn interactions, detect thread satisfaction issues
**Guardrails triggered**
* **Event type**: `trace:guardrails_triggered`
* **Triggered when**: A guardrail check fails for a trace
* **Project scope**: Can be configured to specific projects
* **Payload**: Array of guardrail result objects
* **Use case**: Security monitoring, compliance tracking, PII detection
**Cost threshold exceeded**
* **Event type**: `trace:cost`
* **Triggered when**: Total trace cost exceeds the specified threshold within a time window
* **Project scope**: Can be configured to specific projects
* **Configuration**: Requires threshold value (in currency units) and time window (in seconds)
* **Payload**: Metrics alert payload with cost details
* **Use case**: Budget monitoring, cost control, prevent runaway spending
**Latency threshold exceeded**
* **Event type**: `trace:latency`
* **Triggered when**: Average trace latency exceeds the specified threshold within a time window
* **Project scope**: Can be configured to specific projects
* **Configuration**: Requires threshold value (in seconds) and time window (in seconds)
* **Payload**: Metrics alert payload with latency details
* **Use case**: Performance monitoring, SLA compliance, user experience tracking
### Prompt engineering events
**New prompt added**
* **Event type**: `prompt:created`
* **Triggered when**: A new prompt is created in the prompt library
* **Project scope**: Workspace-wide
* **Payload**: Prompt object with metadata
* **Use case**: Track prompt library changes, audit prompt creation
**New prompt version created**
* **Event type**: `prompt:committed`
* **Triggered when**: A new version (commit) is added to a prompt
* **Project scope**: Workspace-wide
* **Payload**: Prompt version object with template and metadata
* **Use case**: Monitor prompt iterations, track version history
**Prompt deleted**
* **Event type**: `prompt:deleted`
* **Triggered when**: A prompt is removed from the prompt library
* **Project scope**: Workspace-wide
* **Payload**: Array of deleted prompt objects
* **Use case**: Audit prompt deletions, maintain prompt governance
### Evaluation events
**Experiment finished**
* **Event type**: `experiment:finished`
* **Triggered when**: An experiment completes in the workspace
* **Project scope**: Workspace-wide
* **Payload**: Array of experiment objects with completion details
* **Use case**: Automate experiment notifications, track evaluation completions
### Want us to support more event types?
If you need additional event types for your use case, please [create an issue on GitHub](https://github.com/comet-ml/opik/issues/new?title=Alert%20Event%20Request%3A%20%3Cevent-name%3E\&labels=enhancement) and let us know what you'd like to monitor.
## Webhook payload structure
All webhook events follow a consistent payload structure:
```json
{
"id": "webhook-event-id",
"eventType": "trace:errors",
"alertId": "alert-uuid",
"alertName": "Production Errors Alert",
"workspaceId": "workspace-uuid",
"createdAt": "2025-01-15T10:30:00Z",
"payload": {
"alertId": "alert-uuid",
"alertName": "Production Errors Alert",
"eventType": "trace:errors",
"eventIds": ["event-id-1", "event-id-2"],
"userNames": ["user@example.com"],
"eventCount": 2,
"aggregationType": "consolidated",
"message": "Alert 'Production Errors Alert': 2 trace:errors events aggregated",
"metadata": [
{
"id": "trace-uuid",
"name": "handle_query",
"project_id": "project-uuid",
"project_name": "Demo Project",
"start_time": "2025-01-15T10:29:45Z",
"end_time": "2025-01-15T10:29:50Z",
"input": {
"query": "User question"
},
"output": {
"response": "LLM response"
},
"error_info": {
"exception_type": "ValidationException",
"message": "Validation failed",
"traceback": "Full traceback..."
},
"metadata": {
"customer_id": "customer_123"
},
"tags": ["production"]
}
]
}
}
```
### Payload fields
| Field | Type | Description |
| ------------------------- | ----------------- | ------------------------------------------ |
| `id` | string | Unique webhook event identifier |
| `eventType` | string | Type of event (e.g., `trace:errors`) |
| `alertId` | string (UUID) | Alert configuration identifier |
| `alertName` | string | Name of the alert |
| `workspaceId` | string | Workspace identifier |
| `createdAt` | string (ISO 8601) | Timestamp when webhook was created |
| `payload.eventIds` | array | List of aggregated event IDs |
| `payload.userNames` | array | Users associated with the events |
| `payload.eventCount` | number | Number of aggregated events |
| `payload.aggregationType` | string | Always "consolidated" |
| `payload.metadata` | array | Event-specific data (varies by event type) |
## Event-specific payloads
### Trace errors threshold exceeded payload
```json
{
"metadata": {
"event_type": "TRACE_ERRORS",
"metric_name": "trace:errors",
"metric_value": "15",
"threshold": "10",
"window_seconds": "900",
"project_ids": "0198ec68-6e06-7253-a20b-d35c9252b9ba,0198ec68-6e06-7253-a20b-d35c9252b9bb",
"project_names": "Demo Project,Default Project"
}
}
```
### Trace feedback score threshold exceeded payload
```json
{
"metadata": {
"event_type": "TRACE_FEEDBACK_SCORE",
"metric_name": "trace:feedback_score",
"metric_value": "0.7500",
"threshold": "0.8000",
"window_seconds": "3600",
"project_ids": "0198ec68-6e06-7253-a20b-d35c9252b9ba,0198ec68-6e06-7253-a20b-d35c9252b9bb",
"project_names": "Demo Project,Default Project"
}
}
```
### Thread feedback score threshold exceeded payload
```json
{
"metadata": {
"event_type": "TRACE_THREAD_FEEDBACK_SCORE",
"metric_name": "trace_thread:feedback_score",
"metric_value": "0.7500",
"threshold": "0.8000",
"window_seconds": "3600",
"project_ids": "0198ec68-6e06-7253-a20b-d35c9252b9ba,0198ec68-6e06-7253-a20b-d35c9252b9bb",
"project_names": "Demo Project,Default Project"
}
}
```
### Prompt created payload
```json
{
"metadata": {
"id": "prompt-uuid",
"name": "Prompt Name",
"description": "Prompt description",
"tags": ["system", "assistant"],
"created_at": "2025-01-15T10:00:00Z",
"created_by": "user@example.com",
"last_updated_at": "2025-01-15T10:00:00Z",
"last_updated_by": "user@example.com"
}
}
```
### Prompt version created payload
```json
{
"metadata": {
"id": "version-uuid",
"prompt_id": "prompt-uuid",
"commit": "abc12345",
"template": "You are a helpful assistant. {{question}}",
"type": "mustache",
"metadata": {
"version": "1.0",
"model": "gpt-4"
},
"created_at": "2025-01-15T10:00:00Z",
"created_by": "user@example.com"
}
}
```
### Prompt deleted payload
```json
{
"metadata": [
{
"id": "prompt-uuid",
"name": "Prompt Name",
"description": "Prompt description",
"tags": ["deprecated"],
"created_at": "2025-01-10T10:00:00Z",
"created_by": "user@example.com",
"last_updated_at": "2025-01-15T10:00:00Z",
"last_updated_by": "user@example.com",
"latest_version": {
"id": "version-uuid",
"commit": "abc12345",
"template": "Template content",
"type": "mustache",
"created_at": "2025-01-15T10:00:00Z",
"created_by": "user@example.com"
}
}
]
}
```
### Guardrails triggered payload
```json
{
"metadata": [
{
"id": "guardrail-check-uuid",
"entity_id": "trace-uuid",
"project_id": "project-uuid",
"project_name": "Project Name",
"name": "PII",
"result": "failed",
"details": {
"detected_entities": ["EMAIL", "PHONE_NUMBER"],
"message": "PII detected in response: email and phone number"
}
}
]
}
```
### Experiment finished payload
```json
{
"metadata": [
{
"id": "experiment-uuid",
"name": "Experiment Name",
"dataset_id": "dataset-uuid",
"created_at": "2025-01-15T10:00:00Z",
"created_by": "user@example.com",
"last_updated_at": "2025-01-15T10:05:00Z",
"last_updated_by": "user@example.com",
"feedback_scores": [
{
"name": "accuracy",
"value": 0.92
},
{
"name": "latency",
"value": 1.5
}
]
}
]
}
```
### Cost threshold exceeded payload
```json
{
"metadata": {
"event_type": "TRACE_COST",
"metric_name": "trace:cost",
"metric_value": "150.75",
"threshold": "100.00",
"window_seconds": "3600",
"project_ids": "0198ec68-6e06-7253-a20b-d35c9252b9ba,0198ec68-6e06-7253-a20b-d35c9252b9bb",
"project_names": "Demo Project,Default Project"
}
}
```
### Latency threshold exceeded payload
```json
{
"metadata": {
"event_type": "TRACE_LATENCY",
"metric_name": "trace:latency",
"metric_value": "5250.5000",
"threshold": "5",
"window_seconds": "1800",
"project_ids": "0198ec68-6e06-7253-a20b-d35c9252b9ba,0198ec68-6e06-7253-a20b-d35c9252b9bb",
"project_names": "Demo Project,Default Project"
}
}
```
## Securing your webhooks
### Using secret tokens
Add a secret token to your webhook configuration to verify that incoming requests are from Opik:
1. Generate a secure random token (e.g., using `openssl rand -hex 32`)
2. Add it to your alert's "Secret token" field
3. Opik will send it in the `Authorization` header: `Authorization: Bearer your-secret-token`
4. Validate the token in your webhook handler before processing the request
### Example validation (Python/Flask)
```python
from flask import Flask, request, abort
import hmac
app = Flask(__name__)
SECRET_TOKEN = "your-secret-token-here"
@app.route('/webhook', methods=['POST'])
def handle_webhook():
# Verify the secret token
auth_header = request.headers.get('Authorization', '')
if not auth_header.startswith('Bearer '):
abort(401, 'Missing or invalid Authorization header')
token = auth_header.split(' ', 1)[1]
if not hmac.compare_digest(token, SECRET_TOKEN):
abort(401, 'Invalid secret token')
# Process the webhook
data = request.json
event_type = data.get('eventType')
# Handle different event types
if event_type == 'trace:errors':
handle_trace_errors(data)
elif event_type == 'trace:feedback_score':
handle_feedback_score(data)
elif event_type == 'experiment:finished':
handle_experiment_finished(data)
return {'status': 'success'}, 200
```
### Using custom headers
You can add custom headers for additional authentication or routing:
```python
# In your webhook handler
api_key = request.headers.get('X-API-Key')
environment = request.headers.get('X-Environment')
if api_key != EXPECTED_API_KEY:
abort(401, 'Invalid API key')
# Route to different handlers based on environment
if environment == 'production':
handle_production_webhook(data)
else:
handle_staging_webhook(data)
```
## Troubleshooting
### Webhooks not being delivered
**Check endpoint accessibility:**
* Ensure your endpoint is publicly accessible (if using cloud)
* Verify firewall rules allow incoming connections
* Test your endpoint with curl: `curl -X POST -H "Content-Type: application/json" -d '{"test": "data"}' https://your-endpoint.com/webhook`
**Check webhook configuration:**
* Verify the URL starts with `http://` or `https://`
* Check that the endpoint returns 2xx status codes
* Review custom headers for syntax errors
**Check alert status:**
* Ensure the alert is enabled
* Verify at least one trigger is configured
* Check that project scope matches your events (for observability events)
### Webhook timeouts
Opik expects webhooks to respond within the configured timeout (typically 30 seconds). If your endpoint takes longer:
**Optimize your handler:**
* Return a 200 response immediately
* Process the webhook asynchronously in the background
* Use a queue system (e.g., Celery, RabbitMQ) for long-running tasks
**Example async processing:**
```python
from flask import Flask
from threading import Thread
app = Flask(__name__)
def process_webhook_async(data):
# Long-running processing
send_to_slack(data)
update_dashboard(data)
log_to_database(data)
@app.route('/webhook', methods=['POST'])
def handle_webhook():
data = request.json
# Start background processing
thread = Thread(target=process_webhook_async, args=(data,))
thread.start()
# Return immediately
return {'status': 'accepted'}, 200
```
### Duplicate webhooks
If you receive duplicate webhooks:
**Check retry configuration:**
* Opik retries failed webhooks with exponential backoff
* Ensure your endpoint returns 2xx status codes on success
* Implement idempotency using the webhook `id` field
**Example idempotent handler:**
```python
processed_webhook_ids = set()
@app.route('/webhook', methods=['POST'])
def handle_webhook():
data = request.json
webhook_id = data.get('id')
# Skip if already processed
if webhook_id in processed_webhook_ids:
return {'status': 'already_processed'}, 200
# Process webhook
process_alert(data)
# Mark as processed
processed_webhook_ids.add(webhook_id)
return {'status': 'success'}, 200
```
### Events not triggering alerts
**Check event type matching:**
* Verify the alert has a trigger for this event type
* For observability events, check project scope configuration
* Review project IDs in trigger configuration
**Check workspace context:**
* Ensure events are logged to the correct workspace
* Verify the alert is in the same workspace as your events
**Check alert evaluation:**
* View backend logs for alert evaluation messages
* Confirm events are being published to the event bus
* Check Redis for alert buckets (self-hosted deployments)
### SSL certificate errors
If you see SSL certificate errors in logs:
**For development/testing:**
* Use self-signed certificates with proper configuration
* Or use HTTP endpoints (not recommended for production)
**For production:**
* Use valid SSL certificates from trusted CAs
* Ensure certificate chain is complete
* Check certificate expiry dates
* Use services like Let's Encrypt for free SSL
## Architecture and internals
Understanding Opik's alert architecture can help with troubleshooting and optimization.
### How alerts work
The Opik Alerts system monitors your workspace for specific events and sends consolidated webhook notifications to your configured endpoints. Here's the flow:
1. **Event occurs**: An event happens in your workspace (e.g., a trace error, prompt creation, guardrail trigger, new feedback score)
2. **Alert evaluation**: The system checks if any enabled alerts match this event type and evaluates threshold conditions (for metrics-based alerts like errors, cost, latency, and feedback scores)
3. **Event aggregation**: Multiple events are aggregated over a short time window (debouncing)
4. **Webhook delivery**: A consolidated HTTP POST request is sent to your webhook URL
5. **Retry handling**: Failed requests are automatically retried with exponential backoff
#### Event debouncing
To prevent overwhelming your webhook endpoint, Opik aggregates multiple events of the same type within a short time window (typically 30-60 seconds) and sends them as a single consolidated webhook. This is particularly useful for high-frequency events like feedback scores.
### Event flow
```
1. Event occurs (e.g., trace error logged)
โ
2. Service publishes AlertEvent to EventBus
โ
3. AlertEventListener receives event
โ
4. AlertEventEvaluationService evaluates against configured alerts
โ
5. Matching events added to AlertBucketService (Redis)
โ
6. AlertJob (runs every 5 seconds) processes ready buckets
โ
7. WebhookPublisher publishes to Redis stream
โ
8. WebhookSubscriber consumes from stream
โ
9. WebhookHttpClient sends HTTP POST request
โ
10. Retries on failure with exponential backoff
```
### Debouncing mechanism
Opik uses Redis-based buckets to aggregate events:
* **Bucket key format**: `alert_bucket:{alertId}:{eventType}`
* **Window size**: Configurable (default 30-60 seconds)
* **Index**: Redis Sorted Set for efficient bucket retrieval
* **TTL**: Buckets expire automatically after processing
This prevents overwhelming your webhook endpoint with individual events and reduces costs for high-frequency events.
### Retry strategy
Failed webhooks are automatically retried:
* **Max retries**: Configurable (default 3)
* **Initial delay**: 1 second
* **Max delay**: 60 seconds
* **Backoff**: Exponential with jitter
* **Retryable errors**: 5xx status codes, network errors
* **Non-retryable errors**: 4xx status codes (except 429)
## Best practices
### Alert design
**Create focused alerts:**
* Use separate alerts for different purposes (e.g., one for errors, one for feedback)
* Configure project scope to avoid noise from test projects
* Use descriptive names that explain the alert's purpose
**Optimize for your workflow:**
* Send critical errors to PagerDuty or on-call systems
* Route feedback scores to analytics platforms
* Send prompt changes to audit logs or Slack channels
**Test thoroughly:**
* Use the "Test connection" feature before enabling alerts
* Monitor webhook delivery in your endpoint logs
* Start with a small project scope and expand gradually
### Webhook endpoint design
**Handle failures gracefully:**
* Return 2xx status codes immediately
* Process webhooks asynchronously
* Implement retry logic in your handler
* Use dead letter queues for permanent failures
**Implement security:**
* Always validate secret tokens
* Use HTTPS endpoints with valid certificates
* Implement rate limiting to prevent abuse
* Log all webhook attempts for auditing
**Monitor performance:**
* Track webhook processing time
* Alert on handler failures
* Monitor queue lengths for async processing
* Set up dead letter queue monitoring
### Scaling considerations
**For high-volume workspaces:**
* Use event debouncing (built-in)
* Implement batch processing in your handler
* Use message queues for async processing
* Consider using serverless functions (AWS Lambda, Cloud Functions)
**For multiple projects:**
* Create project-specific alerts with scope configuration
* Use custom headers to route to different handlers
* Implement filtering in your webhook handler
* Consider separate endpoints for different event types
## Next steps
* Configure your first alert for production error monitoring
* Set up Slack integration for team notifications
* Explore [Online Evaluation Rules](/production/rules) for automated model monitoring
* Learn about [Guardrails](/production/guardrails) for proactive risk detection
* Review [Production Monitoring](/production/production_monitoring) best practices
# Dashboards
> Create customizable dashboards to monitor quality, cost, and performance of your LLM projects and visualize experiment results.
Dashboards allow you to create customizable views for monitoring your LLM applications. You can track project metrics like token usage, cost, and feedback scores, as well as compare experiment results across different runs.
### Performance overview template
A comprehensive template for monitoring project performance including:
* **Traces and Threads volume**: Track the number of traces and threads over time
* **Quality metrics**: Monitor feedback scores for both traces and threads
* **Duration and Latency**: Analyze trace and thread duration trends
* **Summary cards**: Quick overview of traces, threads, errors, latency, and cost
### Project metrics template
Focused on operational metrics for your project:
* **Token usage**: Monitor token consumption over time
* **Estimated cost**: Track spending trends
* **Feedback scores**: View quality metrics for traces and threads
* **Trace and thread counts**: Monitor volume trends
* **Duration metrics**: Analyze performance over time
* **Failed guardrails**: Track guardrail violations
### Experiment insights template
Designed for comparing experiment results:
* **Feedback scores radar chart**: Compare multiple metrics across experiments at a glance
* **Feedback scores distribution**: Detailed bar chart comparison of scores
## Widget types
Dashboards support several widget types that you can add to your sections:
### Project metrics widget
Displays time-series charts for project metrics over time. This widget visualizes how your metrics change and trend over time. Supports both line and bar chart visualizations for flexible data presentation.
**Available metrics:**
* **Trace feedback scores** - View quality metrics for traces over time
* **Number of traces** - Monitor trace volume trends
* **Trace duration** - Analyze trace performance trends
* **Token usage** - Monitor token consumption over time
* **Estimated cost** - Track spending trends
* **Failed guardrails** - Track guardrail violations over time
* **Number of threads** - Monitor thread volume trends
* **Thread duration** - Analyze thread performance trends
* **Thread feedback scores** - View quality metrics for threads over time
**Configuration options:**
* **Project**: Select a specific project or use the dashboard's default project
* **Metric type**: Choose from any of the metrics listed above
* **Chart type**: Line chart (best for trends) or Bar chart (good for period comparisons)
* **Filters**: Apply trace or thread filters to focus on specific data based on tags, metadata, or other attributes
* **Feedback scores**: When using feedback score metrics, optionally select specific scores to display (leave empty to show all)
### Project statistics widget
Shows a single metric value with a compact card display. Ideal for summary dashboards and key performance indicators.
**Data sources:** Traces or Spans
**Trace-specific metrics:**
* Total trace count
* Total thread count
* Average LLM span count
* Average span count
* Average estimated cost per trace
* Total guardrails failed count
**Span-specific metrics:**
* Total span count
* Average estimated cost per span
**Shared metrics (available for both traces and spans):**
* P50 duration - Median duration
* P90 duration - 90th percentile duration
* P99 duration - 99th percentile duration
* Total input count
* Total output count
* Total metadata count
* Average number of tags
* Total estimated cost sum
* Output tokens (avg.)
* Input tokens (avg.)
* Total tokens (avg.)
* Total error count
* Average feedback scores - Any feedback score defined in your project
### Experiments metrics widget
Compares feedback scores across multiple experiments. Ideal for visualizing A/B test results and prompt iteration outcomes. Supports multiple chart types and flexible data selection methods.
**Chart types:**
* **Line chart** - Show trends across experiments (default)
* **Bar chart** - View detailed score distributions side by side - ideal for precise comparisons
* **Radar chart** - Compare multiple feedback scores across experiments in a radial view - great for seeing overall performance patterns
**Data selection methods:**
* **Filter and group**: Dynamically filter and aggregate experiment results. You can:
* Filter by dataset (e.g., only experiments from a specific dataset)
* Filter by configuration metadata (e.g., model="gpt-4", temperature="0.7")
* Group by dataset to compare results across different datasets
* Group by configuration keys (e.g., group by model or prompt version) to aggregate feedback scores - supports up to 5 grouping levels for hierarchical comparisons
* Best for: creating dashboards that automatically include new experiments matching your criteria
* **Select experiments**: Manually select specific experiments to compare (up to 10 at a time). When creating widgets from the Experiment page (Compare experiments view), you can use this mode to visualize the experiments you're currently comparing. Best for:
* Comparing specific experiment runs you want to highlight
* A/B testing with a fixed set of experiments
* Creating focused comparisons for presentations or reports
**Configuration options:**
* **Data source**: Choose between "Filter and group" or "Select experiments" mode
* **Filters** (Filter and group mode only): Filter experiments by:
* Dataset - show only experiments from a specific dataset
* Configuration - filter by metadata keys and values (e.g., model="gpt-4")
* **Groups** (Filter and group mode only): Group aggregated results by:
* Dataset - compare results across different datasets
* Configuration - group by metadata keys to aggregate feedback scores (e.g., group by model type)
* **Experiments** (Select experiments mode only): Manually select specific experiments from your list (limited to 10 experiments)
* **Chart type**: Choose line, bar, or radar chart visualization
* **Metrics**: Optionally display only specific feedback scores (leave empty to show all)
### Markdown text widget
Add custom notes, descriptions, or documentation to your dashboard using markdown formatting. Use this widget to:
* Add section headers and explanations
* Document dashboard purpose and context
* Include links to related resources
* Add team notes or guidelines
**Example markdown content:**
```markdown
## Production Metrics Dashboard
This dashboard monitors our production LLM application performance.
**Key Metrics:**
- Token usage should stay under 100K/day
- Average latency target: < 500ms
- Error rate threshold: < 1%
**Weekly Review Schedule:**
- Monday: Review cost trends
- Wednesday: Check quality metrics
- Friday: Performance analysis
For more details, see our [Monitoring Guidelines](https://your-docs-link.com)
```
## Creating a dashboard
### From the Dashboards page
1. Navigate to the Dashboards page from the sidebar
2. Click **Create new dashboard**
3. Enter a name and optional description
4. Select a template to start from, or begin with a blank dashboard
5. Click **Create**
### From the Dashboards tab
When viewing a project or comparing experiments:
1. Switch to the **Dashboards** tab
2. Use the dropdown to select an existing dashboard or create a new one
3. New dashboards inherit the current project or experiment context
## Customizing dashboards
### Adding sections
Dashboards are organized into sections, each containing one or more widgets:
1. Click **Add section** at the bottom of the dashboard
2. Give the section a title
3. Add widgets to the section
### Adding widgets
1. Click the **+** button within a section
2. Select the widget type from the available options
3. Configure the widget settings:
* Title and subtitle
* Metric type (for metrics widgets)
* Chart type (line, bar, or radar)
* Filters to narrow the data
* Feedback scores to display (where applicable)
4. Click **Save** to add the widget
### Editing widgets
1. Click the menu icon on any widget
2. Select **Edit** to modify the widget configuration
3. Make your changes and save
### Rearranging widgets
* **Drag and drop**: Use the drag handle on widgets to reorder them within a section
* **Resize**: Drag the edges of widgets to adjust their size
### Collapsing sections
Click on a section title to collapse or expand it. The collapsed state is preserved across sessions.
## Date range filtering
Project-based widgets respect the global date range filter:
1. Use the date picker in the dashboard toolbar
2. Select a preset range (Last 24 hours, Last 7 days, etc.) or choose custom dates
3. Applicable widgets automatically update to show data within the selected range
**Widgets that use date range filtering:**
* Project metrics widget - filters time-series data to the selected range
* Project statistics widget - calculates statistics within the selected range
**Widgets not affected by date range:**
* Experiments metrics widget - displays experiment results regardless of date
* Markdown text widget - static content
## Saving changes
Dashboards auto-detect changes. When you make modifications:
1. The **Save** and **Discard** buttons appear in the toolbar
2. Click **Save** to persist your changes
3. Click **Discard** to revert to the last saved state
The Feedback Definitions page displays a table of all configured feedback types with the following columns:
* **Feedback score**: The name of the feedback definition
* **Type**: The data type of feedback (currently supports Categorical)
* **Values**: Possible values that can be assigned when providing feedback
### Creating a New Feedback Definition
To create a new feedback definition:
1. Click the **Create new feedback definition** button in the top-right corner of the Feedback Definitions page.
2. In the modal, you will be prompted to provide:
* **Name**: A descriptive name for your feedback definition
* **Type**: Select the data type (Categorical or Numerical)
* **Values / Range**: Depending on the type, either define possible labels (for Categorical) or set a minimum and maximum value (for Numerical)
### Examples of Common Feedback Types
As shown in the interface, common feedback definitions include:
* **Thumbs Up / Thumbs Down**: Simple binary feedback for quick human review (Values: ๐, ๐)
* **Usefulness**: Evaluates how helpful the response is (Values: Neutral, Not useful, Useful)
* **Knowledge Retrieval**: Assesses the accuracy of retrieved information (Values: Bad results, Good results, Unrelated results)
* **Hallucination**: Identifies when the model invents information (Values: No, Yes)
* **Correct**: Determines factual accuracy of responses (Values: Bad, Good)
* **Empty**: Identifies empty or non-responsive outputs (Values: No, n/a)
## Best Practices
* Create meaningful, clearly differentiated feedback categories
* Use consistent naming conventions for your feedback definitions
* Limit the number of possible values to make evaluation efficient
* Consider the specific evaluation needs of your use case when designing feedback types
## Using Feedback Definitions
Once created, feedback definitions can be used to:
1. Evaluate outputs in experiments
2. Build custom evaluations and reports
3. Train and fine-tune models based on collected feedback
## Additional Resources
* For programmatic access to feedback definitions, see the [API Reference](/reference/api)
* To learn about creating automated evaluation rules with feedback definitions, see [Automation Rules](/automation/rules)
# AI Providers
> Configure connections to Large Language Model providers
The AI Providers tab allows you to configure connections to different Large Language Models (LLMs). This page explains how to set up and manage AI provider integrations within Opik.
## Overview
Connecting AI providers enables you to:
* Send prompts and receive responses from different LLMs
* Set up a provider in one place and use it across projects
* Automatically record model metadata in the Playground
* Track and analyze traces using online evaluation rules
## Managing AI Providers
### Viewing Existing Providers
The AI Providers page displays a table of all configured connections with the following columns:
* **Name**: The name or identifier of the API key
* **Created**: The date and time when the provider was configured
* **Provider**: The type of AI provider (e.g., OpenAI)
### Adding a New Provider Configuration
To add a new AI provider:
1. Click the **Add configuration** button in the top-right corner
2. In the Provider Configuration dialog that appears:
* Select a provider from the dropdown menu
* Enter your API key for that provider
* Click **Save** to store the configuration
### Supported Providers
Opik supports integration with various AI providers, including:
* OpenAI
* Anthropic
* OpenRouter
* Gemini
* VertexAI
* Azure OpenAI
* Amazon Bedrock
* LM Studio (coming soon)
* vLLM / Ollama / any other OpenAI API-compliant provider
##### Configuration Steps
1. **Provider Name**: Enter a unique name to identify this custom provider (e.g., "vLLM Production", "Ollama Local", "Azure OpenAI Dev")
2. **URL**: Enter your server URL, for example: `http://host.docker.internal:8000/v1`
3. **API Key** (optional): If your model access requires authentication, enter the API key. Otherwise, leave this field blank.
4. **Models**: List all models available on your server. You'll be able to select one of them for use later.
5. **Custom Headers** (optional): Add any additional HTTP headers required by your custom endpoint as key-value pairs.
## Getting started
To use the BeeAI integration with Opik, you will need to have BeeAI and the required OpenTelemetry packages installed.
### Installation
#### Option 1: Using npm
```bash
npm install beeai-framework@0.1.13 @ai-sdk/openai @arizeai/openinference-instrumentation-beeai @opentelemetry/sdk-node dotenv
```
#### Option 2: Using yarn
```bash
yarn add beeai-framework@0.1.13 @ai-sdk/openai @arizeai/openinference-instrumentation-beeai @opentelemetry/sdk-node dotenv
```
## Getting started
### Create a Mastra project
If you don't have a Mastra project yet, you can create one using the Mastra CLI:
```bash
npx create-mastra
cd your-mastra-project
```
### Install required packages
Install the necessary dependencies for Mastra and AI SDK:
```bash
npm install langfuse-vercel
```
### Add environment variables
Create or update your `.env` file with the following variables:
## Getting started
To use the AG2 integration with Opik, you will need to have the following
packages installed:
```bash
pip install -U "ag2[openai]" opik opentelemetry-sdk opentelemetry-instrumentation-openai opentelemetry-instrumentation-threading opentelemetry-exporter-otlp
```
In addition, you will need to set the following environment variables to
configure the OpenTelemetry integration:
## Getting started
To use the Agno integration with Opik, you will need to have the following
packages installed:
```bash
pip install -U agno openai opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-agno yfinance
```
In addition, you will need to set the following environment variables to
configure the OpenTelemetry integration:
## Getting started
To use the BeeAI integration with Opik, you will need to have BeeAI and the required OpenTelemetry packages installed:
```bash
pip install beeai-framework openinference-instrumentation-beeai "beeai-framework[wikipedia]" opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
```
## Environment configuration
Configure your environment variables based on your Opik deployment:
## Getting started
To use the Autogen integration with Opik, you will need to have the following
packages installed:
```bash
pip install -U "autogen-agentchat" "autogen-ext[openai]" opik opentelemetry-sdk opentelemetry-instrumentation-openai opentelemetry-exporter-otlp
```
In addition, you will need to set the following environment variables to
configure the OpenTelemetry integration:
## Getting Started
### Installation
First, ensure you have both `opik` and `crewai` installed:
```bash
pip install opik crewai crewai-tools
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring CrewAI
In order to configure CrewAI, you will need to have your LLM provider API key. For this example, we'll use OpenAI. You can [find or create your OpenAI API Key in this page](https://platform.openai.com/settings/organization/api-keys).
You can set it as an environment variable:
```bash
export OPENAI_API_KEY="YOUR_API_KEY"
```
Or set it programmatically:
```python
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
```
## Logging CrewAI calls
To log a CrewAI pipeline run, you can use the [`track_crewai`](https://www.comet.com/docs/opik/python-sdk-reference/integrations/crewai/track_crewai.html) function. This will log each CrewAI call to Opik, including LLM calls made by your agents.
If you set `log_graph` to `True` in the `OpikCallback`, then each module graph is also displayed in the "Agent graph" tab:
# Observability for Google Agent Development Kit (Python) with Opik
> Start here to integrate Opik into your Google Agent Development Kit-based genai application for end-to-end LLM observability, unit testing, and optimization.
[Agent Development Kit (ADK)](https://google.github.io/adk-docs/) is a flexible and modular framework for developing and deploying AI agents. ADK can be used with popular LLMs and open-source generative AI tools and is designed with a focus on tight integration with the Google ecosystem and Gemini models. ADK makes it easy to get started with simple agents powered by Gemini models and Google AI tools while providing the control and structure needed for more complex agent architectures and orchestration.
In this guide, we will showcase how to integrate Opik with Google ADK so that all the ADK calls are logged as traces in Opik. We'll cover three key integration patterns:
1. **Automatic Agent Tracking** - Recommended approach using `track_adk_agent_recursive` for effortless instrumentation
2. **Manual Callback Configuration** - Alternative approach with explicit callback setup for fine-grained control
3. **Hybrid Tracing** - Combining Opik decorators with ADK callbacks for comprehensive observability
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=google-adk\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=google-adk\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=google-adk\&utm_campaign=opik) for more information.
Opik provides comprehensive integration with ADK, automatically logging traces for all agent executions, tool calls, and LLM interactions with detailed cost tracking and error monitoring.
## Key Features
* **One-line instrumentation** with `track_adk_agent_recursive` for automatic tracing of entire agent hierarchies
* **Automatic cost tracking** for all supported LLM providers including LiteLLM models (OpenAI, Anthropic, Google AI, AWS Bedrock, and more)
* **Full compatibility** with the `@opik.track` decorator for hybrid tracing approaches
* **Thread support** for conversational applications using ADK sessions
* **Automatic agent graph visualization** with Mermaid diagrams for complex multi-agent workflows
* **Comprehensive error tracking** with detailed error information and stack traces
## Getting Started
### Installation
First, ensure you have both `opik` and `google-adk` installed:
```bash
pip install opik google-adk
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring Google ADK
In order to configure Google ADK, you will need to have your LLM provider API key. For this example, we'll use OpenAI. You can [find or create your OpenAI API Key in this page](https://platform.openai.com/settings/organization/api-keys).
You can set it as an environment variable:
```bash
export OPENAI_API_KEY="YOUR_API_KEY"
```
Or set it programmatically:
```python
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
```
## Example 1: Automatic Agent Tracking (Recommended)
The recommended way to track ADK agents is using [`track_adk_agent_recursive`](https://www.comet.com/docs/opik/python-sdk-reference/integrations/adk/track_adk_agent_recursive.html) and [`OpikTracer`](https://www.comet.com/docs/opik/python-sdk-reference/integrations/adk/OpikTracer.html), which automatically instruments your entire agent hierarchy with a single function call. This approach is ideal for both single agents and complex multi-agent setups:
```python
import datetime
from zoneinfo import ZoneInfo
from google.adk.agents import LlmAgent
from google.adk.models.lite_llm import LiteLlm
from opik.integrations.adk import OpikTracer, track_adk_agent_recursive
def get_weather(city: str) -> dict:
"""Get weather information for a city."""
if city.lower() == "new york":
return {
"status": "success",
"report": "The weather in New York is sunny with a temperature of 25 ยฐC (77 ยฐF).",
}
elif city.lower() == "london":
return {
"status": "success",
"report": "The weather in London is cloudy with a temperature of 18 ยฐC (64 ยฐF).",
}
return {"status": "error", "error_message": f"Weather info for '{city}' is unavailable."}
def get_current_time(city: str) -> dict:
"""Get current time for a city."""
if city.lower() == "new york":
tz = ZoneInfo("America/New_York")
now = datetime.datetime.now(tz)
return {
"status": "success",
"report": now.strftime(f"The current time in {city} is %Y-%m-%d %H:%M:%S %Z%z."),
}
elif city.lower() == "london":
tz = ZoneInfo("Europe/London")
now = datetime.datetime.now(tz)
return {
"status": "success",
"report": now.strftime(f"The current time in {city} is %Y-%m-%d %H:%M:%S %Z%z."),
}
return {"status": "error", "error_message": f"No timezone info for '{city}'."}
# Initialize LiteLLM with OpenAI gpt-4o
llm = LiteLlm(model="openai/gpt-4o")
# Create the basic agent
basic_agent = LlmAgent(
name="weather_time_agent",
model=llm,
description="Agent for answering time & weather questions",
instruction="Answer questions about the time or weather in a city. Be helpful and provide clear information.",
tools=[get_weather, get_current_time],
)
# Configure Opik tracer
opik_tracer = OpikTracer(
name="basic-weather-agent",
tags=["basic", "weather", "time", "single-agent"],
metadata={
"environment": "development",
"model": "gpt-4o",
"framework": "google-adk",
"example": "basic"
},
project_name="adk-basic-demo"
)
# Instrument the agent with a single function call - this is the recommended approach
track_adk_agent_recursive(basic_agent, opik_tracer)
```
Each agent execution will now be automatically logged to the Opik platform with detailed trace information:
This approach automatically handles:
* **All agent callbacks** (before/after agent, model, and tool executions)
* **Sub-agents** and nested agent hierarchies
* **Agent tools** that contain other agents
* **Complex workflows** with minimal code
## Example 2: Manual Callback Configuration (Alternative Approach)
For a fine-grained control over which callbacks to instrument, you can manually configure the [`OpikTracer`](https://www.comet.com/docs/opik/python-sdk-reference/integrations/adk/OpikTracer.html) callbacks. This approach gives you explicit control but requires more setup code:
```python
# Configure Opik tracer (same as before)
opik_tracer = OpikTracer(
name="basic-weather-agent",
tags=["basic", "weather", "time", "single-agent"],
metadata={
"environment": "development",
"model": "gpt-4o",
"framework": "google-adk",
"example": "basic"
},
project_name="adk-basic-demo"
)
# Create the agent with explicit callback configuration
basic_agent = LlmAgent(
name="weather_time_agent",
model=llm,
description="Agent for answering time & weather questions",
instruction="Answer questions about the time or weather in a city. Be helpful and provide clear information.",
tools=[get_weather, get_current_time],
before_agent_callback=opik_tracer.before_agent_callback,
after_agent_callback=opik_tracer.after_agent_callback,
before_model_callback=opik_tracer.before_model_callback,
after_model_callback=opik_tracer.after_model_callback,
before_tool_callback=opik_tracer.before_tool_callback,
after_tool_callback=opik_tracer.after_tool_callback,
)
```
The `track_adk_agent_recursive` approach is particularly powerful for:
* **Multi-agent systems** with coordinator and specialist agents
* **Sequential agents** with multiple processing steps
* **Parallel agents** executing tasks concurrently
* **Loop agents** with iterative workflows
* **Agent tools** that contain nested agents
* **Complex hierarchies** with deeply nested agent structures
By calling `track_adk_agent_recursive` once on the top-level agent, all child agents and their operations are automatically instrumented without any additional code
## Cost Tracking
Opik automatically tracks token usage and cost for all LLM calls during the agent execution, not only for the Gemini LLMs, but including the models accessed via `LiteLLM`.
For more complex agent architectures displaying a graph may be even more beneficial:
## Example 4: Hybrid Tracing - Combining Opik Decorators with ADK Callbacks
This advanced example shows how to combine Opik's `@opik.track` decorator with ADK's callback system. This is powerful when you have complex multi-step tools that perform their own internal operations that you want to trace separately, while still maintaining the overall agent trace context.
You can use `track_adk_agent_recursive` together with `@opik.track` decorators on your tool functions for maximum visibility:
```python
from opik import track
@track(name="weather_data_processing", tags=["data-processing", "weather"])
def process_weather_data(raw_data: dict) -> dict:
"""Process raw weather data with additional computations."""
# Simulate some data processing steps that we want to trace separately
processed = {
"temperature_celsius": raw_data.get("temp_c", 0),
"temperature_fahrenheit": raw_data.get("temp_c", 0) * 9/5 + 32,
"conditions": raw_data.get("condition", "unknown"),
"comfort_index": "comfortable" if 18 <= raw_data.get("temp_c", 0) <= 25 else "less comfortable"
}
return processed
@track(name="location_validation", tags=["validation", "location"])
def validate_location(city: str) -> dict:
"""Validate and normalize city names."""
# Simulate location validation logic that we want to trace
normalized_cities = {
"nyc": "New York",
"ny": "New York",
"new york city": "New York",
"london uk": "London",
"london england": "London",
"tokyo japan": "Tokyo"
}
city_lower = city.lower().strip()
validated_city = normalized_cities.get(city_lower, city.title())
return {
"original": city,
"validated": validated_city,
"is_valid": city_lower in ["new york", "london", "tokyo"] or city_lower in normalized_cities
}
@track(name="advanced_weather_lookup", tags=["weather", "api-simulation"])
def get_advanced_weather(city: str) -> dict:
"""Get weather with internal processing steps tracked by Opik decorators."""
# Step 1: Validate location (traced by @opik.track)
location_result = validate_location(city)
if not location_result["is_valid"]:
return {
"status": "error",
"error_message": f"Invalid location: {city}"
}
validated_city = location_result["validated"]
# Step 2: Get raw weather data (simulated)
raw_weather_data = {
"New York": {"temp_c": 25, "condition": "sunny", "humidity": 65},
"London": {"temp_c": 18, "condition": "cloudy", "humidity": 78},
"Tokyo": {"temp_c": 22, "condition": "partly cloudy", "humidity": 70}
}
if validated_city not in raw_weather_data:
return {
"status": "error",
"error_message": f"Weather data unavailable for {validated_city}"
}
raw_data = raw_weather_data[validated_city]
# Step 3: Process the data (traced by @opik.track)
processed_data = process_weather_data(raw_data)
return {
"status": "success",
"city": validated_city,
"report": f"Weather in {validated_city}: {processed_data['conditions']}, {processed_data['temperature_celsius']}ยฐC ({processed_data['temperature_fahrenheit']:.1f}ยฐF). Comfort level: {processed_data['comfort_index']}.",
"raw_humidity": raw_data["humidity"]
}
# Configure Opik tracer for hybrid example
hybrid_tracer = OpikTracer(
name="hybrid-tracing-agent",
tags=["hybrid", "decorators", "callbacks", "advanced"],
metadata={
"environment": "development",
"model": "gpt-4o",
"framework": "google-adk",
"example": "hybrid-tracing",
"tracing_methods": ["decorators", "callbacks"]
},
project_name="adk-hybrid-demo"
)
# Create hybrid agent that combines both tracing approaches
hybrid_agent = LlmAgent(
name="advanced_weather_time_agent",
model=llm,
description="Advanced agent with hybrid Opik tracing using both decorators and callbacks",
instruction="""You are an advanced weather and time agent that provides detailed information with comprehensive internal processing.
Your tools perform multi-step operations that are individually traced, giving detailed visibility into the processing pipeline.
Use the advanced weather and time tools to provide thorough, well-processed information to users.""",
tools=[get_advanced_weather],
)
# Instrument the agent with track_adk_agent_recursive
# The @opik.track decorators in your tools will automatically create child spans
from opik.integrations.adk import track_adk_agent_recursive
track_adk_agent_recursive(hybrid_agent, hybrid_tracer)
```
The trace can now be viewed in the UI:
## Compatibility with @track Decorator
The `OpikTracer` is fully compatible with the `@track` decorator, allowing you to create hybrid tracing approaches that combine ADK agent tracking with custom function tracing.
You can both invoke your agent from inside another tracked function and call tracked functions inside your tool functions, all the spans and traces parent-child relationships will be preserved!
## Thread Support
The Opik integration automatically handles ADK sessions and maps them to Opik threads for conversational applications:
```python
from opik.integrations.adk import OpikTracer
from google.adk import sessions as adk_sessions, runners as adk_runners
# ADK session management
session_service = adk_sessions.InMemorySessionService()
session = session_service.create_session_sync(
app_name="my_app",
user_id="user_123",
session_id="conversation_456"
)
opik_tracer = OpikTracer()
runner = adk_runners.Runner(
agent=your_agent,
app_name="my_app",
session_service=session_service
)
# All traces will be automatically grouped by session_id as thread_id
```
The integration automatically:
* Uses the ADK session ID as the Opik thread ID
* Groups related conversations and interactions
* Logs app\_name and user\_id as metadata
* Maintains conversation context across multiple interactions
You can view your session as a whole conversation and easily navigate to any specific trace you need.
## Error Tracking
The `OpikTracer` provides comprehensive error tracking and monitoring:
* **Automatic error capture** for agent execution failures
* **Detailed stack traces** with full context information
* **Tool execution errors** with input/output data
* **Model call failures** with provider-specific error details
Error information is automatically logged to spans and traces, making it easy to debug issues in production:
## Troubleshooting: Missing Trace
When using `Runner.run_async`, make sure to process all events completely, even after finding the final response (when `event.is_final_response()` is `True`). If you exit the loop too early, OpikTracer won't log the final response and your trace will be incomplete. Don't use code that stops processing events prematurely:
```python
async for event in runner.run_async(user_id=user_id, session_id=session_id, new_message=content):
if event.is_final_response():
...
break # Stop processing events once the final response is found
```
There is an upstream discussion about how to best solve this source of confusion: [https://github.com/google/adk-python/issues/1695](https://github.com/google/adk-python/issues/1695).
## Cost Tracking
The `OpikConnector` automatically tracks token usage and cost for all supported LLM models used within Haystack pipelines.
Cost information is automatically captured and displayed in the Opik UI, including:
* Token usage details
* Cost per request based on model pricing
* Total trace cost
Opik integrates with Harbor to log traces for all trial executions, including:
* **Trial results** as Opik traces with timing, metadata, and feedback scores from verifier rewards
* **Trajectory steps** as nested spans showing the complete agent-environment interaction
* **Tool calls and observations** as detailed execution records
* **Token usage and costs** aggregated from ATIF metrics
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=harbor\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=harbor\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=harbor\&utm_campaign=opik) for more information.
## Getting Started
### Installation
First, ensure you have both `opik` and `harbor` installed:
```bash
pip install opik harbor
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring Harbor
Harbor requires configuration for the agent and benchmark you want to evaluate. Refer to the [Harbor documentation](https://github.com/laude-institute/harbor) for details on setting up your job configuration.
## Using the CLI
The easiest way to use Harbor with Opik is through the `opik harbor` CLI command. This automatically enables Opik tracking for all trial executions without modifying your code.
### Basic Usage
```bash
# Run a benchmark with Opik tracking
opik harbor run -d terminal-bench@head -a terminus_2 -m gpt-4.1
# Use a configuration file
opik harbor run -c config.yaml
```
### Specifying Project Name
```bash
# Set project name via environment variable
export OPIK_PROJECT_NAME=my-benchmark
opik harbor run -d swebench@lite
```
### Available CLI Commands
All Harbor CLI commands are available as subcommands:
```bash
# Run a job (alias for jobs start)
opik harbor run [HARBOR_OPTIONS]
# Job management
opik harbor jobs start [HARBOR_OPTIONS]
opik harbor jobs resume -p ./jobs/my-job
# Single trial
opik harbor trials start -p ./my-task -a terminus_2
```
### CLI Help
```bash
# View available options
opik harbor --help
opik harbor run --help
```
## Example: SWE-bench Evaluation
Here's a complete example running a SWE-bench evaluation with Opik tracking:
```bash
# Configure Opik
opik configure
# Set project name
export OPIK_PROJECT_NAME=swebench-claude-sonnet
# Run SWE-bench evaluation with tracking
opik harbor run \
-d swebench-lite@head \
-a claude-code \
-m claude-3-5-sonnet-20241022
```
## Custom Agents
Harbor supports integrating your own custom agents without modifying the Harbor source code. There are two types of agents you can create:
* **External agents** - Interface with the environment through the `BaseEnvironment` interface, typically by executing bash commands
* **Installed agents** - Installed directly into the container environment and executed in headless mode
For details on implementing custom agents, see the [Harbor Agents documentation](https://harborframework.com/docs/agents).
### Running Custom Agents with Opik
To run a custom agent with Opik tracking, use the `--agent-import-path` flag:
```bash
opik harbor run -d "terminal-bench@head" --agent-import-path path.to.agent:MyCustomAgent
```
### Tracking Custom Agent Functions
When building custom agents, you can use Opik's `@track` decorator on methods within your agent implementation. These decorated functions will automatically be captured as spans within the trial trace, giving you detailed visibility into your agent's internal logic:
```python
from harbor.agents.base import BaseAgent
from opik import track
class MyCustomAgent(BaseAgent):
@staticmethod
def name() -> str:
return "my-custom-agent"
@track
async def plan_next_action(self, observation: str) -> str:
# This function will appear as a span in Opik
# Add your planning logic here
return action
@track
async def execute_tool(self, tool_name: str, args: dict) -> str:
# This will also be tracked as a nested span
result = await self._run_tool(tool_name, args)
return result
async def run(self, instruction: str, environment, context) -> None:
# Your main agent loop
while not done:
observation = await environment.exec("pwd")
action = await self.plan_next_action(observation)
result = await self.execute_tool(action.tool, action.args)
```
This allows you to trace not just the ATIF trajectory steps, but also the internal decision-making processes of your custom agent.
## What Gets Logged
Each trial completion creates an Opik trace with:
* Trial name and task information as the trace name and input
* Agent execution timing as start/end times
* Verifier rewards (e.g., pass/fail, tests passed) as feedback scores
* Agent and model metadata
* Exception information if the trial failed
### Trajectory Spans
The integration automatically creates spans for each step in the agent's trajectory, giving you detailed visibility into the agent-environment interaction. Each trajectory step becomes a span showing:
* The step source (user, agent, or system)
* The message content
* Tool calls and their arguments
* Observation results from the environment
* Token usage and cost per step
* Model name for agent steps
### Verifier Rewards as Feedback Scores
Harbor's verifier produces rewards like `{"pass": 1, "tests_passed": 5}`. These are automatically converted to Opik feedback scores, allowing you to:
* Filter traces by pass/fail status
* Aggregate metrics across experiments
* Compare agent performance across benchmarks
## Cost Tracking
The Harbor integration automatically extracts token usage and cost from ATIF trajectory metrics. If your agent records `prompt_tokens`, `completion_tokens`, and `cost_usd` in step metrics, these are captured in Opik spans.
## Environment Variables
| Variable | Description |
| ------------------- | ------------------------------- |
| `OPIK_PROJECT_NAME` | Default project name for traces |
| `OPIK_API_KEY` | API key for Opik Cloud |
| `OPIK_WORKSPACE` | Workspace name (for Opik Cloud) |
### Getting Help
* Check the [Harbor documentation](https://github.com/laude-institute/harbor) for agent and benchmark setup
* Review the [ATIF specification](https://github.com/laude-institute/harbor/blob/main/docs/rfcs/0001-trajectory-format.md) for trajectory format details
* Open an issue on [GitHub](https://github.com/comet-ml/opik/issues) for Opik integration questions
# Structured Output Tracking for Instructor with Opik
> Start here to integrate Opik into your Instructor-based genai application for structured output tracking, schema validation monitoring, and LLM call observability.
[Instructor](https://github.com/instructor-ai/instructor) is a Python library for working with structured outputs
for LLMs built on top of Pydantic. It provides a simple way to manage schema validations, retries and streaming responses.
In this guide, we will showcase how to integrate Opik with Instructor so that all the Instructor calls are logged as traces in Opik.
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=instructor\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=instructor\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=instructor\&utm_campaign=opik) for more information.
## Getting Started
### Installation
First, ensure you have both `opik` and `instructor` installed:
```bash
pip install opik instructor
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring Instructor
In order to use Instructor, you will need to configure your LLM provider API keys. For this example, we'll use OpenAI, Anthropic, and Gemini. You can [find or create your API keys in these pages](https://platform.openai.com/settings/organization/api-keys):
You can set them as environment variables:
```bash
export OPENAI_API_KEY="YOUR_API_KEY"
export ANTHROPIC_API_KEY="YOUR_API_KEY"
export GOOGLE_API_KEY="YOUR_API_KEY"
```
Or set them programmatically:
```python
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
if "ANTHROPIC_API_KEY" not in os.environ:
os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter your Anthropic API key: ")
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google API key: ")
```
## Using Opik with Instructor library
In order to log traces from Instructor into Opik, we are going to patch the `instructor` library. This will log each LLM call to the Opik platform.
For all the integrations, we will first add tracking to the LLM client and then pass it to the Instructor library:
```python
from opik.integrations.openai import track_openai
import instructor
from pydantic import BaseModel
from openai import OpenAI
# We will first create the OpenAI client and add the `track_openai`
# method to log data to Opik
openai_client = track_openai(OpenAI())
# Patch the OpenAI client for Instructor
client = instructor.from_openai(openai_client)
# Define your desired output structure
class UserInfo(BaseModel):
name: str
age: int
user_info = client.chat.completions.create(
model="gpt-4o-mini",
response_model=UserInfo,
messages=[{"role": "user", "content": "John Doe is 30 years old."}],
)
print(user_info)
```
Thanks to the `track_openai` method, all the calls made to OpenAI will be logged to the Opik platform. This approach also works well if you are also using the `opik.track` decorator as it will automatically log the LLM call made with Instructor to the relevant trace.
## Integrating with other LLM providers
The instructor library supports many LLM providers beyond OpenAI, including: Anthropic, AWS Bedrock, Gemini, etc. Opik supports the majority of these providers as well.
Here are the code snippets needed for the integration with different providers:
### Anthropic
```python
from opik.integrations.anthropic import track_anthropic
import instructor
from anthropic import Anthropic
# Add Opik tracking
anthropic_client = track_anthropic(Anthropic())
# Patch the Anthropic client for Instructor
client = instructor.from_anthropic(
anthropic_client, mode=instructor.Mode.ANTHROPIC_JSON
)
user_info = client.chat.completions.create(
model="claude-3-5-sonnet-20241022",
response_model=UserInfo,
messages=[{"role": "user", "content": "John Doe is 30 years old."}],
max_tokens=1000,
)
print(user_info)
```
### Gemini
```python
from opik.integrations.genai import track_genai
import instructor
from google import genai
# Add Opik tracking
gemini_client = track_genai(genai.Client())
# Patch the GenAI client for Instructor
client = instructor.from_genai(
gemini_client, mode=instructor.Mode.GENAI_STRUCTURED_OUTPUTS
)
user_info = client.chat.completions.create(
model="gemini-2.0-flash-001",
response_model=UserInfo,
messages=[{"role": "user", "content": "John Doe is 30 years old."}],
)
print(user_info)
```
You can read more about how to use the Instructor library in [their documentation](https://python.useinstructor.com/).
# Observability for LangChain (Python) with Opik
> Start here to integrate Opik into your LangChain-based genai application for end-to-end LLM observability, unit testing, and optimization.
Opik provides seamless integration with LangChain, allowing you to easily log and trace your LangChain-based applications. By using the `OpikTracer` callback, you can automatically capture detailed information about your LangChain runs, including inputs, outputs, metadata, and cost tracking for each step in your chain.
## Key Features
* **Automatic cost tracking** for supported LLM providers (OpenAI, Anthropic, Google AI, AWS Bedrock, and more)
* **Full compatibility** with the `@opik.track` decorator for hybrid tracing approaches
* **Thread support** for conversational applications with `thread_id` parameter
* **Distributed tracing** support for multi-service applications
* **LangGraph compatibility** for complex graph-based workflows
* **Evaluation and testing** support for automated LLM application testing
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=langchain\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=langchain\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=langchain\&utm_campaign=opik) for more information.
## Getting Started
### Installation
To use the `OpikTracer` with LangChain, you'll need to have both the `opik` and `langchain` packages installed. You can install them using pip:
```bash
pip install opik langchain langchain_openai
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
## Using OpikTracer
Here's a basic example of how to use the `OpikTracer` callback with a LangChain chain:
```python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from opik.integrations.langchain import OpikTracer
# Initialize the tracer
opik_tracer = OpikTracer(project_name="langchain-examples")
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("human", "Translate the following text to French: {text}")
])
chain = prompt | llm
result = chain.invoke(
{"text": "Hello, how are you?"},
config={"callbacks": [opik_tracer]}
)
print(result.content)
```
The `OpikTracer` will automatically log the run and its details to Opik, including the input prompt, the output, and metadata for each step in the chain.
For detailed parameter information, see the [OpikTracer SDK reference](https://www.comet.com/docs/opik/python-sdk-reference/integrations/langchain/OpikTracer.html).
## Practical Example: Text-to-SQL with Evaluation
Let's walk through a real-world example of using LangChain with Opik for a text-to-SQL query generation task. This example demonstrates how to create synthetic datasets, build LangChain chains, and evaluate your application.
### Setting up the Environment
First, let's set up our environment with the necessary dependencies:
```python
import os
import getpass
import opik
from opik.integrations.openai import track_openai
from openai import OpenAI
# Configure Opik
opik.configure(use_local=False)
os.environ["OPIK_PROJECT_NAME"] = "langchain-integration-demo"
# Set up API keys
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
```
### Creating a Synthetic Dataset
We'll create a synthetic dataset of questions for our text-to-SQL task:
```python
import json
from langchain_community.utilities import SQLDatabase
# Download and set up the Chinook database
import requests
url = "https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"
filename = "./data/chinook/Chinook_Sqlite.sqlite"
folder = os.path.dirname(filename)
if not os.path.exists(folder):
os.makedirs(folder)
if not os.path.exists(filename):
response = requests.get(url)
with open(filename, "wb") as file:
file.write(response.content)
print("Chinook database downloaded")
db = SQLDatabase.from_uri(f"sqlite:///{filename}")
# Create synthetic questions using OpenAI
client = OpenAI()
openai_client = track_openai(client)
prompt = """
Create 20 different example questions a user might ask based on the Chinook Database.
These questions should be complex and require the model to think. They should include complex joins and window functions to answer.
Return the response as a json object with a "result" key and an array of strings with the question.
"""
completion = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
synthetic_questions = json.loads(completion.choices[0].message.content)["result"]
# Create dataset in Opik
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset(name="synthetic_questions")
dataset.insert([{"question": question} for question in synthetic_questions])
```
### Building the LangChain Chain
Now let's create a LangChain chain for SQL query generation:
```python
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
from opik.integrations.langchain import OpikTracer
# Create the LangChain chain with OpikTracer
opik_tracer = OpikTracer(tags=["sql_generation"])
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
chain = create_sql_query_chain(llm, db).with_config({"callbacks": [opik_tracer]})
# Test the chain
response = chain.invoke({"question": "How many employees are there?"})
print(response)
```
### Evaluating the Application
Let's create a custom evaluation metric and test our application:
```python
from opik import track
from opik.evaluation import evaluate
from opik.evaluation.metrics import base_metric, score_result
from typing import Any
class ValidSQLQuery(base_metric.BaseMetric):
def __init__(self, name: str, db: Any):
self.name = name
self.db = db
def score(self, output: str, **ignored_kwargs: Any):
try:
db.run(output)
return score_result.ScoreResult(
name=self.name, value=1, reason="Query ran successfully"
)
except Exception as e:
return score_result.ScoreResult(name=self.name, value=0, reason=str(e))
# Set up evaluation
valid_sql_query = ValidSQLQuery(name="valid_sql_query", db=db)
dataset = opik_client.get_dataset("synthetic_questions")
@track()
def llm_chain(input: str) -> str:
response = chain.invoke({"question": input})
return response
def evaluation_task(item):
response = llm_chain(item["question"])
return {"output": response}
# Run evaluation
res = evaluate(
experiment_name="SQL question answering",
dataset=dataset,
task=evaluation_task,
scoring_metrics=[valid_sql_query],
nb_samples=20,
)
```
The evaluation results are now uploaded to the Opik platform and can be viewed in the UI.
## Cost Tracking
The `OpikTracer` automatically tracks token usage and cost for all supported LLM models used within LangChain applications.
Cost information is automatically captured and displayed in the Opik UI, including:
* Token usage details
* Cost per request based on model pricing
* Total trace cost
## Practical Example: Classification Workflow
Let's walk through a real-world example of using LangGraph with Opik for a classification workflow. This example demonstrates how to create a graph with conditional routing and track its execution.
### Setting up the Environment
First, let's set up our environment with the necessary dependencies:
```python
import opik
# Configure Opik
opik.configure(use_local=False)
```
### Creating the LangGraph Workflow
We'll create a LangGraph workflow with 3 nodes that demonstrates conditional routing:
```python
from langgraph.graph import StateGraph, END
from typing import TypedDict, Optional
# Define the graph state
class GraphState(TypedDict):
question: Optional[str] = None
classification: Optional[str] = None
response: Optional[str] = None
# Create the node functions
def classify(question: str) -> str:
return "greeting" if question.startswith("Hello") else "search"
def classify_input_node(state):
question = state.get("question", "").strip()
classification = classify(question)
return {"classification": classification}
def handle_greeting_node(state):
return {"response": "Hello! How can I help you today?"}
def handle_search_node(state):
question = state.get("question", "").strip()
search_result = f"Search result for '{question}'"
return {"response": search_result}
# Create the workflow
workflow = StateGraph(GraphState)
workflow.add_node("classify_input", classify_input_node)
workflow.add_node("handle_greeting", handle_greeting_node)
workflow.add_node("handle_search", handle_search_node)
# Add conditional routing
def decide_next_node(state):
return (
"handle_greeting"
if state.get("classification") == "greeting"
else "handle_search"
)
workflow.add_conditional_edges(
"classify_input",
decide_next_node,
{"handle_greeting": "handle_greeting", "handle_search": "handle_search"},
)
workflow.set_entry_point("classify_input")
workflow.add_edge("handle_greeting", END)
workflow.add_edge("handle_search", END)
app = workflow.compile()
```
### Executing with Opik Tracing
Now let's execute the workflow with Opik tracing enabled using `track_langgraph`:
```python
from opik.integrations.langchain import OpikTracer, track_langgraph
# Create OpikTracer and track the graph once
# The graph visualization is automatically extracted by track_langgraph
opik_tracer = OpikTracer(
project_name="classification-workflow"
)
app = track_langgraph(app, opik_tracer)
# Execute the workflow - no callbacks needed!
inputs = {"question": "Hello, how are you?"}
result = app.invoke(inputs)
print(result)
# Test with a different input - still tracked automatically
inputs = {"question": "What is machine learning?"}
result = app.invoke(inputs)
print(result)
```
The graph execution is now logged on the Opik platform and can be viewed in the UI. The trace will show the complete execution path through the graph, including the classification decision and the chosen response path.
## Compatibility with Opik tracing context
LangGraph tracing integrates seamlessly with Opik's tracing context, allowing you to call `@track`-decorated functions (and most use most of other native Opik integrations) from within your graph nodes and have them automatically attached to the trace tree.
### Synchronous execution (invoke)
For synchronous graph execution using `invoke()`, everything works out of the box. You can access current spans/traces from LangGraph nodes and call tracked functions inside them:
```python
import opik_context
from opik import track
from opik.integrations.langchain import OpikTracer, track_langgraph
from langgraph.graph import StateGraph, START, END
@track
def process_data(value: int) -> int:
"""Custom tracked function that will be attached to the trace tree."""
return value * 2
def my_node(state):
current_trace_data = opik_context.get_current_trace_data()
current_span_data = opik_context.get_current_span_data() # will return the span for `my_node`, created by OpikTracer
# This tracked function call will automatically be part of the trace tree
result = process_data(state["value"])
return {"value": result}
# Build and execute graph
graph = StateGraph(dict)
graph.add_node("processor", my_node)
graph.add_edge(START, "processor")
graph.add_edge("processor", END)
app = graph.compile()
opik_tracer = OpikTracer()
app = track_langgraph(app, opik_tracer)
# Synchronous execution - tracked functions work automatically
result = app.invoke({"value": 21})
```
### Asynchronous execution (ainvoke)
For asynchronous graph execution using `ainvoke()`, you need to explicitly propagate the trace context to `@track`-decorated functions using the `extract_current_langgraph_span_data` helper:
## What gets traced
With this setup, your LiveKit agent will automatically trace:
* **Session events**: Session start and end with metadata
* **Agent turns**: Complete conversation turns with timing
* **LLM operations**: Model calls, prompts, responses, and token usage
* **Function tools**: Tool executions with inputs and outputs
* **TTS operations**: Text-to-speech conversions with audio metadata
* **STT operations**: Speech-to-text transcriptions
* **End-of-turn detection**: Conversation flow events
## Further improvements
If you have any questions or suggestions for improving the LiveKit Agents integration, please [open an issue](https://github.com/comet-ml/opik/issues/new/choose) on our GitHub repository.
# Observability for LlamaIndex with Opik
> Start here to integrate Opik into your LlamaIndex-based genai application for end-to-end LLM observability, unit testing, and optimization.
[LlamaIndex](https://github.com/run-llama/llama_index) is a flexible data framework for building LLM applications:
LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:
* Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.).
* Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
* Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
* Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=llamaindex\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=llamaindex\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=llamaindex\&utm_campaign=opik) for more information.
## Getting Started
### Installation
To use the Opik integration with LlamaIndex, you'll need to have both the `opik` and `llama_index` packages installed. You can install them using pip:
```bash
pip install opik llama-index llama-index-agent-openai llama-index-llms-openai llama-index-callbacks-opik
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring LlamaIndex
In order to use LlamaIndex, you will need to configure your LLM provider API keys. For this example, we'll use OpenAI. You can [find or create your API keys in these pages](https://platform.openai.com/settings/organization/api-keys):
You can set them as environment variables:
```bash
export OPENAI_API_KEY="YOUR_API_KEY"
```
Or set them programmatically:
```python
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
```
## Using the Opik integration
To use the Opik integration with LLamaIndex, you can use the `set_global_handler` function from the LlamaIndex package to set the global tracer:
```python
from llama_index.core import global_handler, set_global_handler
set_global_handler("opik")
opik_callback_handler = global_handler
```
Now that the integration is set up, all the LlamaIndex runs will be traced and logged to Opik.
Alternatively, you can configure the callback handler directly for more control:
```python
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager
from opik.integrations.llama_index import LlamaIndexCallbackHandler
# Basic setup
opik_callback = LlamaIndexCallbackHandler()
# Or with optional parameters
opik_callback = LlamaIndexCallbackHandler(
project_name="my-llamaindex-project", # Set custom project name
skip_index_construction_trace=True # Skip tracking index construction
)
Settings.callback_manager = CallbackManager([opik_callback])
```
The `skip_index_construction_trace` parameter is useful when you want to track only query operations and not the index construction phase (particularly for large document sets or pre-built indexes)
## Example
To showcase the integration, we will create a new a query engine that will use Paul Graham's essays as the data source.
**First step:**
Configure the Opik integration:
```python
import os
from llama_index.core import global_handler, set_global_handler
# Set project name for better organization
os.environ["OPIK_PROJECT_NAME"] = "llamaindex-integration-demo"
set_global_handler("opik")
opik_callback_handler = global_handler
```
**Second step:**
Download the example data:
```python
import os
import requests
# Create directory if it doesn't exist
os.makedirs('./data/paul_graham/', exist_ok=True)
# Download the file using requests
url = 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt'
response = requests.get(url)
with open('./data/paul_graham/paul_graham_essay.txt', 'wb') as f:
f.write(response.content)
```
**Third step:**
Configure the OpenAI API key:
```python
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
```
**Fourth step:**
We can now load the data, create an index and query engine:
```python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
```
Given that the integration with Opik has been set up, all the traces are logged to the Opik platform:
## Using with the @track Decorator
The LlamaIndex integration seamlessly works with Opik's `@track` decorator. When you call LlamaIndex operations inside a tracked function, the LlamaIndex traces will automatically be attached as child spans to your existing trace.
```python
import opik
from llama_index.core import global_handler, set_global_handler
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage
# Configure Opik integration
set_global_handler("opik")
opik_callback_handler = global_handler
@opik.track()
def my_llm_application(user_query: str):
"""Process user query with LlamaIndex"""
llm = OpenAI(model="gpt-3.5-turbo")
messages = [
ChatMessage(role="system", content="You are a helpful assistant."),
ChatMessage(role="user", content=user_query),
]
response = llm.chat(messages)
return response.message.content
# Call the tracked function
result = my_llm_application("What is the capital of France?")
print(result)
```
In this example, Opik will create a trace for the `my_llm_application` function, and all LlamaIndex operations (like the LLM chat call) will appear as nested spans within this trace, giving you a complete view of your application's execution.
## Using with Manual Trace Creation
You can also manually create traces using `opik.start_as_current_trace()` and have LlamaIndex operations nested within:
```python
import opik
from llama_index.core import global_handler, set_global_handler
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage
# Configure Opik integration
set_global_handler("opik")
opik_callback_handler = global_handler
# Create a manual trace
with opik.start_as_current_trace(name="user_query_processing"):
llm = OpenAI(model="gpt-3.5-turbo")
messages = [
ChatMessage(role="user", content="Explain quantum computing in simple terms"),
]
response = llm.chat(messages)
print(response.message.content)
```
This approach is useful when you want more control over trace naming and want to group multiple LlamaIndex operations under a single trace.
## Tracking LlamaIndex Workflows
LlamaIndex workflows are multi-step processing pipelines for LLM applications. To track workflow executions in Opik, you can manually decorate your workflow steps and use `opik.start_as_current_span()` to wrap the workflow execution.
### Basic Workflow Tracking
You can use `@opik.track()` to decorate your workflow steps and `opik.start_as_current_span()` to track the workflow execution:
```python
import opik
from llama_index.core.workflow import Workflow, StartEvent, StopEvent, step, Event
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager
from llama_index.core import global_handler, set_global_handler
# Configure Opik integration for LLM calls within steps
set_global_handler("opik")
class QueryEvent(Event):
"""Event for passing query through workflow."""
query: str
class MyRAGWorkflow(Workflow):
"""Simple RAG workflow with tracked steps."""
@step
@opik.track()
async def retrieve_context(self, ev: StartEvent) -> QueryEvent:
"""Retrieve relevant context for the query."""
query = ev.get("query", "")
# Your retrieval logic here
context = f"Context for: {query}"
return QueryEvent(query=f"{context} | {query}")
@step
@opik.track()
async def generate_response(self, ev: QueryEvent) -> StopEvent:
"""Generate final response using the context."""
# Your generation logic here
result = f"Response based on: {ev.query}"
return StopEvent(result=result)
# Create workflow instance
workflow = MyRAGWorkflow()
# Use start_as_current_span to track workflow execution
with opik.start_as_current_span(
name="rag_workflow_execution",
input={"query": "What are the key features?"},
project_name="llama-index-workflows"
) as span:
result = await workflow.run(query="What are the key features?")
span.update(output={"result": result})
print(result)
opik.flush_tracker() # Ensure all traces are sent
```
In this example:
* Each workflow step is decorated with `@opik.track()` to create spans
* The `@step` decorator is placed before `@opik.track()` to ensure LlamaIndex can properly discover the workflow steps
* `opik.start_as_current_span()` tracks the overall workflow execution
* LLM calls within steps are automatically tracked via the global Opik handler
* All workflow steps appear as nested spans within the workflow trace
## Getting started
To use the Microsoft Agent Framework integration with Opik, you will need to have the Agent Framework and the required OpenTelemetry packages installed:
```bash
pip install --pre agent-framework opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
```
In addition, you will need to set the following environment variables to configure OpenTelemetry to send data to Opik:
*/}
## Logging threads
When you are running multi-turn conversations with OpenAI Agents using [OpenAI Agents trace API](https://openai.github.io/openai-agents-python/running_agents/#conversationschat-threads), Opik integration automatically use the trace group\_id as the Thread ID so you can easily review conversation inside Opik. Here is an example below:
```python
async def main():
agent = Agent(name="Assistant", instructions="Reply very concisely.")
thread_id = str(uuid.uuid4())
with trace(workflow_name="Conversation", group_id=thread_id):
# First turn
result = await Runner.run(agent, "What city is the Golden Gate Bridge in?")
print(result.final_output)
# San Francisco
# Second turn
new_input = result.to_input_list() + [{"role": "user", "content": "What state is it in?"}]
result = await Runner.run(agent, new_input)
print(result.final_output)
# California
```
## Further improvements
OpenAI Agents is still a relatively new framework and we are working on a couple of improvements:
1. Improved rendering of the inputs and outputs for the LLM calls as part of our `Pretty Mode` functionality
2. Improving the naming conventions for spans
3. Adding the agent execution input and output at a trace level
If there are any additional improvements you would like us to make, feel free to open an issue on our [GitHub repository](https://github.com/comet-ml/opik/issues).
# Observability for Pipecat with Opik
> Start here to integrate Opik into your Pipecat-based real-time voice agent application for end-to-end LLM observability, unit testing, and optimization.
[Pipecat](https://github.com/pipecat-ai/pipecat) is an open-source Python framework for building real-time voice and multimodal conversational AI agents. Developed by Daily, it enables fully programmable AI voice agents and supports multimodal interactions, positioning itself as a flexible solution for developers looking to build conversational AI systems.
This guide explains how to integrate Opik with Pipecat for observability and tracing of real-time voice agents, enabling you to monitor, debug, and optimize your Pipecat agents in the Opik dashboard.
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=pipecat\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=pipecat\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=pipecat\&utm_campaign=opik) for more information.
## Getting started
To use the Pipecat integration with Opik, you will need to have Pipecat and the required OpenTelemetry packages installed:
```bash
pip install pipecat-ai[daily,webrtc,silero,cartesia,deepgram,openai,tracing] opentelemetry-exporter-otlp-proto-http websockets
```
## Advanced usage
You can reduce the amount of data logged to Opik by setting `capture_all` to `False`:
```python
import logfire
logfire.configure(
send_to_logfire=False,
)
logfire.instrument_httpx(capture_all=False)
```
When this parameter is set to `False`, we will not log the exact request made
to the LLM provider.
## Further improvements
If you would like to see us improve this integration, simply open a new feature
request on [Github](https://github.com/comet-ml/opik/issues).
# Observability for Semantic Kernel (Python) with Opik
> Start here to integrate Opik into your Semantic Kernel-based genai application for end-to-end LLM observability, unit testing, and optimization.
[Semantic Kernel](https://github.com/microsoft/semantic-kernel) is a powerful open-source SDK from Microsoft. It facilitates the combination of LLMs with popular programming languages like C#, Python, and Java. Semantic Kernel empowers developers to build sophisticated AI applications by seamlessly integrating AI services, data sources, and custom logic, accelerating the delivery of enterprise-grade AI solutions.
Learn more about Semantic Kernel in the [official documentation](https://learn.microsoft.com/en-us/semantic-kernel/overview/).

## Getting started
To use the Semantic Kernel integration with Opik, you will need to have Semantic Kernel and the required OpenTelemetry packages installed:
```bash
pip install semantic-kernel opentelemetry-exporter-otlp-proto-http
```
## Environment configuration
Configure your environment variables based on your Opik deployment:
## Further improvements
If you would like to see us improve this integration, simply open a new feature
request on [Github](https://github.com/comet-ml/opik/issues).
# Observability for Strands Agents with Opik
> Start here to integrate Opik into your Strands Agents-based genai application for end-to-end LLM observability, unit testing, and optimization.
[Strands Agents](https://github.com/strands-agents/sdk-python) is a simple yet powerful SDK that takes a model-driven approach to building and running AI agents.
The framework's primary advantage is its ability to scale from simple conversational assistants to complex autonomous workflows, supporting both local development and production deployment with built-in observability.
After running your Strands Agents workflow with the OpenTelemetry configuration, you'll see detailed traces in the Opik UI showing agent interactions, model calls, and conversation flows as demonstrated in the screenshot above.
## Getting started
To use the Strands Agents integration with Opik, you will need to have Strands Agents and the required OpenTelemetry packages installed:
```bash
pip install --upgrade "strands-agents" "strands-agents-tools" opentelemetry-sdk opentelemetry-exporter-otlp
```
In addition, you will need to set the following environment variables to
configure the OpenTelemetry integration:
## Advanced Usage
### Using with the `@track` decorator
If you have multiple steps in your LLM pipeline, you can use the `@track` decorator to log the traces for each step. If Anthropic is called within one of these steps, the LLM call will be associated with that corresponding step:
```python
import anthropic
from opik import track
from opik.integrations.anthropic import track_anthropic
os.environ["OPIK_PROJECT_NAME"] = "anthropic-integration-demo"
anthropic_client = anthropic.Anthropic()
anthropic_client = track_anthropic(anthropic_client)
@track
def generate_story(prompt):
res = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return res.content[0].text
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
res = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return res.content[0].text
@track
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
# Execute the multi-step pipeline
generate_opik_story()
```
The trace can now be viewed in the UI with hierarchical spans showing the relationship between different steps:
## Cost Tracking
The `track_anthropic` wrapper automatically tracks token usage and cost for all supported Anthropic models.
Cost information is automatically captured and displayed in the Opik UI, including:
* Token usage details
* Cost per request based on Anthropic pricing
* Total trace cost
### Invoke Model API (Model-Specific Formats)
The Invoke Model API uses model-specific request and response formats. Here are examples for different providers:
### Invoke Model Stream API
The `invoke_model_with_response_stream` method supports streaming with model-specific formats:
## Cost Tracking
The `track_bedrock` wrapper automatically tracks token usage and cost for all supported AWS Bedrock models, regardless of whether you use the Converse API or the Invoke Model API.
## Using Cohere within a tracked function
If you are using Cohere within a function tracked with the [`@track`](/tracing/log_traces#using-function-decorators) decorator, you can use the tracked client as normal:
```python
from opik import track
from opik.integrations.openai import track_openai
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ.get("COHERE_API_KEY"),
base_url="https://api.cohere.ai/compatibility/v1"
)
tracked_client = track_openai(client)
@track
def generate_story(prompt):
response = tracked_client.chat.completions.create(
model="command-r7b-12-2024",
messages=[
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
response = tracked_client.chat.completions.create(
model="command-r7b-12-2024",
messages=[
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
@track
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
generate_opik_story()
```
## Supported Cohere models
The `track_openai` wrapper with Cohere's compatibility API supports the following Cohere models:
* `command-r7b-12-2024` - Command R 7B model
* `command-r-plus` - Command R Plus model
* `command-r` - Command R model
* `command-light` - Command Light model
* `command` - Command model
## Supported OpenAI methods
The `track_openai` wrapper supports the following OpenAI methods when used with Cohere:
* `client.chat.completions.create()`, including support for stream=True mode
* `client.beta.chat.completions.parse()`
* `client.beta.chat.completions.stream()`
* `client.responses.create()`
If you would like to track another OpenAI method, please let us know by opening an issue on [GitHub](https://github.com/comet-ml/opik/issues).
# Observability for DeepSeek with Opik
> Start here to integrate Opik into your DeepSeek-based genai application for end-to-end LLM observability, unit testing, and optimization.
Deepseek is an Open-Source LLM model that rivals o1 from OpenAI. You can learn more about DeepSeek on [Github](https://github.com/deepseek-ai/DeepSeek-R1) or
on [deepseek.com](https://www.deepseek.com/).
In this guide, we will showcase how to track DeepSeek calls using Opik. As DeepSeek is open-source, there are many way to run and call the model. We will focus on how to integrate Opik with the following hosting options:
1. DeepSeek API
2. Fireworks AI API
3. Together AI API
## Getting started
### Configuring your hosting provider
Before you can start tracking DeepSeek calls, you need to get the API key from your hosting provider.
## Using with VertexAI
To use Opik with VertexAI, configure the `google-genai` client for VertexAI and wrap it with `track_genai`:
```python
from google import genai
from opik.integrations.genai import track_genai
# Configure for VertexAI
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
vertexai_client = track_genai(client)
# Set project name for organization
os.environ["OPIK_PROJECT_NAME"] = "vertexai-integration-demo"
# Use the wrapped client
response = vertexai_client.models.generate_content(
model="gemini-2.0-flash-001",
contents="Write a short story about AI observability."
)
print(response.text)
```
## Advanced Usage
### Using with the `@track` decorator
If you have multiple steps in your LLM pipeline, you can use the `@track` decorator to log the traces for each step. If Gemini is called within one of these steps, the LLM call will be associated with that corresponding step:
```python
from opik import track
@track
def generate_story(prompt):
response = gemini_client.models.generate_content(
model="gemini-2.0-flash-001", contents=prompt
)
return response.text
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
response = gemini_client.models.generate_content(
model="gemini-2.0-flash-001", contents=prompt
)
return response.text
@track
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
# Execute the multi-step pipeline
generate_opik_story()
```
The trace can now be viewed in the UI with hierarchical spans showing the relationship between different steps:
## Cost Tracking
The `track_genai` wrapper automatically tracks token usage and cost for all supported Google AI models.
Cost information is automatically captured and displayed in the Opik UI, including:
* Token usage details
* Cost per request based on Google AI pricing
* Total trace cost
## Advanced Usage
### Using with the `@track` decorator
If you are using LiteLLM within a function tracked with the [`@track`](/tracing/log_traces#using-function-decorators) decorator, you will need to pass the `current_span_data` as metadata to the `litellm.completion` call:
```python
from opik import track
from opik.opik_context import get_current_span_data
import litellm
@track
def generate_story(prompt):
response = litellm.completion(
model="groq/llama3-8b-8192",
messages=[{"role": "user", "content": prompt}],
metadata={
"opik": {
"current_span_data": get_current_span_data(),
},
},
)
return response.choices[0].message.content
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
response = litellm.completion(
model="groq/llama3-8b-8192",
messages=[{"role": "user", "content": prompt}],
metadata={
"opik": {
"current_span_data": get_current_span_data(),
},
},
)
return response.choices[0].message.content
@track
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
# Execute the multi-step pipeline
generate_opik_story()
```
# Observability for Mistral AI with Opik
> Start here to integrate Opik into your Mistral AI-based genai application for end-to-end LLM observability, unit testing, and optimization.
[Mistral AI](https://mistral.ai/) provides cutting-edge large language models with excellent performance for text generation, reasoning, and specialized tasks like code generation.
This guide explains how to integrate Opik with Mistral AI via LiteLLM. By using the LiteLLM integration provided by Opik, you can easily track and evaluate your Mistral API calls within your Opik projects as Opik will automatically log the input prompt, model used, token usage, and response generated.
## Getting Started
### Configuring Opik
To start tracking your Mistral AI LLM calls, you'll need to have both `opik` and `litellm` installed. You can install them using pip:
```bash
pip install opik litellm
```
In addition, you can configure Opik using the `opik configure` command which will prompt you for the correct local server address or if you are using the Cloud platform your API key:
```bash
opik configure
```
### Configuring Mistral AI
You'll need to set your Mistral AI API key as an environment variable:
```bash
export MISTRAL_API_KEY="YOUR_API_KEY"
```
## Logging LLM calls
In order to log the LLM calls to Opik, you will need to create the OpikLogger callback. Once the OpikLogger callback is created and added to LiteLLM, you can make calls to LiteLLM as you normally would:
```python
from litellm.integrations.opik.opik import OpikLogger
import litellm
opik_logger = OpikLogger()
litellm.callbacks = [opik_logger]
response = litellm.completion(
model="mistral/mistral-large-2407",
messages=[
{"role": "user", "content": "Why is tracking and evaluation of LLMs important?"}
]
)
```
## Logging LLM calls within a tracked function
If you are using LiteLLM within a function tracked with the [`@track`](/tracing/log_traces#using-function-decorators) decorator, you will need to pass the `current_span_data` as metadata to the `litellm.completion` call:
```python
from opik import track, opik_context
import litellm
@track
def generate_story(prompt):
response = litellm.completion(
model="mistral/mistral-large-2407",
messages=[{"role": "user", "content": prompt}],
metadata={
"opik": {
"current_span_data": opik_context.get_current_span_data(),
},
},
)
return response.choices[0].message.content
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
response = litellm.completion(
model="mistral/mistral-medium-2312",
messages=[{"role": "user", "content": prompt}],
metadata={
"opik": {
"current_span_data": opik_context.get_current_span_data(),
},
},
)
return response.choices[0].message.content
@track
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
generate_opik_story()
```
# Observability for Novita AI with Opik
> Start here to integrate Opik into your Novita AI-based genai application for end-to-end LLM observability, unit testing, and optimization.
[Novita AI](https://novita.ai/) is an AI cloud platform that helps developers easily deploy AI models through a simple API, backed by affordable and reliable GPU cloud infrastructure. It provides access to a wide range of models including DeepSeek, Qwen, Llama, and other popular LLMs.
This guide explains how to integrate Opik with Novita AI via LiteLLM. By using the LiteLLM integration provided by Opik, you can easily track and evaluate your Novita AI API calls within your Opik projects as Opik will automatically log the input prompt, model used, token usage, and response generated.
## Getting Started
### Configuring Opik
To get started, you need to configure Opik to send traces to your Comet project. You can do this by setting the `OPIK_PROJECT_NAME` environment variable:
```bash
export OPIK_PROJECT_NAME="your-project-name"
export OPIK_WORKSPACE="your-workspace-name"
```
You can also call the `opik.configure` method:
```python
import opik
opik.configure(
project_name="your-project-name",
workspace="your-workspace-name",
)
```
### Configuring LiteLLM
Install the required packages:
```bash
pip install opik litellm
```
Create a LiteLLM configuration file (e.g., `litellm_config.yaml`):
```yaml
model_list:
- model_name: deepseek-r1-turbo
litellm_params:
model: novita/deepseek/deepseek-r1-turbo
api_key: os.environ/NOVITA_API_KEY
- model_name: qwen-32b-fp8
litellm_params:
model: novita/qwen/qwen3-32b-fp8
api_key: os.environ/NOVITA_API_KEY
- model_name: llama-70b-instruct
litellm_params:
model: novita/meta-llama/llama-3.1-70b-instruct
api_key: os.environ/NOVITA_API_KEY
litellm_settings:
callbacks: ["opik"]
```
### Authentication
Set your Novita AI API key as an environment variable:
```bash
export NOVITA_API_KEY="your-novita-api-key"
```
You can obtain a Novita AI API key from the [Novita AI dashboard](https://novita.ai/settings).
## Usage
### Using LiteLLM Proxy Server
Start the LiteLLM proxy server:
```bash
litellm --config litellm_config.yaml
```
Use the proxy server to make requests:
```python
import openai
client = openai.OpenAI(
api_key="anything", # can be anything
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="deepseek-r1-turbo",
messages=[
{"role": "user", "content": "What are the advantages of using cloud-based AI platforms?"}
]
)
print(response.choices[0].message.content)
```
### Direct Integration
You can also use LiteLLM directly in your Python code:
```python
import os
from litellm import completion
# Configure Opik
import opik
opik.configure()
# Configure LiteLLM for Opik
from litellm.integrations.opik.opik import OpikLogger
import litellm
litellm.callbacks = ["opik"]
os.environ["NOVITA_API_KEY"] = "your-novita-api-key"
response = completion(
model="novita/deepseek/deepseek-r1-turbo",
messages=[
{"role": "user", "content": "How can cloud AI platforms improve development efficiency?"}
]
)
print(response.choices[0].message.content)
```
## Supported Models
Novita AI provides access to a comprehensive catalog of models from leading providers. Some of the popular models available include:
* **DeepSeek Models**: `deepseek-r1-turbo`, `deepseek-v3-turbo`, `deepseek-v3-0324`
* **Qwen Models**: `qwen3-235b-a22b-fp8`, `qwen3-30b-a3b-fp8`, `qwen3-32b-fp8`
* **Llama Models**: `llama-4-maverick-17b-128e-instruct-fp8`, `llama-3.3-70b-instruct`, `llama-3.1-70b-instruct`
* **Mistral Models**: `mistral-nemo`
* **Google Models**: `gemma-3-27b-it`
For the complete list of available models, visit the [Novita AI model catalog](https://novita.ai/models/llm).
## Advanced Features
### Tool Calling
Novita AI supports function calling with compatible models:
```python
from litellm import completion
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = completion(
model="novita/deepseek/deepseek-r1-turbo",
messages=[{"role": "user", "content": "What's the weather like in Boston today?"}],
tools=tools,
)
```
### JSON Mode
For structured outputs, you can enable JSON mode:
```python
response = completion(
model="novita/deepseek/deepseek-r1-turbo",
messages=[
{"role": "user", "content": "List 5 popular cookie recipes."}
],
response_format={"type": "json_object"}
)
```
## Feedback Scores and Evaluation
Once your Novita AI calls are logged with Opik, you can evaluate your LLM application using Opik's evaluation framework:
```python
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination
# Define your evaluation task
def evaluation_task(x):
return {
"message": x["message"],
"output": x["output"],
"reference": x["reference"]
}
# Create the Hallucination metric
hallucination_metric = Hallucination()
# Run the evaluation
evaluation_results = evaluate(
experiment_name="novita-ai-evaluation",
dataset=your_dataset,
task=evaluation_task,
scoring_metrics=[hallucination_metric],
)
```
## Environment Variables
Make sure to set the following environment variables:
```bash
# Novita AI Configuration
export NOVITA_API_KEY="your-novita-api-key"
# Opik Configuration
export OPIK_PROJECT_NAME="your-project-name"
export OPIK_WORKSPACE="your-workspace-name"
```
# Observability for Ollama with Opik
> Start here to integrate Opik into your Ollama-based genai application for end-to-end LLM observability, unit testing, and optimization.
[Ollama](https://ollama.com/) allows users to run, interact with, and deploy AI models locally on their machines without the need for complex infrastructure or cloud dependencies.
There are multiple ways to interact with Ollama from Python including but not limited to the [ollama python package](https://pypi.org/project/ollama/), [LangChain](https://python.langchain.com/docs/integrations/providers/ollama/) or by using the [OpenAI library](https://docs.ollama.com/api/openai-compatibility#openai-python-library). We will cover how to trace your LLM calls for each of these methods.
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=ollama\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=ollama\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=ollama\&utm_campaign=opik) for more information.
## Getting started
### Configure Ollama
Before starting, you will need to have an Ollama instance running. You can install Ollama by following the [quickstart guide](https://github.com/ollama/ollama/blob/main/README.md#quickstart) which will automatically start the Ollama API server. If the Ollama server is not running, you can start it using `ollama serve`.
Once Ollama is running, you can download the llama3.1 model by running `ollama pull llama3.1`. For a full list of models available on Ollama, please refer to the [Ollama library](https://ollama.com/library).
### Installation
You will also need to have Opik installed. You can install it by running:
```bash
pip install opik
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
## Tracking Ollama calls made with Ollama Python Package
To get started you will need to install the Ollama Python package:
```bash
pip install --quiet --upgrade ollama
```
We will then utilize the `track` decorator to log all the traces to Opik:
```python
import ollama
from opik import track, opik_context
@track(tags=['ollama', 'python-library'])
def ollama_llm_call(user_message: str):
# Create the Ollama model
response = ollama.chat(model='llama3.1', messages=[
{
'role': 'user',
'content': user_message,
},
])
opik_context.update_current_span(
metadata={
'model': response['model'],
'eval_duration': response['eval_duration'],
'load_duration': response['load_duration'],
'prompt_eval_duration': response['prompt_eval_duration'],
'prompt_eval_count': response['prompt_eval_count'],
'done': response['done'],
'done_reason': response['done_reason'],
},
usage={
'completion_tokens': response['eval_count'],
'prompt_tokens': response['prompt_eval_count'],
'total_tokens': response['eval_count'] + response['prompt_eval_count']
}
)
return response['message']
ollama_llm_call("Say this is a test")
```
The trace will now be displayed in the Opik platform.
## Tracking Ollama calls made with OpenAI
Ollama is compatible with the OpenAI format and can be used with the OpenAI Python library. You can therefore leverage the Opik integration for OpenAI to trace your Ollama calls:
```python
from openai import OpenAI
from opik.integrations.openai import track_openai
import os
os.environ["OPIK_PROJECT_NAME"] = "ollama-integration"
# Create an OpenAI client
client = OpenAI(
base_url='http://localhost:11434/v1/',
# required but ignored
api_key='ollama',
)
# Log all traces made to with the OpenAI client to Opik
client = track_openai(client)
# call the local ollama model using the OpenAI client
chat_completion = client.chat.completions.create(
messages=[
{
'role': 'user',
'content': 'Say this is a test',
}
],
model='llama3.1',
)
print(chat_completion.choices[0].message.content)
```
The local LLM call is now traced and logged to Opik.
## Tracking Ollama calls made with LangChain
In order to trace Ollama calls made with LangChain, you will need to first install the `langchain-ollama` package:
```bash
pip install --quiet --upgrade langchain-ollama langchain
```
You will now be able to use the `OpikTracer` class to log all your Ollama calls made with LangChain to Opik:
```python
from langchain_ollama import ChatOllama
from opik.integrations.langchain import OpikTracer
# Create the Opik tracer
opik_tracer = OpikTracer(tags=["langchain", "ollama"])
# Create the Ollama model and configure it to use the Opik tracer
llm = ChatOllama(
model="llama3.1",
temperature=0,
).with_config({"callbacks": [opik_tracer]})
# Call the Ollama model
messages = [
(
"system",
"You are a helpful assistant that translates English to French. Translate the user sentence.",
),
(
"human",
"I love programming.",
),
]
ai_msg = llm.invoke(messages)
ai_msg
```
You can now go to the Opik app to see the trace:
# Observability for OpenAI (Python) with Opik
> Start here to integrate Opik into your OpenAI-based genai application for end-to-end LLM observability, unit testing, and optimization.
## Advanced Usage
### Using with the `@track` decorator
If you have multiple steps in your LLM pipeline, you can use the `@track` decorator to log the traces for each step. If OpenAI is called within one of these steps, the LLM call will be associated with that corresponding step:
```python
from opik import track
from opik.integrations.openai import track_openai
from openai import OpenAI
os.environ["OPIK_PROJECT_NAME"] = "openai-integration-demo"
client = OpenAI()
openai_client = track_openai(client)
@track
def generate_story(prompt):
res = openai_client.chat.completions.create(
model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}]
)
return res.choices[0].message.content
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
res = openai_client.chat.completions.create(
model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}]
)
return res.choices[0].message.content
@track
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
# Execute the multi-step pipeline
generate_opik_story()
```
The trace can now be viewed in the UI with hierarchical spans showing the relationship between different steps:
## Using Azure OpenAI
The OpenAI integration also supports Azure OpenAI Services. To use Azure OpenAI, initialize your client with Azure configuration and use it with `track_openai` just like the standard OpenAI client:
```python
from opik.integrations.openai import track_openai
from openai import AzureOpenAI
# gets the API Key from environment variable AZURE_OPENAI_API_KEY
azure_client = AzureOpenAI(
# https://learn.microsoft.com/azure/ai-services/openai/reference#rest-api-versioning
api_version="2023-07-01-preview",
# https://learn.microsoft.com/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
azure_endpoint="https://example-endpoint.openai.azure.com",
)
azure_client = track_openai(azure_client)
completion = azure_client.chat.completions.create(
model="deployment-name", # e.g. gpt-35-instant
messages=[
{
"role": "user",
"content": "How do I output all files in a directory using Python?",
},
],
)
```
## Cost Tracking
The `track_openai` wrapper automatically tracks token usage and cost for all supported OpenAI models.
Cost information is automatically captured and displayed in the Opik UI, including:
* Token usage details
* Cost per request based on OpenAI pricing
* Total trace cost
## Advanced Usage
### SequentialChain Example
Now, let's create a more complex chain and run it with Opik tracing:
```python
from langchain.chains import LLMChain, SimpleSequentialChain
from langchain_core.prompts import PromptTemplate
# Synopsis chain
template = """You are a playwright. Given the title of play, it is your job to write a synopsis for that title.
Title: {title}
Playwright: This is a synopsis for the above play:"""
prompt_template = PromptTemplate(input_variables=["title"], template=template)
synopsis_chain = LLMChain(llm=model, prompt=prompt_template)
# Review chain
template = """You are a play critic from the New York Times. Given the synopsis of play, it is your job to write a review for that play.
Play Synopsis:
{synopsis}
Review from a New York Times play critic of the above play:"""
prompt_template = PromptTemplate(input_variables=["synopsis"], template=template)
review_chain = LLMChain(llm=model, prompt=prompt_template)
# Overall chain
overall_chain = SimpleSequentialChain(
chains=[synopsis_chain, review_chain], verbose=True
)
# Run the chain with Opik tracing
review = overall_chain.run("Tragedy at sunset on the beach", callbacks=[opik_tracer])
print(review)
```
### Accessing Logged Traces
We can access the trace IDs collected by the Opik tracer:
```python
traces = opik_tracer.created_traces()
print("Collected trace IDs:", [trace.id for trace in traces])
# Flush traces to ensure all data is logged
opik_tracer.flush()
```
### Fine-tuned LLM Example
Finally, let's use a fine-tuned model with Opik tracing:
**Note:** In order to use a fine-tuned model, you will need to have access to the model and the correct model ID. The code below will return a `NotFoundError` unless the `model` and `adapter_id` are updated.
```python
fine_tuned_model = Predibase(
model="my-base-LLM",
predibase_api_key=os.environ.get("PREDIBASE_API_TOKEN"),
predibase_sdk_version=None,
adapter_id="my-finetuned-adapter-id",
adapter_version=1,
**{
"api_token": os.environ.get("HUGGING_FACE_HUB_TOKEN"),
"max_new_tokens": 5,
},
)
# Configure the Opik tracer
fine_tuned_model = fine_tuned_model.with_config({"callbacks": [opik_tracer]})
# Invoke the fine-tuned model
response = fine_tuned_model.invoke(
"Can you help categorize the following emails into positive, negative, and neutral?",
**{"temperature": 0.5, "max_new_tokens": 1024},
)
print(response)
# Final flush to ensure all traces are logged
opik_tracer.flush()
```
## Tracking your fine-tuning training runs
If you are using Predibase to fine-tune an LLM, we recommend using Predibase's integration with Comet's Experiment Management functionality. You can learn more about how to set this up in the [Comet integration guide](https://docs.predibase.com/user-guide/integrations/comet) in the Predibase documentation. If you are already using an Experiment Tracking platform, worth checking if it has an integration with Predibase.
# Observability for Together AI with Opik
> Start here to integrate Opik into your Together AI-based genai application for end-to-end LLM observability, unit testing, and optimization.
[Together AI](https://www.together.ai/) provides fast inference for leading open-source models including Llama, Mistral, Qwen, and many others.
This guide explains how to integrate Opik with Together AI via LiteLLM. By using the LiteLLM integration provided by Opik, you can easily track and evaluate your Together AI calls within your Opik projects as Opik will automatically log the input prompt, model used, token usage, and response generated.
## Getting Started
### Configuring Opik
To start tracking your Together AI calls, you'll need to have both `opik` and `litellm` installed. You can install them using pip:
```bash
pip install opik litellm
```
In addition, you can configure Opik using the `opik configure` command which will prompt you for the correct local server address or if you are using the Cloud platform your API key:
```bash
opik configure
```
### Configuring Together AI
You'll need to set your Together AI API key as an environment variable:
```bash
export TOGETHER_API_KEY="YOUR_API_KEY"
```
## Logging LLM calls
In order to log the LLM calls to Opik, you will need to create the OpikLogger callback. Once the OpikLogger callback is created and added to LiteLLM, you can make calls to LiteLLM as you normally would:
```python
from litellm.integrations.opik.opik import OpikLogger
import litellm
opik_logger = OpikLogger()
litellm.callbacks = [opik_logger]
response = litellm.completion(
model="together_ai/meta-llama/Llama-3.2-3B-Instruct-Turbo",
messages=[
{"role": "user", "content": "Why is tracking and evaluation of LLMs important?"}
]
)
```
## Logging LLM calls within a tracked function
If you are using LiteLLM within a function tracked with the [`@track`](/tracing/log_traces#using-function-decorators) decorator, you will need to pass the `current_span_data` as metadata to the `litellm.completion` call:
```python
from opik import track, opik_context
import litellm
@track
def generate_story(prompt):
response = litellm.completion(
model="together_ai/meta-llama/Llama-3.2-3B-Instruct-Turbo",
messages=[{"role": "user", "content": prompt}],
metadata={
"opik": {
"current_span_data": opik_context.get_current_span_data(),
},
},
)
return response.choices[0].message.content
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
response = litellm.completion(
model="together_ai/meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
messages=[{"role": "user", "content": prompt}],
metadata={
"opik": {
"current_span_data": opik_context.get_current_span_data(),
},
},
)
return response.choices[0].message.content
@track
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
generate_opik_story()
```
# Observability for IBM watsonx with Opik
> Start here to integrate Opik into your IBM watsonx-based genai application for end-to-end LLM observability, unit testing, and optimization.
[watsonx](https://www.ibm.com/products/watsonx-ai) is a next generation enterprise studio for AI builders to train, validate, tune and deploy AI models.
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=watsonx\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=watsonx\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=watsonx\&utm_campaign=opik) for more information.
## Getting Started
### Installation
To start tracking your watsonx LLM calls, you can use our [LiteLLM integration](/integrations/litellm). You'll need to have both the `opik` and `litellm` packages installed. You can install them using pip:
```bash
pip install opik litellm
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
## Advanced Usage
### Using with the `@track` decorator
If you have multiple steps in your LLM pipeline, you can use the `@track` decorator to log the traces for each step. If WatsonX is called within one of these steps, the LLM call will be associated with that corresponding step:
```python
from opik import track
from opik.opik_context import get_current_span_data
import litellm
@track
def generate_story(prompt):
response = litellm.completion(
model="watsonx/ibm/granite-13b-chat-v2",
messages=[{"role": "user", "content": prompt}],
metadata={
"opik": {
"current_span_data": get_current_span_data(),
},
},
)
return response.choices[0].message.content
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
response = litellm.completion(
model="watsonx/ibm/granite-13b-chat-v2",
messages=[{"role": "user", "content": prompt}],
metadata={
"opik": {
"current_span_data": get_current_span_data(),
},
},
)
return response.choices[0].message.content
@track
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
# Execute the multi-step pipeline
generate_opik_story()
```
# Observability for xAI Grok with Opik
> Start here to integrate Opik into your xAI Grok-based genai application for end-to-end LLM observability, unit testing, and optimization.
[xAI](https://x.ai/) is an AI company founded by Elon Musk that develops the Grok series of large language models. Grok models are designed to have access to real-time information and are built with a focus on truthfulness, competence, and maximum benefit to humanity.
This guide explains how to integrate Opik with xAI Grok via LiteLLM. By using the LiteLLM integration provided by Opik, you can easily track and evaluate your xAI API calls within your Opik projects as Opik will automatically log the input prompt, model used, token usage, and response generated.
## Getting Started
### Configuring Opik
To get started, you need to configure Opik to send traces to your Comet project. You can do this by setting the `OPIK_PROJECT_NAME` environment variable:
```bash
export OPIK_PROJECT_NAME="your-project-name"
export OPIK_WORKSPACE="your-workspace-name"
```
You can also call the `opik.configure` method:
```python
import opik
opik.configure(
project_name="your-project-name",
workspace="your-workspace-name",
)
```
### Configuring LiteLLM
Install the required packages:
```bash
pip install opik litellm
```
Create a LiteLLM configuration file (e.g., `litellm_config.yaml`):
```yaml
model_list:
- model_name: grok-beta
litellm_params:
model: xai/grok-beta
api_key: os.environ/XAI_API_KEY
- model_name: grok-vision-beta
litellm_params:
model: xai/grok-vision-beta
api_key: os.environ/XAI_API_KEY
litellm_settings:
callbacks: ["opik"]
```
### Authentication
Set your xAI API key as an environment variable:
```bash
export XAI_API_KEY="your-xai-api-key"
```
You can obtain an xAI API key from the [xAI Console](https://console.x.ai/).
## Usage
### Using LiteLLM Proxy Server
Start the LiteLLM proxy server:
```bash
litellm --config litellm_config.yaml
```
Use the proxy server to make requests:
```python
import openai
client = openai.OpenAI(
api_key="anything", # can be anything
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="grok-beta",
messages=[
{"role": "user", "content": "What are the latest developments in AI technology?"}
]
)
print(response.choices[0].message.content)
```
### Direct Integration
You can also use LiteLLM directly in your Python code:
```python
import os
from litellm import completion
# Configure Opik
import opik
opik.configure()
# Configure LiteLLM for Opik
from litellm.integrations.opik.opik import OpikLogger
import litellm
litellm.callbacks = ["opik"]
os.environ["XAI_API_KEY"] = "your-xai-api-key"
response = completion(
model="xai/grok-beta",
messages=[
{"role": "user", "content": "What is the current state of renewable energy adoption worldwide?"}
]
)
print(response.choices[0].message.content)
```
## Supported Models
xAI provides access to several Grok model variants:
* **Grok Beta**: `grok-beta` - The main conversational AI model with real-time information access
* **Grok Vision Beta**: `grok-vision-beta` - Multimodal model capable of processing text and images
* **Grok Mini**: `grok-mini` - A smaller, faster variant optimized for simpler tasks
For the most up-to-date list of available models, visit the [xAI API documentation](https://docs.x.ai/).
## Real-time Information Access
One of Grok's key features is its ability to access real-time information. This makes it particularly useful for questions about current events:
```python
response = completion(
model="xai/grok-beta",
messages=[
{"role": "user", "content": "What are the latest news headlines today?"}
]
)
print(response.choices[0].message.content)
```
## Vision Capabilities
Grok Vision Beta can process both text and images:
```python
from litellm import completion
response = completion(
model="xai/grok-vision-beta",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
]
)
print(response.choices[0].message.content)
```
## Function Calling
Grok models support function calling for enhanced capabilities:
```python
tools = [
{
"type": "function",
"function": {
"name": "get_current_time",
"description": "Get the current time in a specific timezone",
"parameters": {
"type": "object",
"properties": {
"timezone": {
"type": "string",
"description": "The timezone to get the time for",
}
},
"required": ["timezone"],
},
},
}
]
response = completion(
model="xai/grok-beta",
messages=[{"role": "user", "content": "What time is it in Tokyo right now?"}],
tools=tools,
)
```
## Advanced Features
### Temperature and Creativity Control
Control the creativity of Grok's responses:
```python
# More creative responses
response = completion(
model="xai/grok-beta",
messages=[{"role": "user", "content": "Write a creative story about space exploration"}],
temperature=0.9,
max_tokens=1000
)
# More factual responses
response = completion(
model="xai/grok-beta",
messages=[{"role": "user", "content": "Explain quantum computing"}],
temperature=0.1,
max_tokens=500
)
```
### System Messages for Behavior Control
Use system messages to guide Grok's behavior:
```python
response = completion(
model="xai/grok-beta",
messages=[
{"role": "system", "content": "You are a helpful scientific advisor. Provide accurate, evidence-based information."},
{"role": "user", "content": "What are the current challenges in fusion energy research?"}
]
)
```
## Feedback Scores and Evaluation
Once your xAI calls are logged with Opik, you can evaluate your LLM application using Opik's evaluation framework:
```python
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination
# Define your evaluation task
def evaluation_task(x):
return {
"message": x["message"],
"output": x["output"],
"reference": x["reference"]
}
# Create the Hallucination metric
hallucination_metric = Hallucination()
# Run the evaluation
evaluation_results = evaluate(
experiment_name="xai-grok-evaluation",
dataset=your_dataset,
task=evaluation_task,
scoring_metrics=[hallucination_metric],
)
```
## Environment Variables
Make sure to set the following environment variables:
```bash
# xAI Configuration
export XAI_API_KEY="your-xai-api-key"
# Opik Configuration
export OPIK_PROJECT_NAME="your-project-name"
export OPIK_WORKSPACE="your-workspace-name"
```
# Observability for Gretel with Opik
> Start here to integrate Opik into your Gretel-based genai application for end-to-end LLM observability, unit testing, and optimization.
[Gretel (NVIDIA)](https://gretel.ai/) is a synthetic data platform that enables you to generate high-quality, privacy-safe datasets for AI model training and evaluation.
This guide explains how to integrate Opik with Gretel to create synthetic Q\&A datasets and import them into Opik for model evaluation and optimization.
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=gretel\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=gretel\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=gretel\&utm_campaign=opik) for more information.
## Getting Started
### Installation
To use Gretel with Opik, you'll need to have both the `gretel-client` and `opik` packages installed:
```bash
pip install gretel-client opik pandas
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring Gretel
In order to configure Gretel, you will need to have your Gretel API Key. You can create and manage your Gretel API Keys on [this page](https://console.gretel.ai/account/api-keys).
You can set it as an environment variable:
```bash
export GRETEL_API_KEY="YOUR_API_KEY"
```
Or set it programmatically:
```python
import os
import getpass
if "GRETEL_API_KEY" not in os.environ:
os.environ["GRETEL_API_KEY"] = getpass.getpass("Enter your Gretel API key: ")
# Set project name for organization
os.environ["OPIK_PROJECT_NAME"] = "gretel-integration-demo"
```
## Two Approaches Available
This integration demonstrates **two methods** for generating synthetic data with Gretel:
1. **Data Designer** (recommended for custom datasets): Create datasets from scratch with precise control
2. **Safe Synthetics** (recommended for existing data): Generate synthetic versions of existing datasets
## Method 1: Using Gretel Data Designer
### Generate Q\&A Dataset
Use Gretel Data Designer to generate synthetic Q\&A data with precise control over the structure:
```python
from gretel_client.navigator_client import Gretel
from gretel_client.data_designer import columns as C
from gretel_client.data_designer import params as P
import opik
# Initialize Data Designer
gretel_navigator = Gretel()
dd = gretel_navigator.data_designer.new(model_suite="apache-2.0")
# Add topic column (categorical sampler)
dd.add_column(
C.SamplerColumn(
name="topic",
type=P.SamplerType.CATEGORY,
params=P.CategorySamplerParams(
values=[
"neural networks", "deep learning", "machine learning", "NLP",
"computer vision", "reinforcement learning", "AI ethics", "data science"
]
)
)
)
# Add difficulty column
dd.add_column(
C.SamplerColumn(
name="difficulty",
type=P.SamplerType.CATEGORY,
params=P.CategorySamplerParams(
values=["beginner", "intermediate", "advanced"]
)
)
)
# Add question column (LLM-generated)
dd.add_column(
C.LLMTextColumn(
name="question",
prompt=(
"Generate a challenging, specific question about {{ topic }} "
"at {{ difficulty }} level. The question should be clear, focused, "
"and something a student or practitioner might actually ask."
)
)
)
# Add answer column (LLM-generated)
dd.add_column(
C.LLMTextColumn(
name="answer",
prompt=(
"Provide a clear, accurate, and comprehensive answer to this {{ difficulty }}-level "
"question about {{ topic }}: '{{ question }}'. The answer should be educational "
"and directly address all aspects of the question."
)
)
)
# Generate the dataset
workflow_run = dd.create(num_records=20, wait_until_done=True)
synthetic_df = workflow_run.dataset.df
print(f"Generated {len(synthetic_df)} Q&A pairs!")
```
### Convert to Opik Format
Convert the Gretel-generated data to Opik's expected format:
```python
def convert_to_opik_format(df):
"""Convert Gretel Q&A data to Opik dataset format"""
opik_items = []
for _, row in df.iterrows():
# Create Opik dataset item
item = {
"input": {
"question": row["question"]
},
"expected_output": row["answer"],
"metadata": {
"topic": row.get("topic", "AI/ML"),
"difficulty": row.get("difficulty", "unknown"),
"source": "gretel_data_designer"
}
}
opik_items.append(item)
return opik_items
# Convert to Opik format
opik_data = convert_to_opik_format(synthetic_df)
print(f"Converted {len(opik_data)} items to Opik format!")
```
### Upload to Opik
Upload your dataset to Opik for model evaluation:
```python
# Initialize Opik client
opik_client = opik.Opik()
# Create the dataset
dataset_name = "gretel-ai-qa-dataset"
dataset = opik_client.get_or_create_dataset(
name=dataset_name,
description="Synthetic Q&A dataset generated using Gretel Data Designer for AI/ML evaluation"
)
# Insert the data
dataset.insert(opik_data)
print(f"Successfully created dataset: {dataset.name}")
print(f"Dataset ID: {dataset.id}")
print(f"Total items: {len(opik_data)}")
```
## Method 2: Using Gretel Safe Synthetics
### Prepare Sample Data
If you have an existing Q\&A dataset, you can use Safe Synthetics to create a synthetic version:
```python
import pandas as pd
# Create sample Q&A data (needs 200+ records for holdout)
sample_questions = [
'What is machine learning?',
'How do neural networks work?',
'What is the difference between AI and ML?',
'Explain deep learning concepts',
'What are the applications of NLP?'
] * 50 # Repeat to get 250 records
sample_answers = [
'Machine learning is a subset of AI that enables systems to learn from data.',
'Neural networks are computational models inspired by biological neural networks.',
'AI is the broader concept while ML is a specific approach to achieve AI.',
'Deep learning uses multi-layer neural networks to model complex patterns.',
'NLP applications include chatbots, translation, sentiment analysis, and text generation.'
] * 50 # Repeat to get 250 records
sample_data = {
'question': sample_questions,
'answer': sample_answers,
'topic': (['ML', 'Neural Networks', 'AI/ML', 'Deep Learning', 'NLP'] * 50),
'difficulty': (['beginner', 'intermediate', 'beginner', 'advanced', 'intermediate'] * 50)
}
original_df = pd.DataFrame(sample_data)
print(f"Original dataset: {len(original_df)} records")
```
### Generate Synthetic Version
Use Safe Synthetics to create a privacy-safe version of your dataset:
```python
# Initialize Gretel client
gretel = Gretel()
# Generate synthetic version
synthetic_dataset = gretel.safe_synthetic_dataset \
.from_data_source(original_df, holdout=0.1) \
.synthesize(num_records=100) \
.create()
# Wait for completion and get results
synthetic_dataset.wait_until_done()
synthetic_df_safe = synthetic_dataset.dataset.df
print(f"Generated {len(synthetic_df_safe)} synthetic Q&A pairs using Safe Synthetics!")
```
### Convert and Upload to Opik
Convert the Safe Synthetics data to Opik format and upload:
```python
# Convert to Opik format
opik_data_safe = convert_to_opik_format(synthetic_df_safe)
# Create dataset in Opik
dataset_safe = opik_client.get_or_create_dataset(
name="gretel-safe-synthetics-qa-dataset",
description="Synthetic Q&A dataset generated using Gretel Safe Synthetics"
)
dataset_safe.insert(opik_data_safe)
print(f"Safe Synthetics dataset created: {dataset_safe.name}")
```
## Using with @track decorator
Use the `@track` decorator to create comprehensive traces when working with your Gretel-generated datasets:
```python
from opik import track
@track
def evaluate_qa_model(dataset_item):
"""Evaluate a Q&A model using Gretel-generated data."""
question = dataset_item["input"]["question"]
# Your model logic here (replace with actual model)
if 'neural network' in question.lower():
response = "A neural network is a computational model inspired by biological neural networks."
elif 'machine learning' in question.lower():
response = "Machine learning is a subset of AI that enables systems to learn from data."
else:
response = "This is a complex AI/ML topic that requires detailed explanation."
return {
"question": question,
"response": response,
"expected": dataset_item["expected_output"],
"topic": dataset_item["metadata"]["topic"],
"difficulty": dataset_item["metadata"]["difficulty"]
}
# Evaluate on your dataset
for item in opik_data[:5]: # Evaluate first 5 items
result = evaluate_qa_model(item)
print(f"Topic: {result['topic']}, Difficulty: {result['difficulty']}")
```
## Results viewing
Once your Gretel-generated datasets are uploaded to Opik, you can view them in the Opik UI. Each dataset will contain:
* Input questions and expected answers
* Metadata including topic and difficulty levels
* Source information (Data Designer or Safe Synthetics)
* Quality metrics and evaluation results
## Feedback Scores and Evaluation
Once your Gretel-generated datasets are in Opik, you can evaluate your LLM applications using Opik's evaluation framework:
```python
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination
# Define your evaluation task
def evaluation_task(x):
return {
"message": x["input"]["question"],
"output": x["response"],
"reference": x["expected_output"]
}
# Create the Hallucination metric
hallucination_metric = Hallucination()
# Run the evaluation
evaluation_results = evaluate(
experiment_name="gretel-qa-evaluation",
dataset=opik_data,
task=evaluation_task,
scoring_metrics=[hallucination_metric],
)
```
## Dataset Size Requirements
| Dataset Size | Holdout Setting | Example |
| ------------------ | ----------------------- | ------------------------------------------------------------- |
| **\< 200 records** | `holdout=None` | `from_data_source(df, holdout=None)` |
| **200+ records** | Default (5%) or custom | `from_data_source(df)` or `from_data_source(df, holdout=0.1)` |
| **Large datasets** | Custom percentage/count | `from_data_source(df, holdout=250)` |
## When to Use Which Approach?
| Use Case | Recommended Approach | Why |
| -------------------------------------- | -------------------- | ---------------------------------------------------- |
| **Creating new datasets from scratch** | **Data Designer** | More control, custom column types, guided generation |
| **Synthesizing existing datasets** | **Safe Synthetics** | Preserves statistical relationships, privacy-safe |
| **Custom data structures** | **Data Designer** | Flexible column definitions, template system |
| **Production data replication** | **Safe Synthetics** | Maintains data utility while ensuring privacy |
## Environment Variables
Make sure to set the following environment variables:
```bash
# Gretel Configuration
export GRETEL_API_KEY="your-gretel-api-key"
# Opik Configuration
export OPIK_PROJECT_NAME="your-project-name"
export OPIK_WORKSPACE="your-workspace-name"
```
## Troubleshooting
### Common Issues
1. **Authentication Errors**: Ensure your Gretel API key is correct and has the necessary permissions
2. **Dataset Size**: Safe Synthetics requires at least 200 records for holdout validation
3. **Model Suite**: Ensure you're using a compatible model suite (e.g., "apache-2.0")
4. **Rate Limiting**: Gretel may have rate limits; implement appropriate retry logic
### Getting Help
* Contact Gretel support for API-specific problems
* Check Opik documentation for tracing and evaluation features
## Next Steps
Once you have Gretel integrated with Opik, you can:
* [Evaluate your LLM applications](/evaluation/overview) using Opik's evaluation framework
* [Create datasets](/datasets/overview) to test and improve your models
* [Set up feedback collection](/feedback/overview) to gather human evaluations
* [Monitor performance](/tracing/overview) across different models and configurations
# Observability for Hugging Face Datasets with Opik
> Start here to integrate Opik with Hugging Face Datasets for end-to-end LLM observability, unit testing, and optimization.
[Hugging Face Datasets](https://huggingface.co/docs/datasets/) is a library that provides easy access to thousands of datasets for machine learning and natural language processing tasks.
This guide explains how to integrate Opik with Hugging Face Datasets to convert and import datasets into Opik for model evaluation and optimization.
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=huggingface-datasets\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=huggingface-datasets\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=huggingface-datasets\&utm_campaign=opik) for more information.
## Getting Started
### Installation
To use Hugging Face Datasets with Opik, you'll need to have both the `datasets` and `opik` packages installed:
```bash
pip install opik datasets transformers pandas tqdm huggingface_hub
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring Hugging Face
In order to access private datasets on Hugging Face, you will need to have your Hugging Face token. You can create and manage your Hugging Face tokens on [this page](https://huggingface.co/settings/tokens).
You can set it as an environment variable:
```bash
export HUGGINGFACE_HUB_TOKEN="YOUR_TOKEN"
```
Or set it programmatically:
```python
import os
import getpass
if "HUGGINGFACE_HUB_TOKEN" not in os.environ:
os.environ["HUGGINGFACE_HUB_TOKEN"] = getpass.getpass("Enter your Hugging Face token: ")
# Set project name for organization
os.environ["OPIK_PROJECT_NAME"] = "huggingface-datasets-integration-demo"
```
## HuggingFaceToOpikConverter
The integration provides a utility class to convert Hugging Face datasets to Opik format:
```python
from datasets import load_dataset, Dataset as HFDataset
from opik import Opik
from typing import Optional, Dict, Any, List
import json
from tqdm import tqdm
import warnings
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')
class HuggingFaceToOpikConverter:
"""Utility class to convert Hugging Face datasets to Opik format."""
def __init__(self, opik_client: Opik):
self.opik_client = opik_client
def load_hf_dataset(
self,
dataset_name: str,
split: Optional[str] = None,
config: Optional[str] = None,
subset_size: Optional[int] = None,
**kwargs
) -> HFDataset:
"""
Load a dataset from Hugging Face Hub.
Args:
dataset_name: Name of the dataset on HF Hub
split: Specific split to load (train, validation, test)
config: Configuration/subset of the dataset
subset_size: Limit the number of samples
**kwargs: Additional arguments for load_dataset
Returns:
Loaded Hugging Face dataset
"""
print(f"๐ฅ Loading dataset: {dataset_name}")
if config:
print(f" Config: {config}")
if split:
print(f" Split: {split}")
# Load the dataset
dataset = load_dataset(
dataset_name,
name=config,
split=split,
**kwargs
)
# Limit dataset size if specified
if subset_size and len(dataset) > subset_size:
dataset = dataset.select(range(subset_size))
print(f" Limited to {subset_size} samples")
print(f" โ
Loaded {len(dataset)} samples")
print(f" Features: {list(dataset.features.keys())}")
return dataset
```
## Basic Usage
### Convert and Upload a Dataset
Here's how to convert a Hugging Face dataset to Opik format and upload it:
```python
# Initialize the converter
opik_client = Opik()
converter = HuggingFaceToOpikConverter(opik_client)
# Load a dataset from Hugging Face
dataset = converter.load_hf_dataset(
dataset_name="squad",
split="validation",
subset_size=100 # Limit for demo
)
# Convert to Opik format
opik_data = converter.convert_to_opik_format(
dataset=dataset,
input_columns=["question"],
output_columns=["answers"],
metadata_columns=["id", "title"],
dataset_name="squad-qa-dataset",
description="SQuAD question answering dataset converted from Hugging Face"
)
print(f"โ
Converted {len(opik_data)} items to Opik format!")
```
### Convert to Opik Format
The converter provides a method to transform Hugging Face datasets into Opik's expected format:
```python
def convert_to_opik_format(
self,
dataset: HFDataset,
input_columns: List[str],
output_columns: List[str],
metadata_columns: Optional[List[str]] = None,
dataset_name: str = "huggingface-dataset",
description: str = "Dataset converted from Hugging Face"
) -> List[Dict[str, Any]]:
"""
Convert a Hugging Face dataset to Opik format.
Args:
dataset: Hugging Face dataset
input_columns: List of column names to use as input
output_columns: List of column names to use as expected output
metadata_columns: Optional list of columns to include as metadata
dataset_name: Name for the Opik dataset
description: Description for the Opik dataset
Returns:
List of Opik dataset items
"""
opik_items = []
for row in tqdm(dataset, desc="Converting to Opik format"):
# Extract input data
input_data = {}
for col in input_columns:
if col in dataset.features:
input_data[col] = self._extract_field_value(row, col)
# Extract expected output
expected_output = {}
for col in output_columns:
if col in dataset.features:
expected_output[col] = self._extract_field_value(row, col)
# Extract metadata
metadata = {}
if metadata_columns:
for col in metadata_columns:
if col in dataset.features:
metadata[col] = self._extract_field_value(row, col)
# Create Opik dataset item
item = {
"input": input_data,
"expected_output": expected_output,
"metadata": metadata
}
opik_items.append(item)
return opik_items
```
## Using with @track decorator
Use the `@track` decorator to create comprehensive traces when working with your converted datasets:
```python
from opik import track
@track
def evaluate_qa_model(dataset_item):
"""Evaluate a Q&A model using Hugging Face dataset."""
question = dataset_item["input"]["question"]
# Your model logic here (replace with actual model)
if 'what' in question.lower():
response = "This is a question asking for information."
elif 'how' in question.lower():
response = "This is a question asking for a process or method."
else:
response = "This is a general question that requires analysis."
return {
"question": question,
"response": response,
"expected": dataset_item["expected_output"],
"metadata": dataset_item["metadata"]
}
# Evaluate on your dataset
for item in opik_data[:5]: # Evaluate first 5 items
result = evaluate_qa_model(item)
print(f"Question: {result['question'][:50]}...")
```
## Popular Dataset Examples
### SQuAD (Question Answering)
```python
# Load SQuAD dataset
squad_dataset = converter.load_hf_dataset(
dataset_name="squad",
split="validation",
subset_size=50
)
# Convert to Opik format
squad_opik = converter.convert_to_opik_format(
dataset=squad_dataset,
input_columns=["question"],
output_columns=["answers"],
metadata_columns=["id", "title"],
dataset_name="squad-qa-dataset",
description="SQuAD question answering dataset"
)
```
### GLUE (General Language Understanding)
```python
# Load GLUE SST-2 dataset
sst2_dataset = converter.load_hf_dataset(
dataset_name="glue",
config_name="sst2",
split="validation",
subset_size=100
)
# Convert to Opik format
sst2_opik = converter.convert_to_opik_format(
dataset=sst2_dataset,
input_columns=["sentence"],
output_columns=["label"],
metadata_columns=["idx"],
dataset_name="sst2-sentiment-dataset",
description="SST-2 sentiment analysis dataset from GLUE"
)
```
### Common Crawl (Text Classification)
```python
# Load Common Crawl dataset
cc_dataset = converter.load_hf_dataset(
dataset_name="common_crawl",
subset_size=200
)
# Convert to Opik format
cc_opik = converter.convert_to_opik_format(
dataset=cc_dataset,
input_columns=["text"],
output_columns=["language"],
metadata_columns=["url", "timestamp"],
dataset_name="common-crawl-dataset",
description="Common Crawl text classification dataset"
)
```
## Results viewing
Once your Hugging Face datasets are converted and uploaded to Opik, you can view them in the Opik UI. Each dataset will contain:
* Input data from specified columns
* Expected output from specified columns
* Metadata from additional columns
* Source information (Hugging Face dataset name and split)
## Feedback Scores and Evaluation
Once your Hugging Face datasets are in Opik, you can evaluate your LLM applications using Opik's evaluation framework:
```python
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination
# Define your evaluation task
def evaluation_task(x):
return {
"message": x["input"]["question"],
"output": x["response"],
"reference": x["expected_output"]["answers"]
}
# Create the Hallucination metric
hallucination_metric = Hallucination()
# Run the evaluation
evaluation_results = evaluate(
experiment_name="huggingface-datasets-evaluation",
dataset=squad_opik,
task=evaluation_task,
scoring_metrics=[hallucination_metric],
)
```
## Environment Variables
Make sure to set the following environment variables:
```bash
# Hugging Face Configuration (optional, for private datasets)
export HUGGINGFACE_HUB_TOKEN="your-huggingface-token"
# Opik Configuration
export OPIK_PROJECT_NAME="your-project-name"
export OPIK_WORKSPACE="your-workspace-name"
```
## Troubleshooting
### Common Issues
1. **Authentication Errors**: Ensure your Hugging Face token is correct for private datasets
2. **Dataset Not Found**: Verify the dataset name and configuration are correct
3. **Memory Issues**: Use `subset_size` parameter to limit large datasets
4. **Data Type Conversion**: The converter handles most data types, but complex nested structures may need custom handling
### Getting Help
* Check the [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets/) for dataset loading
* Review the [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/) for authentication
* Contact Hugging Face support for dataset-specific problems
* Check Opik documentation for tracing and evaluation features
## Next Steps
Once you have Hugging Face Datasets integrated with Opik, you can:
* [Evaluate your LLM applications](/evaluation/overview) using Opik's evaluation framework
* [Create datasets](/datasets/overview) to test and improve your models
* [Set up feedback collection](/feedback/overview) to gather human evaluations
* [Monitor performance](/tracing/overview) across different models and configurations
# Evaluate LLM Applications with Ragas Metrics in Opik
> Use Ragas evaluation metrics to assess your LLM application quality and automatically track results in Opik for comprehensive performance monitoring.
The Opik SDK provides a simple way to integrate with Ragas, a framework for evaluating RAG systems.
There are two main ways to use Ragas with Opik:
1. Using Ragas to score traces or spans.
2. Using Ragas to evaluate a RAG pipeline.
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=ragas\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=ragas\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=ragas\&utm_campaign=opik) for more information.
## Getting Started
### Installation
You will first need to install the `opik` and `ragas` packages:
```bash
pip install opik ragas
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring Ragas
In order to use Ragas, you will need to configure your LLM provider API keys. For this example, we'll use OpenAI. You can [find or create your API keys in these pages](https://platform.openai.com/settings/organization/api-keys):
You can set them as environment variables:
```bash
export OPENAI_API_KEY="YOUR_API_KEY"
```
Or set them programmatically:
```python
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
```
## Using Ragas to score traces or spans
Ragas provides a set of metrics that can be used to evaluate the quality of a RAG pipeline, a full list of the supported metrics can be found in the [Ragas documentation](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/).
You can use the `RagasMetricWrapper` to easily integrate Ragas metrics with Opik tracking:
```python
# Import the required dependencies
from ragas.metrics import AnswerRelevancy
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from opik.evaluation.metrics import RagasMetricWrapper
# Initialize the Ragas metric
llm = LangchainLLMWrapper(ChatOpenAI())
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
ragas_answer_relevancy = AnswerRelevancy(llm=llm, embeddings=emb)
# Wrap the Ragas metric with RagasMetricWrapper for Opik integration
answer_relevancy_metric = RagasMetricWrapper(
ragas_answer_relevancy,
track=True # This enables automatic tracing in Opik
)
```
Once the metric wrapper is set up, you can use it to score traces or spans:
```python
from opik import track
from opik.opik_context import update_current_trace
@track
def retrieve_contexts(question):
# Define the retrieval function, in this case we will hard code the contexts
return ["Paris is the capital of France.", "Paris is in France."]
@track
def answer_question(question, contexts):
# Define the answer function, in this case we will hard code the answer
return "Paris"
@track
def rag_pipeline(question):
# Define the pipeline
contexts = retrieve_contexts(question)
answer = answer_question(question, contexts)
# Score the pipeline using the RagasMetricWrapper
score_result = answer_relevancy_metric.score(
user_input=question,
response=answer,
retrieved_contexts=contexts
)
# Add the score to the current trace
update_current_trace(
feedback_scores=[{"name": score_result.name, "value": score_result.value}]
)
return answer
print(rag_pipeline("What is the capital of France?"))
```
In the Opik UI, you will be able to see the full trace including the score calculation:
## Comprehensive Example: Dataset Evaluation
For more advanced use cases, you can evaluate entire datasets using Ragas metrics with the Opik evaluation platform:
### 1. Create a Dataset
```python
from datasets import load_dataset
import opik
opik_client = opik.Opik()
# Create a small dataset
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
# Reformat the dataset to match the schema expected by the Ragas evaluate function
hf_dataset = fiqa_eval["baseline"].select(range(3))
dataset_items = hf_dataset.map(
lambda x: {
"user_input": x["question"],
"reference": x["ground_truths"][0],
"retrieved_contexts": x["contexts"],
}
)
dataset = opik_client.get_or_create_dataset("ragas-demo-dataset")
dataset.insert(dataset_items)
```
### 2. Define Evaluation Task
```python
# Create an evaluation task
def evaluation_task(x):
return {
"user_input": x["question"],
"response": x["answer"],
"retrieved_contexts": x["contexts"],
}
```
### 3. Run Evaluation
```python
# Use the RagasMetricWrapper directly with Opik's evaluate function
opik.evaluation.evaluate(
dataset,
evaluation_task,
scoring_metrics=[answer_relevancy_metric],
task_threads=1,
)
```
### 4. Alternative: Using Ragas Native Evaluation
You can also use Ragas' native evaluation function with Opik tracing:
```python
from datasets import load_dataset
from opik.integrations.langchain import OpikTracer
from ragas.metrics import context_precision, answer_relevancy, faithfulness
from ragas import evaluate
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
# Reformat the dataset to match the schema expected by the Ragas evaluate function
dataset = fiqa_eval["baseline"].select(range(3))
dataset = dataset.map(
lambda x: {
"user_input": x["question"],
"reference": x["ground_truths"][0],
"retrieved_contexts": x["contexts"],
}
)
opik_tracer_eval = OpikTracer(tags=["ragas_eval"], metadata={"evaluation_run": True})
result = evaluate(
dataset,
metrics=[context_precision, faithfulness, answer_relevancy],
callbacks=[opik_tracer_eval],
)
print(result)
```
## Using Ragas metrics to evaluate a RAG pipeline
The `RagasMetricWrapper` can also be used directly within the Opik evaluation platform. This approach is much simpler than creating custom wrappers:
### 1. Define the Ragas metric
We will start by defining the Ragas metric, in this example we will use `AnswerRelevancy`:
```python
from ragas.metrics import AnswerRelevancy
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from opik.evaluation.metrics import RagasMetricWrapper
# Initialize the Ragas metric
llm = LangchainLLMWrapper(ChatOpenAI())
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
ragas_answer_relevancy = AnswerRelevancy(llm=llm, embeddings=emb)
```
### 2. Create the metric wrapper
Simply wrap the Ragas metric with `RagasMetricWrapper`:
```python
# Create the answer relevancy scoring metric
answer_relevancy = RagasMetricWrapper(
ragas_answer_relevancy,
track=True # Enable tracing for the metric computation
)
```
## Advanced Usage
### Using with the `@track` decorator
If you have multiple steps in your LLM pipeline, you can use the `@track` decorator to log the traces for each step. If AISuite is called within one of these steps, the LLM call will be associated with that corresponding step:
```python
from opik import track
from opik.integrations.aisuite import track_aisuite
import aisuite as ai
client = track_aisuite(ai.Client(), project_name="aisuite-integration-demo")
@track
def generate_story(prompt):
res = client.chat.completions.create(
model="openai:gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}]
)
return res.choices[0].message.content
@track
def generate_topic():
prompt = "Generate a topic for a story about Opik."
res = client.chat.completions.create(
model="openai:gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}]
)
return res.choices[0].message.content
@track(project_name="aisuite-integration-demo")
def generate_opik_story():
topic = generate_topic()
story = generate_story(topic)
return story
# Execute the multi-step pipeline
generate_opik_story()
```
The trace can now be viewed in the UI with hierarchical spans showing the relationship between different steps:
## Supported aisuite methods
The `track_aisuite` wrapper supports the following aisuite methods:
* `aisuite.Client.chat.completions.create()`
If you would like to track another aisuite method, please let us know by opening an issue on [GitHub](https://github.com/comet-ml/opik/issues).
# Observability for LiteLLM with Opik
> Start here to integrate Opik into your LiteLLM-based genai application for end-to-end LLM observability, unit testing, and optimization.
[LiteLLM](https://github.com/BerriAI/litellm) allows you to call all LLM APIs using the OpenAI format \[Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]. There are two main ways to use LiteLLM:
1. Using the [LiteLLM Python SDK](https://docs.litellm.ai/docs/#litellm-python-sdk)
2. Using the [LiteLLM Proxy Server (LLM Gateway)](https://docs.litellm.ai/docs/#litellm-proxy-server-llm-gateway)
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=litellm\&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=litellm\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=litellm\&utm_campaign=opik) for more information.
## Getting Started
### Installation
First, ensure you have both `opik` and `litellm` packages installed:
```bash
pip install opik litellm
```
### Configuring Opik
Configure the Opik Python SDK for your deployment type. See the [Python SDK Configuration guide](/tracing/sdk_configuration) for detailed instructions on:
* **CLI configuration**: `opik configure`
* **Code configuration**: `opik.configure()`
* **Self-hosted vs Cloud vs Enterprise** setup
* **Configuration files** and environment variables
### Configuring LiteLLM
In order to use LiteLLM, you will need to configure your LLM provider API keys. For this example, we'll use OpenAI. You can [find or create your API keys in these pages](https://platform.openai.com/settings/organization/api-keys):
You can set them as environment variables:
```bash
export OPENAI_API_KEY="YOUR_API_KEY"
```
Or set them programmatically:
```python
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
```
## Using Opik with the LiteLLM Python SDK
### Logging LLM calls
In order to log the LLM calls to Opik, you will need to create the OpikLogger callback. Once the OpikLogger callback is created and added to LiteLLM, you can make calls to LiteLLM as you normally would:
```python
from litellm.integrations.opik.opik import OpikLogger
import litellm
import os
# Set project name for better organization
os.environ["OPIK_PROJECT_NAME"] = "litellm-integration-demo"
opik_logger = OpikLogger()
litellm.callbacks = [opik_logger]
response = litellm.completion(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "Why is tracking and evaluation of LLMs important?"}
]
)
print(response.choices[0].message.content)
```
### Logging LLM calls within a tracked function
If you are using LiteLLM within a function tracked with the [`@track`](/tracing/log_traces#using-function-decorators) decorator, you will need to pass the `current_span_data` as metadata to the `litellm.completion` call:
```python
from opik import track
from opik.opik_context import get_current_span_data
from litellm.integrations.opik.opik import OpikLogger
import litellm
opik_logger = OpikLogger()
litellm.callbacks = [opik_logger]
@track
def streaming_function(input):
messages = [{"role": "user", "content": input}]
response = litellm.completion(
model="gpt-3.5-turbo",
messages=messages,
metadata = {
"opik": {
"current_span_data": get_current_span_data(),
"tags": ["streaming-test"],
},
}
)
return response
response = streaming_function("Why is tracking and evaluation of LLMs important?")
chunks = list(response)
```
## Using Opik with the LiteLLM Proxy Server
# Overview
> Build powerful applications with Opik's Python SDK, enabling seamless integration and functionality for your projects.
# OpenTelemetry Python SDK
> Learn to instrument your Python applications with OpenTelemetry SDK to effectively send trace data to Opik for better observability.
# Using the OpenTelemetry Python SDK
This guide shows you how to directly instrument your Python applications with the OpenTelemetry SDK to send trace data to Opik.
## Installation
First, install the required OpenTelemetry packages:
```bash
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
```
## Full Example
Here's a complete example that demonstrates how to instrument a chatbot application with OpenTelemetry and send the traces to Opik:
```python
# Dependencies: opentelemetry-exporter-otlp
import os
import time
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.resource import ResourceAttributes
# Configure OpenTelemetry
# For comet.com
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://www.comet.com/opik/api/v1/private/otel"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = "Authorization=
## Getting started
To use the Microsoft Agent Framework .NET integration with Opik, you will need to have the Agent Framework and the required OpenTelemetry packages installed:
```bash
# Agent Framework (2 packages)
dotnet add package Microsoft.Agents.AI --prerelease
dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease
# Hosting (1 package)
dotnet add package Microsoft.Extensions.Hosting
# OpenTelemetry (3 packages)
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.Http
```
You will also need to configure your OpenAI API key:
```bash
export OPENAI_API_KEY=If you are looking at integrating Opik with your agent using cursor, check-out our pre-built prompt:
## Installation
You can also install the Cursor extension manually by navigating to the `Extensions` tab at the top of the file sidebar and search for `Opik`.
From there, simply click on the `Install` button.
### Configuring the extension
In order to use the extension, you will need to configure your Opik API key. There are a few ways to do this:
In this modal, click on `Open Settings` and paste your Opik API key in the `Opik: Opik API Key` field.
You can find your API key in the [Opik dashboard](https://www.comet.com/api/my/settings).
# Observability for Flowise with Opik
> Start here to integrate Opik into your Flowise-based genai application for end-to-end LLM observability, unit testing, and optimization.
Flowise AI is a visual LLM builder that allows you to create AI agents and workflows through a drag-and-drop interface. With Opik integration, you can analyze and troubleshoot your chatflows and agentflows to improve performance and user experience.
## Getting started
### Prerequisites
Before integrating Opik with Langflow, ensure you have:
* A running Langflow server
* An Opik Cloud account or self-hosted Opik instance
* Access to the terminal/environment where Langflow runs
### Installation
Install both Langflow and Opik in the same environment:
```bash
pip install langflow opik
```
For more Langflow installation options and details, see the [Langflow documentation](https://docs.langflow.org/).
## Configuring Opik
Configure Opik to connect to your Opik instance. Run the following command in the same terminal/environment where you'll run Langflow:
## Features
* ๐ **Automatic tracing** of workflow executions and individual node operations
* ๐ **Standard OpenTelemetry** instrumentation using the official Node.js SDK
* ๐ฏ **Zero-code setup** via n8n's hook system
* ๐ **OTLP compatible** - works with Opik's OpenTelemetry endpoint
* โ๏ธ **Configurable** I/O capture, node filtering, and more
## Account Setup
[Comet](https://www.comet.com/site?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=n8n\&utm_campaign=opik) provides a hosted version of the Opik platform. [Simply create an account](https://www.comet.com/signup?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=n8n\&utm_campaign=opik) and grab your API Key.
> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm\&utm_source=opik\&utm_medium=colab\&utm_content=n8n\&utm_campaign=opik) for more information.
## Quick Start with Docker
The fastest way to get started is with Docker Compose:
```bash
# Clone and navigate to the example
git clone https://github.com/comet-ml/n8n-observability.git
cd n8n-observability/examples/docker-compose
# Set your Opik API key (get one free at https://www.comet.com/signup)
export OPIK_API_KEY=your_api_key_here
# Build and run
docker-compose up --build
```
Open [http://localhost:5678](http://localhost:5678), create a workflow, and see traces in your [Opik dashboard](https://www.comet.com)!
## Setup Options
### Docker (Recommended)
Create a custom Dockerfile that installs the `n8n-observability` package globally:
```dockerfile
FROM n8nio/n8n:latest
USER root
RUN npm install -g n8n-observability
ENV EXTERNAL_HOOK_FILES=/usr/local/lib/node_modules/n8n-observability/dist/hooks.cjs
USER node
```
Then configure your docker-compose.yml with OTLP settings:
## Why teams choose Opik Agent Optimizer
* **Automatic prompt optimization** โ end-to-end workflow that installs in minutes and runs locally or in your stack.
* **Open-source and framework agnostic** โ no lock-in, use Opikโs first-party optimizers or community favorites like GEPA in the same SDK.
* **Agent-aware** โ optimize beyond system prompts, including MCP tool signatures, function-calling schemas, and full multi-agent systems.
* **Deep observability** โ every trial logs prompts, tool calls, traces, and metric reasons to Opik so you can explain and ship changes confidently.
## Key capabilities
## Quickstart
You can use the `FewShotBayesianOptimizer` to optimize a prompt by following these steps:
```python maxLines=1000
from opik_optimizer import FewShotBayesianOptimizer
from opik.evaluation.metrics import LevenshteinRatio
from opik_optimizer import datasets, ChatPrompt
# Initialize optimizer
optimizer = FewShotBayesianOptimizer(
model="openai/gpt-4",
model_parameters={
"temperature": 0.1,
"max_tokens": 5000
},
)
# Prepare dataset
dataset = datasets.hotpot(count=300)
# Define metric and prompt (see docs for more options)
def levenshtein_ratio(dataset_item, llm_output):
return LevenshteinRatio().score(reference=dataset_item["answer"], output=llm_output)
prompt = ChatPrompt(
messages=[
{"role": "system", "content": "Provide an answer to the question."},
{"role": "user", "content": "{question}"}
]
)
# Run optimization
results = optimizer.optimize_prompt(
prompt=prompt,
dataset=dataset,
metric=levenshtein_ratio,
n_samples=100
)
# Access results
results.display()
```
## Configuration Options
### Optimizer parameters
The optimizer has the following parameters:
## Quickstart
```python
"""
Optimize a simple system prompt on the tiny_test dataset.
Requires: pip install gepa, and a valid OPENAI_API_KEY for LiteLLM-backed models.
"""
from typing import Any, Dict
from opik.evaluation.metrics import LevenshteinRatio
from opik.evaluation.metrics.score_result import ScoreResult
from opik_optimizer import ChatPrompt, datasets
from opik_optimizer.gepa_optimizer import GepaOptimizer
def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
return LevenshteinRatio().score(reference=dataset_item["label"], output=llm_output)
dataset = datasets.tiny_test()
prompt = ChatPrompt(
system="You are a helpful assistant. Answer concisely with the exact answer.",
user="{text}",
)
optimizer = GepaOptimizer(
model="openai/gpt-4o-mini",
n_threads=6,
temperature=0.2,
max_tokens=200,
)
result = optimizer.optimize_prompt(
prompt=prompt,
dataset=dataset,
metric=levenshtein_ratio,
max_trials=12,
reflection_minibatch_size=2,
n_samples=5,
)
result.display()
```
### Determinism and tool usage
* GEPAโs seed is forwarded directly to the underlying `gepa.optimize` call, but any non-determinism in your prompt (tool calls, non-zero temperature, external APIs) will still introduce variance. To test seeding in isolation, disable tools or substitute cached responses.
* GEPA emits its own baseline evaluation inside the optimization loop. Youโll see one baseline score from Opikโs wrapper and another from GEPA before the first trial; this is expected and does not double-charge the metric budget.
* Reflection only triggers after GEPA accepts at least `reflection_minibatch_size` unique prompts. If the minibatch is larger than the trial budget, the optimizer logs a warning and skips reflection.
### GEPA scores vs. Opik scores
* The **GEPA Score** column reflects the aggregate score GEPA computes on its train/validation split when deciding which candidates stay on the Pareto front. It is useful for understanding how GEPAโs evolutionary search ranks prompts.
* The **Opik Score** column is a fresh evaluation performed through Opikโs metric pipeline on the same dataset (respecting `n_samples`). This is the score you should use when comparing against your baseline or other optimizers.
* Because the GEPA score is based on GEPAโs internal aggregation, it can diverge from the Opik score for the same prompt. This is expectedโtreat the GEPA score as a hint about why GEPA kept or discarded a candidate, and rely on the Opik score for final comparisons.
### `skip_perfect_score`
* When `skip_perfect_score=True`, GEPA immediately ignores any candidate whose GEPA score meets or exceeds the `perfect_score` threshold (default `1.0`). This keeps the search moving toward imperfect prompts instead of spending budget refining already perfect ones.
* Set `skip_perfect_score=False` if your metric tops out below `1.0`, or if you still want to see how GEPA mutates a perfect-scoring promptโfor example, when you care about ties being broken by Opikโs rescoring step rather than GEPAโs aggregate.
## Configuration Options
### Optimizer parameters
The optimizer has the following parameters:
## Backend Service
Opik's backend uses Java 21 LTS and Dropwizard 4, structured as a RESTful web service offering public API
endpoints for core functionality. Full API documentation is available [here](/reference/rest-api/overview).
For observability Opik uses OpenTelemetry due its vendor-neutral approach and wide support across languages
and frameworks. It provides a single, consistent way to collect telemetry data from all services and applications.
*You can find the full backend codebase in Github under the [`apps/opik-backend`](https://github.com/comet-ml/opik/tree/main/apps/opik-backend) folder.*
## Frontend Application
Opik's frontend is a TypeScript + React application served by Nginx. It provides a user-friendly interface for
interacting with the backend services. The frontend is built using a modular approach, with each module
encapsulating a specific feature or functionality.
*You can find the full frontend codebase in Github under the [`apps/opik-frontend`](https://github.com/comet-ml/opik/tree/main/apps/opik-frontend) folder.*
## SDK's
Opik provides SDKs for Python, and JavaScript. These SDKs allow developers to interact with Opik's backend
services programmatically. The SDKs are designed to be easy to use and provide a high-level abstraction over
the REST API and many additional features.
*You can find the full SDK codebase in Github under the [`sdks/python`](https://github.com/comet-ml/opik/tree/main/sdks/python) for the Python SDK
and [`sdks/typescript`](https://github.com/comet-ml/opik/tree/main/sdks/typescript) for the TypeScript SDK.*
## ClickHouse
ClickHouse is a column-oriented database management system developed by Yandex. It is optimized for fast analytics on large datasets and is capable of processing hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.
Opik uses ClickHouse for datasets that require near real-time ingestion and analytical queries, such as:
* LLM calls and traces
* Feedback scores
* Datasets and experiments
* Experiments
The image below details the schema used by Opik in ClickHouse:
Liquibase automates schema management
## MySQL
Opik uses MySQL for transactional data, it provides ACID-compliant transactional storage for Opik's lower-volume but critical data, such as:
* Feedback definitions
* Metadata containers e.g., projects that group related traces
* Configuration data
The image below details the schema used by Opik in MySQL:
Liquibase automates schema management
## Redis
Redis is an in-memory data structure store, used as a database, cache, and message broker. It supports a vast range of data structures. Opik uses Redis for:
* A distributed cache: for high-speed lookups.
* A distributed lock: for coordinating safe access to certain shared resources.
* A rate limiter: to enforce throughput limits and protect scalability.
* A streaming mechanism: Redis streams power Opik's Online evaluation functionality; future iterations may integrate Kafka or similar platforms for even higher scalability.
## Observability
Opik is built and runs on top of open-source infrastructure (MySQL, Redis, Kubernetes, and more), making it straightforward to integrate with popular observability stacks such as Grafana and Prometheus. Specifically:
* The backend uses OpenTelemetry for vendor-neutral instrumentation.
* ClickHouse deployments include an operator for real-time performance monitoring and metric exports to Grafana/Prometheus.
* Other components (MySQL, Redis, Kubernetes) also have well-documented strategies for monitoring.
# Scaling Opik
> Learn best practices and configurations for running Opik in production, ensuring resilience and scalability for mission-critical workloads.
# Scaling Opik
Opik is built to power mission-critical workloads at scale. Whether you're running a small proof of concept or a high-volume enterprise deployment, Opik adapts seamlessly to your needs. Its stateless architecture and powerful ClickHouse backed storage make it highly resilient, horizontally scalable, and future-proof for your data growth.
This guide outlines recommended configurations and best practices for running Opik in production.
## Proven at Scale
Opik is engineered to handle demanding, production-grade workloads. The following example demonstrates the robustness of a typical deployment:
| Metric | Value |
| ------------------------- | ---------- |
| Select queries per second | \~80 |
| Insert queries per second | \~20 |
| Rows inserted per minute | Up to 75K |
| Traces (count) | 40 million |
| Traces (size) | 400 GB |
| Spans (count) | 250M |
| Spans (size) | 3.1 TB |
| Total data on disk | 5 TB |
| Weekly data ingestion | 100 GB |
A deployment of this scale is fully supported using:
**Opik Services** -
These Opik Services run on r7i.2xlarge instances with 2 replicas:
* Opik Backend
* Opik Frontend
The Opik Python Backend service runs on c7i.2xlarge instances with 3 replicas:
**ClickHouse** - running on m7i.8xlarge instances with 2 replicas.
This configuration provides both performance and reliability while leaving room for seamless expansion.
## Built for Growth
Opik is designed with flexibility at its core. As your data grows and query volumes increase, Opik grows with you.
* **Horizontal scaling** - add more replicas of services to instantly handle more traffic
* **Vertical scaling** - increase CPU, memory, or storage to handle denser workloads
* **Seamless elasticity** - scale out during peak usage and scale back during quieter periods
For larger workloads, ClickHouse can be scaled to support enterprise-level deployments. A common configuration includes:
* 62 CPU cores
* 256 GB RAM
* 25 TB disk space
ClickHouse's read path can also scale horizontally by increasing replicas, ensuring Opik continues to deliver high performance as usage grows.
## Resilient Services Cluster
Opik services are stateless and fault-tolerant, ensuring high availability across environments. Recommended resources:
| Environment | CPU (vCPU) | RAM (GB) |
| ----------- | ---------- | -------- |
| Development | 4 | 8 |
| Production | 13 | 32 |
### Instance Guidance
| Deployment | Instance | vCPUs | Memory (GiB) |
| ------------ | ----------- | ----- | ------------ |
| Dev (small) | c7i.large | 2 | 4 |
| Dev | c7i.xlarge | 4 | 8 |
| Prod (small) | c7i.2xlarge | 8 | 16 |
| Prod | c7i.4xlarge | 16 | 32 |
### Backend Service (Scales to Demand)
| Metric | Dev | Prod Small | Prod Large |
| ------------ | --- | ---------- | ---------- |
| Replicas | 2 | 5 | 7 |
| CPU cores | 1 | 2 | 2 |
| Memory (GiB) | 2 | 9 | 12 |
### Frontend Service (Always Responsive)
| Metric | Dev | Prod Small | Prod Large |
| ---------------- | --- | ---------- | ---------- |
| Replicas | 2 | 3 | 5 |
| CPU (millicores) | 5 | 50 | 50 |
| Memory (MiB) | 16 | 32 | 64 |
## ClickHouse: High-Performance Storage
At the heart of Opik's scalability is ClickHouse, a proven, high-performance analytical database designed for large-scale workloads. Opik leverages ClickHouse for storing traces and spans, ensuring fast queries, robust ingestion, and uncompromising reliability.
### Instance Types
Memory-optimized instances are recommended, with a minimum 4:1 memory-to-CPU ratio:
| Deployment | Instance |
| ---------- | ----------- |
| Small | m7i.2xlarge |
| Medium | m7i.4xlarge |
| Large | m7i.8xlarge |
### Replication Strategy
* **Development**: 1 replica
* **Production**: 2 replicas
Always scale vertically before adding more replicas for efficiency.
### CPU & Memory Guidance
Target 10โ20% CPU utilization, with safe spikes up to 40โ50%.
Maintain at least a 4:1 memory-to-CPU ratio (extend to 8:1 for very large environments).
| Deployment | CPU cores | Memory (GiB) |
| ------------------ | --------- | ------------ |
| Minimum | 2 | 8 |
| Development | 4 | 16 |
| Production (small) | 6 | 24 |
| Production | 32 | 128 |
### Disk Recommendations
To ensure reliable performance under heavy load:
| Volume | Value |
| ---------- | ----------------------------- |
| Family | SSD |
| Type | gp3 |
| Size | 8โ16 TiB (workload dependent) |
| IOPS | 3000 |
| Throughput | 250 MiB/s |
Opik's ClickHouse layer is resilient even under sustained, large-scale ingestion, ensuring queries stay fast.
## Managing System Tables
System tables (e.g., `system.opentelemetry_span_log`) can grow quickly. To keep storage lean:
* Configure TTL settings in ClickHouse, or
* Perform periodic manual pruning
## Why Opik Scales with Confidence
* **Enterprise-ready** โ built to support multi-terabyte data volumes
* **Elastic & flexible** โ easily adjust resources to match workload demands
* **Robust & reliable** โ designed for high availability and long-term stability
* **Future-proof** โ proven to support growing usage without redesign
With Opik, you can start small and scale confidently, knowing your observability platform won't hold you back.
## References
* [ClickHouse sizing & hardware recommendations](https://clickhouse.com/docs/guides/sizing-and-hardware-recommendations)
# Advanced clickhouse backup
> Learn to implement SQL-based and dedicated backup tools for ClickHouse in Opik's Kubernetes setup to ensure data protection and recovery.
# ClickHouse Backup Guide
This guide covers the two backup options available for ClickHouse in Opik's Kubernetes deployment:
1. **SQL-based Backup** - Uses ClickHouse's native `BACKUP` command with S3
2. **ClickHouse Backup Tool** - Uses the dedicated `clickhouse-backup` tool
## Overview
ClickHouse backup is essential for data protection and disaster recovery. Opik provides two different approaches to handle backups, each with its own advantages:
* **SQL-based Backup**: Simple, uses ClickHouse's built-in backup functionality
* **ClickHouse Backup Tool**: More advanced, provides additional features like compression and incremental backups
## Option 1: SQL-based Backup (Default)
This is the default backup method that uses ClickHouse's native `BACKUP` command to create backups directly to S3-compatible storage.
### Features
* Uses ClickHouse's built-in `BACKUP ALL EXCEPT DATABASE system` command
* Direct S3 upload with timestamped backup names
* Configurable schedule via CronJob
* Supports both AWS S3 and S3-compatible storage (like MinIO)
### Configuration
#### Basic Setup
#### With AWS S3 Credentials
Create a Kubernetes secret with your S3 credentials:
```bash
kubectl create secret generic clickhouse-backup-secret \
--from-literal=access_key_id=YOUR_ACCESS_KEY \
--from-literal=access_key_secret=YOUR_SECRET_KEY
```
Then configure the backup:
```yaml
clickhouse:
backup:
enabled: true
bucketURL: "https://your-bucket.s3.region.amazonaws.com"
secretName: "clickhouse-backup-secret"
schedule: "0 0 * * *"
```
#### With IAM Role (AWS EKS)
For AWS EKS clusters, you can use IAM roles instead of access keys:
```yaml
clickhouse:
serviceAccount:
create: true
name: "opik-clickhouse"
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT:role/clickhouse-backup-role"
backup:
enabled: true
bucketURL: "https://your-bucket.s3.region.amazonaws.com"
schedule: "0 0 * * *"
```
**Required IAM Policy:**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": ["arn:aws:s3:::your-bucket", "arn:aws:s3:::your-bucket/*"]
}
]
}
```
**Trust Relationship Policy:**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.REGION.amazonaws.com/id/OIDCPROVIDERID"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.REGION.amazonaws.com/id/OIDCPROVIDERID:sub": "system:serviceaccount:YOUR_NAMESPACE:opik-clickhouse",
"oidc.eks.REGION.amazonaws.com/id/OIDCPROVIDERID:aud": "sts.amazonaws.com"
}
}
}
]
}
```
#### Custom Backup Command
You can customize the backup command if needed:
```yaml
clickhouse:
backup:
enabled: true
bucketURL: "https://your-bucket.s3.region.amazonaws.com"
command:
- /bin/bash
- "-cx"
- |-
export backupname=backup$(date +'%Y%m%d%H%M')
echo "BACKUP ALL EXCEPT DATABASE system TO S3('${CLICKHOUSE_BACKUP_BUCKET}/${backupname}/', '$ACCESS_KEY', '$SECRET_KEY');" > /tmp/backQuery.sql
clickhouse-client -h clickhouse-opik-clickhouse --send_timeout 600000 --receive_timeout 600000 --port 9000 --queries-file=/tmp/backQuery.sql
```
### Backup Process
The SQL-based backup:
1. Creates a timestamped backup name (format: `backupYYYYMMDDHHMM`)
2. Executes `BACKUP ALL EXCEPT DATABASE system TO S3(...)` command
3. Uploads all databases except the `system` database to S3
4. Uses ClickHouse's native backup format
### Restore Process
To restore from a SQL-based backup:
```bash
# Connect to ClickHouse
kubectl exec -it deployment/clickhouse-opik-clickhouse -- clickhouse-client
# Restore from S3 backup
RESTORE ALL FROM S3('https://your-bucket.s3.region.amazonaws.com/backup202401011200/', 'ACCESS_KEY', 'SECRET_KEY');
```
## Option 2: ClickHouse Backup Tool
The ClickHouse Backup Tool provides more advanced backup features including compression, incremental backups, and better restore capabilities.
### Features
* Advanced backup management with compression
* Incremental backup support
* REST API for backup operations
* Better restore capabilities
* Backup metadata and validation
### Configuration
#### Enable Backup Server
```yaml
clickhouse:
backupServer:
enabled: true
image: "altinity/clickhouse-backup:2.6.23"
port: 7171
env:
LOG_LEVEL: "info"
ALLOW_EMPTY_BACKUPS: true
API_LISTEN: "0.0.0.0:7171"
API_CREATE_INTEGRATION_TABLES: true
```
#### Configure S3 Storage
Set up S3 configuration for the backup tool:
```yaml
clickhouse:
backupServer:
enabled: true
env:
S3_BUCKET: "your-backup-bucket"
S3_ACCESS_KEY: "your-access-key" # can be ignored when use role
S3_SECRET_KEY: "your-secret-key"
S3_REGION: "us-west-2"
S3_ENDPOINT: "https://s3.us-west-2.amazonaws.com" # Optional: for S3-compatible storage
```
#### With Kubernetes Secrets
Use Kubernetes secrets for sensitive data:
(can be ignored when using IAM roles)
```bash
kubectl create secret generic clickhouse-backup-tool-secret \
--from-literal=S3_ACCESS_KEY=YOUR_ACCESS_KEY \
--from-literal=S3_SECRET_KEY=YOUR_SECRET_KEY
```
```yaml
clickhouse:
backupServer:
enabled: true
env:
S3_BUCKET: "your-backup-bucket"
S3_REGION: "us-west-2"
envFrom:
- secretRef:
name: "clickhouse-backup-tool-secret"
```
### Using the Backup Tool
#### Create Backup
```bash
# Port-forward to access the backup server
kubectl port-forward svc/clickhouse-opik-clickhouse 7171:7171
# Create a backup
curl -X POST "http://localhost:7171/backup/create?name=backup-$(date +%Y%m%d-%H%M%S)"
# List available backups
curl "http://localhost:7171/backup/list"
```
#### Upload Backup to S3
```bash
# Upload backup to S3
curl -X POST "http://localhost:7171/backup/upload/backup-20240101-120000"
```
#### Download and Restore
```bash
# Download backup from S3
curl -X POST "http://localhost:7171/backup/download/backup-20240101-120000"
# Restore backup
curl -X POST "http://localhost:7171/backup/restore/backup-20240101-120000"
```
### Automated Backup with CronJob
You can create a custom CronJob to automate the backup tool:
```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: clickhouse-backup-tool-job
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup-tool
image: altinity/clickhouse-backup:2.6.23
command:
- /bin/bash
- -c
- |
BACKUP_NAME="backup-$(date +%Y%m%d-%H%M%S)"
curl -X POST "http://clickhouse-opik-clickhouse:7171/backup/create?name=$BACKUP_NAME"
sleep 30
curl -X POST "http://clickhouse-opik-clickhouse:7171/backup/upload/$BACKUP_NAME"
restartPolicy: OnFailure
```
## Comparison
| Feature | SQL-based Backup | ClickHouse Backup Tool |
| ----------------------- | ---------------- | ---------------------- |
| **Setup Complexity** | Simple | Moderate |
| **Compression** | No | Yes |
| **Incremental Backups** | No | Yes |
| **Backup Validation** | Basic | Advanced |
| **REST API** | No | Yes |
| **Restore Flexibility** | Basic | Advanced |
| **Resource Usage** | Low | Moderate |
| **S3 Compatibility** | Native | Native |
## Best Practices
### General Recommendations
1. **Test Restores**: Regularly test backup restoration procedures
2. **Monitor Backup Jobs**: Set up monitoring for backup job failures
3. **Retention Policy**: Implement backup retention policies
4. **Cross-Region**: Consider cross-region backup replication for disaster recovery
### Security
1. **Access Control**: Use IAM roles when possible instead of access keys
2. **Encryption**: Enable S3 server-side encryption for backup storage
3. **Network Security**: Use VPC endpoints for S3 access when available
### Performance
1. **Schedule**: Run backups during low-traffic periods
2. **Resource Limits**: Set appropriate resource limits for backup jobs
3. **Storage Class**: Use appropriate S3 storage classes for cost optimization
## Troubleshooting
### Common Issues
#### Backup Job Fails
```bash
# Check backup job logs
kubectl logs -l app=clickhouse-backup
# Check CronJob status
kubectl get cronjobs
kubectl describe cronjob clickhouse-backup
```
#### S3 Access Issues
```bash
# Test S3 connectivity
kubectl exec -it deployment/clickhouse-opik-clickhouse -- \
clickhouse-client --query "SELECT * FROM system.disks WHERE name='s3'"
```
#### Backup Tool API Issues
```bash
# Check backup server logs
kubectl logs -l app=clickhouse-backup-server
# Test API connectivity
kubectl port-forward svc/clickhouse-opik-clickhouse 7171:7171
curl "http://localhost:7171/backup/list"
```
### Monitoring
Set up monitoring for backup operations:
```yaml
# Example Prometheus alerts
- alert: ClickHouseBackupFailed
expr: increase(kube_job_status_failed{job_name=~".*clickhouse-backup.*"}[5m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "ClickHouse backup job failed"
description: "ClickHouse backup job {{ $labels.job_name }} has failed"
```
## Migration Between Backup Methods
### From SQL-based to ClickHouse Backup Tool
1. Enable the backup server:
```yaml
clickhouse:
backupServer:
enabled: true
```
2. Create initial backup with the tool
3. Disable SQL-based backup:
```yaml
clickhouse:
backup:
enabled: false
```
### From ClickHouse Backup Tool to SQL-based
1. Disable backup server:
```yaml
clickhouse:
backupServer:
enabled: false
```
2. Enable SQL-based backup:
```yaml
clickhouse:
backup:
enabled: true
```
## Support
For additional help with ClickHouse backups:
* [ClickHouse Backup Documentation](https://github.com/Altinity/clickhouse-backup)
* [ClickHouse Backup Command Reference](https://clickhouse.com/docs/operations/backup)
* [Opik Community Support](https://github.com/comet-ml/opik/issues)
# Troubleshooting
> Troubleshooting guide for common issues when running self-hosted Opik deployments.
This guide covers common troubleshooting scenarios for self-hosted Opik deployments.
## Common Issues
### ClickHouse Zookeeper Metadata Loss
#### Problem Description
If Zookeeper loses the metadata paths for ClickHouse tables, you will see coordination exceptions in the ClickHouse logs and potentially in the opik-backend service logs. These errors indicate that Zookeeper cannot find table metadata paths.
**Symptoms:**
Error messages appearing in ClickHouse logs and propagating to opik-backend service:
```
Code: 999. Coordination::Exception: Coordination error: No node, path /clickhouse/tables/0/default/DATABASECHANGELOG/log. (KEEPER_EXCEPTION)
```
This indicates that Zookeeper has lost the metadata paths for one or more ClickHouse tables.
#### Resolution Steps
Follow these steps to restore ClickHouse table metadata in Zookeeper:
##### 1. Clean Zookeeper Paths (If Needed)
If only some table paths are missing in Zookeeper, you'll need to delete the existing paths manually. Connect to the Zookeeper pod and use the Zookeeper CLI:
```bash
# Connect to Zookeeper pod
kubectl exec -it cometml-production-opik-zookeeper-0 -- zkCli.sh -server localhost:2181
# Delete all ClickHouse table paths
deleteall /clickhouse/tables
```