The Prompt Library is now part of the Opik 2.0 UI, accessible from the project sidebar under Prompt library. Alongside that, prompt versions have gained first-class environment support — you can tag a version as production or staging and retrieve it by name from the SDK, without tracking version numbers in application code.
What’s new:
client.get_prompt(name, environment="production") returns the version currently tagged as production; version and environment are mutually exclusive and passing both raises a clear errorclient.set_prompt_environments(name, ["production", "staging"]) replaces the full environment set on a version; the same environment is automatically moved away from whatever version previously held itclient.create_prompt(name, content="...", environments=["staging"]) and client.create_chat_prompt(...) accept environments directlysetPromptEnvironments, getPrompt({ environment }), and createPrompt({ environments }) mirror the Python APIv1, v2, v3 in the UI and API instead of raw commit hashesThe Traces, Spans, and Threads tabs now have a redesigned filter bar that makes it faster to narrow down what you’re looking at. Filters appear as chips directly in the toolbar — pick a field, set a value, and the table updates instantly. Frequently-used filters can be pinned to the bar so they’re always one click away, and filter state is preserved in the URL so you can share an exact filtered view with a teammate.
get_trace_spans and read tool calls to inspect intermediate spans during evaluation, enabling correctness checks about tool usage, model selection, and per-span errors inside complex agentsClassCastException under certain configurationsdata:<type>;base64, prefix are now stripped correctly in both the SDK and the frontendopik migrate: skipped items reported clearly — the migration command now reports each skipped item with its reason, count, and sample source IDs, and exits with code 1 so CI pipelines detect incomplete migrationsAnd much more! 👉 See full commit log on GitHub
Releases: 2.0.48, 2.0.49, 2.0.50, 2.0.51, 2.0.52
Alert rules now support structured condition grouping: conditions within a group are evaluated with AND, while groups themselves are combined with OR. This makes it possible to express logic such as “flag a trace if (hallucination score > 0.8 AND relevance score < 0.3) OR (toxicity score > 0.5)”.
Existing single-condition alerts continue to work exactly as before — each legacy condition is automatically treated as its own group, so no migration is needed.
prompt_mask_context(masks) / promptMaskContext(masks) lets you run agent code with specific prompt IDs silently redirected to a different version ID, non-destructively. The agent calls get_prompt() as usual and receives the overridden template without any permanent change to the prompt library. Designed for A/B testing and optimizer sweep scenarios.DatasetItem field (e.g. id, as in HotpotQA) previously raised TypeError: multiple values for keyword argument. The SDK now strips conflicting keys and emits a one-time warning so iteration completes.<0.8 and >=0.8 — track_harbor() now patches whichever method name the installed version of harbor exposes (_setup_environment or _setup_agent_environment), so tracing works regardless of which version is installed.qwen/qwen3.7-max are now available in the model picker.And much more! 👉 See full commit log on GitHub
Releases: 2.0.42, 2.0.43, 2.0.44, 2.0.45, 2.0.46, 2.0.47
client.get_prompt() and client.get_chat_prompt() now cache results in-process, so repeated calls inside a hot path skip the network round-trip entirely. Pinned commits are cached indefinitely; latest-version lookups use a 5-minute TTL that refreshes in the background so your code always gets a reasonably fresh value without blocking.
What’s new:
OPIK_PROMPT_CACHE_TTL_SECONDS to adjust the freshness window (default: 300 s)no_cache=True / noCache: true to force a live fetch from the backend@track context, the prompt ID and commit are automatically recorded in the trace metadata so you know which version was used at inference timeopik connect CLI ImprovementsThe opik connect and opik endpoint CLI commands have been reorganized with a much better error experience:
~/.opik.config file exists, opik connect now offers to run opik configure automatically (skipped in non-interactive / headless environments)OpikADKOtelTracer was killing all active OpenTelemetry spans and re-patching the ADK exporter on every request; the patcher is now idempotent and preserves user-configured OTel pipelinesspan.end(), span.update(), trace.end(), or trace.update() no longer clears the environment field set at creation time/datasets/items/stream call; under high request volume this was pushing database CPU to 80–99%, it now uses a direct primary-key lookup insteadAnd much more! 👉 See full commit log on GitHub
Releases: 2.0.32, 2.0.33, 2.0.34, 2.0.35, 2.0.36, 2.0.37
Here are the most relevant improvements we’ve made since the last release:
You can now tag traces, spans, and threads with an environment field — production, staging, dev, or any label you define. This makes it easy to separate signal from noise: filter your project’s trace view to only production issues, or compare behavior between environments without spinning up separate projects.
What’s new:
production vs staging in a single projectenvironment to @track, opik.trace(), or opik.span() — and it’s preserved through .end() and .update() callsenvironment on trace and span creationTest suite assertions can now look inside a trace — not just the top-level input/output — to reason about tool calls, intermediate LLM steps, and sub-agent behavior. The evaluator LLM gets access to two on-demand tools: get_trace_spans (lists all sub-spans for the trace) and read (fetches a specific span by ID) — so it can drill into exactly what happened at each step.
Why it matters: Previously, an assertion could only see what went in and came out of the agent. Now it can check whether the right tool was called, which model was used in an intermediate step, or whether a specific span had an error — enabling far more meaningful correctness checks for complex agents.
Traces and spans tables no longer download attachment bytes (images, PDFs) when loading a list — attachments are lazy-loaded only when you open an individual trace. In our benchmarks with image and PDF attachments, this reduced the per-page payload from 85 MB → 0.13 MB and load time from 3.4 s to 0.1 s.
Why it matters: If any of your traces include file attachments, the table was silently fetching all that binary data on every page load. The experience is now fast regardless of attachment size or count.
The Playground’s reasoning_effort control now tracks OpenAI’s actual per-model capability matrix. Models like gpt-5.1 that support a "none" option show it; models that don’t support reasoning effort have the control hidden automatically. Previously, the UI could get out of sync with what the backend supported.
Several reliability fixes and small improvements to the Python (and TypeScript) SDKs:
search_traces() and search_spans() now automatically wait and retry on 429 responses instead of raising an error, so large bulk searches complete reliably under API rate limitsAnd much more! 👉 See full commit log on GitHub
Releases: 2.0.25, 2.0.26, 2.0.27, 2.0.28, 2.0.29, 2.0.30, 2.0.31
This is our biggest release yet! A fundamental rethink of how you build, debug, and improve AI agents with Opik. Three major new feature groups (Ollie, Test Suites, and the Agent Playground) work together to close the loop from observing a problem to shipping a fix, all without leaving the platform. Alongside them, we’ve reorganized everything around projects, redesigned the core trace experience, and rebuilt the navigation to match. Here’s what’s new:
Ollie is a powerful coding agent built into the Opik UI. It has full access to your project’s traces and logs, and can analyze patterns across hundreds of interactions, diagnose issues, and take action to fix them, all without leaving the platform.

Highlights:
opik connect, with support for --workspace and --api-key flagsTest Suites bring structured regression testing to agent development. Each suite has global rules that every test case must pass, plus item-level assertions for specific scenarios. Define rules in plain English for what your agent should and shouldn’t do, and get clear pass/fail results when you run them.

Highlights:
The Agent Playground connects to your agent so you can run it directly from the Opik UI. Experiment with different prompts, models, and parameters to see how your whole agent responds, without touching your code. Agent Configurations track and version the full set of prompts, models, and variables as a single unit, so you always know what combination worked.

Highlights:
AgentConfigManager and TypeScript AgentConfig with Zod schema validation and blueprint cachingProjects now map directly to your agents. Test suites, experiments, optimizations, prompts, datasets, alerts, and dashboards are all scoped to the project, giving you a focused view of everything related to a single agent, paired with a redesigned navigation and trace experience.

What’s new:
project_name scoping for datasets, experiments, optimizations, prompts, alerts, and dashboardsAnd much more! 👉 See full commit log on GitHub
Releases: 1.10.24 through 2.0.21
Here are the most relevant improvements we’ve made since the last release:
We’ve released opik-openclaw, a native OpenClaw plugin that gives you full-stack observability for your agents, powered by Opik. This brings enterprise-grade tracing, evaluation, and monitoring to the fastest-growing open-source agent framework.
What you get:
Get started in two minutes: install the plugin with openclaw plugins install @opik/opik-openclaw, configure your API key, and traces start flowing immediately. Works with both Opik Cloud and self-hosted instances.
👉 Visit the GitHub repository here
We’ve broadened the range of models and providers you can use across the platform, giving you more flexibility in how you build and evaluate your LLM applications.
What’s new:
openrouter/free is directly selectable, and openrouter/* route models including /auto are supported and prioritized in model selectionWe’ve continued to expand the capabilities of both the TypeScript and Python SDKs, making it easier to integrate Opik into your workflows programmatically.
What’s new:
searchThreads functionality in the TypeScript SDKWe’ve made the Optimization Studio more powerful and flexible, with new metrics, persistence, and a major Optimizer SDK update.
What’s new:
We’ve made several improvements to make your day-to-day workflow smoother and more intuitive.
What’s improved:
We’ve introduced prompt version tags, giving you a lightweight way to label and organize your prompt versions across the platform.
What’s new:
👉 Prompt Version Tags Documentation
And much more! 👉 See full commit log on GitHub
Releases: 1.10.11, 1.10.12, 1.10.13, 1.10.14, 1.10.15, 1.10.16, 1.10.17, 1.10.18, 1.10.19, 1.10.20, 1.10.21, 1.10.22, 1.10.23
Here are the most relevant improvements we’ve made since the last release:
We’ve significantly expanded the capabilities of both our Python and TypeScript SDKs, making it easier to integrate Opik into your workflows programmatically.
What’s new:
searchThreads functionality👉 Annotation Queues | Opik Query Language | Dataset Versioning
We’ve expanded our LLM provider support and improved integrations to give you more flexibility in your AI workflows.
What’s new:
We’ve made several improvements to make your day-to-day workflow smoother and more intuitive.
What’s improved:
And much more! 👉 See full commit log on GitHub
Releases: 1.9.102, 1.9.103, 1.9.104, 1.10.0, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6, 1.10.7, 1.10.8, 1.10.9, 1.10.10
Here are the most relevant improvements we’ve made since the last release:
We’re excited to introduce Optimization Studio — a powerful new way to improve your prompts without writing code. Bring a prompt, define what “good” looks like, and Opik tests variations to find a better version you can ship with confidence.
What’s new:
For teams that prefer a programmatic workflow, we’ve also released Opik Optimizer SDK v3 with improved algorithms, better performance, and more intuitive APIs.
👉 Optimization Studio Documentation
We’ve enhanced the dashboard with new widgets and visualization capabilities to help you track and compare experiments more effectively.

What’s new:

We’ve added support for the latest video generation models, enabling you to track and log video outputs from your AI applications.
What’s new:
We’ve made it easier to organize and navigate your experiments with new filtering and tagging capabilities.
What’s improved:
We’ve made several improvements to make your day-to-day workflow smoother.
What’s improved:
We’ve improved our SDK integrations with better tracing and performance metrics.
What’s improved:
And much more! 👉 See full commit log on GitHub
Releases: 1.9.79, 1.9.80, 1.9.81, 1.9.82, 1.9.83, 1.9.84, 1.9.85, 1.9.86, 1.9.87, 1.9.88, 1.9.89, 1.9.90, 1.9.91, 1.9.92, 1.9.95, 1.9.96, 1.9.97, 1.9.98, 1.9.99, 1.9.100, 1.9.101
Here are the most relevant improvements we’ve made since the last release:
We’ve expanded the Playground with new provider support and enhanced functionality to make prompt experimentation more powerful.
What’s new:



We’ve made online evaluation more flexible and easier to manage across your projects.
What’s improved:

We’ve refined the user experience across the platform with improved responsiveness and dashboard polish.
What’s improved:
We’ve updated our SDKs with new capabilities and modernized dependencies.
What’s new:
evaluate() methodAnd much more! 👉 See full commit log on GitHub
Releases: 1.9.57, 1.9.58, 1.9.59, 1.9.60, 1.9.61, 1.9.62, 1.9.63, 1.9.64, 1.9.65, 1.9.66, 1.9.67, 1.9.68, 1.9.69, 1.9.70, 1.9.71, 1.9.72, 1.9.73, 1.9.74, 1.9.75, 1.9.76, 1.9.77, 1.9.78
Here are the most relevant improvements we’ve made since the last release:
Custom Dashboards are now live! 🎉

Our new dashboards engine lets you build fully customizable views to track everything from token usage and cost to latency, quality across projects and experiments.
📍 Where to find them?
Dashboards are available in three places inside Opik:
🧩 Built-in templates to get started fast
We ship dashboards with zero-setup pre-built templates, including Performance Overview, Experiment Insights and Project Operational Metrics.
Templates are fully editable and can be saved as new dashboards once customized.
🧱 Flexible widgets
Dashboards support multiple widget types:
Widgets support filtering, grouping, resizing, drag-and-drop layouts, and global date range controls.
Span-Level Metrics
Span-level metrics are officially live in Opik supporting both LLMaaJ and code-based metrics!
Teams can now EASILY evaluate the quality of specific steps inside their agent flows with full precision. Instead of assessing only the final output or top-level trace, you can attach metrics directly to individual call spans or segments of an agent’s trajectory.
This unlocks dramatically finer-grained visibility and control. For example:
New Support accessing full tree, subtree, or leaf nodes in Online Scores
This update enhances the online scoring engine to support referencing entire root objects (input, output, metadata) in LLM-as-Judge and code-based evaluators, not just nested fields within them.
Online Scoring previously only exposed leaf-level values from an LLM’s structured output. With this update, Opik now supports rendering any subtree: from individual nodes to entire nested structures.
You can now tag individual prompt versions (not just the prompt!).
This provides a clean, intuitive way to mark best-performing versions, manage lifecycles, and integrate version selection into agent deployments.
Now you can pass audio as part of your prompts, in the playground and on online evals for advanced multimodal scenarios.

Thread-level insights
Added new metrics to the threads table with thread-level metrics and statistics, providing users with aggregated insights about their full multi-turn agentic interactions:
Experiment insights
Added additional aggregation methods in headers for experiment items.
This new release adds percentile aggregation methods (p50, p90, p99) for all numerical metrics in experiment items table headers, extending the existing pattern used for duration to cost, feedback scores, and total tokens.
Support for GPT-5.2 in Playground and Online Scoring
Added full support for GPT 5.2 models in both the playground and online scoring features for OpenAI and OpenRouter providers.
Harbor Integration
Added a comprehensive Opik integration for Harbor, a benchmark evaluation framework for autonomous LLM agents. The integration enables observability for agent benchmark evaluations (SWE-bench, LiveCodeBench, Terminal-Bench, etc.).
👉 Harbor Integration Documentation
And much more! 👉 See full commit log on GitHub
Releases: 1.9.41, 1.9.42, 1.9.43, 1.9.44, 1.9.45, 1.9.46, 1.9.47, 1.9.48, 1.9.49, 1.9.50, 1.9.51, 1.9.52, 1.9.53, 1.9.54, 1.9.55, 1.9.56