January 28, 2025
LLM-as-a-judge evaluators have gained widespread adoption due to their flexibility, scalability, and close alignment with…
Opik is an open-source platform for evaluating, testing, and monitoring LLM applications, created by Comet. When teams integrate language models into their applications, they need ways to debug complex systems, analyze performance, and understand how their development work affects responses returned by an LLM in terms of accuracy, relevance, context awareness, and other qualities. The platform they use to log, evaluate, and iterate on this work needs to be both user friendly and highly scalable, and accommodate many distinct tasks and use cases.
To build Opik, the Comet team tapped into decades of combined experience training, deploying, and monitoring machine learning models, with the goal of making data science workflows more accessible to teams building with LLMs.
In this post, we’ll share the Comet engineering team’s perspective on building Opik, exploring the architectural decisions and technical details that enable Opik to provide robust tracing, evaluation, and production monitoring capabilities.
While our focus is on how Opik’s architecture supports state-of-the-art LLM evaluation, the underlying design patterns and technologies can serve as a reference for many other production systems.
Opik’s development was shaped by several key requirements:
One prominent challenge in building Opik was the unpredictable nature of LLM calls and traces, which create multiple types of concurrent events (traces, spans, feedback scores, dataset items, experiment items, etc.). These events often arrive in unexpected order, making data consistency trickier.
Addressing these challenges required balancing tradeoffs. The nature of data generated by LLM applications indicated that Opik should be built as an eventually consistent system. Performance and scalability take precedence over strict consistency in Opik’s architecture, which led us to carefully select technologies that meet these performance requirements.
This is why Opik leverages:
Opik’s architecture consists of multiple services that each handle a specific role, including:
We conducted performance tests to measure ingestion and display latencies for Opik using 10,000 and 100,000 traces, each containing three spans, using our free and pro plans as a reference.
From these tests, we can conclude:
These results are remarkable given the constraints of a simple local installation:
Setup:
Metrics:
Results:
1) 10,000 traces (30,000 spans, ~40,000 total events)
2) 100,000 traces (300,000 spans, ~400,000 total events)
During Opik’s development, we needed an identifier strategy that favors scalability for data entities such as traces and experiments. We identified these requirements:
We evaluated:
After measuring, both in ClickHouse and MySQL:
We selected UUID v7 for most identifiers due to its performance, sortability, and standardization. For cases where exposing the creation timestamp might be problematic, we use UUID v4.
Opik’s backend uses Java 21 LTS and Dropwizard 4, structured as a RESTful web service offering public API endpoints for core functionality. Full API documentation is available here:
We rely on well-known open-source libraries to avoid reinventing the wheel:
Opik’s user interface is built with:
The frontend is served by Nginx, which also functions as a reverse proxy. In the fully open-source version, Nginx does not enforce rate limits by default (though it can be configured to do so).
Currently, Opik offers a Python SDK, and a TypeScript SDK will be released soon. Much of the boilerplate code for the SDKs is automatically generated from the OpenAPI specification using Fern. This approach helps us:
The Python SDK uses a message queue and multiple workers, so it sends data to the Opik API asynchronously. This design ensures that latency or transient errors do not disrupt your LLM application.
To meet Opik’s scalability requirements for high-volume data ingestion and fast queries, we compared:
against 22 different criteria (e.g., performance, scalability, operability, popularity, and licensing). After weighing these factors against Opik’s functional and non-functional requirements, we chose ClickHouse.
Opik uses ClickHouse for datasets that require near real-time ingestion and analytical queries, such as:
ClickHouse’s MergeTree engine family is vital for high ingest speeds and large data volumes. We use the ReplacingMergeTree engine variant to minimize costly data mutations (updates and deletes). Some highlights:
The image below details the schema used by Opik in ClickHouse:
MySQL provides ACID-compliant transactional storage for Opik’s lower-volume but critical data, such as:
Again, Liquibase automates schema management and keeps MySQL definitions in sync with the rest of the platform.
The image below details the schema used by Opik in MySQL:
Redis is employed as:
The easiest way to use Opik is via a free comet.com account. However, Opik’s full open-source version can also be self-hosted with all core features (tracing, evaluation, production monitoring), but without the integrated user management provided by comet.com.
There are two main ways to self-host:
Comet also provides scalable, managed deployment solutions if you prefer a hands-off approach.
Opik is built and runs on top of open-source infrastructure (MySQL, Redis, Kubernetes, and more), making it straightforward to integrate with popular observability stacks such as Grafana and Prometheus. Specifically:
Opik’s roadmap and extensibility thrive on active community collaboration. We’re excited to see how users contribute by writing code, improving documentation, and sharing feature ideas. If you’d like to get involved, here are a few ways to get started:
Before contributing, please make sure to review our Contributor License Agreement and License. This ensures a smooth process and clarifies how your contributions are used and recognized. Together, we can make Opik an even more powerful platform for LLM observability.
Opik’s architecture is designed with extensibility in mind. Recent updates include a new Online evaluation feature that allows traces to be scored in real time, using an LLM as a judge. We plan to add user-defined Python code metrics soon, and a TypeScript/JavaScript SDK is also underway.
Some upcoming features will introduce notable architectural changes. For example, we plan to support file attachments like images or PDFs in new traces, which will require integrating an object storage system (e.g., Amazon S3 for AWS-based deployments or MinIO for self-hosting). You can explore more details on our public roadmap:
Opik is a significant step forward in LLM evaluation and observability, combining cutting-edge technologies with a carefully planned, modular architecture. Its open-source nature and free availability empower a growing community of users, while Comet’s infrastructure offers scaling options and commercial support if needed. Whether you adopt the managed service or self-host Opik, you gain a powerful, flexible framework for building next-generation LLM applications.
For more information, visit Opik’s GitHub repository and documentation.