Run open source LLM evaluations with Opik!

Star
Comet logo
  • Opik LLM Evals
  • Products
    • Opik – LLM Evaluation
    • ML Experiment Management
    • ML Artifacts
    • ML Model Registry
    • ML Model Production Monitoring
  • Docs
    • Opik – LLM Evaluation
    • ML Experiment Management
  • Pricing
  • Customers
  • Learn
    • Blog
    • Deep Learning Weekly
    • LLM Course
  • Company
    • About Us
    • News and Events
      • Events
      • Press Releases
    • Careers
    • Contact Us
    • Leadership
  • Login
Get Demo
Try Comet Free
  1. Home
  2. Products
  3. Opik

OPEN SOURCE LLM EVALUATION

Track. Evaluate. Test. Ship. Repeat.

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with tracing, eval metrics, and production-ready dashboards.

Now with automated agent optimization and built-in guardrails.

Get Started Free

Optimize and Benchmark Your LLM Applications With Ease

Log traces and spans, define and compute evaluation metrics, score LLM outputs, compare performance across app versions, and more.

dashboard screenshot showing an automated prompt optimization run with llm eval metric scores improving with iteration

Automatically Optimize
Prompts & Agents

  • Automate prompt engineering for agents and tools based on your LLM eval metrics.
  • Iterate to achieve elite system prompts and freeze them into reusable, production-ready assets.
  • Run 4 powerful optimizers: Few-shot Bayesian, MIPRO, evolutionary, & LLM-powered MetaPrompt.
dashboard screenshot showing ai guardrails performing topic moderation for an llm chatbot

Maximize Trust & Safety
With Guardrails

  • Screen user inputs and LLM outputs to stop unwanted content in its tracks.
  • Detect and redact PII, competitor mentions, off-topic discussions, and more.
  • Choose Opik’s powerful built-in models or your favorite third-party guardrails libraries.
llm evaluation dashboard screenshot showing an application trace, which logs the application's interactions with the large language model

Log Traces & Spans

  • Record, sort, search, and understand each step your LLM app takes to generate a response. 
  • Manually annotate, view, and compare LLM responses in a user-friendly table.
  • Log traces during development and in production.
llm evaluation ui screenshot showing llm-as-a-judge eval metrics scoring conversational ai responses for correctness and hallucination

Evaluate Your LLM Application’s Performance

  • Run experiments with different prompts and evaluate against a test set.
  • Choose and run pre-configured evaluation metrics or define your own with our convenient SDK library.
  • Consult built-in LLM judges for complex issues like hallucination detection, factuality, and moderation.
llm evaluation dashboard screenshot showing application-level unit tests on GenAI applications and their results

Confidently Test Within Your CI/CD Pipeline

  • Establish reliable performance baselines with Opik’s LLM unit tests, built on PyTest.
  • Build comprehensive test suites to evaluate your entire LLM pipeline on every deploy.
llm evaluation dashboard screenshot showing production monitoring and logging to create new test datasets from production data

Monitor & Analyze Production Data

  • Log all your production traces to easily identify issues in production.
  • Understand your models’ performance on unseen data in production and generate datasets for new dev iterations.

Open Source & Ready to Run

Opik is a true open-source project, and its full LLM evaluation feature set is included free in the source code. Users can download the code from GitHub and run it locally, with a highly scalable and industry-compliant version ready for enterprise teams.

GitHub comet-ml: 10k

Iterate Across Your LLM App
Development Lifecycle

Opik helps analyze the quality of LLM responses at every step of the app development lifecycle so you can debug and optimize with confidence.

Understand Cause & Effect in Complex LLM Systems

With multiple components influencing model behavior and countless outputs generated during development, manual review and vibe checks don’t cut it.

With Opik, you can log traces and compute scores in the aggregate, and drill down to individual prompts and responses that need attention.   

Built for developers first. Trusted by the world’s largest enterprise teams.

AssemblyAI logo
Natwest logo
Uber Logo
Netflix Logo
Etsy logo
Mobileye logo
AssemblyAI logo
Natwest logo
Uber Logo
Netflix Logo
Etsy logo
Mobileye logo
Try Comet Free
Book a Demo

Integrate With Your Existing LLM Workflow

Opik is compatible with any LLM you choose, and it comes out of the box with the following direct integrations to get you up and running fast.

OpenAI

Predibase

Ragas

OpenTelemetry

LangChain

LlamaIndex

LiteLLM

DSPy

Try Opik in Your LLM System

Opik is free to try and fast to configure. Choose the implementation that’s right for your team and follow the steps below to start logging your first trace.

Get started today, free.

No credit card required, try Comet with no risk and no commitment.

Create Free Account
Contact Sales
Comet logo
  • LinkedIn
  • X
  • YouTube
  • Facebook

Subscribe to Comet

Thank you for subscribing to Comet’s newsletter!

Products

  • Opik
  • Experiment Management
  • Artifacts
  • Model Registry
  • Model Production Monitoring

Learn

  • Documentation
  • Resources
  • Comet Blog
  • Deep Learning Weekly
  • Heartbeat
  • LLM Course

Company

  • About Us
  • News and Events
  • Careers
  • Contact Us

Pricing

  • Pricing
  • Create a Free Account
  • Contact Sales
Capterra badge
AICPA badge

©2025 Comet ML, Inc. – All Rights Reserved

Terms of Service

Privacy Policy

CCPA Privacy Notice

Cookie Settings

We use cookies to collect statistical usage information about our website and its visitors and ensure we give you the best experience on our website. Please refer to our Privacy Policy to learn more.OkPrivacy policy