github confident-ai/deepeval v4.0.2
🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!

6 hours ago

DeepEval 4.0 introduces an agent-native evaluation workflow designed for coding agents, rapid debugging, and production AI systems.

If you're vibe coding agents, on something like claude code, this release is for you.

Eval Harness for Coding Agents

Coding agents can now run eval-driven iterations directly in context.

  • Agents see metric failures, scores, and reasoning inline
  • Supports iterative patch → eval → retry workflows
  • Built for Cursor, Claude Code, Codex, and agentic development loops
  • Enables autonomous regression fixing with evaluation feedback

Quickstart: https://deepeval.com/docs/vibe-coder-quickstart


Terminal Trace Inspection (TUI)

Inspect traces locally without leaving the terminal.

  • Navigate spans, inputs, outputs, and metric failures
  • Debug agent execution step-by-step
  • Understand why an eval failed, not just that it failed
  • Lightweight local-first developer workflow

The idea is to develop rapidly by staying local on your machine. No more delegating to external UIs, unless collaboration is required.

Screenshot 2026-05-14 at 12 10 31 AM

10+ Native Integrations

One-line integrations for modern AI frameworks and providers.

Supports:

  • LangChain
  • OpenAI
  • Anthropic
  • LiteLLM
  • PydanticAI
  • CrewAI
  • LlamaIndex
  • DSPy
  • Google ADK
  • Custom agents

Here's a langchain example:

import pytest
from langchain.agents import create_agent
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def multiply(a: int, b: int) -> int:
    return a * b

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")

dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_langchain_agent(golden: Golden):
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler()]},
    )
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])

That's it! Native CI/CD integration for LangChain agents, via Pytest, enabled by DeepEval.

Quickstart for integrations: https://deepeval.com/integrations

Don't miss a new deepeval release

NewReleases is sending notifications on new releases.