confident-ai/deepeval v4.0.2 on GitHub

DeepEval 4.0 introduces an agent-native evaluation workflow designed for coding agents, rapid debugging, and production AI systems.

If you're vibe coding agents, on something like claude code, this release is for you.

Eval Harness for Coding Agents

Coding agents can now run eval-driven iterations directly in context.

Agents see metric failures, scores, and reasoning inline
Supports iterative patch → eval → retry workflows
Built for Cursor, Claude Code, Codex, and agentic development loops
Enables autonomous regression fixing with evaluation feedback

Quickstart: https://deepeval.com/docs/vibe-coder-quickstart

Terminal Trace Inspection (TUI)

Inspect traces locally without leaving the terminal.

Navigate spans, inputs, outputs, and metric failures
Debug agent execution step-by-step
Understand why an eval failed, not just that it failed
Lightweight local-first developer workflow

The idea is to develop rapidly by staying local on your machine. No more delegating to external UIs, unless collaboration is required.

10+ Native Integrations

One-line integrations for modern AI frameworks and providers.

Supports:

LangChain
OpenAI
Anthropic
LiteLLM
PydanticAI
CrewAI
LlamaIndex
DSPy
Google ADK
Custom agents

Here's a langchain example:

import pytest
from langchain.agents import create_agent
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def multiply(a: int, b: int) -> int:
    return a * b

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")

dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_langchain_agent(golden: Golden):
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler()]},
    )
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])

That's it! Native CI/CD integration for LangChain agents, via Pytest, enabled by DeepEval.

Quickstart for integrations: https://deepeval.com/integrations

confident-ai/deepeval v4.0.2 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection! on GitHub

Eval Harness for Coding Agents

Terminal Trace Inspection (TUI)

10+ Native Integrations

confident-ai/deepeval v4.0.2
🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!

on GitHub