DeepEval 4.0 introduces an agent-native evaluation workflow designed for coding agents, rapid debugging, and production AI systems.
If you're vibe coding agents, on something like claude code, this release is for you.
Eval Harness for Coding Agents
Coding agents can now run eval-driven iterations directly in context.
- Agents see metric failures, scores, and reasoning inline
- Supports iterative patch → eval → retry workflows
- Built for Cursor, Claude Code, Codex, and agentic development loops
- Enables autonomous regression fixing with evaluation feedback
Quickstart: https://deepeval.com/docs/vibe-coder-quickstart
Terminal Trace Inspection (TUI)
Inspect traces locally without leaving the terminal.
- Navigate spans, inputs, outputs, and metric failures
- Debug agent execution step-by-step
- Understand why an eval failed, not just that it failed
- Lightweight local-first developer workflow
The idea is to develop rapidly by staying local on your machine. No more delegating to external UIs, unless collaboration is required.
10+ Native Integrations
One-line integrations for modern AI frameworks and providers.
Supports:
- LangChain
- OpenAI
- Anthropic
- LiteLLM
- PydanticAI
- CrewAI
- LlamaIndex
- DSPy
- Google ADK
- Custom agents
Here's a langchain example:
import pytest
from langchain.agents import create_agent
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
def multiply(a: int, b: int) -> int:
return a * b
agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")
dataset = EvaluationDataset(goldens=[
Golden(input="What is 8 multiplied by 6?"),
Golden(input="What is 7 multiplied by 9?"),
])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_langchain_agent(golden: Golden):
agent.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler()]},
)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])That's it! Native CI/CD integration for LangChain agents, via Pytest, enabled by DeepEval.
Quickstart for integrations: https://deepeval.com/integrations