Full support for agentic evals :)

If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app.

🎯 1. Task Completion

Evaluate whether an agent actually completes the intended task, not just whether its final output “looks correct.”

Captures:

Goal completion
Intermediate step correctness
Error recovery
Procedural accuracy

Docs: https://deepeval.com/docs/metrics-task-completion

🔧 2. Tool Correctness

Evaluates whether tools were invoked correctly, meaningfully, and in the right order.

Captures:

Correct tool usage
Correct argument formatting
Avoiding hallucinated tools
Using tools only when needed

Docs: https://deepeval.com/docs/metrics-tool-correctness

🧩 3. Argument Correctness

Evaluates whether the agent’s arguments to tools are valid, structured, and aligned to the task.

Captures:

Correct parameter selection
Type/format adherence
Logical argument formation
Avoiding semantically incorrect inputs

Docs: https://deepeval.com/docs/metrics-argument-correctness

⚡ 4. Step Efficiency

Measures how efficiently an agent completes a task — rewarding fewer unnecessary steps and penalizing detours.

Captures:

Optimality of step count
Redundant tool calls
Unnecessary loops
Waffling behavior

Docs: https://deepeval.com/docs/metrics-step-efficiency

🪜 5. Plan Adherence

Evaluates how well the agent follows a predefined or self-generated plan.

Captures:

Alignment to planned steps
Deviations and detours
Fidelity to strategy
Execution according to intent

Docs: https://deepeval.com/docs/metrics-plan-adherence

🧭 6. Plan Quality

Evaluates the quality of the plan itself when the agent generates one.

Captures:

Clarity
Completeness
Achievability
Logical ordering of steps

Docs: https://deepeval.com/docs/metrics-plan-quality

🧪 New: Multi-Turn Synthetic Goldens Generation

Synthetic data generation now supports multi-turn goldens instead of just single-turn.

You can now generate:

🎭 Multi-turn conversational scenarios
📝 Scenario + Expected Outcome pairs
🔁 Turn-by-turn dialogue structure
💬 Goldens instantly compatible with the Conversation Simulator
🚀 Direct pipeline: Generate → Simulate → Evaluate

Perfect for building large-scale synthetic datasets for support agents, sales agents, research assistants, workflow agents, and any multi-step conversational system.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
conversational_goldens = synthesizer.generate_conversational_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
)

Docs here (click on the "multi-turn" tab): https://deepeval.com/docs/synthesizer-generate-from-docs

confident-ai/deepeval v3.7.3 🎉 Metrics for AI agents, multi-turn synthetic data generation, and more! on GitHub

Full support for agentic evals :)

🎯 1. Task Completion

🔧 2. Tool Correctness

🧩 3. Argument Correctness

⚡ 4. Step Efficiency

🪜 5. Plan Adherence

🧭 6. Plan Quality

🧪 New: Multi-Turn Synthetic Goldens Generation

confident-ai/deepeval v3.7.3
🎉 Metrics for AI agents, multi-turn synthetic data generation, and more!

on GitHub