Full support for agentic evals :)
If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app.
π― 1. Task Completion
Evaluate whether an agent actually completes the intended task, not just whether its final output βlooks correct.β
Captures:
- Goal completion
- Intermediate step correctness
- Error recovery
- Procedural accuracy
Docs: https://deepeval.com/docs/metrics-task-completion
π§ 2. Tool Correctness
Evaluates whether tools were invoked correctly, meaningfully, and in the right order.
Captures:
- Correct tool usage
- Correct argument formatting
- Avoiding hallucinated tools
- Using tools only when needed
Docs: https://deepeval.com/docs/metrics-tool-correctness
π§© 3. Argument Correctness
Evaluates whether the agentβs arguments to tools are valid, structured, and aligned to the task.
Captures:
- Correct parameter selection
- Type/format adherence
- Logical argument formation
- Avoiding semantically incorrect inputs
Docs: https://deepeval.com/docs/metrics-argument-correctness
β‘ 4. Step Efficiency
Measures how efficiently an agent completes a task β rewarding fewer unnecessary steps and penalizing detours.
Captures:
- Optimality of step count
- Redundant tool calls
- Unnecessary loops
- Waffling behavior
Docs: https://deepeval.com/docs/metrics-step-efficiency
πͺ 5. Plan Adherence
Evaluates how well the agent follows a predefined or self-generated plan.
Captures:
- Alignment to planned steps
- Deviations and detours
- Fidelity to strategy
- Execution according to intent
Docs: https://deepeval.com/docs/metrics-plan-adherence
π§ 6. Plan Quality
Evaluates the quality of the plan itself when the agent generates one.
Captures:
- Clarity
- Completeness
- Achievability
- Logical ordering of steps
Docs: https://deepeval.com/docs/metrics-plan-quality
π§ͺ New: Multi-Turn Synthetic Goldens Generation
Synthetic data generation now supports multi-turn goldens instead of just single-turn.
You can now generate:
- π Multi-turn conversational scenarios
- π Scenario + Expected Outcome pairs
- π Turn-by-turn dialogue structure
- π¬ Goldens instantly compatible with the Conversation Simulator
- π Direct pipeline: Generate β Simulate β Evaluate
Perfect for building large-scale synthetic datasets for support agents, sales agents, research assistants, workflow agents, and any multi-step conversational system.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
conversational_goldens = synthesizer.generate_conversational_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
)Docs here (click on the "multi-turn" tab): https://deepeval.com/docs/synthesizer-generate-from-docs