github confident-ai/deepeval v3.7.3
πŸŽ‰ Metrics for AI agents, multi-turn synthetic data generation, and more!

6 hours ago

Full support for agentic evals :)

If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app.

🎯 1. Task Completion

Evaluate whether an agent actually completes the intended task, not just whether its final output β€œlooks correct.”

Captures:

  • Goal completion
  • Intermediate step correctness
  • Error recovery
  • Procedural accuracy

Docs: https://deepeval.com/docs/metrics-task-completion


πŸ”§ 2. Tool Correctness

Evaluates whether tools were invoked correctly, meaningfully, and in the right order.

Captures:

  • Correct tool usage
  • Correct argument formatting
  • Avoiding hallucinated tools
  • Using tools only when needed

Docs: https://deepeval.com/docs/metrics-tool-correctness


🧩 3. Argument Correctness

Evaluates whether the agent’s arguments to tools are valid, structured, and aligned to the task.

Captures:

  • Correct parameter selection
  • Type/format adherence
  • Logical argument formation
  • Avoiding semantically incorrect inputs

Docs: https://deepeval.com/docs/metrics-argument-correctness


⚑ 4. Step Efficiency

Measures how efficiently an agent completes a task β€” rewarding fewer unnecessary steps and penalizing detours.

Captures:

  • Optimality of step count
  • Redundant tool calls
  • Unnecessary loops
  • Waffling behavior

Docs: https://deepeval.com/docs/metrics-step-efficiency


πŸͺœ 5. Plan Adherence

Evaluates how well the agent follows a predefined or self-generated plan.

Captures:

  • Alignment to planned steps
  • Deviations and detours
  • Fidelity to strategy
  • Execution according to intent

Docs: https://deepeval.com/docs/metrics-plan-adherence


🧭 6. Plan Quality

Evaluates the quality of the plan itself when the agent generates one.

Captures:

  • Clarity
  • Completeness
  • Achievability
  • Logical ordering of steps

Docs: https://deepeval.com/docs/metrics-plan-quality


πŸ§ͺ New: Multi-Turn Synthetic Goldens Generation

Synthetic data generation now supports multi-turn goldens instead of just single-turn.

You can now generate:

  • 🎭 Multi-turn conversational scenarios
  • πŸ“ Scenario + Expected Outcome pairs
  • πŸ” Turn-by-turn dialogue structure
  • πŸ’¬ Goldens instantly compatible with the Conversation Simulator
  • πŸš€ Direct pipeline: Generate β†’ Simulate β†’ Evaluate

Perfect for building large-scale synthetic datasets for support agents, sales agents, research assistants, workflow agents, and any multi-step conversational system.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
conversational_goldens = synthesizer.generate_conversational_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
)

Docs here (click on the "multi-turn" tab): https://deepeval.com/docs/synthesizer-generate-from-docs

Don't miss a new deepeval release

NewReleases is sending notifications on new releases.