π DeepEval v3.0 β Evaluate Any LLM Workflow, Anywhere
Weβre excited to introduce DeepEval v3.0, a major milestone that transforms how you evaluate LLM applications β from complex multi-step agents to simple prompt chains. This release brings component-level granularity, production-ready observability, and simulation tools to empower devs building modern AI systems.
π Component-Level Evaluation for Agentic Workflows
You can now apply DeepEval metrics to any step of your LLM workflow β tools, memories, retrievers, generators β and monitor them in both development and production.
- Evaluate individual function calls, not just final outputs
- Works with any framework or custom agent logic
- Real-time evaluation in production using
observe()
- Track sub-component performance over time
π Learn more β
π§ Conversation Simulation
Automatically simulate realistic multi-turn conversations to test your chatbots and agents.
- Define model goals and user behavior
- Generate labeled conversations at scale
- Use DeepEval metrics to assess response quality
- Customize turn count, persona types, and more
𧬠Generate Goldens from Goldens
Bootstrapping eval datasets just got easier. Now you can exponentially expand your test cases using LLM-generated variants of existing goldens.
- Transform goldens into many meaningful test cases
- Preserve structure while diversifying content
- Control tone, complexity, length, and more
π Read the guide β
π Red Teaming Moved to DeepTeam
All red teaming functionality now lives in its own focused project: DeepTeam. DeepTeam is built for LLM security β adversarial testing, attack generation, and vulnerability discovery.
π οΈ Install or Upgrade
pip install deepeval --upgrade
π§ Why v3.0 Matters
DeepEval v3.0 is more than an evaluation framework β it's a foundation for LLM observability. Whether you're debugging agents, simulating conversations, or continuously monitoring production performance, DeepEval now meets you wherever your LLM logic runs.
Ready to explore?
π Full docs at deepeval.com β