16.0.0 (2026-05-21)
⚠ BREAKING CHANGES
- Sandboxing and Code Evaluators (#13290)
Features
Phoenix now lets you compose evaluation strategies in code.
Most eval tooling hands you a fixed menu of judge templates. Real evaluation is rarely that tidy.
Code Evaluators enable you to build evaluation criteria the way you want. You write a Python or TypeScript evaluate() function in the Phoenix UI — no SDK, no local runtime, no deploy step — and Phoenix runs it server-side, recording labels and scores as annotations on every experiment run.
Because it's just code, you control the whole strategy:
• Composite scoring: blend sub-scores (LLM judgment + deterministic rules) into one weighted metric
• Embedding-based evaluation: cosine similarity over embeddings instead of brittle string matching
• LLM juries: poll multiple models and combine verdicts into a weighted consensus
Sandboxed Code evaluators unlock the idea of agents as a judge as well. We're excited where this is heading.
- agents: Enable provider native web search / fetch when available (#13333) (41eb4fc)
- Sandboxing and Code Evaluators (#13290) (e294d93)