Arize-ai/phoenix arize-phoenix-v16.0.0 on GitHub

16.0.0 (2026-05-21)

⚠ BREAKING CHANGES

Sandboxing and Code Evaluators (#13290)

Features

Phoenix now lets you compose evaluation strategies in code.

Most eval tooling hands you a fixed menu of judge templates. Real evaluation is rarely that tidy.

Code Evaluators enable you to build evaluation criteria the way you want. You write a Python or TypeScript evaluate() function in the Phoenix UI — no SDK, no local runtime, no deploy step — and Phoenix runs it server-side, recording labels and scores as annotations on every experiment run.

Because it's just code, you control the whole strategy:

• Composite scoring: blend sub-scores (LLM judgment + deterministic rules) into one weighted metric
• Embedding-based evaluation: cosine similarity over embeddings instead of brittle string matching
• LLM juries: poll multiple models and combine verdicts into a weighted consensus

Sandboxed Code evaluators unlock the idea of agents as a judge as well. We're excited where this is heading.

agents: Enable provider native web search / fetch when available (#13333) (41eb4fc)
Sandboxing and Code Evaluators (#13290) (e294d93)

Bug Fixes

agents: Prevent broken tool groups (#13387) (78d1e96)

Arize-ai/phoenix arize-phoenix-v16.0.0 arize-phoenix: v16.0.0 on GitHub

16.0.0 (2026-05-21)

⚠ BREAKING CHANGES

Features

Bug Fixes

Arize-ai/phoenix arize-phoenix-v16.0.0
arize-phoenix: v16.0.0

on GitHub