What's new in 2.10.0 (2026-06-05)

These are the changes in inference v2.10.0.

New features

feat(logging): enhance logging system with JSON format and stdout redirect by @m199369309 in #4947
feat(auth): add OIDC/Keycloak SSO authentication support by @m199369309 in #4948
feat(audit): add comprehensive audit logging system by @m199369309 in #4951
feat(security): add IP/Key ban and rate limiting for brute-force protection by @m199369309 in #4949
feat(apikey): add description field and ban status display by @m199369309 in #4952
feat(monitor): add security audit panels and filebeat configurations by @m199369309 in #4953
feat(ui): menu reorganization, fetchWrapper auth, and i18n updates by @m199369309 in #4954
feat(monitor): add per-model GPU memory usage metrics by @m199369309 in #4965
feat(monitor): update Grafana dashboards with GPU memory panels by @m199369309 in #4969
feat: persist launch model configuration history server-side by @m199369309 in #4972
FEAT: [UI] update sidebar, login logo and favicon by @yiboyasss in #4978
feat: new ui (register json view, formInstance, launch model list, cache/env …) by @maoyuehui in #4966
feat(logging): add three-level download progress logging by @m199369309 in #4989
feat(ui): allow editing API key name and description in edit dialog by @m199369309 in #4991

fix(vllm): set quantization="fp8" when model_format is fp8 by @m199369309 in #4959
fix(auth): return specific error messages for expired/disabled API keys by @m199369309 in #4963
fix(monitor): periodic refresh for security gauges and ban remaining API by @m199369309 in #4964
BUG: fix jina-embeddings-v2-base-zh deployment dependencies by @m199369309 in #4970
fix(monitor): capture vLLM/SGLang GPU workers via deferred PID tattoo by @m199369309 in #4977
fix(vllm): remove best_of for v0.21.0 by @llyycchhee in #4979
bug: Adapt vLLM LoRA request path parameter by @amumu96 in #4980
fix(logging): strip all CSI escapes and route flush() through sampling by @m199369309 in #4983
fix(vllm): read json_schema from schema_ so guided decoding applies by @m199369309 in #4985
bug: fix llama.cpp streaming tool call edge cases by @qinxuye in #4988
fix: Fix GPU info probe on GB10 / DGX Spark (NVML v2 memory-info fallback) by @tbraun96 in #4990
fix(worker): limit concurrent model launches with semaphore to prevent heartbeat timeouts by @m199369309 in #4992
fix(venv): allow user override by @llyycchhee in #4993
fix(venv): evaluate CUDA version markers dynamically by @llyycchhee in #4958

chattts: set weights_only=True for torch.load speaker embedding by @tonghuaroot in #4956

Full Changelog: v2.9.0...v2.10.0