github xorbitsai/inference v2.10.0

5 hours ago

What's new in 2.10.0 (2026-06-05)

These are the changes in inference v2.10.0.

New features

  • feat(logging): enhance logging system with JSON format and stdout redirect by @m199369309 in #4947
  • feat(auth): add OIDC/Keycloak SSO authentication support by @m199369309 in #4948
  • feat(audit): add comprehensive audit logging system by @m199369309 in #4951
  • feat(security): add IP/Key ban and rate limiting for brute-force protection by @m199369309 in #4949
  • feat(apikey): add description field and ban status display by @m199369309 in #4952
  • feat(monitor): add security audit panels and filebeat configurations by @m199369309 in #4953
  • feat(ui): menu reorganization, fetchWrapper auth, and i18n updates by @m199369309 in #4954
  • feat(monitor): add per-model GPU memory usage metrics by @m199369309 in #4965
  • feat(monitor): update Grafana dashboards with GPU memory panels by @m199369309 in #4969
  • feat: persist launch model configuration history server-side by @m199369309 in #4972
  • FEAT: [UI] update sidebar, login logo and favicon by @yiboyasss in #4978
  • feat: new ui (register json view, formInstance, launch model list, cache/env …) by @maoyuehui in #4966
  • feat(logging): add three-level download progress logging by @m199369309 in #4989
  • feat(ui): allow editing API key name and description in edit dialog by @m199369309 in #4991

Enhancements

Bug fixes

  • fix(vllm): set quantization="fp8" when model_format is fp8 by @m199369309 in #4959
  • fix(auth): return specific error messages for expired/disabled API keys by @m199369309 in #4963
  • fix(monitor): periodic refresh for security gauges and ban remaining API by @m199369309 in #4964
  • BUG: fix jina-embeddings-v2-base-zh deployment dependencies by @m199369309 in #4970
  • fix(monitor): capture vLLM/SGLang GPU workers via deferred PID tattoo by @m199369309 in #4977
  • fix(vllm): remove best_of for v0.21.0 by @llyycchhee in #4979
  • bug: Adapt vLLM LoRA request path parameter by @amumu96 in #4980
  • fix(logging): strip all CSI escapes and route flush() through sampling by @m199369309 in #4983
  • fix(vllm): read json_schema from schema_ so guided decoding applies by @m199369309 in #4985
  • bug: fix llama.cpp streaming tool call edge cases by @qinxuye in #4988
  • fix: Fix GPU info probe on GB10 / DGX Spark (NVML v2 memory-info fallback) by @tbraun96 in #4990
  • fix(worker): limit concurrent model launches with semaphore to prevent heartbeat timeouts by @m199369309 in #4992
  • fix(venv): allow user override by @llyycchhee in #4993
  • fix(venv): evaluate CUDA version markers dynamically by @llyycchhee in #4958

Documentation

Others

  • chattts: set weights_only=True for torch.load speaker embedding by @tonghuaroot in #4956

New Contributors

Full Changelog: v2.9.0...v2.10.0

Don't miss a new inference release

NewReleases is sending notifications on new releases.