What's New
Two-level model directory scanning (#1)
- Support for organization folder layouts (e.g.,
mlx-community/llama-3b/) - Flat and two-level directories can coexist in the same model directory
Streaming tool call parsing (#2)
- Stream tool calls in OpenAI-compatible format
- XML fallback parser for GLM/Qwen/Llama models without native tool call support
- Content buffering prevents duplicate tool call output
Client disconnect detection (#3)
- Streaming responses now detect client disconnects via ASGI
- Proper cleanup of async generators and pending tasks on disconnect
KV cache headroom & manual model unload (#4)
- 25% KV cache headroom during model loading for better multi-model memory management
- Manual model unload via
POST /v1/models/{model_id}/unloadand admin panel
New Contributors
Thanks to @thornad for all four PRs in this release!