What's Changed
- fix(server): remove duplicate “server” response header by @goniz in #1140
- Fix top_p_sampling crash on 3D logits in speculative decoding by @ivanfioravanti in #1141
- Use local rotating-cache offsets for Gemma assistant MTP masks by @Thump604 in #1139
- Fix stale Qwen MRoPE rope_deltas in decode by @neilmehta24 in #1143
- Fix server chat completions text content array handling by @spicyneuron in #1137
- Add server metrics and improve streamed response handling by @Blaizzy in #1145
- Fix streamed byte-fallback token handling in server batching by @Blaizzy in #1147
- Fix APC disk cache directory recovery by @Blaizzy in #1149
- minicpmo / fastvlm: fix pixel cast on quantized language models by @contrapuntal in #1098
- Fix Qwen3 VL vision config deserialization (patch_size ignored) by @lyonsno in #1157
- Fix max kv size enforcement by @lucasnewman in #1160
- [codex] Enforce server context budget by @eloe in #1153
- Fix LFM2-VL model by @lucasnewman in #1162
- Add ZAYA1-VL model support by @Blaizzy in #1159
- Fix an error when loading the MolmoPoint-8B model by @lucasnewman in #1164
- Fix image feature fp16 overflow in Molmo2 by @lucasnewman in #1165
- Improve Gemma4 MTP server batching by @Blaizzy in #1166
- Add Qwen MTP speculative decoding by @Blaizzy in #1167
- Add MiniCPM-V 4.6 support by @pzc163 in #1058
- Refactor speculative decoding utilities by @Blaizzy in #1169
- Fix crash with FastAPI workers when using Qwen3.5 family models by @lucasnewman in #1168
- Add Gemma 4 DFlash + adaptive block sizing by @Blaizzy in #1176
- fix: replace deprecated use_fast=True with backend=torchvision in load_processor by @Jonathangadeaharder in #1170
- Fix weight remapping for the Qwen3.5/3.6 vision tower by @lucasnewman in #1179
- Fix Nemotron processor metadata loading by @Blaizzy in #1177
- Add Gemma 4 EAGLE3 speculative decoding support by @Blaizzy in #1180
- Split speculative decoding utilities by @Blaizzy in #1182
- Add Laguna support by @Blaizzy in #1183
- Compatibility bridge for non-VL models by @lucasnewman in #1181
- Unify LoRA loading for VLM/LM models by @lucasnewman in #1186
- [mlx_vlm] Expose 'strict' parameter in load() function by @zyguy in #1198
- Add server support for Anthropic-style /v1/messages API by @lucasnewman in #1196
- Support /v1/responses stateful responses API in the server by @lucasnewman in #1199
- Fix /v1/messages tool call image handling by @lucasnewman in #1200
- Fix Qwen MTP verification drift and sampling parity by @Blaizzy in #1188
- Modularize server by @lucasnewman in #1203
- Fix mask creation for batched quantized kv caches by @lucasnewman in #1208
- Track APC cached tokens for generation metrics by @spicyneuron in #1209
- fastvlm: support untied lm_head for the 7B variant by @contrapuntal in #1193
- Add RT-DETRv2 detection model by @leonnoirclerc in #1195
- Handle base64-encoded input audio in chat completions by @lucasnewman in #1211
- Fix inverted --verbose flag on chat / generate / video_generate CLIs by @SuperMarioYL in #1152
- Fix for Gemma4 audio task by @lifeiteng in #980
- Add MiniCPM-V 4.6 language wrapper by @Blaizzy in #1212
- Support Qwen image pixel overrides by @Blaizzy in #1213
- Improve GLM-OCR generation and repetition controls by @Blaizzy in #1214
- Fix LFM2-VL image preprocessing for variants shipping input_data_format=channels_last by @contrapuntal in #1190
- Fix rotation-basis mismatch in TurboQuant L=1 fast quantize by @nnorris7 in #973
- fix: fallback to defaults by @GeneCodeSavvy in #1221
- Refactor MRoPE handling by @Blaizzy in #1135
- Fix server
finish_reasoninconsistencies by @spicyneuron in #1215 - Fix inference cancellation on stream abort by @spicyneuron in #1217
- Add DeepSeek V4 language model by @Blaizzy in #1223
- Add spec-compliant
usageandtimingsfor server endpoints by @spicyneuron in #1216 - Add DeepSeek V4 MTP support by @Blaizzy in #1225
- Freeze MaskedEmbedder.token_ordering so optimizer can't corrupt it by @DirectriX01 in #1194
- Validate MTP drafter compatibility by @Blaizzy in #1227
- Restructure generation into AR and diffusion engines by @Blaizzy in #1229
- Add support for image generation with PrismML Bonsai by @lucasnewman in #1226
- Enforce thinking budget in server batching by @Blaizzy in #1228
- Add LLaDA2.X support by @Blaizzy in #1230
- Add support for FLUX.2 base & klein models by @lucasnewman in #1232
- Improve image generation quality for Bonsai by @lucasnewman in #1234
- Add support for image editing with FLUX.2 models by @lucasnewman in #1236
- Improve masked diffusion text generation controls by @Blaizzy in #1235
- Add HRM-Text model support by @Blaizzy in #1238
- Add LFM2 MoE language model by @Blaizzy in #1237
- Add Step-3.7 Flash support by @ivanfioravanti in #1245
- fix(kernels): match PyTorch bicubic coefficient (-0.75 non-AA / -0.5 AA) — closes #1241 by @beshkenadze in #1243
- Fix(qwen3_omni_moe): enable audio input (mask key + audio placeholder) by @yuhuanowo in #1240
- docs: align Python version metadata by @aqilaziz in #1150
- Fix ByteLevel streaming detokenizer detection for Step 3.7 Flash by @ivanfioravanti in #1246
- turboquant: guard L=1 value kernels behind
not use_rht(fix masked decode under RHT) by @popfido in #1244 - Fix cache merging for batched quantized caches by @lucasnewman in #1248
- security: require patched Starlette by @Thump604 in #1249
- Don't show tqdm progress bars for prefill by default by @lucasnewman in #1250
- Fix adaptors with the /v1/responses API by @lucasnewman in #1251
- [codex] Fix models endpoint for loaded local models by @Blaizzy in #1253
- perf(turboquant): RHT-correct L=1 value kernels (keep the fast path under RHT) by @popfido in #1252
- Fix trainer adapter save fallback by @Blaizzy in #1255
- Fix Qwen MTP batched target-verify drift by @Blaizzy in #1210
- Add Nemotron Labs Diffusion model by @Blaizzy in #1239
- Update version to v0.6.0 by @Blaizzy in #1257
New Contributors
- @contrapuntal made their first contribution in #1098
- @lyonsno made their first contribution in #1157
- @pzc163 made their first contribution in #1058
- @Jonathangadeaharder made their first contribution in #1170
- @zyguy made their first contribution in #1198
- @leonnoirclerc made their first contribution in #1195
- @SuperMarioYL made their first contribution in #1152
- @lifeiteng made their first contribution in #980
- @nnorris7 made their first contribution in #973
- @GeneCodeSavvy made their first contribution in #1221
- @DirectriX01 made their first contribution in #1194
- @beshkenadze made their first contribution in #1243
- @yuhuanowo made their first contribution in #1240
- @aqilaziz made their first contribution in #1150
- @popfido made their first contribution in #1244
Full Changelog: v0.5.0...v0.6.0