Blaizzy/mlx-vlm v0.6.0 on GitHub

What's Changed

fix(server): remove duplicate “server” response header by @goniz in #1140
Fix top_p_sampling crash on 3D logits in speculative decoding by @ivanfioravanti in #1141
Use local rotating-cache offsets for Gemma assistant MTP masks by @Thump604 in #1139
Fix stale Qwen MRoPE rope_deltas in decode by @neilmehta24 in #1143
Fix server chat completions text content array handling by @spicyneuron in #1137
Add server metrics and improve streamed response handling by @Blaizzy in #1145
Fix streamed byte-fallback token handling in server batching by @Blaizzy in #1147
Fix APC disk cache directory recovery by @Blaizzy in #1149
minicpmo / fastvlm: fix pixel cast on quantized language models by @contrapuntal in #1098
Fix Qwen3 VL vision config deserialization (patch_size ignored) by @lyonsno in #1157
Fix max kv size enforcement by @lucasnewman in #1160
[codex] Enforce server context budget by @eloe in #1153
Fix LFM2-VL model by @lucasnewman in #1162
Add ZAYA1-VL model support by @Blaizzy in #1159
Fix an error when loading the MolmoPoint-8B model by @lucasnewman in #1164
Fix image feature fp16 overflow in Molmo2 by @lucasnewman in #1165
Improve Gemma4 MTP server batching by @Blaizzy in #1166
Add Qwen MTP speculative decoding by @Blaizzy in #1167
Add MiniCPM-V 4.6 support by @pzc163 in #1058
Refactor speculative decoding utilities by @Blaizzy in #1169
Fix crash with FastAPI workers when using Qwen3.5 family models by @lucasnewman in #1168
Add Gemma 4 DFlash + adaptive block sizing by @Blaizzy in #1176
fix: replace deprecated use_fast=True with backend=torchvision in load_processor by @Jonathangadeaharder in #1170
Fix weight remapping for the Qwen3.5/3.6 vision tower by @lucasnewman in #1179
Fix Nemotron processor metadata loading by @Blaizzy in #1177
Add Gemma 4 EAGLE3 speculative decoding support by @Blaizzy in #1180
Split speculative decoding utilities by @Blaizzy in #1182
Add Laguna support by @Blaizzy in #1183
Compatibility bridge for non-VL models by @lucasnewman in #1181
Unify LoRA loading for VLM/LM models by @lucasnewman in #1186
[mlx_vlm] Expose 'strict' parameter in load() function by @zyguy in #1198
Add server support for Anthropic-style /v1/messages API by @lucasnewman in #1196
Support /v1/responses stateful responses API in the server by @lucasnewman in #1199
Fix /v1/messages tool call image handling by @lucasnewman in #1200
Fix Qwen MTP verification drift and sampling parity by @Blaizzy in #1188
Modularize server by @lucasnewman in #1203
Fix mask creation for batched quantized kv caches by @lucasnewman in #1208
Track APC cached tokens for generation metrics by @spicyneuron in #1209
fastvlm: support untied lm_head for the 7B variant by @contrapuntal in #1193
Add RT-DETRv2 detection model by @leonnoirclerc in #1195
Handle base64-encoded input audio in chat completions by @lucasnewman in #1211
Fix inverted --verbose flag on chat / generate / video_generate CLIs by @SuperMarioYL in #1152
Fix for Gemma4 audio task by @lifeiteng in #980
Add MiniCPM-V 4.6 language wrapper by @Blaizzy in #1212
Support Qwen image pixel overrides by @Blaizzy in #1213
Improve GLM-OCR generation and repetition controls by @Blaizzy in #1214
Fix LFM2-VL image preprocessing for variants shipping input_data_format=channels_last by @contrapuntal in #1190
Fix rotation-basis mismatch in TurboQuant L=1 fast quantize by @nnorris7 in #973
fix: fallback to defaults by @GeneCodeSavvy in #1221
Refactor MRoPE handling by @Blaizzy in #1135
Fix server finish_reason inconsistencies by @spicyneuron in #1215
Fix inference cancellation on stream abort by @spicyneuron in #1217
Add DeepSeek V4 language model by @Blaizzy in #1223
Add spec-compliant usage and timings for server endpoints by @spicyneuron in #1216
Add DeepSeek V4 MTP support by @Blaizzy in #1225
Freeze MaskedEmbedder.token_ordering so optimizer can't corrupt it by @DirectriX01 in #1194
Validate MTP drafter compatibility by @Blaizzy in #1227
Restructure generation into AR and diffusion engines by @Blaizzy in #1229
Add support for image generation with PrismML Bonsai by @lucasnewman in #1226
Enforce thinking budget in server batching by @Blaizzy in #1228
Add LLaDA2.X support by @Blaizzy in #1230
Add support for FLUX.2 base & klein models by @lucasnewman in #1232
Improve image generation quality for Bonsai by @lucasnewman in #1234
Add support for image editing with FLUX.2 models by @lucasnewman in #1236
Improve masked diffusion text generation controls by @Blaizzy in #1235
Add HRM-Text model support by @Blaizzy in #1238
Add LFM2 MoE language model by @Blaizzy in #1237
Add Step-3.7 Flash support by @ivanfioravanti in #1245
fix(kernels): match PyTorch bicubic coefficient (-0.75 non-AA / -0.5 AA) — closes #1241 by @beshkenadze in #1243
Fix(qwen3_omni_moe): enable audio input (mask key + audio placeholder) by @yuhuanowo in #1240
docs: align Python version metadata by @aqilaziz in #1150
Fix ByteLevel streaming detokenizer detection for Step 3.7 Flash by @ivanfioravanti in #1246
turboquant: guard L=1 value kernels behind not use_rht (fix masked decode under RHT) by @popfido in #1244
Fix cache merging for batched quantized caches by @lucasnewman in #1248
security: require patched Starlette by @Thump604 in #1249
Don't show tqdm progress bars for prefill by default by @lucasnewman in #1250
Fix adaptors with the /v1/responses API by @lucasnewman in #1251
[codex] Fix models endpoint for loaded local models by @Blaizzy in #1253
perf(turboquant): RHT-correct L=1 value kernels (keep the fast path under RHT) by @popfido in #1252
Fix trainer adapter save fallback by @Blaizzy in #1255
Fix Qwen MTP batched target-verify drift by @Blaizzy in #1210
Add Nemotron Labs Diffusion model by @Blaizzy in #1239
Update version to v0.6.0 by @Blaizzy in #1257

New Contributors

@contrapuntal made their first contribution in #1098
@lyonsno made their first contribution in #1157
@pzc163 made their first contribution in #1058
@Jonathangadeaharder made their first contribution in #1170
@zyguy made their first contribution in #1198
@leonnoirclerc made their first contribution in #1195
@SuperMarioYL made their first contribution in #1152
@lifeiteng made their first contribution in #980
@nnorris7 made their first contribution in #973
@GeneCodeSavvy made their first contribution in #1221
@DirectriX01 made their first contribution in #1194
@beshkenadze made their first contribution in #1243
@yuhuanowo made their first contribution in #1240
@aqilaziz made their first contribution in #1150
@popfido made their first contribution in #1244

Full Changelog: v0.5.0...v0.6.0