Blaizzy/mlx-vlm v0.5.0 on GitHub

What's Changed

Fix gemma4 multi-image processing for different-sized images by @Blaizzy in #938
Fix race condition in TurboQuant fused fast-quantize kernels by @Chedrian07 in #967
Fix Gemma 4 tool parser to accept hyphenated function names by @michaelstingl in #963
fix(gemma4): snapshot cache.offset to prevent alias mutation under batched caches by @Thump604 in #966
Fix Gemma 4 quantized per-layer projection loading by @spicyneuron in #935
fix: Gemma 4 audio — mel preprocessing, weight loading, feature extractor by @stephencox-ict in #931
Add continuous batching to server by @Blaizzy in #1027
fix: use alpha/rank scaling in LoRaLayer (standard LoRA convention) by @kikoncuo in #846
Fix duplicate docstring entries and minor issues in utils.py by @sjhddh in #925
Add KV cache quantization for continuous batching by @Blaizzy in #1030
Strip tool-call markup from streamed delta.content by @Blaizzy in #1037
Add DFlash speculative decoding (single + batch + server) by @Blaizzy in #1029
Fix stale position IDs and gdn_sink compatibility for Qwen3.5/3.6 MoE… by @lele872 in #1040
Distributed inference for Qwen3, Kimi K2.5 and K2.6 by @pcuenca in #689
Add Distributed Inference section by @pcuenca in #1041
Resolve no images crash for qwen3_vl and qwen3_vl_moe generate call by @urimem in #1013
Add Youtu-VL by @MollySophia in #1018
Fix preprocessing for image input for trainer by @Goekdeniz-Guelmez in #826
add grounded_reasoning: Falcon Perception + Gemma4 agentic demo by @YasserdahouML in #926
Add Gemma 4 video support (multi-video capable) by @Blaizzy in #1042
Qwen2 / 2.5 / 3 / 3.5 VL: torch-free video processors + chunked-prefill rope fix by @Blaizzy in #1048
Close the batch_generate / server decode gap + VLM fixes by @Blaizzy in #1055
Thread-local generation stream (port mlx-lm#1090) by @Blaizzy in #1050
Fix DFlash speculative decoding: GPU hang, performance, and upstream alignment by @Blaizzy in #1053
hunyuan_vl / gemma3n: drop dead assignments in cache-offset extraction by @Blaizzy in #1056
Fix Gemma 4 LoRA training: vision backward NaN + audio_tower freeze leak by @john-rocky in #1052
server: added 'server' header in responses by @goniz in #1082
Add Nemotron 3 Nano Omni model by @lucasnewman in #1087
Nemotron H Nano Omni: rename SoundConfig and add processor by @Blaizzy in #1088
mistral3: skip lm_head quantization and add multi_modal_projector to skip list by @Blaizzy in #1089
Add server json_schema response_format support by @avbiswas in #1047
Fix Kimi VL concurrent Metal crash and mixed-batch text degradation by @Blaizzy in #1039
Fix per-sequence MRoPE alignment in mixed VL+text batches by @neilmehta24 in #1095
granite4_vision: support standard granite MLP backbone for 4.1-4b by @EliSchwartz in #1104
Add Gemma 4 MTP speculative-decoding drafter by @Blaizzy in #1112
Add APC prompt caching with disk persistence by @Blaizzy in #1103
Add APC prompt caching with warm-disk persistence for hybrid models by @Blaizzy in #1114
Fix token queue timeout during long prefills by @eloe in #1111
Server: add Gemma 4 MTP drafter support by @Blaizzy in #1115
Fix TurboQuant batch cache offset merging by @eloe in #1110
Fix Qwen3.5 quantization config keys by @Blaizzy in #1119
server: added loaded model's context size and tool call parser to /health endpoint by @goniz in #1092
fix: Batch generation breaks top-p sampling by @spicyneuron in #1094
feat: Add SAM 3D Body — monocular 3D body mesh on Apple Silicon by @shihwesley in #922
Add --max-tokens to server by @spicyneuron in #1120
refactor: improve model loading and resource handling in utils.py by @SyedaAnshrahGillani in #1019
Fix Gemma 4 E2B/E4B per_layer_inputs crash in batched prefill by @Blaizzy in #1123
Fix mixed-length Gemma 4 batching by @Blaizzy in #1127
Fix streamed detokenization for byte-fallback tokens by @Blaizzy in #1129
Add server thinking mode flag by @Blaizzy in #1130
Speculative decoding fixes: auto-detect drafter kind and preserve multimodal prefill by @Blaizzy in #1125

New Contributors

@Chedrian07 made their first contribution in #967
@michaelstingl made their first contribution in #963
@Thump604 made their first contribution in #966
@stephencox-ict made their first contribution in #931
@kikoncuo made their first contribution in #846
@sjhddh made their first contribution in #925
@lele872 made their first contribution in #1040
@urimem made their first contribution in #1013
@MollySophia made their first contribution in #1018
@YasserdahouML made their first contribution in #926
@john-rocky made their first contribution in #1052
@goniz made their first contribution in #1082
@lucasnewman made their first contribution in #1087
@avbiswas made their first contribution in #1047
@EliSchwartz made their first contribution in #1104
@eloe made their first contribution in #1111
@shihwesley made their first contribution in #922
@SyedaAnshrahGillani made their first contribution in #1019

Full Changelog: v0.4.4...v0.5.0