What's Changed
- Fix gemma4 multi-image processing for different-sized images by @Blaizzy in #938
- Fix race condition in TurboQuant fused fast-quantize kernels by @Chedrian07 in #967
- Fix Gemma 4 tool parser to accept hyphenated function names by @michaelstingl in #963
- fix(gemma4): snapshot cache.offset to prevent alias mutation under batched caches by @Thump604 in #966
- Fix Gemma 4 quantized per-layer projection loading by @spicyneuron in #935
- fix: Gemma 4 audio — mel preprocessing, weight loading, feature extractor by @stephencox-ict in #931
- Add continuous batching to server by @Blaizzy in #1027
- fix: use alpha/rank scaling in LoRaLayer (standard LoRA convention) by @kikoncuo in #846
- Fix duplicate docstring entries and minor issues in utils.py by @sjhddh in #925
- Add KV cache quantization for continuous batching by @Blaizzy in #1030
- Strip tool-call markup from streamed delta.content by @Blaizzy in #1037
- Add DFlash speculative decoding (single + batch + server) by @Blaizzy in #1029
- Fix stale position IDs and gdn_sink compatibility for Qwen3.5/3.6 MoE… by @lele872 in #1040
- Distributed inference for Qwen3, Kimi K2.5 and K2.6 by @pcuenca in #689
- Add Distributed Inference section by @pcuenca in #1041
- Resolve no images crash for qwen3_vl and qwen3_vl_moe generate call by @urimem in #1013
- Add Youtu-VL by @MollySophia in #1018
- Fix preprocessing for image input for trainer by @Goekdeniz-Guelmez in #826
- add grounded_reasoning: Falcon Perception + Gemma4 agentic demo by @YasserdahouML in #926
- Add Gemma 4 video support (multi-video capable) by @Blaizzy in #1042
- Qwen2 / 2.5 / 3 / 3.5 VL: torch-free video processors + chunked-prefill rope fix by @Blaizzy in #1048
- Close the batch_generate / server decode gap + VLM fixes by @Blaizzy in #1055
- Thread-local generation stream (port mlx-lm#1090) by @Blaizzy in #1050
- Fix DFlash speculative decoding: GPU hang, performance, and upstream alignment by @Blaizzy in #1053
- hunyuan_vl / gemma3n: drop dead assignments in cache-offset extraction by @Blaizzy in #1056
- Fix Gemma 4 LoRA training: vision backward NaN + audio_tower freeze leak by @john-rocky in #1052
- server: added 'server' header in responses by @goniz in #1082
- Add Nemotron 3 Nano Omni model by @lucasnewman in #1087
- Nemotron H Nano Omni: rename SoundConfig and add processor by @Blaizzy in #1088
- mistral3: skip lm_head quantization and add multi_modal_projector to skip list by @Blaizzy in #1089
- Add server
json_schemaresponse_format support by @avbiswas in #1047 - Fix Kimi VL concurrent Metal crash and mixed-batch text degradation by @Blaizzy in #1039
- Fix per-sequence MRoPE alignment in mixed VL+text batches by @neilmehta24 in #1095
- granite4_vision: support standard granite MLP backbone for 4.1-4b by @EliSchwartz in #1104
- Add Gemma 4 MTP speculative-decoding drafter by @Blaizzy in #1112
- Add APC prompt caching with disk persistence by @Blaizzy in #1103
- Add APC prompt caching with warm-disk persistence for hybrid models by @Blaizzy in #1114
- Fix token queue timeout during long prefills by @eloe in #1111
- Server: add Gemma 4 MTP drafter support by @Blaizzy in #1115
- Fix TurboQuant batch cache offset merging by @eloe in #1110
- Fix Qwen3.5 quantization config keys by @Blaizzy in #1119
- server: added loaded model's context size and tool call parser to /health endpoint by @goniz in #1092
- fix: Batch generation breaks top-p sampling by @spicyneuron in #1094
- feat: Add SAM 3D Body — monocular 3D body mesh on Apple Silicon by @shihwesley in #922
- Add
--max-tokensto server by @spicyneuron in #1120 - refactor: improve model loading and resource handling in utils.py by @SyedaAnshrahGillani in #1019
- Fix Gemma 4 E2B/E4B per_layer_inputs crash in batched prefill by @Blaizzy in #1123
- Fix mixed-length Gemma 4 batching by @Blaizzy in #1127
- Fix streamed detokenization for byte-fallback tokens by @Blaizzy in #1129
- Add server thinking mode flag by @Blaizzy in #1130
- Speculative decoding fixes: auto-detect drafter kind and preserve multimodal prefill by @Blaizzy in #1125
New Contributors
- @Chedrian07 made their first contribution in #967
- @michaelstingl made their first contribution in #963
- @Thump604 made their first contribution in #966
- @stephencox-ict made their first contribution in #931
- @kikoncuo made their first contribution in #846
- @sjhddh made their first contribution in #925
- @lele872 made their first contribution in #1040
- @urimem made their first contribution in #1013
- @MollySophia made their first contribution in #1018
- @YasserdahouML made their first contribution in #926
- @john-rocky made their first contribution in #1052
- @goniz made their first contribution in #1082
- @lucasnewman made their first contribution in #1087
- @avbiswas made their first contribution in #1047
- @EliSchwartz made their first contribution in #1104
- @eloe made their first contribution in #1111
- @shihwesley made their first contribution in #922
- @SyedaAnshrahGillani made their first contribution in #1019
Full Changelog: v0.4.4...v0.5.0