github Blaizzy/mlx-vlm v0.5.0

5 hours ago

What's Changed

  • Fix gemma4 multi-image processing for different-sized images by @Blaizzy in #938
  • Fix race condition in TurboQuant fused fast-quantize kernels by @Chedrian07 in #967
  • Fix Gemma 4 tool parser to accept hyphenated function names by @michaelstingl in #963
  • fix(gemma4): snapshot cache.offset to prevent alias mutation under batched caches by @Thump604 in #966
  • Fix Gemma 4 quantized per-layer projection loading by @spicyneuron in #935
  • fix: Gemma 4 audio — mel preprocessing, weight loading, feature extractor by @stephencox-ict in #931
  • Add continuous batching to server by @Blaizzy in #1027
  • fix: use alpha/rank scaling in LoRaLayer (standard LoRA convention) by @kikoncuo in #846
  • Fix duplicate docstring entries and minor issues in utils.py by @sjhddh in #925
  • Add KV cache quantization for continuous batching by @Blaizzy in #1030
  • Strip tool-call markup from streamed delta.content by @Blaizzy in #1037
  • Add DFlash speculative decoding (single + batch + server) by @Blaizzy in #1029
  • Fix stale position IDs and gdn_sink compatibility for Qwen3.5/3.6 MoE… by @lele872 in #1040
  • Distributed inference for Qwen3, Kimi K2.5 and K2.6 by @pcuenca in #689
  • Add Distributed Inference section by @pcuenca in #1041
  • Resolve no images crash for qwen3_vl and qwen3_vl_moe generate call by @urimem in #1013
  • Add Youtu-VL by @MollySophia in #1018
  • Fix preprocessing for image input for trainer by @Goekdeniz-Guelmez in #826
  • add grounded_reasoning: Falcon Perception + Gemma4 agentic demo by @YasserdahouML in #926
  • Add Gemma 4 video support (multi-video capable) by @Blaizzy in #1042
  • Qwen2 / 2.5 / 3 / 3.5 VL: torch-free video processors + chunked-prefill rope fix by @Blaizzy in #1048
  • Close the batch_generate / server decode gap + VLM fixes by @Blaizzy in #1055
  • Thread-local generation stream (port mlx-lm#1090) by @Blaizzy in #1050
  • Fix DFlash speculative decoding: GPU hang, performance, and upstream alignment by @Blaizzy in #1053
  • hunyuan_vl / gemma3n: drop dead assignments in cache-offset extraction by @Blaizzy in #1056
  • Fix Gemma 4 LoRA training: vision backward NaN + audio_tower freeze leak by @john-rocky in #1052
  • server: added 'server' header in responses by @goniz in #1082
  • Add Nemotron 3 Nano Omni model by @lucasnewman in #1087
  • Nemotron H Nano Omni: rename SoundConfig and add processor by @Blaizzy in #1088
  • mistral3: skip lm_head quantization and add multi_modal_projector to skip list by @Blaizzy in #1089
  • Add server json_schema response_format support by @avbiswas in #1047
  • Fix Kimi VL concurrent Metal crash and mixed-batch text degradation by @Blaizzy in #1039
  • Fix per-sequence MRoPE alignment in mixed VL+text batches by @neilmehta24 in #1095
  • granite4_vision: support standard granite MLP backbone for 4.1-4b by @EliSchwartz in #1104
  • Add Gemma 4 MTP speculative-decoding drafter by @Blaizzy in #1112
  • Add APC prompt caching with disk persistence by @Blaizzy in #1103
  • Add APC prompt caching with warm-disk persistence for hybrid models by @Blaizzy in #1114
  • Fix token queue timeout during long prefills by @eloe in #1111
  • Server: add Gemma 4 MTP drafter support by @Blaizzy in #1115
  • Fix TurboQuant batch cache offset merging by @eloe in #1110
  • Fix Qwen3.5 quantization config keys by @Blaizzy in #1119
  • server: added loaded model's context size and tool call parser to /health endpoint by @goniz in #1092
  • fix: Batch generation breaks top-p sampling by @spicyneuron in #1094
  • feat: Add SAM 3D Body — monocular 3D body mesh on Apple Silicon by @shihwesley in #922
  • Add --max-tokens to server by @spicyneuron in #1120
  • refactor: improve model loading and resource handling in utils.py by @SyedaAnshrahGillani in #1019
  • Fix Gemma 4 E2B/E4B per_layer_inputs crash in batched prefill by @Blaizzy in #1123
  • Fix mixed-length Gemma 4 batching by @Blaizzy in #1127
  • Fix streamed detokenization for byte-fallback tokens by @Blaizzy in #1129
  • Add server thinking mode flag by @Blaizzy in #1130
  • Speculative decoding fixes: auto-detect drafter kind and preserve multimodal prefill by @Blaizzy in #1125

New Contributors

Full Changelog: v0.4.4...v0.5.0

Don't miss a new mlx-vlm release

NewReleases is sending notifications on new releases.