⚠️ This is a dev release intended for testing. There may be bugs — please report issues you run into.
Highlights
Upgraded mlx-vlm to 1bf7742, which brings continuous batching support and torch-free Qwen VL processors. This lets me simplify the VLM engine significantly — dropped the mlx-lm decode model weight-sharing workaround.
VLM Engine Improvements
- mlx-vlm bump to
1bf7742: continuous batching, torch-free Qwen2 / 2.5 / 3 / 3.5 VL processors, Qwen3-VL chunked-prefill rope fix - Simplified VLMModelAdapter: removed
_IntOffsetCacheProxy,_CachedOffsetProxy,_wrap_caches, and the separate mlx-lm decode model. mlx-vlm models now handle per-sequencemx.arrayoffsets and batched decode natively - Thread-local generation stream:
_init_mlx_thread()now usesmx.new_thread_local_streamand patchesmlx_vlm.generate.generation_stream(workaround for mlx-vlm PR #1050 until merged) - Relaxed transformers pin: no more
<5.4.0upper bound, since mlx-vlm's custom processors bypassAutoProcessorfor Qwen models - Qwen3-VL null-pixels fix:
preprocessor_config.jsonwithmax_pixels: null/min_pixels: nullnow falls back to sensible defaults