github jundot/omlx v0.3.8.dev1

3 hours ago

⚠️ This is a dev release intended for testing. There may be bugs — please report issues you run into.

Highlights

Upgraded mlx-vlm to 1bf7742, which brings continuous batching support and torch-free Qwen VL processors. This lets me simplify the VLM engine significantly — dropped the mlx-lm decode model weight-sharing workaround.

VLM Engine Improvements

  • mlx-vlm bump to 1bf7742: continuous batching, torch-free Qwen2 / 2.5 / 3 / 3.5 VL processors, Qwen3-VL chunked-prefill rope fix
  • Simplified VLMModelAdapter: removed _IntOffsetCacheProxy, _CachedOffsetProxy, _wrap_caches, and the separate mlx-lm decode model. mlx-vlm models now handle per-sequence mx.array offsets and batched decode natively
  • Thread-local generation stream: _init_mlx_thread() now uses mx.new_thread_local_stream and patches mlx_vlm.generate.generation_stream (workaround for mlx-vlm PR #1050 until merged)
  • Relaxed transformers pin: no more <5.4.0 upper bound, since mlx-vlm's custom processors bypass AutoProcessor for Qwen models
  • Qwen3-VL null-pixels fix: preprocessor_config.json with max_pixels: null / min_pixels: null now falls back to sensible defaults

Bug Fixes

  • Cached tokens in API usage (#922): OpenAI /v1/chat/completions now returns cached_tokens in usage.prompt_tokens_details.cached_tokens, and Anthropic /v1/messages reports cache_read_input_tokens
  • VLM spec-prefill (#918, thanks @brettp): don't re-tokenize the system prompt for VLM models

Don't miss a new omlx release

NewReleases is sending notifications on new releases.