github jundot/omlx v0.3.8rc1

5 hours ago

This is a release candidate for testing before the official v0.3.8 release. It may contain bugs. If you run into any issues, please open an issue.

Highlights

oMLX 0.3.8 performance vs mlx-lm and mlx-vlm

Apple Silicon, 512 GB unified memory. pp16384 is prefill tok/s, tg512 is decode tok/s. Cache misses forced on every request.

ModelMetricoMLX (Cache off)oMLX (Cache on)mlx-lmmlx-vlm
Qwen3.6-35B-A3B-6bitpp163842469.92383.72512.22491.6
tg51275.077.385.375.9
Qwen3.6-27B-oQ6pp16384401.1395.1402.2401.6
tg51226.426.426.926.3
gemma-4-26b-a4b-it-8bitpp163842148.61868.32163.32156.6
tg51278.779.181.979.0
gemma-4-31b-it-4bitpp16384318.8306.1319.3318.8
tg51231.130.931.631.0

A note on the numbers: oMLX is tuned for agent workloads with shared context, not single-shot benchmark wins. Cache snapshotting and BatchGenerator add a few percent overhead, and the slightly higher latency vs other engines is expected.

gemma-4's cache-on dip on pp16384 is a known overhead on hybrid-attention models with RotatingKVCache SSD storage and is being tracked separately.

New Features

  • VLM continuous batching: mlx-vlm bumped to e41cd25, picking up native continuous batching and torch-free Qwen VL processors. VLMModelAdapter dropped its offset/cache glue code since mlx-vlm handles batched decode natively now
  • Native TTS streaming: /v1/audio/speech now streams audio as it's generated instead of buffering the full clip, dropping time-to-first-byte significantly on long inputs. Thanks @apetersson (#951)
  • Per-model "Trust Remote Code" toggle in admin Advanced Settings, off by default. trust_remote_code flips to False everywhere with explicit per-model opt-in, excluded from profiles/templates so the flag never propagates implicitly (#926)
  • Russian admin localization. Thanks @DrMaks22 (#977)
  • VLM SpecPrefill skips system prompt re-tokenization. Thanks @brettp (#916, #918)
  • Anthropic /v1/messages reports prefix cache usage via cache_creation_input_tokens and cache_read_input_tokens, splitting prompt_tokens into the Anthropic disjoint triple. Closes #912
  • OpenAI-compatible usage details now include prompt_tokens_details.cached_tokens. Thanks @chulanpro5 (#922)
  • MCP and audio routers gated behind verify_api_key
  • Clean errors when chat completions hit a non-LLM engine (returns 400 with the right endpoint suggestion) or when STT models are missing preprocessor_config.json (mlx-community Whisper variants, Qwen3-ASR-MLX). Thanks @ethannortharc. Closes #507, #800

Bug Fixes

  • Fix RuntimeError: There is no Stream(gpu, 0) in current thread follow-up: drop per-layer mx.all(...).item() GPU syncs in Qwen3.5 text-only decode. Each .item() was forcing a GPU sync, costing 5-10ms per generated token on 40-64 layer Qwen models
  • Fix mlx-lm sample_utils mx.compile RNG state bug that pinned identical prompts to character-identical output even at temperature > 1. omlx now ships its own sampler at omlx/utils/sampling.py without the @partial(mx.compile, inputs=mx.random.state, outputs=mx.random.state) decorator. Surfaced as the 0.3.7 "infinite loop" reports
  • Fix request hang from mlx-lm GenerationBatch._step raising TypeError: 'NoneType' object is not iterable when a per-row logits_processors slot was None. Always pass per-row list, possibly empty, never None. Pattern added to cache-corruption recovery as a safety net (#934)
  • Fix attention dilution on RotatingKVCache SSD restore for sliding-window models (Gemma3, gpt-oss-120b, Gemma4 sliding layers). Old restore zero-padded the buffer to max_size, leaking zero positions into attention via softmax. New PrefillReadyRotatingKVCache subclass clamps size() by actual buffer length, and SSD cache format bumped to v2
  • Fix PrefillReadyRotatingKVCache falling through to DefaultCacheHandler on SSD restore because its class name was missing from the cache type registry, reintroducing the dilution bug for restored caches
  • Fix Qwen3.5/3.6 mlx-vlm forward diverging from mlx-lm cache semantics (missing cache.advance(S) and mx.contiguous on cache[0]). Greedy outputs no longer diverge between cold prefill and cache-hit prefill on Qwen3.6-35B-A3B
  • Fix mlx-vlm GatedDeltaNet missing cache.advance(S) and mx.contiguous from mlx-lm 0.31.3. Patch now wraps both libraries with idempotent source-inspect forward-compat
  • Fix Anthropic Content block is not a text block SDK errors from broken text/thinking block transitions when models emitted multiple <think> sections; non-streaming tool-only responses no longer fall back to regular_content with raw <|tool_call> markup
  • Fix Anthropic SDKs rejecting thinking blocks that had empty/missing signature. omlx now emits a stable placeholder (omlx-reasoning) so signature-presence checks pass. Strict signature-verifying clients still reject as expected
  • Fix Gemma 4 tool calling on /v1/messages returning empty tool_use.input because the strict dict-only validator coerced JSON-object strings (per OpenAI spec from native parsers) to "{}". Now accepts any string that parses back to a dict
  • Fix Gemma 4 leaking a bare <channel|> close token into visible text on long multi-turn contexts; both channel markers tracked unconditionally and stray closes dropped
  • Fix updater picking dev/rc tags (e.g. v0.3.8.dev3) as "latest" because /releases/latest only honors the prerelease flag and our dev tags shipped with that flag unset. Both menubar and admin update check now scan /releases and select the highest stable PEP 440 tag (#981)
  • Fix oQ uploader excluding folders like Qwen3.6-27B-oQ3.5e because the suffix filter only checked name[-5:] for "oQ". Match "oQ" anywhere in the folder name (#968)

Dependency Updates

  • mlx-vlm 1bf7742 → e41cd25 (continuous batching, torch-free Qwen VL processors, thread-local generation stream now upstream)
  • transformers upper pin <5.4.0 removed
  • mistral_common ReasoningEffort stub for Gemma4 processor loading
  • ProcessorMixin video_processor kwarg shim for HF fallback

New Contributors

Don't miss a new omlx release

NewReleases is sending notifications on new releases.