jundot/omlx v0.3.8rc1 on GitHub

This is a release candidate for testing before the official v0.3.8 release. It may contain bugs. If you run into any issues, please open an issue.

Highlights

oMLX 0.3.8 performance vs mlx-lm and mlx-vlm

Apple Silicon, 512 GB unified memory. pp16384 is prefill tok/s, tg512 is decode tok/s. Cache misses forced on every request.

Model	Metric	oMLX (Cache off)	oMLX (Cache on)	mlx-lm	mlx-vlm
Qwen3.6-35B-A3B-6bit	pp16384	2469.9	2383.7	2512.2	2491.6
Qwen3.6-35B-A3B-6bit	tg512	75.0	77.3	85.3	75.9
Qwen3.6-27B-oQ6	pp16384	401.1	395.1	402.2	401.6
Qwen3.6-27B-oQ6	tg512	26.4	26.4	26.9	26.3
gemma-4-26b-a4b-it-8bit	pp16384	2148.6	1868.3	2163.3	2156.6
gemma-4-26b-a4b-it-8bit	tg512	78.7	79.1	81.9	79.0
gemma-4-31b-it-4bit	pp16384	318.8	306.1	319.3	318.8
gemma-4-31b-it-4bit	tg512	31.1	30.9	31.6	31.0

A note on the numbers: oMLX is tuned for agent workloads with shared context, not single-shot benchmark wins. Cache snapshotting and BatchGenerator add a few percent overhead, and the slightly higher latency vs other engines is expected.

gemma-4's cache-on dip on pp16384 is a known overhead on hybrid-attention models with RotatingKVCache SSD storage and is being tracked separately.

New Features

VLM continuous batching: mlx-vlm bumped to e41cd25, picking up native continuous batching and torch-free Qwen VL processors. VLMModelAdapter dropped its offset/cache glue code since mlx-vlm handles batched decode natively now
Native TTS streaming: /v1/audio/speech now streams audio as it's generated instead of buffering the full clip, dropping time-to-first-byte significantly on long inputs. Thanks @apetersson (#951)
Per-model "Trust Remote Code" toggle in admin Advanced Settings, off by default. trust_remote_code flips to False everywhere with explicit per-model opt-in, excluded from profiles/templates so the flag never propagates implicitly (#926)
Russian admin localization. Thanks @DrMaks22 (#977)
VLM SpecPrefill skips system prompt re-tokenization. Thanks @brettp (#916, #918)
Anthropic /v1/messages reports prefix cache usage via cache_creation_input_tokens and cache_read_input_tokens, splitting prompt_tokens into the Anthropic disjoint triple. Closes #912
OpenAI-compatible usage details now include prompt_tokens_details.cached_tokens. Thanks @chulanpro5 (#922)
MCP and audio routers gated behind verify_api_key
Clean errors when chat completions hit a non-LLM engine (returns 400 with the right endpoint suggestion) or when STT models are missing preprocessor_config.json (mlx-community Whisper variants, Qwen3-ASR-MLX). Thanks @ethannortharc. Closes #507, #800

Bug Fixes

Fix RuntimeError: There is no Stream(gpu, 0) in current thread follow-up: drop per-layer mx.all(...).item() GPU syncs in Qwen3.5 text-only decode. Each .item() was forcing a GPU sync, costing 5-10ms per generated token on 40-64 layer Qwen models
Fix mlx-lm sample_utils mx.compile RNG state bug that pinned identical prompts to character-identical output even at temperature > 1. omlx now ships its own sampler at omlx/utils/sampling.py without the @partial(mx.compile, inputs=mx.random.state, outputs=mx.random.state) decorator. Surfaced as the 0.3.7 "infinite loop" reports
Fix request hang from mlx-lm GenerationBatch._step raising TypeError: 'NoneType' object is not iterable when a per-row logits_processors slot was None. Always pass per-row list, possibly empty, never None. Pattern added to cache-corruption recovery as a safety net (#934)
Fix attention dilution on RotatingKVCache SSD restore for sliding-window models (Gemma3, gpt-oss-120b, Gemma4 sliding layers). Old restore zero-padded the buffer to max_size, leaking zero positions into attention via softmax. New PrefillReadyRotatingKVCache subclass clamps size() by actual buffer length, and SSD cache format bumped to v2
Fix PrefillReadyRotatingKVCache falling through to DefaultCacheHandler on SSD restore because its class name was missing from the cache type registry, reintroducing the dilution bug for restored caches
Fix Qwen3.5/3.6 mlx-vlm forward diverging from mlx-lm cache semantics (missing cache.advance(S) and mx.contiguous on cache[0]). Greedy outputs no longer diverge between cold prefill and cache-hit prefill on Qwen3.6-35B-A3B
Fix mlx-vlm GatedDeltaNet missing cache.advance(S) and mx.contiguous from mlx-lm 0.31.3. Patch now wraps both libraries with idempotent source-inspect forward-compat
Fix Anthropic Content block is not a text block SDK errors from broken text/thinking block transitions when models emitted multiple <think> sections; non-streaming tool-only responses no longer fall back to regular_content with raw <|tool_call> markup
Fix Anthropic SDKs rejecting thinking blocks that had empty/missing signature. omlx now emits a stable placeholder (omlx-reasoning) so signature-presence checks pass. Strict signature-verifying clients still reject as expected
Fix Gemma 4 tool calling on /v1/messages returning empty tool_use.input because the strict dict-only validator coerced JSON-object strings (per OpenAI spec from native parsers) to "{}". Now accepts any string that parses back to a dict
Fix Gemma 4 leaking a bare <channel|> close token into visible text on long multi-turn contexts; both channel markers tracked unconditionally and stray closes dropped
Fix updater picking dev/rc tags (e.g. v0.3.8.dev3) as "latest" because /releases/latest only honors the prerelease flag and our dev tags shipped with that flag unset. Both menubar and admin update check now scan /releases and select the highest stable PEP 440 tag (#981)
Fix oQ uploader excluding folders like Qwen3.6-27B-oQ3.5e because the suffix filter only checked name[-5:] for "oQ". Match "oQ" anywhere in the folder name (#968)

Dependency Updates

mlx-vlm 1bf7742 → e41cd25 (continuous batching, torch-free Qwen VL processors, thread-local generation stream now upstream)
transformers upper pin <5.4.0 removed
mistral_common ReasoningEffort stub for Gemma4 processor loading
ProcessorMixin video_processor kwarg shim for HF fallback

New Contributors

@brettp: VLM SpecPrefill system prompt fix (#918)
@DrMaks22: Russian admin localization (#977)