This is a release candidate for testing before the official v0.3.8 release. It may contain bugs. If you run into any issues, please open an issue.
Highlights
oMLX 0.3.8 performance vs mlx-lm and mlx-vlm
Apple Silicon, 512 GB unified memory. pp16384 is prefill tok/s, tg512 is decode tok/s. Cache misses forced on every request.
| Model | Metric | oMLX (Cache off) | oMLX (Cache on) | mlx-lm | mlx-vlm |
|---|---|---|---|---|---|
| Qwen3.6-35B-A3B-6bit | pp16384 | 2469.9 | 2383.7 | 2512.2 | 2491.6 |
| tg512 | 75.0 | 77.3 | 85.3 | 75.9 | |
| Qwen3.6-27B-oQ6 | pp16384 | 401.1 | 395.1 | 402.2 | 401.6 |
| tg512 | 26.4 | 26.4 | 26.9 | 26.3 | |
| gemma-4-26b-a4b-it-8bit | pp16384 | 2148.6 | 1868.3 | 2163.3 | 2156.6 |
| tg512 | 78.7 | 79.1 | 81.9 | 79.0 | |
| gemma-4-31b-it-4bit | pp16384 | 318.8 | 306.1 | 319.3 | 318.8 |
| tg512 | 31.1 | 30.9 | 31.6 | 31.0 |
A note on the numbers: oMLX is tuned for agent workloads with shared context, not single-shot benchmark wins. Cache snapshotting and BatchGenerator add a few percent overhead, and the slightly higher latency vs other engines is expected.
gemma-4's cache-on dip on pp16384 is a known overhead on hybrid-attention models with RotatingKVCache SSD storage and is being tracked separately.
New Features
- VLM continuous batching: mlx-vlm bumped to
e41cd25, picking up native continuous batching and torch-free Qwen VL processors.VLMModelAdapterdropped its offset/cache glue code since mlx-vlm handles batched decode natively now - Native TTS streaming:
/v1/audio/speechnow streams audio as it's generated instead of buffering the full clip, dropping time-to-first-byte significantly on long inputs. Thanks @apetersson (#951) - Per-model "Trust Remote Code" toggle in admin Advanced Settings, off by default.
trust_remote_codeflips to False everywhere with explicit per-model opt-in, excluded from profiles/templates so the flag never propagates implicitly (#926) - Russian admin localization. Thanks @DrMaks22 (#977)
- VLM SpecPrefill skips system prompt re-tokenization. Thanks @brettp (#916, #918)
- Anthropic
/v1/messagesreports prefix cache usage viacache_creation_input_tokensandcache_read_input_tokens, splittingprompt_tokensinto the Anthropic disjoint triple. Closes #912 - OpenAI-compatible usage details now include
prompt_tokens_details.cached_tokens. Thanks @chulanpro5 (#922) - MCP and audio routers gated behind
verify_api_key - Clean errors when chat completions hit a non-LLM engine (returns 400 with the right endpoint suggestion) or when STT models are missing
preprocessor_config.json(mlx-community Whisper variants, Qwen3-ASR-MLX). Thanks @ethannortharc. Closes #507, #800
Bug Fixes
- Fix
RuntimeError: There is no Stream(gpu, 0) in current threadfollow-up: drop per-layermx.all(...).item()GPU syncs in Qwen3.5 text-only decode. Each.item()was forcing a GPU sync, costing 5-10ms per generated token on 40-64 layer Qwen models - Fix mlx-lm sample_utils
mx.compileRNG state bug that pinned identical prompts to character-identical output even at temperature > 1. omlx now ships its own sampler atomlx/utils/sampling.pywithout the@partial(mx.compile, inputs=mx.random.state, outputs=mx.random.state)decorator. Surfaced as the 0.3.7 "infinite loop" reports - Fix request hang from mlx-lm
GenerationBatch._stepraisingTypeError: 'NoneType' object is not iterablewhen a per-rowlogits_processorsslot was None. Always pass per-row list, possibly empty, never None. Pattern added to cache-corruption recovery as a safety net (#934) - Fix attention dilution on
RotatingKVCacheSSD restore for sliding-window models (Gemma3, gpt-oss-120b, Gemma4 sliding layers). Old restore zero-padded the buffer tomax_size, leaking zero positions into attention via softmax. NewPrefillReadyRotatingKVCachesubclass clampssize()by actual buffer length, and SSD cache format bumped to v2 - Fix
PrefillReadyRotatingKVCachefalling through toDefaultCacheHandleron SSD restore because its class name was missing from the cache type registry, reintroducing the dilution bug for restored caches - Fix Qwen3.5/3.6 mlx-vlm forward diverging from mlx-lm cache semantics (missing
cache.advance(S)andmx.contiguouson cache[0]). Greedy outputs no longer diverge between cold prefill and cache-hit prefill on Qwen3.6-35B-A3B - Fix mlx-vlm
GatedDeltaNetmissingcache.advance(S)andmx.contiguousfrom mlx-lm 0.31.3. Patch now wraps both libraries with idempotent source-inspect forward-compat - Fix Anthropic
Content block is not a text blockSDK errors from broken text/thinking block transitions when models emitted multiple<think>sections; non-streaming tool-only responses no longer fall back toregular_contentwith raw<|tool_call>markup - Fix Anthropic SDKs rejecting thinking blocks that had empty/missing
signature. omlx now emits a stable placeholder (omlx-reasoning) so signature-presence checks pass. Strict signature-verifying clients still reject as expected - Fix Gemma 4 tool calling on
/v1/messagesreturning emptytool_use.inputbecause the strict dict-only validator coerced JSON-object strings (per OpenAI spec from native parsers) to"{}". Now accepts any string that parses back to a dict - Fix Gemma 4 leaking a bare
<channel|>close token into visible text on long multi-turn contexts; both channel markers tracked unconditionally and stray closes dropped - Fix updater picking dev/rc tags (e.g.
v0.3.8.dev3) as "latest" because/releases/latestonly honors the prerelease flag and our dev tags shipped with that flag unset. Both menubar and admin update check now scan/releasesand select the highest stable PEP 440 tag (#981) - Fix oQ uploader excluding folders like
Qwen3.6-27B-oQ3.5ebecause the suffix filter only checkedname[-5:]for "oQ". Match "oQ" anywhere in the folder name (#968)
Dependency Updates
- mlx-vlm
1bf7742 → e41cd25(continuous batching, torch-free Qwen VL processors, thread-local generation stream now upstream) - transformers upper pin
<5.4.0removed - mistral_common
ReasoningEffortstub for Gemma4 processor loading - ProcessorMixin
video_processorkwarg shim for HF fallback