jundot/omlx v0.4.2rc1 on GitHub

This release candidate focuses on native MarkItDown document processing, Qwen throughput recovery (x1.48), Gemma 4 unified multimodal support, and cache, scheduler, and server stability improvements.

Highlights

Added native MarkItDown document processing. Chat file uploads can now be converted through MarkItDown, including PDF, DOCX, PPTX, TXT, and Markdown inputs, with OpenAI-style file_data support.
Added selectable PDF processing engines. PDFs can use either MarkItDown conversion or VLM OCR from Integration settings (currently web dashboard only).
Recovered Qwen throughput after the 0.4.0 regression. Qwen VLM/MTP batching and singleton decode handling were updated so single-row decode no longer falls into the slower batched cache path.
Added Gemma 4 unified audio input support. Gemma 4 unified models can now accept audio input alongside image inputs, with suppress-token handling for multimodal placeholders.
Improved model and server controls. A server-wide context window cap policy was added, embedding requests now respect the effective model context length, and server processes now show an omlx-server process title.

Performance

Internal tg512 measurements show the main Qwen regression recovered while keeping Gemma performance stable:

Fixes

Fixed Qwen VLM/MTP decode behavior by keeping singleton cache layers out of BatchKVCache until a second request is actually appended.
Fixed prompt-prefix token seeding in BatchGenerator so penalty processors see cached prompt tokens correctly. Thanks to @Yukon for identifying the missing token buffer seed.
Fixed SSD cache compatibility across model and cache-layout switches so valid persisted blocks are not deleted for other models or layouts.
Fixed SSD cache pressure handling by unlinking LRU files outside the bounded write queue and preserving capped-eviction observability. by @cfbraun (#1451)
Fixed cache-store backpressure and aborted prefill cleanup so new prefills wait safely while cache cleanup is full.
Fixed an engine-pool acquire-vs-use eviction race and active-request counter leak for embedding and rerank engines. by @Cmerrill1713 (#1668)
Fixed Gemma 4 MoE multi-turn tool-call corruption by stripping stray tool-call close markers before history is re-rendered. by @kreeger (#1665)
Fixed Gemma special tokens leaking into API output. by @JimStenstrom (#1698)
Fixed raw tool-call JSON recovery when model output contains tabs or newlines inside string values.
Added Hermes-style tool call parsing and streaming support. by @scubamount (#1596)
Fixed response_format downgrade visibility by returning a client-visible Warning header when grammar-constrained output cannot be enforced. by @richgoodson (#1564)
Fixed embedding context length handling so /v1/embeddings uses the effective model context window instead of falling back to 512 tokens. by @jackwh (#1694)
Fixed multimodal oQ precision by preserving protected vision and audio tensors as float32 where needed. by @dodams258 (#1682)
Fixed mlx-audio resample export compatibility for input audio.
Fixed decoder-aware streaming detokenization and kept the fallback path compatible with older tokenizer wrappers.
Fixed Gemma 4 unified routing, prompt kwargs preservation, assistant drafter acceptance, and suppress-token handling across scheduler, DFlash, and VLM MTP paths.
Fixed small-system memory behavior so sub-24GB Apple Silicon systems use the small-system reserve path.
Fixed idle CPU overhead while models are loaded and guarded idle wakeups for partial engine cores.
Fixed interactive Claude model picker launch behavior when tier models are configured. by @fparrav (#1638)
Fixed prerelease Homebrew formula updates so prerelease tags do not update the stable formula.
Fixed packaging documentation for the staged app path and stale DMG build wording. by @cfbraun (#1458)

macOS App

Added all DFlash settings to the native model config screen, including verify mode, draft window and sink sizes, quantization controls, in-memory cache entries, and SSD cache size.
Added the VLM MTP toggle to model settings and gated overlapping speculative controls while VLM MTP is enabled. by @jabagawee (#1654)
Added the server-wide context window cap to the admin settings UI.
Fixed the chat composer at 768px tablet portrait width so the sidebar no longer covers it. by @pigeonstorm (#1699)
Fixed the Settings menu so it stays available when the server is stopped.
Improved process naming so the server appears as omlx-server. by @iamckun (#1658)

New Contributors

Thank you to everyone making their first contribution in this release:

@jabagawee, @Cmerrill1713, @kreeger, @sje397, @dodams258, @pigeonstorm, @JimStenstrom, @jackwh, @iamckun, @Kistaro.

jundot/omlx v0.4.2rc1 0.4.2rc1 on GitHub

Highlights

Performance

Fixes

macOS App

New Contributors

jundot/omlx v0.4.2rc1
0.4.2rc1

on GitHub