github jundot/omlx v0.4.2rc1
0.4.2rc1

6 hours ago

This release candidate focuses on native MarkItDown document processing, Qwen throughput recovery (x1.48), Gemma 4 unified multimodal support, and cache, scheduler, and server stability improvements.

Highlights

  • Added native MarkItDown document processing. Chat file uploads can now be converted through MarkItDown, including PDF, DOCX, PPTX, TXT, and Markdown inputs, with OpenAI-style file_data support.
  • Added selectable PDF processing engines. PDFs can use either MarkItDown conversion or VLM OCR from Integration settings (currently web dashboard only).
  • Recovered Qwen throughput after the 0.4.0 regression. Qwen VLM/MTP batching and singleton decode handling were updated so single-row decode no longer falls into the slower batched cache path.
  • Added Gemma 4 unified audio input support. Gemma 4 unified models can now accept audio input alongside image inputs, with suppress-token handling for multimodal placeholders.
  • Improved model and server controls. A server-wide context window cap policy was added, embedding requests now respect the effective model context length, and server processes now show an omlx-server process title.

Performance

Internal tg512 measurements show the main Qwen regression recovered while keeping Gemma performance stable:

HKJjNP8aMAA9UJ-

Fixes

  • Fixed Qwen VLM/MTP decode behavior by keeping singleton cache layers out of BatchKVCache until a second request is actually appended.
  • Fixed prompt-prefix token seeding in BatchGenerator so penalty processors see cached prompt tokens correctly. Thanks to @Yukon for identifying the missing token buffer seed.
  • Fixed SSD cache compatibility across model and cache-layout switches so valid persisted blocks are not deleted for other models or layouts.
  • Fixed SSD cache pressure handling by unlinking LRU files outside the bounded write queue and preserving capped-eviction observability. by @cfbraun (#1451)
  • Fixed cache-store backpressure and aborted prefill cleanup so new prefills wait safely while cache cleanup is full.
  • Fixed an engine-pool acquire-vs-use eviction race and active-request counter leak for embedding and rerank engines. by @Cmerrill1713 (#1668)
  • Fixed Gemma 4 MoE multi-turn tool-call corruption by stripping stray tool-call close markers before history is re-rendered. by @kreeger (#1665)
  • Fixed Gemma special tokens leaking into API output. by @JimStenstrom (#1698)
  • Fixed raw tool-call JSON recovery when model output contains tabs or newlines inside string values.
  • Added Hermes-style tool call parsing and streaming support. by @scubamount (#1596)
  • Fixed response_format downgrade visibility by returning a client-visible Warning header when grammar-constrained output cannot be enforced. by @richgoodson (#1564)
  • Fixed embedding context length handling so /v1/embeddings uses the effective model context window instead of falling back to 512 tokens. by @jackwh (#1694)
  • Fixed multimodal oQ precision by preserving protected vision and audio tensors as float32 where needed. by @dodams258 (#1682)
  • Fixed mlx-audio resample export compatibility for input audio.
  • Fixed decoder-aware streaming detokenization and kept the fallback path compatible with older tokenizer wrappers.
  • Fixed Gemma 4 unified routing, prompt kwargs preservation, assistant drafter acceptance, and suppress-token handling across scheduler, DFlash, and VLM MTP paths.
  • Fixed small-system memory behavior so sub-24GB Apple Silicon systems use the small-system reserve path.
  • Fixed idle CPU overhead while models are loaded and guarded idle wakeups for partial engine cores.
  • Fixed interactive Claude model picker launch behavior when tier models are configured. by @fparrav (#1638)
  • Fixed prerelease Homebrew formula updates so prerelease tags do not update the stable formula.
  • Fixed packaging documentation for the staged app path and stale DMG build wording. by @cfbraun (#1458)

macOS App

  • Added all DFlash settings to the native model config screen, including verify mode, draft window and sink sizes, quantization controls, in-memory cache entries, and SSD cache size.
  • Added the VLM MTP toggle to model settings and gated overlapping speculative controls while VLM MTP is enabled. by @jabagawee (#1654)
  • Added the server-wide context window cap to the admin settings UI.
  • Fixed the chat composer at 768px tablet portrait width so the sidebar no longer covers it. by @pigeonstorm (#1699)
  • Fixed the Settings menu so it stays available when the server is stopped.
  • Improved process naming so the server appears as omlx-server. by @iamckun (#1658)

New Contributors

Thank you to everyone making their first contribution in this release:

@jabagawee, @Cmerrill1713, @kreeger, @sje397, @dodams258, @pigeonstorm, @JimStenstrom, @jackwh, @iamckun, @Kistaro.

Don't miss a new omlx release

NewReleases is sending notifications on new releases.