jundot/omlx v0.3.0rc1 on GitHub

This is a pre-release of oMLX 0.3.0. It will go through 1 day of testing before the official release. If you find any bugs, please report them on the issues page!

Highlights

Audio support — STT, TTS, STS

mlx-audio integration brings three new engine types for audio models on Apple Silicon. by @ethannortharc

STT (Speech-to-Text): Whisper, Qwen3-ASR, Parakeet, Voxtral
TTS (Text-to-Speech): Qwen3-TTS, Kokoro, F5-TTS, Sesame CSM, Dia, Spark, CosyVoice
STS (Speech-to-Speech): DeepFilterNet, MossFormer2, SAMAudio, LFM2.5-Audio

Three new OpenAI-compatible endpoints: /v1/audio/transcriptions, /v1/audio/speech, /v1/audio/process (oMLX-specific). Audio models are automatically detected from the mlx-audio registry and show up in the admin dashboard alongside LLM/VLM models.

Optional dependency — pip install 'omlx[audio]'. Homebrew and DMG builds include mlx-audio by default. (#365)

XGrammar constrained decoding

xgrammar-based structured output generation by @leuski. Enforces grammar constraints at the logit level using bitmasks, running in parallel with the model forward pass via mx.async_eval.

Supported grammar types:

json — JSON schema validation
regex — regular expression patterns
grammar — EBNF/GBNF grammars
choice — allowed string lists

Uses vLLM-compatible structured_outputs field (in extra_body). Per-model reasoning_parser config maps xgrammar's structural tags to model protocols (Qwen, Harmony, DeepSeek, Llama, etc.). Performance overhead is 9–24% on decode (no TTFT impact), decreasing with larger models.

Optional dependency — pip install 'omlx[grammar]'. Homebrew and DMG builds include xgrammar by default. Without it, response_format falls back to prompt injection and structured_outputs returns a 400 with install instructions. (#335)

New Features

XTC sampler — XTC (eXclude Top Choices) sampling support. Pass xtc_probability and xtc_threshold through any API endpoint. Defaults to 0.0 (disabled) (#337 by @blightbow)
MCP Streamable HTTP — MCP now supports Streamable HTTP transport in addition to stdio (#286 by @tianfeng98)
Multimodal embedding items — /v1/embeddings accepts structured items with text + image input. Tested with Qwen3-VL-Embedding-2B-mxfp8 (#373 by @MasakiMu319)
Custom processor embedding support — Embedding requests route through custom processor hooks when available, fixing models like Qwen3-VL-Embedding that broke on the generic tokenizer path (#369 by @MasakiMu319)
System prompt support in chat UI — Chat interface now accepts system prompts
Clear all SSD cache button — Admin dashboard has a new button to clear all SSD cache blocks
SSD cache size display — Shows SSD cache size even when no models are loaded
Responsive admin dashboard — Admin dashboard now works on mobile devices
Real-time menubar updates — macOS menubar status (Stopped → Starting → Running) and button states now update in real time while the menu is open, without needing to close and reopen (#426 by @EmotionalAmo)
App bundle size reduction — Stripped torch, cv2, pyarrow, pandas, sympy from the app bundle (~780MB saved)
mlx-vlm bump — Updated to v0.4.2 (7f7a04c)
mlx-embeddings bump — Updated to v0.1.0 (32981fa)

Bug Fixes

Fix ContentPart list content causing 500 error when Qwen Code CLI sends [{"type": "text", ...}] instead of string (#433 by @mbauer)
Fix prefix index storing mutable reference to block_ids, causing cache corruption on CoW block changes (#391 by @andafterall)
Fix Anthropic document content blocks (base64 PDF) rejected by Pydantic validation (#434)
Fix "Total Tokens Processed" metric ignoring reasoning/thinking tokens — renamed to "Total Prefill Tokens" for clarity (#430)
Fix MCP client resource leak on partial connect failure
Fix ValueError from processor.apply_chat_template in VLM engine
Fix fake <think> tag prepended in TRACE log output
Fix VLM cache proxy using wrong offset for batched mask sizing
Fix specprefill_threshold not propagating from model_settings to scheduler
Fix max_position_embeddings not detected in nested text_config
Fix voice param not mapped to instruct for VoiceDesign TTS models
Fix 8-bit quantization inconsistency — unified to affine/gs64 (removed mxfp8/gs32)
Fix admin dashboard polling continuing when browser tab is hidden (#352)
Fix _prefix_index type annotations to match tuple storage
Fix menubar menu status duplication and _build_menu guard during menu open
Fix mlx-audio dependency conflicts with mlx-lm version pinning
Fix auto-generate README when frontmatter-only stub exists
Sync 25 stale tests with current implementation

New Contributors

@ethannortharc — Audio integration (STT/TTS/STS) (#365)
@leuski — XGrammar constrained decoding (#335)
@tianfeng98 — MCP Streamable HTTP (#286)
@MasakiMu319 — Multimodal embedding items (#373, #369)
@mbauer — ContentPart list type error fix (#433)
@andafterall — Prefix index mutable reference fix (#391)
@EmotionalAmo — Menubar real-time update (#426)

Full changelog: v0.2.24...v0.3.0rc1