github jundot/omlx v0.3.0rc1

7 hours ago

This is a pre-release of oMLX 0.3.0. It will go through 1 day of testing before the official release. If you find any bugs, please report them on the issues page!

Highlights

Audio support — STT, TTS, STS

mlx-audio integration brings three new engine types for audio models on Apple Silicon. by @ethannortharc

  • STT (Speech-to-Text): Whisper, Qwen3-ASR, Parakeet, Voxtral
  • TTS (Text-to-Speech): Qwen3-TTS, Kokoro, F5-TTS, Sesame CSM, Dia, Spark, CosyVoice
  • STS (Speech-to-Speech): DeepFilterNet, MossFormer2, SAMAudio, LFM2.5-Audio

Three new OpenAI-compatible endpoints: /v1/audio/transcriptions, /v1/audio/speech, /v1/audio/process (oMLX-specific). Audio models are automatically detected from the mlx-audio registry and show up in the admin dashboard alongside LLM/VLM models.

Audio Integration

Optional dependency — pip install 'omlx[audio]'. Homebrew and DMG builds include mlx-audio by default. (#365)

XGrammar constrained decoding

xgrammar-based structured output generation by @leuski. Enforces grammar constraints at the logit level using bitmasks, running in parallel with the model forward pass via mx.async_eval.

Supported grammar types:

  • json — JSON schema validation
  • regex — regular expression patterns
  • grammar — EBNF/GBNF grammars
  • choice — allowed string lists

Uses vLLM-compatible structured_outputs field (in extra_body). Per-model reasoning_parser config maps xgrammar's structural tags to model protocols (Qwen, Harmony, DeepSeek, Llama, etc.). Performance overhead is 9–24% on decode (no TTFT impact), decreasing with larger models.

Optional dependency — pip install 'omlx[grammar]'. Homebrew and DMG builds include xgrammar by default. Without it, response_format falls back to prompt injection and structured_outputs returns a 400 with install instructions. (#335)

New Features

  • XTC sampler — XTC (eXclude Top Choices) sampling support. Pass xtc_probability and xtc_threshold through any API endpoint. Defaults to 0.0 (disabled) (#337 by @blightbow)
  • MCP Streamable HTTP — MCP now supports Streamable HTTP transport in addition to stdio (#286 by @tianfeng98)
  • Multimodal embedding items/v1/embeddings accepts structured items with text + image input. Tested with Qwen3-VL-Embedding-2B-mxfp8 (#373 by @MasakiMu319)
  • Custom processor embedding support — Embedding requests route through custom processor hooks when available, fixing models like Qwen3-VL-Embedding that broke on the generic tokenizer path (#369 by @MasakiMu319)
  • System prompt support in chat UI — Chat interface now accepts system prompts
  • Clear all SSD cache button — Admin dashboard has a new button to clear all SSD cache blocks
  • SSD cache size display — Shows SSD cache size even when no models are loaded
  • Responsive admin dashboard — Admin dashboard now works on mobile devices
  • Real-time menubar updates — macOS menubar status (Stopped → Starting → Running) and button states now update in real time while the menu is open, without needing to close and reopen (#426 by @EmotionalAmo)
  • App bundle size reduction — Stripped torch, cv2, pyarrow, pandas, sympy from the app bundle (~780MB saved)
  • mlx-vlm bump — Updated to v0.4.2 (7f7a04c)
  • mlx-embeddings bump — Updated to v0.1.0 (32981fa)

Bug Fixes

  • Fix ContentPart list content causing 500 error when Qwen Code CLI sends [{"type": "text", ...}] instead of string (#433 by @mbauer)
  • Fix prefix index storing mutable reference to block_ids, causing cache corruption on CoW block changes (#391 by @andafterall)
  • Fix Anthropic document content blocks (base64 PDF) rejected by Pydantic validation (#434)
  • Fix "Total Tokens Processed" metric ignoring reasoning/thinking tokens — renamed to "Total Prefill Tokens" for clarity (#430)
  • Fix MCP client resource leak on partial connect failure
  • Fix ValueError from processor.apply_chat_template in VLM engine
  • Fix fake <think> tag prepended in TRACE log output
  • Fix VLM cache proxy using wrong offset for batched mask sizing
  • Fix specprefill_threshold not propagating from model_settings to scheduler
  • Fix max_position_embeddings not detected in nested text_config
  • Fix voice param not mapped to instruct for VoiceDesign TTS models
  • Fix 8-bit quantization inconsistency — unified to affine/gs64 (removed mxfp8/gs32)
  • Fix admin dashboard polling continuing when browser tab is hidden (#352)
  • Fix _prefix_index type annotations to match tuple storage
  • Fix menubar menu status duplication and _build_menu guard during menu open
  • Fix mlx-audio dependency conflicts with mlx-lm version pinning
  • Fix auto-generate README when frontmatter-only stub exists
  • Sync 25 stale tests with current implementation

New Contributors

Full changelog: v0.2.24...v0.3.0rc1

Don't miss a new omlx release

NewReleases is sending notifications on new releases.