github jundot/omlx v0.2.18

latest releases: v0.3.8, v0.3.8rc1, v0.3.8rc2...
one month ago

Download the DMG that matches your macOS version (sequoia or tahoe).
If you're on an M5 Mac, you must use the macos26-tahoe DMG for M5 Neural Accelerator.

Highlights: thinking budget support

  • Thinking budget for reasoning models. you can now limit how many tokens a model spends on reasoning. set it per-model in the admin panel or per-request via the API. when the budget is exceeded, thinking is force-closed and the model transitions to the actual response.

New Features

Thinking budget (#285)

  • per-model thinking budget toggle + token count in admin panel (advanced settings)
  • per-request thinking_budget parameter for OpenAI API, thinking.budget_tokens for Anthropic API
  • uses logits processor to force close-think sequence when budget exceeded (same approach as vLLM/SGLang)
  • auto-detects the correct </think> transition pattern from each model's chat template (handles Qwen3, DeepSeek, GLM, MiniMax, Step etc.)
  • suppresses duplicate </think> tokens after forced close
  • zero overhead when budget is not set. near-zero overhead when active
  • works for both LLM and VLM. no impact on embedding, reranker, or any cache system

Bug Fixes

  • fix disable mx.compile on runtime failure to prevent repeated warnings on every subsequent call

Notes

  • tip for Qwen3.5-35B-A3B users: if reasoning (enable_thinking) is true, the model may emit EOS during tool calling and stop generation mid-turn. if you're using Qwen3.5 for agentic coding, go to model settings → Chat Template Kwargs, set enable_thinking to false and check force.

full changelog: v0.2.17...v0.2.18

Don't miss a new omlx release

NewReleases is sending notifications on new releases.