github jundot/omlx v0.3.8.dev3

latest releases: v0.3.8rc1, v0.3.8rc2, v0.3.8.dev1...
8 hours ago

Sorry this patch took a while. Several bugs turned out to be interconnected, which made the fixes take longer than I'd hoped. I'm a solo developer with a separate full-time job, but I'm putting in whatever spare time I have to work through everyone's reports. Thanks so much for everyone's continued support.

Once core stability is verified with this release, I'll move on to working through the pending issues and PRs.

⚠️ This is a dev release intended for testing. There may be bugs. Please report issues you run into.

Highlights

Critical fixes for the 0.3.7 regression that caused identical output across requests, attention dilution on sliding-window models, Qwen3.5 / 3.6 output divergence between cold prefill and cache-hit prefill, and Gemma 4 tool calling breaking under strict Anthropic SDKs (Claude Code, etc.). trust_remote_code also defaults to false now for security.

Bug Fixes

  • Sampling determinism regression: every request returned the same tokens at temperature > 0 due to an interaction between mlx-lm's compiled sampler and mlx's thread-local random state. omlx now uses its own sampler that bypasses the bad path. Affects all models.
  • Sliding-window attention dilution (Gemma3, gpt-oss-120b): SSD-restored RotatingKVCache leaked zero positions into attention. Cache restore now clamps to the actual buffer length (#934).
  • Qwen3.5 / Qwen3.6 forward alignment with mlx-lm: greedy output diverged between cold prefill and cache-hit prefill on Qwen3.6-35B-A3B because mlx-vlm's GatedDeltaNet and attention drifted from mlx-lm. Both forward bodies now mirror mlx-lm semantics, with mRoPE preserved for genuinely multimodal inputs.
  • Anthropic streaming compatibility: thinking blocks carry a placeholder signature so Claude Code, etc. accept them. Streaming now closes the text block before a follow-up thinking block (two <think> sections per turn previously collided on one index), and non-streaming no longer falls back to raw markup when cleaned content is empty.
  • Gemma 4 tool calling regression: tool_use.input was emptied because the validator only accepted dicts while mlx-vlm / mlx-lm parsers return JSON-object strings (OpenAI spec). It now accepts both. Stray <channel|> tokens outside thought blocks no longer reach visible content.
  • Heterogeneous batch crash (#934): logits_processors is now always passed as a per-row list.
  • Security: trust_remote_code defaults to false, opt-in per model (#926).

Don't miss a new omlx release

NewReleases is sending notifications on new releases.