github jundot/omlx v0.2.20rc1

latest release: v0.2.20
pre-release2 days ago

This is a release candidate for v0.2.20. Please test and report any issues before the final release.

Highlights

oQ — oMLX universal dynamic quantization

Quantize any model directly from the web dashboard. oQ produces standard mlx-lm compatible models that work everywhere, no custom loader required. oMLX, mlx-lm, and any app that supports MLX safetensors format.

oQ analyzes weight distributions per-layer and applies the optimal quantization format (mxfp4, mxfp8, affine) automatically. See the oQ documentation for details.

  • oQ2-oQ8 levels with calibration datasets and CLIP-based evaluation
  • oQ3.5 base 3-bit + expert down_proj 4-bit (~3.9 bpw)
  • Hybrid quantization per-layer mxfp4/mxfp8/affine format selection for better quality-size tradeoffs
  • AutoAWQ weight equalization with per-expert scaling and data-driven sensitivity analysis for improved accuracy
  • VLM support. Quantize vision-language models with vision weight preservation
  • FP8 model support. Use native FP8 models (MiniMax, DeepSeek) as quantization source
  • Clip optimization speedup with GPU batch size setting for faster AWQ-style clipping
  • Blocks inference requests during quantization to prevent conflicts

Intelligence benchmark suite

Evaluate model intelligence across knowledge, reasoning, math, and coding benchmarks. All datasets bundled locally for offline use.

  • Knowledge: MMLU, ARC-Challenge, KMMLU (Korean), CMMLU (Chinese), JMMLU (Japanese)
  • Reasoning: HellaSwag, Winogrande, TruthfulQA, GSM8K
  • Coding: HumanEval (164 function completions, pass@1), MBPP
  • Benchmark queue for sequential multi-model evaluation with persistent results
  • Comparison table with mode/sample columns and text export
  • Sample size options: 30/50/100/200/300/500/1000/2000/Full
  • Batch processing: 1x/2x/4x/8x/16x/32x
  • Download raw results as JSON

New Features

  • Prefill memory guard. Prevents kernel panics on large context by detecting head_dim>128 O(n²) SDPA fallback and enforcing safe prefill chunk sizes
  • Native BERT/XLMRoBERTa embedding. Load BERT-family embedding models (bge-m3, mxbai-embed) without mlx-embeddings fallback (#330 by @yes999zc)
  • Jina v3 reranker. Reranking via <|score_token|> logits for jinaai/jina-reranker-v3-mlx (#331 by @yes999zc)
  • Partial mode. Assistant message prefill support for Moonshot/Kimi K2 models (partial field + name field passthrough) (#306 by @blightbow)
  • Codex smart config merging. Non-destructive config merge with reasoning model auto-detection (#249 by @JasonYeYuhe)
  • i18n normalization. Normalize translation files against en.json with missing key detection (#247 by @xiaoran007)
  • Web dashboard generating status. Show generating status for active requests after prefill completes

Experimental Features

  • SpecPrefill. Attention-based sparse prefill for MoE models. Reduces prefill compute by skipping low-attention tokens. System prompt is protected from token dropping to preserve instruction following.

Bug Fixes

  • Fix chat streaming failure not sending error message to client (#342)
  • Fix TTL auto-unload during benchmark causing Metal GPU crash
  • Fix dtype normalization on enhanced path causing OOM on large models
  • Fix oQ bf16→fp16 weight conversion causing 41% quantized value corruption
  • Fix oQ mxfp4 uint8 scales being force-cast to fp16
  • Fix oQ clip optimization mask dtype and position_ids for Qwen3.5
  • Fix oQ streaming quantization accuracy and VLM support
  • Fix MC benchmarks (MMLU, HellaSwag, TruthfulQA) always scoring 0% due to max_tokens=1
  • Fix HumanEval scoring. Prepend prompt imports when model returns function only
  • Fix MBPP scoring. Include test cases in prompt so model uses correct function name
  • Fix benchmark code extraction. Extract last answer/code block instead of first
  • Fix benchmark penalties. Force neutral presence_penalty=0 and repetition_penalty=1
  • Fix think prefix false positive for disabled thinking patterns (<think></think>)
  • Fix responses API image support for VLM + missing prompt_tokens in completions usage
  • Fix SSE streaming behind nginx reverse proxy (X-Accel-Buffering header) (#309)
  • Fix CausalLM-based embedding model detection (Qwen3-Embedding) (#327)
  • Fix web dashboard unload tooltip clipping in active models box (#314)
  • Fix web dashboard 401 warning log spam from dashboard polling
  • Fix web dashboard model settings not showing for embedding/reranker models
  • Fix PEP 735 dependency-groups for uv sync --dev (#305 by @blightbow)

New Contributors

Full changelog: v0.2.19...v0.2.20rc1

Don't miss a new omlx release

NewReleases is sending notifications on new releases.