jundot/omlx v0.2.20rc1 on GitHub

This is a release candidate for v0.2.20. Please test and report any issues before the final release.

Highlights

oQ — oMLX universal dynamic quantization

Quantize any model directly from the web dashboard. oQ produces standard mlx-lm compatible models that work everywhere, no custom loader required. oMLX, mlx-lm, and any app that supports MLX safetensors format.

oQ analyzes weight distributions per-layer and applies the optimal quantization format (mxfp4, mxfp8, affine) automatically. See the oQ documentation for details.

oQ2-oQ8 levels with calibration datasets and CLIP-based evaluation
oQ3.5 base 3-bit + expert down_proj 4-bit (~3.9 bpw)
Hybrid quantization per-layer mxfp4/mxfp8/affine format selection for better quality-size tradeoffs
AutoAWQ weight equalization with per-expert scaling and data-driven sensitivity analysis for improved accuracy
VLM support. Quantize vision-language models with vision weight preservation
FP8 model support. Use native FP8 models (MiniMax, DeepSeek) as quantization source
Clip optimization speedup with GPU batch size setting for faster AWQ-style clipping
Blocks inference requests during quantization to prevent conflicts

Intelligence benchmark suite

Evaluate model intelligence across knowledge, reasoning, math, and coding benchmarks. All datasets bundled locally for offline use.

Knowledge: MMLU, ARC-Challenge, KMMLU (Korean), CMMLU (Chinese), JMMLU (Japanese)
Reasoning: HellaSwag, Winogrande, TruthfulQA, GSM8K
Coding: HumanEval (164 function completions, pass@1), MBPP
Benchmark queue for sequential multi-model evaluation with persistent results
Comparison table with mode/sample columns and text export
Sample size options: 30/50/100/200/300/500/1000/2000/Full
Batch processing: 1x/2x/4x/8x/16x/32x
Download raw results as JSON

New Features

Prefill memory guard. Prevents kernel panics on large context by detecting head_dim>128 O(n²) SDPA fallback and enforcing safe prefill chunk sizes
Native BERT/XLMRoBERTa embedding. Load BERT-family embedding models (bge-m3, mxbai-embed) without mlx-embeddings fallback (#330 by @yes999zc)
Jina v3 reranker. Reranking via <|score_token|> logits for jinaai/jina-reranker-v3-mlx (#331 by @yes999zc)
Partial mode. Assistant message prefill support for Moonshot/Kimi K2 models (partial field + name field passthrough) (#306 by @blightbow)
Codex smart config merging. Non-destructive config merge with reasoning model auto-detection (#249 by @JasonYeYuhe)
i18n normalization. Normalize translation files against en.json with missing key detection (#247 by @xiaoran007)
Web dashboard generating status. Show generating status for active requests after prefill completes

Experimental Features

SpecPrefill. Attention-based sparse prefill for MoE models. Reduces prefill compute by skipping low-attention tokens. System prompt is protected from token dropping to preserve instruction following.

Bug Fixes

Fix chat streaming failure not sending error message to client (#342)
Fix TTL auto-unload during benchmark causing Metal GPU crash
Fix dtype normalization on enhanced path causing OOM on large models
Fix oQ bf16→fp16 weight conversion causing 41% quantized value corruption
Fix oQ mxfp4 uint8 scales being force-cast to fp16
Fix oQ clip optimization mask dtype and position_ids for Qwen3.5
Fix oQ streaming quantization accuracy and VLM support
Fix MC benchmarks (MMLU, HellaSwag, TruthfulQA) always scoring 0% due to max_tokens=1
Fix HumanEval scoring. Prepend prompt imports when model returns function only
Fix MBPP scoring. Include test cases in prompt so model uses correct function name
Fix benchmark code extraction. Extract last answer/code block instead of first
Fix benchmark penalties. Force neutral presence_penalty=0 and repetition_penalty=1
Fix think prefix false positive for disabled thinking patterns (<think></think>)
Fix responses API image support for VLM + missing prompt_tokens in completions usage
Fix SSE streaming behind nginx reverse proxy (X-Accel-Buffering header) (#309)
Fix CausalLM-based embedding model detection (Qwen3-Embedding) (#327)
Fix web dashboard unload tooltip clipping in active models box (#314)
Fix web dashboard 401 warning log spam from dashboard polling
Fix web dashboard model settings not showing for embedding/reranker models
Fix PEP 735 dependency-groups for uv sync --dev (#305 by @blightbow)

New Contributors

@blightbow made their first contribution in #305
@yes999zc made their first contribution in #330
@JasonYeYuhe made their first contribution in #249
@xiaoran007 made their first contribution in #247

Full changelog: v0.2.19...v0.2.20rc1