This is a release candidate for v0.2.20. Please test and report any issues before the final release.
Highlights
oQ — oMLX universal dynamic quantization
Quantize any model directly from the web dashboard. oQ produces standard mlx-lm compatible models that work everywhere, no custom loader required. oMLX, mlx-lm, and any app that supports MLX safetensors format.
oQ analyzes weight distributions per-layer and applies the optimal quantization format (mxfp4, mxfp8, affine) automatically. See the oQ documentation for details.
- oQ2-oQ8 levels with calibration datasets and CLIP-based evaluation
- oQ3.5 base 3-bit + expert down_proj 4-bit (~3.9 bpw)
- Hybrid quantization per-layer mxfp4/mxfp8/affine format selection for better quality-size tradeoffs
- AutoAWQ weight equalization with per-expert scaling and data-driven sensitivity analysis for improved accuracy
- VLM support. Quantize vision-language models with vision weight preservation
- FP8 model support. Use native FP8 models (MiniMax, DeepSeek) as quantization source
- Clip optimization speedup with GPU batch size setting for faster AWQ-style clipping
- Blocks inference requests during quantization to prevent conflicts
Intelligence benchmark suite
Evaluate model intelligence across knowledge, reasoning, math, and coding benchmarks. All datasets bundled locally for offline use.
- Knowledge: MMLU, ARC-Challenge, KMMLU (Korean), CMMLU (Chinese), JMMLU (Japanese)
- Reasoning: HellaSwag, Winogrande, TruthfulQA, GSM8K
- Coding: HumanEval (164 function completions, pass@1), MBPP
- Benchmark queue for sequential multi-model evaluation with persistent results
- Comparison table with mode/sample columns and text export
- Sample size options: 30/50/100/200/300/500/1000/2000/Full
- Batch processing: 1x/2x/4x/8x/16x/32x
- Download raw results as JSON
New Features
- Prefill memory guard. Prevents kernel panics on large context by detecting head_dim>128 O(n²) SDPA fallback and enforcing safe prefill chunk sizes
- Native BERT/XLMRoBERTa embedding. Load BERT-family embedding models (bge-m3, mxbai-embed) without mlx-embeddings fallback (#330 by @yes999zc)
- Jina v3 reranker. Reranking via
<|score_token|>logits for jinaai/jina-reranker-v3-mlx (#331 by @yes999zc) - Partial mode. Assistant message prefill support for Moonshot/Kimi K2 models (
partialfield +namefield passthrough) (#306 by @blightbow) - Codex smart config merging. Non-destructive config merge with reasoning model auto-detection (#249 by @JasonYeYuhe)
- i18n normalization. Normalize translation files against en.json with missing key detection (#247 by @xiaoran007)
- Web dashboard generating status. Show generating status for active requests after prefill completes
Experimental Features
- SpecPrefill. Attention-based sparse prefill for MoE models. Reduces prefill compute by skipping low-attention tokens. System prompt is protected from token dropping to preserve instruction following.
Bug Fixes
- Fix chat streaming failure not sending error message to client (#342)
- Fix TTL auto-unload during benchmark causing Metal GPU crash
- Fix dtype normalization on enhanced path causing OOM on large models
- Fix oQ bf16→fp16 weight conversion causing 41% quantized value corruption
- Fix oQ mxfp4 uint8 scales being force-cast to fp16
- Fix oQ clip optimization mask dtype and position_ids for Qwen3.5
- Fix oQ streaming quantization accuracy and VLM support
- Fix MC benchmarks (MMLU, HellaSwag, TruthfulQA) always scoring 0% due to max_tokens=1
- Fix HumanEval scoring. Prepend prompt imports when model returns function only
- Fix MBPP scoring. Include test cases in prompt so model uses correct function name
- Fix benchmark code extraction. Extract last answer/code block instead of first
- Fix benchmark penalties. Force neutral presence_penalty=0 and repetition_penalty=1
- Fix think prefix false positive for disabled thinking patterns (
<think></think>) - Fix responses API image support for VLM + missing prompt_tokens in completions usage
- Fix SSE streaming behind nginx reverse proxy (X-Accel-Buffering header) (#309)
- Fix CausalLM-based embedding model detection (Qwen3-Embedding) (#327)
- Fix web dashboard unload tooltip clipping in active models box (#314)
- Fix web dashboard 401 warning log spam from dashboard polling
- Fix web dashboard model settings not showing for embedding/reranker models
- Fix PEP 735 dependency-groups for
uv sync --dev(#305 by @blightbow)
New Contributors
- @blightbow made their first contribution in #305
- @yes999zc made their first contribution in #330
- @JasonYeYuhe made their first contribution in #249
- @xiaoran007 made their first contribution in #247
Full changelog: v0.2.19...v0.2.20rc1