jundot/omlx v0.2.20 on GitHub

Highlights

oQ — oMLX universal dynamic quantization

Quantization should not be exclusive to any particular inference server. oQ produces standard mlx-lm compatible models that work everywhere — oMLX, mlx-lm, and any app that supports MLX safetensors. No custom loader required.

oQ is a data-driven mixed-precision quantization system for Apple Silicon. Instead of assigning bits by fixed rules or tensor type, oQ measures each layer's actual quantization sensitivity through calibration and allocates bits where the data says they matter most. See the oQ documentation for details.

Benchmarks (Qwen3.5-35B-A3B)

Benchmark	Samples	2-bit mlx-lm	2-bit oQ	3-bit mlx-lm	3-bit oQ	4-bit mlx-lm	4-bit oQ
MMLU	300	14.0%	64.0%	76.3%	85.0%	79.7%	83.3%
TRUTHFULQA	300	17.0%	80.0%	81.7%	86.7%	87.7%	88.0%
HUMANEVAL	164 (full)	0.0%	78.0%	84.8%	86.6%	87.2%	85.4%
MBPP	300	0.3%	63.3%	69.0%	72.0%	71.7%	74.3%

oQ2-oQ8 levels with sensitivity-driven mixed-precision bit allocation
oQ3.5 base 3-bit + routed expert down_proj 4-bit (Super Weights protection)
AWQ weight equalization rewritten from scratch following the llm-compressor reference implementation. fixed critical double-scaling bug on hybrid attention models (Qwen3.5) and added per-layer mask-aware calibration
Sensitivity-driven budget plan. mandatory lm_head 8-bit protection, then data-driven tier allocation (+4/+2/+1 bits) with greedy fallback. no hardcoded tensor-type priorities — calibration data decides which layers matter
Proxy sensitivity model. select a quantized version of the source model for layer sensitivity analysis with ~4x less memory. 90% top-10 overlap with full-precision measurement validated on Qwen3.5-35B
New calibration dataset. 600 samples from codeparrot/self-instruct-starcoder (real code), allenai/c4 (web text), Open-Orca (conversation), gsm8k (reasoning), and wikipedia multilingual. replaces the old HumanEval/MBPP-only code samples
VLM support. quantize vision-language models with vision weight preservation (fp16)
FP8 model support. use native FP8 models (MiniMax, DeepSeek) as quantization source
MiniMax M2.5 support. block_sparse_moe architecture with SwitchGLU fused experts
DeepSeek V3.2 support. shared_experts (plural) + MLA projections. MLP AWQ works, MLA attention AWQ planned
Nemotron support. backbone.embeddings path detection for sensitivity measurement on hybrid Mamba+MoE+Attention architecture
AWQ grid size setting. configurable n_grid (10 fast / 20 recommended) from the web UI
HuggingFace Hub uploader. upload quantized models directly from the dashboard
blocks inference requests during quantization to prevent conflicts

Intelligence benchmark suite

Evaluate model intelligence across knowledge, reasoning, math, and coding benchmarks. All datasets bundled locally for offline use.

Knowledge: MMLU, ARC-Challenge, KMMLU (Korean), CMMLU (Chinese), JMMLU (Japanese)
Reasoning: HellaSwag, Winogrande, TruthfulQA, GSM8K
Coding: HumanEval (164 function completions, pass@1), MBPP
benchmark queue for sequential multi-model evaluation with persistent results
comparison table with mode/sample columns and text export
sample size options: 30/50/100/200/300/500/1000/2000/Full
batch processing: 1x/2x/4x/8x/16x/32x
download raw results as JSON

New Features

Prefill memory guard. prevents kernel panics on large context by detecting head_dim>128 O(n^2) SDPA fallback and enforcing safe prefill chunk sizes
Native BERT/XLMRoBERTa embedding. load BERT-family embedding models (bge-m3, mxbai-embed) without mlx-embeddings fallback (#330 by @yes999zc)
Jina v3 reranker. reranking via <|score_token|> logits for jinaai/jina-reranker-v3-mlx (#331 by @yes999zc)
Partial mode. assistant message prefill support for Moonshot/Kimi K2 models (partial field + name field passthrough) (#306 by @blightbow)
Codex smart config merging. non-destructive config merge with reasoning model auto-detection (#249 by @JasonYeYuhe)
i18n normalization. normalize translation files against en.json with missing key detection (#247 by @xiaoran007)
Web dashboard generating status. show generating status for active requests after prefill completes

Experimental Features

SpecPrefill. attention-based sparse prefill for MoE models. reduces prefill compute by skipping low-attention tokens. system prompt is protected from token dropping to preserve instruction following.

Bug Fixes

Fix lucide icon rendering race condition with Alpine.js microtask
Fix chat streaming failure not sending error message to client (#342)
Fix TTL auto-unload during benchmark causing Metal GPU crash
Fix MC benchmarks (MMLU, HellaSwag, TruthfulQA) always scoring 0% due to max_tokens=1
Fix HumanEval scoring. prepend prompt imports when model returns function only
Fix MBPP scoring. include test cases in prompt so model uses correct function name
Fix benchmark code extraction. extract last answer/code block instead of first
Fix benchmark penalties. force neutral presence_penalty=0 and repetition_penalty=1
Fix think prefix false positive for disabled thinking patterns (<think></think>)
Fix responses API image support for VLM + missing prompt_tokens in completions usage
Fix SSE streaming behind nginx reverse proxy (X-Accel-Buffering header) (#309)
Fix CausalLM-based embedding model detection (Qwen3-Embedding) (#327)
Fix web dashboard unload tooltip clipping in active models box (#314)
Fix web dashboard 401 warning log spam from dashboard polling
Fix web dashboard model settings not showing for embedding/reranker models
Fix PEP 735 dependency-groups for uv sync --dev (#305 by @blightbow)

New Contributors

@blightbow made their first contribution in #305
@yes999zc made their first contribution in #330
@JasonYeYuhe made their first contribution in #249
@xiaoran007 made their first contribution in #247

Full changelog: v0.2.19...v0.2.20