github jundot/omlx v0.2.20

7 hours ago

Highlights

oQ — oMLX universal dynamic quantization

Quantization should not be exclusive to any particular inference server. oQ produces standard mlx-lm compatible models that work everywhere — oMLX, mlx-lm, and any app that supports MLX safetensors. No custom loader required.

oQ is a data-driven mixed-precision quantization system for Apple Silicon. Instead of assigning bits by fixed rules or tensor type, oQ measures each layer's actual quantization sensitivity through calibration and allocates bits where the data says they matter most. See the oQ documentation for details.

Benchmarks (Qwen3.5-35B-A3B)

Benchmark Samples 2-bit mlx-lm 2-bit oQ 3-bit mlx-lm 3-bit oQ 4-bit mlx-lm 4-bit oQ
MMLU 300 14.0% 64.0% 76.3% 85.0% 79.7% 83.3%
TRUTHFULQA 300 17.0% 80.0% 81.7% 86.7% 87.7% 88.0%
HUMANEVAL 164 (full) 0.0% 78.0% 84.8% 86.6% 87.2% 85.4%
MBPP 300 0.3% 63.3% 69.0% 72.0% 71.7% 74.3%
  • oQ2-oQ8 levels with sensitivity-driven mixed-precision bit allocation
  • oQ3.5 base 3-bit + routed expert down_proj 4-bit (Super Weights protection)
  • AWQ weight equalization rewritten from scratch following the llm-compressor reference implementation. fixed critical double-scaling bug on hybrid attention models (Qwen3.5) and added per-layer mask-aware calibration
  • Sensitivity-driven budget plan. mandatory lm_head 8-bit protection, then data-driven tier allocation (+4/+2/+1 bits) with greedy fallback. no hardcoded tensor-type priorities — calibration data decides which layers matter
  • Proxy sensitivity model. select a quantized version of the source model for layer sensitivity analysis with ~4x less memory. 90% top-10 overlap with full-precision measurement validated on Qwen3.5-35B
  • New calibration dataset. 600 samples from codeparrot/self-instruct-starcoder (real code), allenai/c4 (web text), Open-Orca (conversation), gsm8k (reasoning), and wikipedia multilingual. replaces the old HumanEval/MBPP-only code samples
  • VLM support. quantize vision-language models with vision weight preservation (fp16)
  • FP8 model support. use native FP8 models (MiniMax, DeepSeek) as quantization source
  • MiniMax M2.5 support. block_sparse_moe architecture with SwitchGLU fused experts
  • DeepSeek V3.2 support. shared_experts (plural) + MLA projections. MLP AWQ works, MLA attention AWQ planned
  • Nemotron support. backbone.embeddings path detection for sensitivity measurement on hybrid Mamba+MoE+Attention architecture
  • AWQ grid size setting. configurable n_grid (10 fast / 20 recommended) from the web UI
  • HuggingFace Hub uploader. upload quantized models directly from the dashboard
  • blocks inference requests during quantization to prevent conflicts

Intelligence benchmark suite

Evaluate model intelligence across knowledge, reasoning, math, and coding benchmarks. All datasets bundled locally for offline use.

oMLX Benchmark Suite

  • Knowledge: MMLU, ARC-Challenge, KMMLU (Korean), CMMLU (Chinese), JMMLU (Japanese)
  • Reasoning: HellaSwag, Winogrande, TruthfulQA, GSM8K
  • Coding: HumanEval (164 function completions, pass@1), MBPP
  • benchmark queue for sequential multi-model evaluation with persistent results
  • comparison table with mode/sample columns and text export
  • sample size options: 30/50/100/200/300/500/1000/2000/Full
  • batch processing: 1x/2x/4x/8x/16x/32x
  • download raw results as JSON

New Features

  • Prefill memory guard. prevents kernel panics on large context by detecting head_dim>128 O(n^2) SDPA fallback and enforcing safe prefill chunk sizes
  • Native BERT/XLMRoBERTa embedding. load BERT-family embedding models (bge-m3, mxbai-embed) without mlx-embeddings fallback (#330 by @yes999zc)
  • Jina v3 reranker. reranking via <|score_token|> logits for jinaai/jina-reranker-v3-mlx (#331 by @yes999zc)
  • Partial mode. assistant message prefill support for Moonshot/Kimi K2 models (partial field + name field passthrough) (#306 by @blightbow)
  • Codex smart config merging. non-destructive config merge with reasoning model auto-detection (#249 by @JasonYeYuhe)
  • i18n normalization. normalize translation files against en.json with missing key detection (#247 by @xiaoran007)
  • Web dashboard generating status. show generating status for active requests after prefill completes

Experimental Features

  • SpecPrefill. attention-based sparse prefill for MoE models. reduces prefill compute by skipping low-attention tokens. system prompt is protected from token dropping to preserve instruction following.

Bug Fixes

  • Fix lucide icon rendering race condition with Alpine.js microtask
  • Fix chat streaming failure not sending error message to client (#342)
  • Fix TTL auto-unload during benchmark causing Metal GPU crash
  • Fix MC benchmarks (MMLU, HellaSwag, TruthfulQA) always scoring 0% due to max_tokens=1
  • Fix HumanEval scoring. prepend prompt imports when model returns function only
  • Fix MBPP scoring. include test cases in prompt so model uses correct function name
  • Fix benchmark code extraction. extract last answer/code block instead of first
  • Fix benchmark penalties. force neutral presence_penalty=0 and repetition_penalty=1
  • Fix think prefix false positive for disabled thinking patterns (<think></think>)
  • Fix responses API image support for VLM + missing prompt_tokens in completions usage
  • Fix SSE streaming behind nginx reverse proxy (X-Accel-Buffering header) (#309)
  • Fix CausalLM-based embedding model detection (Qwen3-Embedding) (#327)
  • Fix web dashboard unload tooltip clipping in active models box (#314)
  • Fix web dashboard 401 warning log spam from dashboard polling
  • Fix web dashboard model settings not showing for embedding/reranker models
  • Fix PEP 735 dependency-groups for uv sync --dev (#305 by @blightbow)

New Contributors

Full changelog: v0.2.19...v0.2.20

Don't miss a new omlx release

NewReleases is sending notifications on new releases.