jundot/omlx v0.3.9rc1 on GitHub

This is the release candidate for 0.3.9. Big thanks to all the contributors who landed PRs since dev2. If you hit a bug, please open an issue. Stable release planned after ~24h of testing.

Highlights

Stronger memory management for low-memory Macs

A new phys_footprint-based memory enforcer with prefill admission control measures the same metric the OS uses for jetsam decisions, so the server stops accepting work before it gets killed instead of after. Hot-cache eviction race fixed (#1298), SSD↔hot block preloading parallelized (#1301), per-model cache hit-rate now observable (#1183, all thanks @ivaniguarans), and a real-time memory bar lands on the admin dashboard (#1278, thanks @beamivalice).

Chunked prefill (#1224)

A long-context prompt no longer blocks decode for other in-flight requests: prefill advances one chunk per scheduler step, so concurrent requests keep streaming tokens through it. Off by default, toggleable from admin. Thanks @drumtorben.

Chat multi-tasking (#1231)

Multiple chats can now run in parallel from the admin chat, and the "wrong response stitched into the wrong chat" bug that came along with concurrent streaming is fixed at the same time. Thanks @beamivalice.

New Features

dflash-mlx bumped to 1ba6713 (v0.1.7), thanks to @bstnxbt's dflash-mlx — Qwen thinking / GDN exactness fix, GQA SDPA reshape, DDTree + CopySpec decode path, prefix cache identity hardening, and FP16 draft on older Apple chips. dflash also exposes draft_window_size / draft_sink_size / verify_mode in model settings (#1276).
Native reasoning support for the Responses API so reasoning blocks survive across tool-call round-trips (#1245). Thanks @a4501150.
Split store_cache_main_prep into sub-phase timers so the slowest sub-step is visible in the scheduler logs (#1243). Thanks @ivaniguarans.
Load pre-saved oQ sensitivity map from the source model folder so re-quantization can skip the measurement pass (#1295). Thanks @deepsweet.
Sortable Browse Models table plus a mobile-friendly layout (#830). Thanks @omniwired.
Minimum display-level filter in the log viewer so the admin log panel can hide low-priority lines (#1251). Thanks @fqx.
Hermes Agent quick launch added to the launch picker (#1250). Thanks @shannonsands.
word_timestamps on /v1/audio/transcriptions for per-word timing in transcription output (#1214). Thanks @alexferrao.
/v1/mcp/execute accepts both tool and tool_name so clients written against either alias work (#1285). Thanks @mvanhorn.
total_time populated in non-streaming usage so non-stream responses match the stream-side accounting (#1269). Thanks @richgoodson.
Forward extra CLI args to codex from omlx launch (#1255). Thanks @EdenGottlieb.
Forward extra CLI args to claude from omlx launch (#1223).
Respect PI_CODING_AGENT_DIR env var so omlx launch pi honors a custom agent directory (#1282). Thanks @SuperGregM.
Re-check for updates from the periodic health timer so the menubar app picks up new releases without a restart (#1088). Thanks @jprado.

Bug Fixes

Preserve role=tool for chat templates beyond the mlx-lm marker set so non-default templates with tool turns render correctly.
dflash OutputParserSession wired for gemma4 channel markers so dflash-decoded gemma4 streams parse cleanly.
VLM loading fixes for oQ-quantized checkpoints plus MTP head attach restored for VLM sensitivity (#1247). Thanks @a4501150.
HF downloader preserves finalized shards on cancel and only wipes partial temp files (#1284). Thanks @mvanhorn.
Engine pool surfaces the original load error when fallback also fails instead of masking it with the fallback error (#1283). Thanks @contrapuntal.
Profile classification for 9 new ModelSettings fields so they land in the right config bucket (#1268). Thanks @richgoodson.
Responses API preserves reasoning round-trip on tool-call turns so reasoning isn't dropped between turns.
Streaming bubble re-renders when the chat switches so the in-progress stream lands in the correct chat view.
Move Server Settings restart button next to Save in the admin settings panel (#1288).
Speculative toggle mutual exclusion made symmetric so toggling one engine off cleanly releases the other (#1227). Thanks @fish0710.
Admin browse: quant detection wired in trending / popular and sort dropdown options aligned with the table.
CLI launch connects via localhost when the server binds 0.0.0.0 so omlx chat works after a permissive bind (#1219). Thanks @fish0710.
Mobile chat input bar stays on-screen by dropping min-height: 100vh (#1218). Thanks @Luunae.
Packaging reuses the health-check Session so the menubar app doesn't exhaust ephemeral ports (#1211). Thanks @arthware-dev.
Surface chunked prefill RuntimeError as a request error instead of swallowing it inside the scheduler.
oQ loading progress callback hoisted out of the sensitivity-map cache branch so progress lands even on cache hits.
python_version marker on the paroquant optional dependency so non-3.11 environments resolve cleanly (#1228). Thanks @ivaniguarans.
Pre-existing upstream test failures and import guards repaired so the test suite is green on a clean checkout (#1244). Thanks @a4501150.

New Contributors

@drumtorben: chunked prefill (#1224)
@mvanhorn: HF download cleanup on cancel, /v1/mcp/execute alias, scheduler/memory test alignment (#1284, #1285, #1286)
@richgoodson: profile classification for new ModelSettings fields, total_time in non-streaming usage, stale test alignment (#1268, #1269, #1287)
@contrapuntal: engine pool fallback error surfacing (#1283)
@omniwired: sortable Browse Models table (#830)
@shannonsands: Hermes Agent quick launch (#1250)
@alexferrao: word_timestamps on /v1/audio/transcriptions (#1214)
@EdenGottlieb: forward extra CLI args to codex (#1255)
@SuperGregM: respect PI_CODING_AGENT_DIR env var (#1282)
@jprado: re-check for updates from periodic health timer (#1088)
@Luunae: keep mobile chat input bar on-screen (#1218)
@arthware-dev: reuse health-check Session to prevent port exhaustion (#1211)