This is the release candidate for 0.3.9. Big thanks to all the contributors who landed PRs since dev2. If you hit a bug, please open an issue. Stable release planned after ~24h of testing.
Highlights
Stronger memory management for low-memory Macs
A new phys_footprint-based memory enforcer with prefill admission control measures the same metric the OS uses for jetsam decisions, so the server stops accepting work before it gets killed instead of after. Hot-cache eviction race fixed (#1298), SSD↔hot block preloading parallelized (#1301), per-model cache hit-rate now observable (#1183, all thanks @ivaniguarans), and a real-time memory bar lands on the admin dashboard (#1278, thanks @beamivalice).
Chunked prefill (#1224)
A long-context prompt no longer blocks decode for other in-flight requests: prefill advances one chunk per scheduler step, so concurrent requests keep streaming tokens through it. Off by default, toggleable from admin. Thanks @drumtorben.
Chat multi-tasking (#1231)
Multiple chats can now run in parallel from the admin chat, and the "wrong response stitched into the wrong chat" bug that came along with concurrent streaming is fixed at the same time. Thanks @beamivalice.
New Features
- dflash-mlx bumped to
1ba6713(v0.1.7), thanks to @bstnxbt's dflash-mlx — Qwen thinking / GDN exactness fix, GQA SDPA reshape, DDTree + CopySpec decode path, prefix cache identity hardening, and FP16 draft on older Apple chips. dflash also exposesdraft_window_size/draft_sink_size/verify_modein model settings (#1276). - Native reasoning support for the Responses API so reasoning blocks survive across tool-call round-trips (#1245). Thanks @a4501150.
- Split
store_cache_main_prepinto sub-phase timers so the slowest sub-step is visible in the scheduler logs (#1243). Thanks @ivaniguarans. - Load pre-saved oQ sensitivity map from the source model folder so re-quantization can skip the measurement pass (#1295). Thanks @deepsweet.
- Sortable Browse Models table plus a mobile-friendly layout (#830). Thanks @omniwired.
- Minimum display-level filter in the log viewer so the admin log panel can hide low-priority lines (#1251). Thanks @fqx.
- Hermes Agent quick launch added to the launch picker (#1250). Thanks @shannonsands.
word_timestampson/v1/audio/transcriptionsfor per-word timing in transcription output (#1214). Thanks @alexferrao./v1/mcp/executeaccepts bothtoolandtool_nameso clients written against either alias work (#1285). Thanks @mvanhorn.total_timepopulated in non-streaming usage so non-stream responses match the stream-side accounting (#1269). Thanks @richgoodson.- Forward extra CLI args to
codexfromomlx launch(#1255). Thanks @EdenGottlieb. - Forward extra CLI args to
claudefromomlx launch(#1223). - Respect
PI_CODING_AGENT_DIRenv var soomlx launch pihonors a custom agent directory (#1282). Thanks @SuperGregM. - Re-check for updates from the periodic health timer so the menubar app picks up new releases without a restart (#1088). Thanks @jprado.
Bug Fixes
- Preserve
role=toolfor chat templates beyond the mlx-lm marker set so non-default templates with tool turns render correctly. - dflash
OutputParserSessionwired for gemma4 channel markers so dflash-decoded gemma4 streams parse cleanly. - VLM loading fixes for oQ-quantized checkpoints plus MTP head attach restored for VLM sensitivity (#1247). Thanks @a4501150.
- HF downloader preserves finalized shards on cancel and only wipes partial temp files (#1284). Thanks @mvanhorn.
- Engine pool surfaces the original load error when fallback also fails instead of masking it with the fallback error (#1283). Thanks @contrapuntal.
- Profile classification for 9 new ModelSettings fields so they land in the right config bucket (#1268). Thanks @richgoodson.
- Responses API preserves reasoning round-trip on tool-call turns so reasoning isn't dropped between turns.
- Streaming bubble re-renders when the chat switches so the in-progress stream lands in the correct chat view.
- Move Server Settings restart button next to Save in the admin settings panel (#1288).
- Speculative toggle mutual exclusion made symmetric so toggling one engine off cleanly releases the other (#1227). Thanks @fish0710.
- Admin browse: quant detection wired in trending / popular and sort dropdown options aligned with the table.
- CLI launch connects via localhost when the server binds
0.0.0.0soomlx chatworks after a permissive bind (#1219). Thanks @fish0710. - Mobile chat input bar stays on-screen by dropping
min-height: 100vh(#1218). Thanks @Luunae. - Packaging reuses the health-check
Sessionso the menubar app doesn't exhaust ephemeral ports (#1211). Thanks @arthware-dev. - Surface chunked prefill
RuntimeErroras a request error instead of swallowing it inside the scheduler. - oQ loading progress callback hoisted out of the sensitivity-map cache branch so progress lands even on cache hits.
python_versionmarker on theparoquantoptional dependency so non-3.11 environments resolve cleanly (#1228). Thanks @ivaniguarans.- Pre-existing upstream test failures and import guards repaired so the test suite is green on a clean checkout (#1244). Thanks @a4501150.
New Contributors
- @drumtorben: chunked prefill (#1224)
- @mvanhorn: HF download cleanup on cancel,
/v1/mcp/executealias, scheduler/memory test alignment (#1284, #1285, #1286) - @richgoodson: profile classification for new ModelSettings fields,
total_timein non-streaming usage, stale test alignment (#1268, #1269, #1287) - @contrapuntal: engine pool fallback error surfacing (#1283)
- @omniwired: sortable Browse Models table (#830)
- @shannonsands: Hermes Agent quick launch (#1250)
- @alexferrao:
word_timestampson/v1/audio/transcriptions(#1214) - @EdenGottlieb: forward extra CLI args to
codex(#1255) - @SuperGregM: respect
PI_CODING_AGENT_DIRenv var (#1282) - @jprado: re-check for updates from periodic health timer (#1088)
- @Luunae: keep mobile chat input bar on-screen (#1218)
- @arthware-dev: reuse health-check
Sessionto prevent port exhaustion (#1211)