EXO v1.0.69 Release Notes

This release ships with continuous batching, Qwen3.5 support and support for M5 Pro/Max chips, as well as a host of quality of life improvements and bug fixes.

Continuous batching is on by default, enabling you to run multiple requests in parallel for significantly higher throughput. EXO will automatically batch together inference requests, on single node and multi-node instances including RDMA instances. This is particularly useful for agentic workflows where multiple agents can run in parallel.

Models

Add support for Qwen3.5 (#1644)
Add support for Nemotron sharding (#1693)
Add default model cards for DeepSeek v3.2 (#1769)

API

Add POST /v1/cancel/{command_id} endpoint for cancelling ongoing text generations (#1579)
Add reasoning params to chat completions and responses APIs (#1654)
Add repetition_penalty and repetition_context_size to chat completions (#1665)

Performance

Continuous batching (#1642, #1632, #1777)
Better pipeline parallel prefill that splits the prompt into chunks and overlaps computation and communication. This makes pipeline parallel prefill up to 1.98x faster on 2 nodes (#1587, #1629)

Quality of Life

Support trace deletion in dashboard (#1628)
Make mini topology sidebar navigate to home on click (#1616)
Show feedback that the model was successfully added when adding custom models from Huggingface (#1661)
Enable global model search from huggingface, not just mlx-community (#1661)
Mobile-friendly UI (#1677)
Include power usage in exo-bench responses, enabling benchmarks to capture energy usage (#1693)
Prefer nodes with more of the model downloaded when placing an instance (#1767, #1795)
Sync custom model cards across nodes (#1768)
Add --bootstrap-peers and --libp2p-port for static peer discovery, bypassing mDNS which Is useful in environments where mDNS is unavailable (#1690)

Image Generation (Experimental)

Update mflux to 0.16.9 (#1751)
Support image generation cancellation (#1774)

Bug Fixes

Upgrade macmon to fix macmon errors on M5 Pro/Max. This fixes an issue where M5 Pro/Max would not report memory or GPU usage stats and therefore could not participate in clusters (#1747, #1797)
Use tmpdir for MLX distributed coordination file, preventing local network access permission issues (#1624)
Fix BrokenResourceError crash when immediately loading a model on start (#1637)
Emit error chunks when a runner crashes in the middle of a request, preventing streams hanging forever when runners crash (#1645)
Fix copy code button not working in dashboard (#1659)
Fix re-downloads so that models can be downloaded again after being deleted via the dashboard (#1658)
Reset download status to DownloadPending when a download is cancelled so that the API and dashboard reflect the correct model download status (#1674)
Increase gossipsub message limit to 8MB, fixing requests with very large prompts (#1671)
Clean up stale state.runners state when runners shut down (#1684)
Fix emoji rendering in chat responses (#1691)
Fix placement validation for tensor sharding and show an error message with the constraints when no valid placement is found (#1669)
Normalise responses API tool call format. This fixes tool calling with n8n (#1704)
Show partial download progress on initial dashboard load (#1706)
Fix a runner crash where very fast requests would cause a ZeroDivisionError (#1707)
Make prefill more consistent on slow machines (#1748)
Fix a race condition during master reelection that caused a full node hang with IOConnectUnmapMemory failed: kr=0xe00002bc (#1801)
Use mlx_generate for warmup, preventing occasional issues with warmup when using stream_generate (#1794)
Fix DeepSeek v3.2 warmup crash and tool calling (#1769)

New Contributors

Thank you to everyone who made contributions to exo for the first time:

Full Changelog: v1.0.68...v1.0.69

exo-explore/exo v1.0.69 on GitHub