EXO v1.0.69 Release Notes
This release ships with continuous batching, Qwen3.5 support and support for M5 Pro/Max chips, as well as a host of quality of life improvements and bug fixes.
Continuous batching is on by default, enabling you to run multiple requests in parallel for significantly higher throughput. EXO will automatically batch together inference requests, on single node and multi-node instances including RDMA instances. This is particularly useful for agentic workflows where multiple agents can run in parallel.
Models
- Add support for Qwen3.5 (#1644)
- Add support for Nemotron sharding (#1693)
- Add default model cards for DeepSeek v3.2 (#1769)
API
- Add POST /v1/cancel/{command_id} endpoint for cancelling ongoing text generations (#1579)
- Add reasoning params to chat completions and responses APIs (#1654)
- Add
repetition_penaltyandrepetition_context_sizeto chat completions (#1665)
Performance
- Continuous batching (#1642, #1632, #1777)
- Better pipeline parallel prefill that splits the prompt into chunks and overlaps computation and communication. This makes pipeline parallel prefill up to 1.98x faster on 2 nodes (#1587, #1629)
Quality of Life
- Support trace deletion in dashboard (#1628)
- Make mini topology sidebar navigate to home on click (#1616)
- Show feedback that the model was successfully added when adding custom models from Huggingface (#1661)
- Enable global model search from huggingface, not just
mlx-community(#1661) - Mobile-friendly UI (#1677)
- Include power usage in exo-bench responses, enabling benchmarks to capture energy usage (#1693)
- Prefer nodes with more of the model downloaded when placing an instance (#1767, #1795)
- Sync custom model cards across nodes (#1768)
- Add
--bootstrap-peersand--libp2p-portfor static peer discovery, bypassing mDNS which Is useful in environments where mDNS is unavailable (#1690)
Image Generation (Experimental)
Bug Fixes
- Upgrade macmon to fix macmon errors on M5 Pro/Max. This fixes an issue where M5 Pro/Max would not report memory or GPU usage stats and therefore could not participate in clusters (#1747, #1797)
- Use tmpdir for MLX distributed coordination file, preventing local network access permission issues (#1624)
- Fix BrokenResourceError crash when immediately loading a model on start (#1637)
- Emit error chunks when a runner crashes in the middle of a request, preventing streams hanging forever when runners crash (#1645)
- Fix copy code button not working in dashboard (#1659)
- Fix re-downloads so that models can be downloaded again after being deleted via the dashboard (#1658)
- Reset download status to
DownloadPendingwhen a download is cancelled so that the API and dashboard reflect the correct model download status (#1674) - Increase gossipsub message limit to 8MB, fixing requests with very large prompts (#1671)
- Clean up stale
state.runnersstate when runners shut down (#1684) - Fix emoji rendering in chat responses (#1691)
- Fix placement validation for tensor sharding and show an error message with the constraints when no valid placement is found (#1669)
- Normalise responses API tool call format. This fixes tool calling with n8n (#1704)
- Show partial download progress on initial dashboard load (#1706)
- Fix a runner crash where very fast requests would cause a
ZeroDivisionError(#1707) - Make prefill more consistent on slow machines (#1748)
- Fix a race condition during master reelection that caused a full node hang with
IOConnectUnmapMemory failed: kr=0xe00002bc(#1801) - Use
mlx_generatefor warmup, preventing occasional issues with warmup when usingstream_generate(#1794) - Fix DeepSeek v3.2 warmup crash and tool calling (#1769)
New Contributors
Thank you to everyone who made contributions to exo for the first time:
- @onel
- @nakheel77
- @asprooo
- @zeus247
- @sigkill
- @EthyMoney
- @Luckystrike561
- @saidulahmed2266-cloud
- @tiawl
Full Changelog: v1.0.68...v1.0.69