The biggest Voicebox release yet. Three new TTS engines bring the lineup to seven — HumeAI TADA, Kokoro 82M, and Qwen CustomVoice join Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and Chatterbox Turbo. GPU support broadens to Intel Arc (XPU) and NVIDIA Blackwell (RTX 50-series), with runtime diagnostics that warn when your PyTorch build doesn't match your GPU. The CUDA backend is now split into independently versioned server and library archives, so upgrading no longer redownloads 4 GB of PyTorch/CUDA DLLs.
This release also marks a big community moment: 13 new contributors shipped fixes and features in 0.4.0. Thirty-plus bug fixes target the most-reported issues in the tracker — numpy 2.x TTS crashes, Windows background-server reliability, macOS 11 launch failures, audio playback silence, Stories clip-splitting races, history status staleness, and more.
New TTS Engines
HumeAI TADA — Expressive English & Multilingual (#296)
- Added
tada-1b(English) andtada-3b-ml(multilingual) backends - Replaced
descript-audio-codecwith a lightweight DAC shim to cut dependencies - Switched audio decoding to
soundfileto sidesteptorchcodecbundling issues - Redirected gated Llama tokenizer lookups to an ungated mirror so model loading works out of the box
- Fixed tokenizer patch that was corrupting
AutoTokenizerfor other engines - Fixed TorchScript error in frozen builds
Kokoro 82M — Fast Lightweight TTS (#325)
- Added Kokoro 82M engine with a new voice profile type system that distinguishes preset voices from cloned profiles
- Profile grid now handles engine compatibility directly — removed redundant dropdown filtering
- Tightened Kokoro profile handling so preset voices can't be edited like cloned profiles
Qwen CustomVoice (#328)
- Added
qwen-custom-voicepreset engine backed by Qwen3-TTS - Enforced preset/profile engine compatibility across the generation flow
- Floating generator now shows all engines instead of silently filtering
Voice Profile UX
Until 0.4, every engine in Voicebox was a cloning model, so every voice profile was usable with every engine and the profile grid just showed them all. Introducing Kokoro and Qwen CustomVoice — which work from preset voices rather than cloned samples — broke that assumption for the first time. An early cut on main filtered the grid by the selected engine, which left users running pre-release builds thinking their cloned voices had vanished whenever they switched to a preset-only engine.
This release ships the resolution before it ever reaches a tagged version:
- Grey-out instead of filter — all profiles are always visible; unsupported ones render dimmed with a compatibility hint at the bottom of the grid
- Auto-switch on selection — clicking a greyed-out profile selects it AND switches the engine to a compatible one, instead of silently doing nothing
- Instruct toggle restored for Qwen CustomVoice — the floating generate box now reveals a delivery-instructions input (tone, emotion, pace) when CustomVoice is selected. Hidden across the board while the new multi-engine lineup was stabilizing because most engines don't honor the kwarg; now conditionally exposed only for the one engine that was actually trained for instruction-based style control
- Supported profiles sort first; the grid scrolls the selected profile into view after engine/sort changes
- Fixed engine desync on tab navigation — the form now initializes its engine from the store
- Fixed the disabled-and-selected card click edge case by bouncing selection to re-trigger the auto-switch
- Cleaned up scroll effect timers (requestAnimationFrame + setTimeout) to prevent stale DOM writes on unmount or rapid selection changes
GPU & Platform
Intel Arc (XPU) Support (#320)
- First-class Intel Arc support across all PyTorch-based backends
- Device-aware seeding, XPU detection in the GPU status panel, and setup flow detection
- Reports correct device name and VRAM in settings
Blackwell / RTX 50-series Support (#316, #401)
- Upgraded the CUDA backend from cu126 → cu128 for RTX 50-series support
- Added
sm_120+PTXto the CUDA build viaTORCH_CUDA_ARCH_LISTfor forward-compatibility with Blackwell architectures (closes 5 open reports: #386, #395, #396, #399, #400) - GPU settings UI fixes around install/uninstall state
GPU Compatibility Diagnostics (#367, adapted)
- New
check_cuda_compatibility()compares the current device's compute capability against the bundled PyTorch's architecture list - Health endpoint exposes a
gpu_compatibility_warningfield so the UI can surface mismatches - Startup logs a
WARNwhen the installed PyTorch build doesn't support the detected GPU - GPU status label shows
[UNSUPPORTED - see logs]— no more silent "no kernel image" failures
Split CUDA Backend (#298)
- CUDA backend now ships as two independently versioned archives: a small server binary and a large libs archive (the ~4 GB of PyTorch/CUDA DLLs)
- Upgrading Voicebox no longer redownloads the libs archive when only the server binary changed
- Added
asyncio.Lockarounddownload_cuda_binary()so auto-update and manual download can't race on the same temp file (#428) - Updated
package_cuda.pyfor PyInstaller 6.18 onedir layout - Temp archives are always cleaned up on failure, even when the install aborts mid-extract
Bug Fixes
Critical: TTS Generation
- numpy 2.x
torch.from_numpycrash (#361) — torch compiled against numpy 1.x ABI fails silently when paired with numpy 2.x, causingRuntimeError: Numpy is not available/Unable to create tensoron every TTS request in bundled macOS Intel / Rosetta builds. Pinnednumpy<2.0in requirements and added a PyInstaller runtime hook with actypes.memmovefallback as belt-and-suspenders. Hardened afterward to raise on unknown dtypes instead of silently reinterpreting bytes as float32.
Platform Reliability
- Windows background server (#402) — "keep server running after close" now actually keeps the server running. The HTTP
/watchdog/disablerequest could lose the race against process exit on Windows; added a.keep-runningsentinel file as a synchronous fallback, with stale-sentinel cleanup on startup to avoid orphan server processes - macOS 11 launch crash (#424) — weak-linked ScreenCaptureKit so the app can launch on macOS < 12.3 instead of crashing at dyld resolution. Gated system audio capture behind a real
sw_versversion check so unsupported systems cleanly advertise "not available" rather than crashing at runtime - macOS Intel (x86_64) setup (#416) — relaxed
torch>=2.7.0→torch>=2.2.0. PyTorch dropped pre-built x86_64 wheels after 2.2.2, so Intel Mac devs could no longerpip install. Now resolves to the latest compatible torch per platform - Offline model loading (#318) — Qwen TTS and Whisper force offline mode when loading cached models, so startup works without network access
- GUI startup with external server (#319) — fixed GUI launch when pointed at a remote/external server, and added data refresh on server switch; hardened health validation and error handling
- Qwen3-TTS cache split on Windows (adapted from #218) — route
Qwen3TTSModel.from_pretrainedthroughhf_constants.HF_HUB_CACHEso the speech tokenizer andpreprocessor_config.jsonresolve from a single cache root - Qwen3-TTS bundling (#305) — bundle
qwen_ttssource files in the PyInstaller build to fixinspect.getsourceerrors in frozen builds - Backend import paths (#345) — moved lazy imports to top-level with absolute paths to resolve the "Failed to Save" preset error caused by
ModuleNotFoundErrorin production builds - Effects service import (#384) — fixed
ModuleNotFoundErroron preset create/update by switching to relative imports (#349)
Audio & Playback
- cpal stream silent playback (#405) —
cpal::Streamwas dropped on function return immediately afterplay(), causing every playback to fall silent. Now holds the stream until either the buffer drains or the stop flag fires (#404)
Stories & History
- Clip-splitting race (#403) — rapid double-clicks on split could race through
split_story_itemwith inconsistent state. Addedwith_for_update()row locking on the backend and anisPendingguard on the frontend (#366) - History
statusstaleness (#394) —GET /history/{id}was hardcodingstatus="completed"regardless of the DB row, breaking any client polling for job completion. Now returnsstatus,error,engine,model_size, andis_favoritedfrom the actual row - "Clear failed" bulk button (#412) — new
DELETE /history/failedendpoint and a header strip showing"N failed generations"with a Clear button, complementing the per-row trash icon added in #321 (#410) - Delete failed generations (#321) — added a trash icon next to the retry button so failed entries can be cleaned up without having to retry first
Security & Safety
- Voice prompt cache hardening (#429) —
torch.load(weights_only=True)on cached voice prompts per PyTorch 2.6 recommendation; replaced string-based SPA path guard withPath.is_relative_to()for more robust path-traversal protection
Infrastructure & Docker
- Docker web build (#344) — include
CHANGELOG.mdin the Docker web build so the in-app changelog page works in Docker deployments - Docker numba cache (#425) — set
NUMBA_CACHE_DIRin docker-compose so numba can write its JIT cache in container runtime (#308) - Relative media paths (#332) — media paths now stored relative to the configured data dir rather than resolved against CWD, so the data directory is portable between installs
Developer Tooling
- New
triage-prsagent skill — encodes the end-to-end PR-speedrun workflow (classification → triage doc → rebase → squash-merge → follow-ups) so future release cycles can reproduce it - Rewrote the TTS engine guide with the patterns learned from adding TADA and Kokoro
- Added the API refactor plan and CUDA libs addon design doc
- Fixed broken links in the Get Started section (#332)
New Contributors
Huge thank you to everyone who contributed their first PR to Voicebox in this release:
@liorshahverdi, @nicoschtein, @ArfianID, @aimaaaimaa, @maxmcoding, @Khalodddd, @LuisSambrano, @shaun0927, @malletfils, @mvanhorn, @kuishou68, @txhno, @MukundaKatta