The best local voice cloning tool, just got better...
See the new website: https://voicebox.sh
Released 2026-03-15 — v0.2.1 on GitHub (version bump due to an immutable release tag on GitHub)
Voicebox v0.1.x was a single-engine voice cloning app built around Qwen3-TTS. v0.2.0 is a ground-up rethink: four TTS engines, 23 languages, paralinguistic emotion controls, a post-processing effects pipeline, unlimited generation length, an async generation queue, and support for every major GPU vendor. Plus Docker.
New TTS Engines
Multi-Engine Architecture
Voicebox now runs four independent TTS engines behind a thread-safe per-engine backend registry. Switch engines per-generation from a single dropdown — no restart required.
| Engine | Languages | Size | Key Strengths |
|---|---|---|---|
| Qwen3-TTS 1.7B | 10 | ~3.5 GB | Highest quality, delivery instructions ("speak slowly", "whisper") |
| Qwen3-TTS 0.6B | 10 | ~1.2 GB | Lighter, faster variant |
| LuxTTS | English | ~300 MB | CPU-friendly, 48 kHz output, 150x realtime |
| Chatterbox Multilingual | 23 | ~3.2 GB | Broadest language coverage, zero-shot cloning |
| Chatterbox Turbo | English | ~1.5 GB | 350M params, low latency, paralinguistic tags |
Chatterbox Multilingual — 23 Languages (#257)
Zero-shot voice cloning in Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, and Turkish. The language dropdown dynamically filters to show only languages supported by the selected engine.
LuxTTS — Lightweight English TTS (#254)
A fast, CPU-friendly English engine. ~300 MB download, 48 kHz output, runs at 150x realtime on CPU. Good for quick drafts and machines without a GPU.
Chatterbox Turbo — Expressive English (#258)
A fast 350M-parameter English model with inline paralinguistic tags.
Paralinguistic Tags Autocomplete (#265)
Type / in the text input with Chatterbox Turbo selected to open an autocomplete for 9 expressive tags that the model synthesizes inline with speech:
[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]
Tags render as inline badges in a rich text editor and serialize cleanly to the API.
Generation
Unlimited Generation Length — Auto-Chunking (#266)
Long text is now automatically split at sentence boundaries, generated per-chunk, and crossfaded back together. Engine-agnostic — works with all four engines.
- Auto-chunking limit slider — 100–5,000 chars (default 800)
- Crossfade slider — 0–200ms (default 50ms), or 0 for a hard cut
- Max text length raised to 50,000 characters
- Smart splitting respects abbreviations (Dr., e.g., a.m.), CJK punctuation, and never breaks inside
[tags]
Asynchronous Generation Queue (#269)
Generation is now fully non-blocking. Submit a generation and start typing the next one immediately.
- Serial execution queue prevents GPU contention
- Real-time SSE status streaming (
generating→completed/failed) - Failed generations can be retried without re-entering text
- Stale generations from crashes are auto-recovered on startup
- Generating status pill shown inline in the story editor
Generation Versions
Every generation now supports multiple versions with provenance tracking:
- Original — the unprocessed TTS output, always preserved
- Effects versions — apply different effects chains to create new versions from any source
- Takes — regenerate with the same text/voice but a new seed
- Source tracking — each version records which version it was derived from
- Version pinning in stories — pin a specific version to a story track clip
- Favorites — star generations for quick access
Language Parameter Fix
Qwen TTS models now correctly receive the selected language. The generation form syncs with the voice profile's language setting.
Post-Processing Effects (#271)
A full audio effects system powered by Spotify's pedalboard library. Apply effects after generation, preview in real time, and build reusable presets.
| Effect | Description |
|---|---|
| Pitch Shift | ±12 semitones |
| Reverb | Room size, damping, wet/dry mix |
| Delay | Adjustable time, feedback, mix |
| Chorus / Flanger | Modulated delay — short for metallic, long for lush |
| Compressor | Threshold, ratio, attack, release |
| Gain | -40 to +40 dB |
| High-Pass Filter | Configurable cutoff frequency |
| Low-Pass Filter | Configurable cutoff frequency |
- 4 built-in presets — Robotic, Radio, Echo Chamber, Deep Voice
- Custom presets — create unlimited drag-and-drop effect chains
- Per-profile default effects — assign a chain to a voice profile, auto-applies to every generation
- Live preview — audition effects against existing audio before committing
- Source version selection — apply effects to any version of a generation, not just the latest
Platform Support
Windows Support (#272)
Full Windows support with CUDA GPU detection, cross-platform justfile, and clean server shutdown using taskkill /T for the process tree.
Linux (#262)
Pre-built Linux binaries are not available for this release — the release CI is still broken on Linux and we're working on fixing it. However, this release includes significant Linux improvements that make compiling from source much easier:
- AMD ROCm GPU acceleration with automatic
HSA_OVERRIDE_GFX_VERSIONfor unlisted GPUs - NVIDIA GBM buffer crash fix (#210)
- WebKitGTK microphone access for voice sample recording
- Cross-platform justfile with Linux-specific setup targets
- See the README for build-from-source instructions — we'll ship Linux CI builds as soon as we can
NVIDIA CUDA Backend Swap (#252)
The CPU-only release can download and swap in a CUDA-accelerated backend from within the app. Downloads split parts to work around GitHub's 2GB asset limit, verifies SHA-256 checksums, and restarts the server automatically.
Intel Arc (XPU) and DirectML
PyTorch backend supports Intel Arc GPUs via IPEX/XPU and any-GPU on Windows via DirectML.
Docker + Web Deployment (#161)
Run Voicebox headless:
docker compose up3-stage build, non-root runtime, health checks, persistent model cache. Binds to localhost only by default.
Whisper Turbo
Added openai/whisper-large-v3-turbo as a transcription model option.
Model Management (#268)
- Per-model unload — free GPU memory without deleting downloaded models
- Custom models directory — set
VOICEBOX_MODELS_DIRto store models anywhere - Model folder migration — move all models to a new location with progress tracking
- Download cancel/clear UI — cancel in-progress downloads, VS Code-style problems panel for errors (#238)
- Restructured settings UI — server settings and model management split into cleaner sections
Security & Reliability
- CORS hardening — explicit allowlist of local origins instead of wildcard
*; extensible viaVOICEBOX_CORS_ORIGINS(#88) - Network access toggle — fully disable outbound requests for air-gapped deployments (#133)
- Offline crash fix — Voicebox no longer crashes when HuggingFace is unreachable (#152)
- Atomic audio saves — two-phase write prevents corrupted files on crash or disk-full (#263)
- Filesystem health endpoint — proactive disk space and directory writability checks
- Errno-specific error messages — clear feedback for permission denied, disk full, missing directory
- Chatterbox float64 dtype fix — patches S3Tokenizer and VoiceEncoder to cast float64→float32, preventing crashes on certain audio inputs (#264)
- Watchdog respects keep-server-running —
/watchdog/disableendpoint prevents the server from shutting down when the app window closes, if configured - Server shutdown on Windows — clean process tree termination with
taskkill /Tandos._exitfallback
Accessibility (#243)
- Screen reader support (tested with NVDA/Narrator) across all major UI surfaces
- Keyboard navigation for voice cards, history rows, model management, and story editor
- State-aware
aria-labelattributes on all interactive controls
UI Polish
- Redesigned landing page with animated ControlUI hero, multi-engine copy, model cards, and voice creator section (#274)
- Glassmorphic active state for sidebar buttons with accent border shine
- Voices tab overhaul with inline inspector
- Responsive layout — pointer-events fix on animations, sticky header with scroll fade, horizontal-scroll voice cards on mobile
- Auto-select first story when navigating to Stories tab
- App version shown in sidebar
- Voice card heights normalized
- Audio player title hidden at narrow widths
- Duplicate profile name validation with clear error messages (#175)
- Model loaded icon uses accent-colored CircleCheck; loaded models show size
Developer Experience
- Justfile — streamlined dev setup and workflow (replaces Makefile)
- Cross-platform justfile — works on macOS, Linux, and Windows
- Updated README dev quick start with
justcommands - Tauri prerequisite docs updated (#215)
Bug Fixes
- Fix generate box overlapping audio player on Stories route
- Fix model management JSX closing tag mismatch
- Fix LuxTTS generation failures and preserve model selection after generate
- Fix download progress tracking for all engines
- Fix
chatterbox-ttsinstall with--no-depsto avoid numpy pin conflict - Fix WAV format specification for atomic save temp files
- Fix piper-phonemize find-links for LuxTTS install
- Remove unused
TTS_MODEenv var from docker-compose - Fix Linux release build exceeding 4GB PyInstaller limit
- Fix player not loading new version after applying effects
- Fix window close loop on server shutdown
Downloads
| Platform | File |
|---|---|
| macOS (Apple Silicon) | Voicebox_0.2.1_aarch64.dmg
|
| macOS (Intel) | Voicebox_0.2.1_x64.dmg
|
| Windows | Voicebox_0.2.1_x64_en-US.msi
|
| Linux | Build from source (see README) — CI builds coming soon |
| Docker | docker compose up
|
Community Contributors
Thanks to everyone who contributed to this release:
- @haosenwang1018 — README grammar fixes (#230)
- @Balneario-de-Cofrentes — CORS origin restriction (#88)
- @ageofalgo — Docker + web deployment (#161)
- @mikeswann — Tauri prerequisite updates (#215)
- @rayl15 — Network access toggle (#133)
- @mpecanha — Offline mode crash fix (#152)
- @ways2read — Accessibility improvements (#243)
- @ieguiguren — Linux NVIDIA GBM buffer fix (#210)
- @Vaibhavee89 — Duplicate profile name validation (#175)
- @pandego — API port documentation fix (#250)
- @luminest-llc — Download cancel/clear UI (#238)
