The best local voice cloning tool, just got better...

See the new website: https://voicebox.sh

Released 2026-03-15 — v0.2.1 on GitHub (version bump due to an immutable release tag on GitHub)

Voicebox v0.1.x was a single-engine voice cloning app built around Qwen3-TTS. v0.2.0 is a ground-up rethink: four TTS engines, 23 languages, paralinguistic emotion controls, a post-processing effects pipeline, unlimited generation length, an async generation queue, and support for every major GPU vendor. Plus Docker.

New TTS Engines

Multi-Engine Architecture

Voicebox now runs four independent TTS engines behind a thread-safe per-engine backend registry. Switch engines per-generation from a single dropdown — no restart required.

Engine	Languages	Size	Key Strengths
Qwen3-TTS 1.7B	10	~3.5 GB	Highest quality, delivery instructions ("speak slowly", "whisper")
Qwen3-TTS 0.6B	10	~1.2 GB	Lighter, faster variant
LuxTTS	English	~300 MB	CPU-friendly, 48 kHz output, 150x realtime
Chatterbox Multilingual	23	~3.2 GB	Broadest language coverage, zero-shot cloning
Chatterbox Turbo	English	~1.5 GB	350M params, low latency, paralinguistic tags

Chatterbox Multilingual — 23 Languages (#257)

Zero-shot voice cloning in Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, and Turkish. The language dropdown dynamically filters to show only languages supported by the selected engine.

LuxTTS — Lightweight English TTS (#254)

A fast, CPU-friendly English engine. ~300 MB download, 48 kHz output, runs at 150x realtime on CPU. Good for quick drafts and machines without a GPU.

Chatterbox Turbo — Expressive English (#258)

A fast 350M-parameter English model with inline paralinguistic tags.

Paralinguistic Tags Autocomplete (#265)

Type / in the text input with Chatterbox Turbo selected to open an autocomplete for 9 expressive tags that the model synthesizes inline with speech:

[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]

Tags render as inline badges in a rich text editor and serialize cleanly to the API.

Generation

Unlimited Generation Length — Auto-Chunking (#266)

Long text is now automatically split at sentence boundaries, generated per-chunk, and crossfaded back together. Engine-agnostic — works with all four engines.

Auto-chunking limit slider — 100–5,000 chars (default 800)
Crossfade slider — 0–200ms (default 50ms), or 0 for a hard cut
Max text length raised to 50,000 characters
Smart splitting respects abbreviations (Dr., e.g., a.m.), CJK punctuation, and never breaks inside [tags]

Asynchronous Generation Queue (#269)

Generation is now fully non-blocking. Submit a generation and start typing the next one immediately.

Serial execution queue prevents GPU contention
Real-time SSE status streaming (generating → completed / failed)
Failed generations can be retried without re-entering text
Stale generations from crashes are auto-recovered on startup
Generating status pill shown inline in the story editor

Generation Versions

Every generation now supports multiple versions with provenance tracking:

Original — the unprocessed TTS output, always preserved
Effects versions — apply different effects chains to create new versions from any source
Takes — regenerate with the same text/voice but a new seed
Source tracking — each version records which version it was derived from
Version pinning in stories — pin a specific version to a story track clip
Favorites — star generations for quick access

Language Parameter Fix

Qwen TTS models now correctly receive the selected language. The generation form syncs with the voice profile's language setting.

Post-Processing Effects (#271)

A full audio effects system powered by Spotify's pedalboard library. Apply effects after generation, preview in real time, and build reusable presets.

Effect	Description
Pitch Shift	±12 semitones
Reverb	Room size, damping, wet/dry mix
Delay	Adjustable time, feedback, mix
Chorus / Flanger	Modulated delay — short for metallic, long for lush
Compressor	Threshold, ratio, attack, release
Gain	-40 to +40 dB
High-Pass Filter	Configurable cutoff frequency
Low-Pass Filter	Configurable cutoff frequency

4 built-in presets — Robotic, Radio, Echo Chamber, Deep Voice
Custom presets — create unlimited drag-and-drop effect chains
Per-profile default effects — assign a chain to a voice profile, auto-applies to every generation
Live preview — audition effects against existing audio before committing
Source version selection — apply effects to any version of a generation, not just the latest

Platform Support

Windows Support (#272)

Full Windows support with CUDA GPU detection, cross-platform justfile, and clean server shutdown using taskkill /T for the process tree.

Linux (#262)

Pre-built Linux binaries are not available for this release — the release CI is still broken on Linux and we're working on fixing it. However, this release includes significant Linux improvements that make compiling from source much easier:

AMD ROCm GPU acceleration with automatic HSA_OVERRIDE_GFX_VERSION for unlisted GPUs
NVIDIA GBM buffer crash fix (#210)
WebKitGTK microphone access for voice sample recording
Cross-platform justfile with Linux-specific setup targets
See the README for build-from-source instructions — we'll ship Linux CI builds as soon as we can

NVIDIA CUDA Backend Swap (#252)

The CPU-only release can download and swap in a CUDA-accelerated backend from within the app. Downloads split parts to work around GitHub's 2GB asset limit, verifies SHA-256 checksums, and restarts the server automatically.

Intel Arc (XPU) and DirectML

PyTorch backend supports Intel Arc GPUs via IPEX/XPU and any-GPU on Windows via DirectML.

Docker + Web Deployment (#161)

Run Voicebox headless:

docker compose up

3-stage build, non-root runtime, health checks, persistent model cache. Binds to localhost only by default.

Whisper Turbo

Added openai/whisper-large-v3-turbo as a transcription model option.

Model Management (#268)

Per-model unload — free GPU memory without deleting downloaded models
Custom models directory — set VOICEBOX_MODELS_DIR to store models anywhere
Model folder migration — move all models to a new location with progress tracking
Download cancel/clear UI — cancel in-progress downloads, VS Code-style problems panel for errors (#238)
Restructured settings UI — server settings and model management split into cleaner sections

Security & Reliability

CORS hardening — explicit allowlist of local origins instead of wildcard *; extensible via VOICEBOX_CORS_ORIGINS (#88)
Network access toggle — fully disable outbound requests for air-gapped deployments (#133)
Offline crash fix — Voicebox no longer crashes when HuggingFace is unreachable (#152)
Atomic audio saves — two-phase write prevents corrupted files on crash or disk-full (#263)
Filesystem health endpoint — proactive disk space and directory writability checks
Errno-specific error messages — clear feedback for permission denied, disk full, missing directory
Chatterbox float64 dtype fix — patches S3Tokenizer and VoiceEncoder to cast float64→float32, preventing crashes on certain audio inputs (#264)
Watchdog respects keep-server-running — /watchdog/disable endpoint prevents the server from shutting down when the app window closes, if configured
Server shutdown on Windows — clean process tree termination with taskkill /T and os._exit fallback

Accessibility (#243)

Screen reader support (tested with NVDA/Narrator) across all major UI surfaces
Keyboard navigation for voice cards, history rows, model management, and story editor
State-aware aria-label attributes on all interactive controls

UI Polish

Redesigned landing page with animated ControlUI hero, multi-engine copy, model cards, and voice creator section (#274)
Glassmorphic active state for sidebar buttons with accent border shine
Voices tab overhaul with inline inspector
Responsive layout — pointer-events fix on animations, sticky header with scroll fade, horizontal-scroll voice cards on mobile
Auto-select first story when navigating to Stories tab
App version shown in sidebar
Voice card heights normalized
Audio player title hidden at narrow widths
Duplicate profile name validation with clear error messages (#175)
Model loaded icon uses accent-colored CircleCheck; loaded models show size

Developer Experience

Justfile — streamlined dev setup and workflow (replaces Makefile)
Cross-platform justfile — works on macOS, Linux, and Windows
Updated README dev quick start with just commands
Tauri prerequisite docs updated (#215)

Bug Fixes

Fix generate box overlapping audio player on Stories route
Fix model management JSX closing tag mismatch
Fix LuxTTS generation failures and preserve model selection after generate
Fix download progress tracking for all engines
Fix chatterbox-tts install with --no-deps to avoid numpy pin conflict
Fix WAV format specification for atomic save temp files
Fix piper-phonemize find-links for LuxTTS install
Remove unused TTS_MODE env var from docker-compose
Fix Linux release build exceeding 4GB PyInstaller limit
Fix player not loading new version after applying effects
Fix window close loop on server shutdown

Downloads

Platform	File
macOS (Apple Silicon)	`Voicebox_0.2.1_aarch64.dmg`
macOS (Intel)	`Voicebox_0.2.1_x64.dmg`
Windows	`Voicebox_0.2.1_x64_en-US.msi`
Linux	Build from source (see README) — CI builds coming soon
Docker	`docker compose up`

Community Contributors

Thanks to everyone who contributed to this release:

@haosenwang1018 — README grammar fixes (#230)
@Balneario-de-Cofrentes — CORS origin restriction (#88)
@ageofalgo — Docker + web deployment (#161)
@mikeswann — Tauri prerequisite updates (#215)
@rayl15 — Network access toggle (#133)
@mpecanha — Offline mode crash fix (#152)
@ways2read — Accessibility improvements (#243)
@ieguiguren — Linux NVIDIA GBM buffer fix (#210)
@Vaibhavee89 — Duplicate profile name validation (#175)
@pandego — API port documentation fix (#250)
@luminest-llc — Download cancel/clear UI (#238)

jamiepine/voicebox v0.2.1 voicebox v0.2.1 on GitHub