github jamiepine/voicebox v0.2.1
voicebox v0.2.1

latest releases: v0.2.3, v0.2.2
3 hours ago

The best local voice cloning tool, just got better...

See the new website: https://voicebox.sh

voicebox-0 2 0

Released 2026-03-15 — v0.2.1 on GitHub (version bump due to an immutable release tag on GitHub)

Voicebox v0.1.x was a single-engine voice cloning app built around Qwen3-TTS. v0.2.0 is a ground-up rethink: four TTS engines, 23 languages, paralinguistic emotion controls, a post-processing effects pipeline, unlimited generation length, an async generation queue, and support for every major GPU vendor. Plus Docker.


New TTS Engines

Multi-Engine Architecture

Voicebox now runs four independent TTS engines behind a thread-safe per-engine backend registry. Switch engines per-generation from a single dropdown — no restart required.

Engine Languages Size Key Strengths
Qwen3-TTS 1.7B 10 ~3.5 GB Highest quality, delivery instructions ("speak slowly", "whisper")
Qwen3-TTS 0.6B 10 ~1.2 GB Lighter, faster variant
LuxTTS English ~300 MB CPU-friendly, 48 kHz output, 150x realtime
Chatterbox Multilingual 23 ~3.2 GB Broadest language coverage, zero-shot cloning
Chatterbox Turbo English ~1.5 GB 350M params, low latency, paralinguistic tags

Chatterbox Multilingual — 23 Languages (#257)

Zero-shot voice cloning in Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, and Turkish. The language dropdown dynamically filters to show only languages supported by the selected engine.

LuxTTS — Lightweight English TTS (#254)

A fast, CPU-friendly English engine. ~300 MB download, 48 kHz output, runs at 150x realtime on CPU. Good for quick drafts and machines without a GPU.

Chatterbox Turbo — Expressive English (#258)

A fast 350M-parameter English model with inline paralinguistic tags.

Paralinguistic Tags Autocomplete (#265)

Type / in the text input with Chatterbox Turbo selected to open an autocomplete for 9 expressive tags that the model synthesizes inline with speech:

[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]

Tags render as inline badges in a rich text editor and serialize cleanly to the API.


Generation

Unlimited Generation Length — Auto-Chunking (#266)

Long text is now automatically split at sentence boundaries, generated per-chunk, and crossfaded back together. Engine-agnostic — works with all four engines.

  • Auto-chunking limit slider — 100–5,000 chars (default 800)
  • Crossfade slider — 0–200ms (default 50ms), or 0 for a hard cut
  • Max text length raised to 50,000 characters
  • Smart splitting respects abbreviations (Dr., e.g., a.m.), CJK punctuation, and never breaks inside [tags]

Asynchronous Generation Queue (#269)

Generation is now fully non-blocking. Submit a generation and start typing the next one immediately.

  • Serial execution queue prevents GPU contention
  • Real-time SSE status streaming (generatingcompleted / failed)
  • Failed generations can be retried without re-entering text
  • Stale generations from crashes are auto-recovered on startup
  • Generating status pill shown inline in the story editor

Generation Versions

Every generation now supports multiple versions with provenance tracking:

  • Original — the unprocessed TTS output, always preserved
  • Effects versions — apply different effects chains to create new versions from any source
  • Takes — regenerate with the same text/voice but a new seed
  • Source tracking — each version records which version it was derived from
  • Version pinning in stories — pin a specific version to a story track clip
  • Favorites — star generations for quick access

Language Parameter Fix

Qwen TTS models now correctly receive the selected language. The generation form syncs with the voice profile's language setting.


Post-Processing Effects (#271)

A full audio effects system powered by Spotify's pedalboard library. Apply effects after generation, preview in real time, and build reusable presets.

Effect Description
Pitch Shift ±12 semitones
Reverb Room size, damping, wet/dry mix
Delay Adjustable time, feedback, mix
Chorus / Flanger Modulated delay — short for metallic, long for lush
Compressor Threshold, ratio, attack, release
Gain -40 to +40 dB
High-Pass Filter Configurable cutoff frequency
Low-Pass Filter Configurable cutoff frequency
  • 4 built-in presets — Robotic, Radio, Echo Chamber, Deep Voice
  • Custom presets — create unlimited drag-and-drop effect chains
  • Per-profile default effects — assign a chain to a voice profile, auto-applies to every generation
  • Live preview — audition effects against existing audio before committing
  • Source version selection — apply effects to any version of a generation, not just the latest

Platform Support

Windows Support (#272)

Full Windows support with CUDA GPU detection, cross-platform justfile, and clean server shutdown using taskkill /T for the process tree.

Linux (#262)

Pre-built Linux binaries are not available for this release — the release CI is still broken on Linux and we're working on fixing it. However, this release includes significant Linux improvements that make compiling from source much easier:

  • AMD ROCm GPU acceleration with automatic HSA_OVERRIDE_GFX_VERSION for unlisted GPUs
  • NVIDIA GBM buffer crash fix (#210)
  • WebKitGTK microphone access for voice sample recording
  • Cross-platform justfile with Linux-specific setup targets
  • See the README for build-from-source instructions — we'll ship Linux CI builds as soon as we can

NVIDIA CUDA Backend Swap (#252)

The CPU-only release can download and swap in a CUDA-accelerated backend from within the app. Downloads split parts to work around GitHub's 2GB asset limit, verifies SHA-256 checksums, and restarts the server automatically.

Intel Arc (XPU) and DirectML

PyTorch backend supports Intel Arc GPUs via IPEX/XPU and any-GPU on Windows via DirectML.

Docker + Web Deployment (#161)

Run Voicebox headless:

docker compose up

3-stage build, non-root runtime, health checks, persistent model cache. Binds to localhost only by default.

Whisper Turbo

Added openai/whisper-large-v3-turbo as a transcription model option.


Model Management (#268)

  • Per-model unload — free GPU memory without deleting downloaded models
  • Custom models directory — set VOICEBOX_MODELS_DIR to store models anywhere
  • Model folder migration — move all models to a new location with progress tracking
  • Download cancel/clear UI — cancel in-progress downloads, VS Code-style problems panel for errors (#238)
  • Restructured settings UI — server settings and model management split into cleaner sections

Security & Reliability

  • CORS hardening — explicit allowlist of local origins instead of wildcard *; extensible via VOICEBOX_CORS_ORIGINS (#88)
  • Network access toggle — fully disable outbound requests for air-gapped deployments (#133)
  • Offline crash fix — Voicebox no longer crashes when HuggingFace is unreachable (#152)
  • Atomic audio saves — two-phase write prevents corrupted files on crash or disk-full (#263)
  • Filesystem health endpoint — proactive disk space and directory writability checks
  • Errno-specific error messages — clear feedback for permission denied, disk full, missing directory
  • Chatterbox float64 dtype fix — patches S3Tokenizer and VoiceEncoder to cast float64→float32, preventing crashes on certain audio inputs (#264)
  • Watchdog respects keep-server-running/watchdog/disable endpoint prevents the server from shutting down when the app window closes, if configured
  • Server shutdown on Windows — clean process tree termination with taskkill /T and os._exit fallback

Accessibility (#243)

  • Screen reader support (tested with NVDA/Narrator) across all major UI surfaces
  • Keyboard navigation for voice cards, history rows, model management, and story editor
  • State-aware aria-label attributes on all interactive controls

UI Polish

  • Redesigned landing page with animated ControlUI hero, multi-engine copy, model cards, and voice creator section (#274)
  • Glassmorphic active state for sidebar buttons with accent border shine
  • Voices tab overhaul with inline inspector
  • Responsive layout — pointer-events fix on animations, sticky header with scroll fade, horizontal-scroll voice cards on mobile
  • Auto-select first story when navigating to Stories tab
  • App version shown in sidebar
  • Voice card heights normalized
  • Audio player title hidden at narrow widths
  • Duplicate profile name validation with clear error messages (#175)
  • Model loaded icon uses accent-colored CircleCheck; loaded models show size

Developer Experience

  • Justfile — streamlined dev setup and workflow (replaces Makefile)
  • Cross-platform justfile — works on macOS, Linux, and Windows
  • Updated README dev quick start with just commands
  • Tauri prerequisite docs updated (#215)

Bug Fixes

  • Fix generate box overlapping audio player on Stories route
  • Fix model management JSX closing tag mismatch
  • Fix LuxTTS generation failures and preserve model selection after generate
  • Fix download progress tracking for all engines
  • Fix chatterbox-tts install with --no-deps to avoid numpy pin conflict
  • Fix WAV format specification for atomic save temp files
  • Fix piper-phonemize find-links for LuxTTS install
  • Remove unused TTS_MODE env var from docker-compose
  • Fix Linux release build exceeding 4GB PyInstaller limit
  • Fix player not loading new version after applying effects
  • Fix window close loop on server shutdown

Downloads

Platform File
macOS (Apple Silicon) Voicebox_0.2.1_aarch64.dmg
macOS (Intel) Voicebox_0.2.1_x64.dmg
Windows Voicebox_0.2.1_x64_en-US.msi
Linux Build from source (see README) — CI builds coming soon
Docker docker compose up

Community Contributors

Thanks to everyone who contributed to this release:

Don't miss a new voicebox release

NewReleases is sending notifications on new releases.