Voicebox v0.1.0
The first public release of Voicebox — an open-source voice synthesis studio powered by Qwen3-TTS.
Download
| Platform | Status |
|---|---|
| macOS (Apple Silicon) | Available |
| macOS (Intel) | Available |
| Windows (x64) | Available |
| Linux | Coming soon* |
*Linux builds are delayed due to GitHub Actions CI issues. We're working on it and will release Linux support in v0.1.1.
What's in this release
Voice Cloning with Qwen3-TTS
Clone any voice from just a few seconds of audio using Alibaba's Qwen3-TTS model.
- Automatic model download — Models download from HuggingFace on first use
- Multiple model sizes — Support for 1.7B and 0.6B parameter models
- Voice prompt caching — Regenerate instantly without reprocessing audio
- Multi-language — English and Chinese support
Voice Profile Management
- Create profiles from audio files or record directly in the app
- Multiple samples per profile — Combine samples for higher quality cloning
- Import/Export — Share profiles or back them up
- Automatic transcription — Whisper extracts reference text from samples
Speech Generation
- Simple text-to-speech — Select a profile, type text, generate
- Seed control — Reproducible generations with optional seed input
- Long-form support — Generate up to 5,000 characters at once
Generation History
- Full history — Every generation is saved with metadata
- Search — Find past generations by text content
- Inline playback — Listen without leaving the app
- Download — Export audio files to your system
Flexible Deployment
- Local mode — Backend runs alongside the desktop app
- Remote mode — Connect to a GPU server on your network
- One-click server — Turn any machine into a Voicebox server
Desktop Experience
- Native performance — Built with Tauri (Rust), not Electron
- Cross-platform — Same experience on macOS and Windows
- Bundled backend — No Python installation required
Tech Stack
- Desktop: Tauri v2 (Rust)
- Frontend: React, TypeScript, Tailwind CSS
- Backend: FastAPI (Python)
- Voice Model: Qwen3-TTS
- Transcription: Whisper
- Database: SQLite
Known Issues
- First launch is slow — Model downloads (2-7GB) on first use
- Apple Silicon performance — Generation takes ~10s per paragraph on M1/M2 chips; CUDA is significantly faster
- Linux not available — CI pipeline issues; coming in v0.1.1
What's Next
We're already working on the next release. Here's a preview:
- Linux support — Top priority
- Real-time synthesis — Stream audio as it generates
- Voice effects — Pitch shift, reverb, and more
- Timeline editor — Word-level precision audio editing
- Conversation mode — Multi-speaker dialogue generation
- More models — XTTS, Bark, and other open-source voice models
Feedback
Found a bug? Have a feature request? Open an issue on GitHub or reach out at voicebox.sh.
Thank you for trying Voicebox!
P.S: This was originally released yesterday, note to self, don't let Claude manage GitHub tags with bypass permissions turned on.