New models! 🎉 🎉 🎉
Credit: @vvolhejn
This release notably includes new models. We now have a --language argument to select the pre-trained model from the cli and a language= argument in TTSModel.load_model(). Here is the list of all available models/languages:
english_2026-01: The only model that was available until now. 6 layers.english_2026-04: The new and improved english model. Supports better short sentences and has better voice cloning. 6 layers.english: This is just an alias forenglish_2026-04.italian: Our new pocket-tts in italian! 6 layers.italian_24l: The undistilled italian model. We would love reports if you find bugs that are present in theitalianmodel but not in theitalian_24lmodel. 24 layers.german: Our new pocket-tts in german! 6 layers.german_24l: The undistilled german model. We would love reports if you find bugs that are present in thegermanmodel but not in thegerman_24lmodel. 24 layers.spanish: Our new pocket-tts in spanish! 6 layers.spanish_24l: The undistilled spanish model. We would love reports if you find bugs that are present in thespanishmodel but not in thespanish_24lmodel. 24 layers.portuguese: Our new pocket-tts in portuguese! 6 layers.portuguese_24l: The undistilled portuguese model. We would love reports if you find bugs that are present in theportuguesemodel but not in theportuguese_24lmodel. 24 layers.french_24l: The undistilled french model. The distillation of the french model has been more painful than anticipated due to the data quality. While we fix those issue, we want to unblock the french pocket-tts community, which is why we release the undistilled version here. 24 layers.
If the 24 layers are too slow to run in real-time on your CPU, you can try the new --quantize option! You can expect ~30% perf improvements in most cases.
The pre-defined voices are all english. For other languages, we recommend using the voice cloning and use a voice prompt that correspond to your language.
Note for maintainers of alternative implementations
This section should be especially helpful to @LaurentMazare @KevinAHM @babybirdprd @ekzhang @jishnuvenugopal @VolgaGerm @csukuangfj @TheAjaykrishnanR
The pocket-tts community has been amazing! We were blown away by the number of alternative implementation of pocket-tts in other languages and frameworks! We want to make it easy for them to adapt their code to the new models. I added comments to the commit that did architecture changes. If you report the changes done next to each comment, that should be enough to make your alternative implementation work!
Here is the list
- Beginning of sequence weight
- Upsample and downsample take more args
- Speaker proj weight have the dimensions configurable
- Optional: a trick to make sure mimi is working
- Prompting a new beginning of sequence before the voice prompt
- Making the downsample more flexible
- Making the upsample more flexible
Notable pull requests:
- Changed the implementation to fuse the transformers by @darknight054 in #85
- Raise minimum huggingface_hub to 0.13.0 for consistent offline behavior by @joshwhiton in #137
- Add int8 dynamic quantization support by @nabil-tazi in #147
- Split long sentences on commas to prevent skipped words by @costajohnt in #143
- Add french, italian, portuguese, spanish, german by @gabrieldemarmiesse in #155
New Contributors
- @ai-joe-git made their first contribution in #136
- @joshwhiton made their first contribution in #137
- @alkmei made their first contribution in #138
- @dooart made their first contribution in #139
- @tonelord made their first contribution in #140
- @VolgaGerm made their first contribution in #142
- @csukuangfj made their first contribution in #150
- @nabil-tazi made their first contribution in #147
- @costajohnt made their first contribution in #143
- @markd89 made their first contribution in #157
- @TheAjaykrishnanR made their first contribution in #159
- @dodgyrabbit made their first contribution in #160
Many thanks to the community for being so awesome! ❤️
Full Changelog: v1.1.1...v2.0.0