🚀 Audiobook Creator v2.0 Release 🎧

This release represents a significant improvement in character identification accuracy and overall audiobook quality. The shift to LLM-based processing provides much more reliable character recognition and speaker attribution, especially for complex narratives. This release is also clubbed with latest release of my Orpheus-TTS-FastAPI repo which provides much improved stability and auto audio issue detection and correction. Users upgrading will need to configure the new LLM environment variables for full functionality.

Major Features

Two-Step LLM-Based Character Identification

Replaced NLP pipeline with advanced two-pass LLM approach for maximum accuracy
Pass 1: Extracts all characters from entire text with intelligent merge/insert/update operations
Pass 2: Attributes speakers to dialogue using pure matching (no character creation)
Significantly improved character recognition and speaker attribution accuracy

Enhanced Emotion Tagging

Improved emotion tag processing for Orpheus TTS engine
Better integration with character identification workflow
Enhanced expressiveness for audiobook narration

Enhanced Orpheus TTS FastAPI package
Checkout the latest version of Orpheus TTS FastAPI package which provides these improvements for advanced audio quality assurance:

Multi-stage error detection prevents audio artifacts and quality issues
Automatic retry logic with parameter adjustment for failed generations
Audio quality analysis detects silence, clipping, repetition, and stretching
Duration outlier detection identifies abnormally slow generations
Improved token repetition detection prevents infinite audio loops

Welcoming new contributors

Big thanks to @kimnzl @PatrickGardiner @purohitdeep for their PRs for fixing bugs and improving the app.

⚠️ Migration Notes

New Environment Variables Required:

CHARACTER_IDENTIFICATION_LLM_BASE_URL - LLM endpoint for character identification
CHARACTER_IDENTIFICATION_LLM_API_KEY - API key for character identification LLM
CHARACTER_IDENTIFICATION_LLM_MODEL_NAME - Model name (requires ≥20K context window)
EMOTION_TAG_ADDITION_LLM_BASE_URL - LLM endpoint for emotion tagging
EMOTION_TAG_ADDITION_LLM_API_KEY - API key for emotion tagging LLM
EMOTION_TAG_ADDITION_LLM_MODEL_NAME - Model name (requires ≥8K context window)
EMOTION_TAG_ADDITION_LLM_MAX_PARALLEL_REQUESTS_BATCH_SIZE - Parallel processing setting

Check the .env.sample and instructions in README.md for config changes:

New docker package

Use the latest docker package below, the older packages have been deprecated since there is no need to separate cpu/gpu versions as we don't require Gliner NLP anymore for character identification and instead rely solely on LLMs. Have also remove kokoro from docker compose, its a standalone component now, not clubbed with the app anymore.

Update to latest version of Orpheus TTS FastAPI package
Link

📦 Docker Image

You can pull the latest image with

docker pull ghcr.io/prakharsr/audiobook_creator:v2.0

Deprecated old packages : ghcr.io/prakharsr/audiobook_creator_cpu and ghcr.io/prakharsr/audiobook_creator_gpu

What's Changed

Changes for running on Windows by @kimnzl in #20
Fix sanitize_filename based on pattern in run_shell_command_secure by @kimnzl in #21
Fix: Add libnss3 dependency to fix Calibre PDF conversion in Docker by @purohitdeep in #22
Problem with commas (#2) by @PatrickGardiner in #25
Two step llm based character identification by @prakharsr in #27

New Contributors

@kimnzl made their first contribution in #20
@purohitdeep made their first contribution in #22
@PatrickGardiner made their first contribution in #25

Full Changelog: v1.5...v2.0

prakharsr/audiobook-creator v2.0 on GitHub