A large release (168 commits since v0.3.5). The headline is the Longform suite — produce full audiobooks and multi-voice stories from text, EPUB, or PDF — alongside a real engine-routing layer that tells you up front when an engine will fall back to CPU instead of finding out mid-synth. Dubbing, first-run, and install reliability all get a pass too.
Added
- Longform: Stories + Audiobook editors. Two new tabs turn long text into finished audio. Audiobook takes a script (or imports plain text / EPUB / PDF), auto-splits it into chapters, and renders a chaptered
.m4bwith metadata, cover art, and per-chapter preview/resume. Stories is a multi-voice editor — assign a different voice per line, preview, and export the whole thing through the same server-side renderer. Both share one render core (loudness, metadata, cover art) and one live SSE progress stream, and you can convert a project between Story and Audiobook in place. (#402, #403, #404, #408, #409, #411, #412, #413, #426, #435, #436, #447) - Longform: PDF & EPUB ingest. "Import" on the Audiobook tab accepts EPUB and PDF (not just plain text) and auto-chapters the result, so an existing ebook becomes an audiobook without manual copy-paste. (#412, #459)
- Longform: two-pass loudnorm mastering. Audiobook/Story exports now run a measure-then-normalize loudnorm pass for accurate ACX/podcast loudness targets. A slow or broken measure pass degrades gracefully to single-pass rather than aborting the render. (#449, #455)
- Longform: crash-resume. An interrupted render is resumable without re-submitting the original input — the compiled plan is persisted to the job dir and finished chapters are reused, so a crash mid-book doesn't cost you the whole render. (#470)
- Longform: pronunciation control + SSML-lite prosody. A per-render pronunciation lexicon (word respelling) plus an in-app pronunciation editor and markup reference, and inline prosody markers —
[slow]/[fast]/[emphasis]/[spell]— for fine-grained delivery. (#419, #421, #422) - Stories: global reading-speed control. A toolbar slider (0.5–2.0×) sets one speed for every line that doesn't have its own per-line override; the per-line slider still wins. Persisted as a UI preference. (#415, #416)
- Unified LongformProject store. Audiobook metadata, scripts, and prefs persist in a single project store (with a
v4→v5migration), and finished books/stories now show up alongside other work in Projects. (#417, #443, #444) - Portable personas (
.ovsvoice). Export any voice as a self-contained, fully-local persona bundle — identity, optional reference clip, consent attestation, SPDX license, and a watermarked preview — and import it back into another OmniVoice install. A privacy toggle ships a preview-only bundle so no raw recording of your voice has to travel. Verified-own-voice status can't be forged by hand-editing a bundle (real recording + consent text + attestation required). Legacy.omnivoicefiles still import. See docs/persona-format.md. (#29) - Engine routing — no more silent CPU fallback. A host device probe and routing resolver now decide where each engine actually runs, and the verdict is surfaced before you hit Synthesize: the Settings → Engines picker shows a per-engine compatibility matrix, and preflight / diagnose report the active engine's GPU verdict (accelerated / caveat / CPU-fallback / unavailable). At synth time every TTS entry point (
/generate,/v1/audio/speech) enforces the same routing — an engine that can't use this host's GPU returns an explicit error or anX-OmniVoice-Routingheader instead of silently dropping to CPU or dying mid-synth. (#21) - Diagnostics suite. New self-check tooling for when something's wrong: a
/system/diagnosereport (and matching backend--diagnose), a persistent error journal surfaced in Settings, and a scrubbed diagnostic bundle (home dirs stripped to~/, no tokens/keys) you can attach to a bug report. Paired with structured GitHub Issue Forms (bug / install / feature) for cleaner reports. (#433, #456) - Dubbing: multi-speaker per-speaker voice assignment. When diarization detects multiple speakers, each segment is now bound to its speaker's cloned voice automatically instead of landing on "Default" and needing manual fixes; per-segment reference clips are still preferred for quality where present. Also adds an optional speaker-count hint for diarization. (#275, #486, #490)
- Dubbing: Smart Fit timing + second-pass QC. A Smart Fit timing strategy (planner, fingerprints, per-segment video retime + drift absorption + fitted subtitles) plus a second-pass ASR QC that flags lines whose dub drifts from the target timing — wired into the dub editor UI. Includes a timeline segment editor (drag, snap-to-onset, keyboard a11y), speech-onset alignment, regional dialect targeting, and per-segment clone references. (#280, #347, #350, #369, #370, #458)
- Dubbing: dedicated Dub home. A projects/history landing for dubbing with project rename. (#435)
- Voice Console workspace. Clone and Design are consolidated into one Voice workspace with right-side panels, a shared waveform player, an identity recipe line / Active-voice card, and a free-text "describe your voice" field that maps natural language to design parameters. (#317, #374, #376, #378, #395, #396, #397)
- Unified first-run setup. Nothing installs until you confirm a plan: pick an install mode (installed / portable), a storage location, and (on restricted networks) custom PyPI/HF/python-build-standalone mirrors — with a minimum-free-space gate before anything downloads. Followed by a guided studio-console wizard with platform-aware hints, resume reassurance, and download ETAs. (#286, #295, #297, #298)
- Dictation: local-LLM refinement. Opt-in local-LLM cleanup of final transcripts (collapsing Whisper hallucination loops), available on both live dictation and the REST
/transcribepath; plus opt-in NLMS acoustic echo cancellation for dictating over playback. Configure a remote LLM endpoint (Ollama / vLLM / LM Studio) in Settings. (#356, #357, #363, #399, #400, #457) - Unlimited-length TTS + streaming. Sentence-boundary chunking with crossfade removes the per-generation length cap, and a new sentence-by-sentence
/ws/ttsstreams audio as it's produced. An inline[pause Nms]marker inserts measured silence in generated speech. (#276, #357, #358) - MCP server v1. OmniVoice mounts an MCP server on
/mcp(with a stdio shim and per-agent voice binding) so it can act as a local TTS/STT provider for agentic pipelines. (#368) - Remote-backend access. Point the desktop UI at a remote backend URL with a bearer key (Tailscale-documented), and an opt-in Hugging Face token field in the setup flow. (#303, #364)
- "Fund Claude Max" support experience. The donate page gets a real goal bar with a "Join N supporters" social-proof line and suggested amounts, plus Pip the mascot and a non-blocking "postcard" toast that appears only after a success (a finished dub, a saved clone, a longform export) — never on errors, setup, or first run — with escalating cooldowns and a one-click "don't ask again". (#494)
Fixed
- Transcription/dubbing failed when ffmpeg wasn't on
PATH(notably on Windows). WhisperX now decodes audio through OmniVoice's own validated ffmpeg binary instead of a barePATHlookup, so ASR works without a system ffmpeg install. (#479) - Translation defaulted the source language to English. Dubbing/translation now guesses the source language from the text instead of assuming
en, fixing wrong-direction translations. (#478) - Cinematic / LLM dubbing features failed out of the box because
openaiwasn't bundled. The client is now a runtime dependency, so those paths work on a fresh install. (#484) pkg_resources missinginstall dead-end (#248). The auto-repair ranuv pip install setuptools, whichuvtreated as a no-op when setuptools metadata was present but its files had been removed (commonly by Windows Defender quarantine or a partial extract). Both repair sites now use--reinstallto force re-extraction, and the error/hint text suggests the working command plus an antivirus-exclusion note. (#248)- A stuck backend trapped users on a buttonless splash (#474). The bootstrap splash now has a per-stage stall watchdog: if a non-terminal stage sits past its budget (20 min for dep install, 120 s otherwise), it flips to the failed state with actionable hints, the live log, and Retry / Clean-&-Retry — instead of polling forever with no way out. (#474)
- Changing the model-download location in Settings had no effect (#480). The desktop launcher injected a stale models dir that overrode the per-user value, so new downloads kept going to the old folder and "Effective location" stayed wrong. The per-user env file now wins, so the in-app Settings path is authoritative. (#480)
- Backend crashed on app upgrade with a stale venv (#307). Dependencies are now synced on upgrade, and a structurally broken venv self-heals instead of exiting
106.scalar_fastapiis now optional so its absence can't break startup. (#307, #314) /generateignored the selected TTS engine (#312) and GGUF speech-control parameters weren't forwarded — both now honored. (#306, #312)- TTS generation failed on some GPUs.
torch.compilefailures now fall back to eager execution so generation never hard-fails on unsupported GPUs, and cudagraph-compiled inference is pinned to one dedicated thread to avoid crashes. (#278, #315) - Re-dub ignored transcript edits (#281). Fingerprints are canonicalized, the preview cache is busted, and the mux is atomic, so editing the transcript and re-dubbing actually reflects your changes. Translated subtitles now burn in correctly and subtitle save no longer throws a JSON error. (#281, #309)
- macOS: app wouldn't open without using Terminal. Builds are now ad-hoc signed (with signing/notarization verification), so the app launches normally. (#290)
- macOS dictation auto-paste stole focus; it now writes the clipboard natively without grabbing focus, and microphone-permission handling adds OS usage descriptions, a WebView grant handler, and an actionable denied-state UI. (#287, #323)
- Clone-reference transcription was broken (it used a removed transformers pipeline); it now routes through the ASR registry. A crash-isolated faster-whisper subprocess backend keeps an ASR crash from taking down the app. (#308, #393)
- Realtime status probe hit a gated route. It now probes the auth-exempt
/healthinstead of the gated/model/status, and the UI polls the backend over HTTP before opening the WebSocket to avoid startupECONNREFUSED. (#439, #450) - Non-executable or unreachable engine binaries showed cryptic errors — these now produce actionable messages. (#437, #438, #454, #466)
- Design-profile save was coupled to a TTS render (#476), so saving a profile needlessly triggered synthesis; the two are now decoupled. (#476)
- UI scale / black bands. The app shell now scales via
transform: scaleand always fills the viewport, fixing the WebKitGTK black-band issue on Linux and cramped/black layouts at narrow widths — a permanent fix across platforms. (#445, #452) - Clone popover/CTA clipping and a non-resizable textarea are fixed, the WaveformPlayer no longer pauses itself on play or ignores clicks, and several layout/history-display issues (phantom sidebar gap, title clamping, flicker) are cleaned up. (#379, #384, #398, #481)
- Windows:
desktop-prodnow runs from cmd/PowerShell via a cross-platform launcher,tqdmis disabled on non-TTY to avoid anOSError, and ffmpeg validation guards againstWinError 193. (#282, #305, #377) - MLX import hardened against PyInstaller dylib failures, with a proper platform gate so it's only loaded where it works. (#390)
Changed
- Restricted-network support. A Hugging Face mirror (
HF_ENDPOINT) setting, custom PyPI / HF / python-build-standalone mirrors in first-run setup, and region presets help installs complete behind restrictive networks. (#286, #391) - Engine memory management. Subprocess-engine sidecars now unload on demand and idle-reap to free VRAM. (#401, #406)
- Faster, more accurate model downloads via a Xet fast path with accurate progress reporting, plus a model-management cleanup pass. (#424, #428)
- Voice profiles unified under one model with a
kinddiscriminator and stored design params, and consent-locked profiles (verified_own_voice+ spoken-consent flow). (#354, #376) - Updater preview channel now offers the newest build across channels, and preview versions carry an MSI-legal numeric pre-release stamp. (#293, #326)
- Performance. Voice-clone prompt embeddings are cached, and dub retime batches seek to their window instead of decoding from frame 0. (#387, #427)
License
- Relicensed from FSL-1.1-ALv2 to AGPL-3.0 (open-core). The project is now under the GNU Affero General Public License v3, with a paid commercial license retained for proprietary/closed-source use without AGPL obligations. The bundled
omnivoice/TTS model package stays Apache-2.0 upstream (AGPL-compatible). Manifests declareAGPL-3.0-only; the in-app Commercial License copy and README are updated, and the old "converts to Apache 2.0 after two years" FAQ is removed. In-app commercial-license strings are translated across all 20 locales. (#292)
CI
- macOS Intel (x86_64) build target reinstated on
macos-15-intel, so Intel Mac users get installers again. (#342) - Docker Hub publishing. Images now also publish to Docker Hub (
palashdeb/omnivoice-studio), with the Docker Hub overview maintained in-repo and auto-synced frommain(sync is non-fatal so it can't redden a build). (#375, #410, #414) - Docs-drift guard. A daily job compares the canonical feature inventory against README / docs / registries to catch stale docs. (#353)
- Security scans never cancel on
main, so merge trains no longer leave red ✗ on intermediate commits. (#340)
Downloads & checksums
Linux x64 artifacts
34fc42364f95662b7b1c7164db79531ea0b7cc265be2cca7bba4f85b54ee674c OmniVoice Studio_0.3.6_amd64.AppImage
f898611f9f899d3b9f5f5d08eee303e25aa32902d647e05788b239a5d0fc6ae5 OmniVoice Studio_0.3.6_amd64.AppImage.sig
Windows x64 artifacts
f7761bb556cea145f5f0d38ff24547dc1e20ece508b445f65e4802356ceb6557 *OmniVoice Studio_0.3.6_x64_en-US.msi
ba7720a2f31ece91ba6cd615631807768b0ff6cc7b96e3381c3efe4fd17d0302 *OmniVoice Studio_0.3.6_x64_en-US.msi.sig
macOS Intel artifacts
51fb4137b500f34a17940a541fa242d48e80cfd5dc89a051dcf41bcec002c923 OmniVoice Studio_0.3.6_x64.dmg
4bd0acfa9e11fada2410d4df1ea306cd222404b36aff4ce63abed9fb12b5fae0 OmniVoice Studio.app.tar.gz
6b007592d7fa24a639f3811dc16e276b0f69cf554708f9f7140c566053451f6b OmniVoice Studio.app.tar.gz.sig