github debpalash/OmniVoice-Studio v0.3.8
OmniVoice Studio v0.3.8

5 hours ago

A stability-focused release that makes first-run and Windows "just work," ships
live, faster-than-real-time local dictation and a user pronunciation
dictionary
, and gives Settings a full redesign. It clears the wave of
"Can't reach the local backend" reports at the source — the 8 GB-card OOM
crash, the slow-load future-scheduling break, a Windows-only WhisperX load
failure, an ASR engine that couldn't load CTranslate2 on newer Linux/WSL, and
both transcription and generation stalls that looked like a dead backend
(a wedged GPU job now resets the worker pool and returns an actionable timeout)
are all fixed or now fail with a clear, actionable message. macOS gets native file drag-and-drop back
(including macOS 26 Tahoe). Downloads are faster out of the box (parallel
segmented transfer on by default) and the Hugging Face token that speeds them up
is front-and-center on setup. Plus multi-voice story casting, faster long-form
previews on Windows, and a friendlier, more honest batch of error messages
across dub, generate, and design (a corrupt-binary failure no longer poses as
"out of memory," a bad model id self-heals, and a stale dub job resets cleanly).

Added

  • "Autofit" translation quality — the dub keeps the video's timing. A new
    quality alongside Fast and Cinematic: the LLM rewrites each translated line so
    its target-language reading time fits within the segment's slot (a strict
    "never overrun" bound, per-language pronunciation-speed aware), so long
    translations no longer force the audio into a stressed >1.3× time-stretch.
    Cinematic still applies its reflect/adapt polish; Autofit adds the hard
    fit-to-slot pass on top. Needs an LLM (below); falls back to Fast with a clear
    notice if none is set. (#838)

  • A new LLM Providers settings page — bring your own high-quality LLM.
    Settings → System → LLM Providers configures the LLM that powers Cinematic
    and Autofit translation. One page for 16 providers — OpenAI, OpenRouter,
    Groq, Cerebras, Google AI (Gemini), Mistral, Cohere, NVIDIA, GitHub Models,
    Cloudflare, Hugging Face, SambaNova, SiliconFlow, plus local Ollama / LM
    Studio
    (fully offline, no key) and a Custom OpenAI-compatible endpoint.
    Paste a key, pick a model, Test the connection in one click, and "use for
    translation" to make it active. Keys are stored encrypted (the same
    at-rest protection as the HF token) and never leave the machine unless you
    choose a cloud provider; env vars still override for power users. The dub
    translate menu now routes you straight here when you pick a high-quality
    style without an LLM, instead of dead-ending on a toast. (#838)

  • A dedicated Network pane. The HTTP/SOCKS proxy and FFmpeg-path controls
    (previously buried in General → Advanced) are promoted to their own category.

  • Factory reset in Storage. A confirm-dialog-guarded action that clears the
    locally-saved UI preferences and reloads — without touching your voices,
    projects, or generated audio on disk.

  • Proactive, highlighted "Install" affordance for translation engines. When
    you pick a Dub translation engine whose optional package isn't installed yet
    (e.g. Google / DeepL via deep_translator), the Engine selector now surfaces a
    bright accent Install button before you hit Translate — no more
    discovering the missing package only via a translate-time 400. On a from-source
    install it one-click installs into the backend's own interpreter; on a
    read-only packaged build it opens a popover with the exact uv pip install …
    command (copy-to-clipboard), a one-click Switch to Argos (bundled, offline)
    escape hatch, and a docs link. The install command is single-sourced in the
    backend registry, so the button and the 400 error can never disagree. New guide:
    docs/dubbing/translation-engines.md.

  • A user pronunciation dictionary that actually changes the audio. Settings →
    General → Pronunciation lets you teach the engine how to say tricky words —
    each entry replaces a term with a respelling (GIFjiff) right before
    synthesis, so it works on every engine, not just one. Scope an entry
    Global or to a single language (a German rule never fires on an English
    render), with longest-match-first, word-boundary-aware, case-insensitive
    substitution. For one-offs, write [[word|respelling]] inline in your text —
    it overrides the dictionary for that occurrence and never persists. A built-in
    Test field previews the substitution with no model call. Pure text transform,
    identical on macOS/Windows/Linux; plain text stays byte-identical, existing
    data upgrades cleanly via an additive migration. (Expressive-TTS Spec 01)

  • Live, faster-than-real-time dictation via a new sherpa-onnx ASR engine.
    Pick one of seven small ONNX speech-to-text models (Parakeet TDT v3/v2,
    streaming Zipformer EN/ZH/bilingual, streaming Paraformer, multilingual
    Whisper Tiny) for dictation, and watch text appear as you speak. Streaming
    models emit partials frame-by-frame and commit a sentence on natural silence;
    offline models surface live partials too by re-decoding a growing buffer.
    Runs CPU-only and identically on macOS, Windows, and Linux — no GPU, no cloud,
    no extra setup beyond a ~75–180 MB one-time model download. Parakeet TDT v3 is
    the recommended default; existing Whisper/MLX/NeMo dictation engines are
    untouched and still the fallback.

  • New "Voice" settings panel for live dictation. Settings → Capture now
    leads with a Voice card: an Enable Voice Dictation toggle (showing your real
    registered shortcut), a Toggle/Hold mode switch, and a Speech Model dropdown
    that lists all seven models with offline/streaming + recommended badges, size,
    one-line descriptions, the installed checkmark, and inline download/delete —
    reusing the model-store download progress. Picking an uninstalled model starts
    its download and switches to it once ready. Toggle vs Hold is wired for
    both the desktop global hotkey and the in-app Ctrl/Cmd+Shift+Space fallback, so
    the behaviour is identical on macOS, Windows, and Linux. While you speak, the
    dictation pill shows the transcript building live, and words type straight
    into the focused field as you speak — self-correcting with backspaces as the
    streaming recognizer refines, with clipboard-paste as an automatic fallback.

  • Tagged scripts auto-cast into a multi-voice podcast/audiobook. Paste a
    [Alice] … [Bob] … script into Stories and hit Auto-cast: it now recognizes
    the [Name] tag format (alongside the existing NAME: screenplay and quoted
    prose), builds the cast, and assigns a voice per character automatically.
    Editing one line only re-synthesizes that line on export (the chapter cache
    is content-addressed), and inline markers like [pause] / [voice:…] are
    never mistaken for speakers. (#487)

  • A dedicated Contact page. Discord, email, GitHub issues, and the project
    website (palash.dev) as clean one-tap rows, reachable from the footer — so
    reaching the maker is never more than a click away.

  • Live download speed, remaining size, and ETA on first-run setup. The
    Models & Engines step now shows 38% · 5.2 MB/s · 1.2 GB left · ~3m while a
    model downloads, instead of a bare "downloading…". (#657)

  • Turn off auto-play of the preview after a render. New Settings →
    Appearance toggle, "Auto-play preview" (on by default) — switch it off so a
    finished clip doesn't start playing on its own, ideal when batch-generating
    segments. (#666)

  • App version in the status bar, one click from updates. A v<version>
    badge sits by the network icon in the bottom bar; clicking it opens Settings →
    Updates, and it grows a pulsing dot the moment a new version is ready to
    install. (#671)

Changed

  • Settings is now a sidebar-nav hub instead of an 11-tab strip. The whole
    page was rebuilt from scratch as a grouped left-rail navigator (with a
    search/filter box) plus a scrollable content pane — the macOS System Settings /
    VS Code layout. Settings are organized into four groups and sixteen
    categories: General (Appearance · General), Voice & Engines (Engines ·
    Models · Dictation · Pronunciation · Translation), System (Performance &
    Device · Storage · Network · Sharing & Remote · Credentials), and App
    (Updates · Privacy & Reporting · Logs · About). Every existing control keeps
    its behavior and store/API bindings — this is a reorganization, not a rewrite.
    Typing in the search box filters the category list and jumps to the first
    match, and the rail collapses to a dropdown navigator below 760px so the full
    IA stays reachable on a narrow window. Categories whose changes need a backend
    restart (Models, Performance & Device, Sharing & Remote) carry a "restart
    required" badge.

  • The Settings pages got a full redesign — cleaner, denser, responsive. A
    shared design system replaces the old patchwork: a left icon nav-rail,
    sentence-case section titles (no more debug-log uppercase), exactly one muted
    description per row, unified toggles/inputs, full-width content with proper
    padding, and horizontal font/theme pickers. Premium and compact instead of
    sparse and cluttered, and it adapts cleanly to window width. (#686, #690, #696)

  • Adding a Hugging Face token on first-run is now a one-line input right by
    Continue.
    Was a bulky card buried at the bottom of the model list; it's now a
    compact "paste a token, Save" bar pinned next to the "Waiting for required
    models…" button, so you can add it (for faster, authenticated downloads)
    without scrolling. (#687, #688)

  • First-run setup is calmer and surfaces the best models for your machine.
    Dimmed and tightened the setup descriptions (less wordy, more compact). The
    "Models & engines" step now shows the platform-tuned optional models up-front
    with a green "recommended" tag and their catalog note — e.g. MLX Whisper on
    Apple Silicon, CUDA-tuned variants on NVIDIA — instead of burying every optional
    model behind the fold (the universal long tail still folds).

  • Donations now go through Ko-fi or PayPal (GitHub Sponsors removed). GitHub
    Sponsors isn't available, so the Support page no longer routes there: pick an
    amount (now $10 / $20 / $50) and then choose Ko-fi or PayPal — PayPal carries
    the amount straight into checkout. .github/FUNDING.yml and the README badges
    were updated to match.

  • Simplified the Commercial License page. Trimmed the six-tile benefit grid
    and FAQ down to the three things that actually drive the decision (you own the
    output, no per-minute cost, direct support) plus one clear "request a quote"
    contact — less wall-of-text, faster to act on.

  • Model downloads are faster out of the box. The built-in multi-connection
    (segmented) downloader — parallel byte-ranges with live speed/ETA — is now on
    by default, so the legacy-LFS path is no longer single-stream and slow. It
    falls back to the normal download on any error, so it can never compromise a
    correct install (OMNIVOICE_SEGMENTED_DOWNLOAD=0 to disable). (#669)

  • The Hugging Face token is now front-and-center on first-run. Was a
    collapsed "advanced" fold almost nobody opened; it's now a prominent card right
    above Continue, framed around what it actually buys you — authenticated, faster,
    more reliable downloads (higher rate limits, fewer stalls) — with a one-click
    "get a free token" link. (#657, #669)

Fixed

  • Bug reports redact more secrets and every Windows username casing. The
    opt-in bug-report scrubber now catches more credential shapes (JWT/Bearer,
    Google, Slack, AWS keys, and ?token=/?api_key= URL secrets), redacts
    Windows home paths regardless of Users/users casing, and stops a superstring
    username (/Users/john vs /Users/johnny) from leaking a fragment. The
    prefilled-issue URL is now bounded by its encoded length so a large report
    can't silently truncate. Nothing new leaves the machine — this only makes the
    existing local-first, user-reviewed report stricter. (#856)

  • A hung TTS generate can no longer brick the backend ("Can't reach the local
    backend").
    A GPU job that wedges on some Windows + CUDA setups occupies its
    worker forever — Python can't cancel the thread — so on the 1–2 worker pools we
    ship, one stuck job starved every other request and the next action surfaced as
    the misleading "Can't reach the local backend" even though the process was
    alive. ASR/dub/model-load already bounded and reset the pool on hang (#730); but
    every generate path — Studio synthesis, the streaming path, batch, the dub
    per-segment + preview render, archetype previews, and the OpenAI-compatible
    /v1/audio/speech API — was still an unguarded GPU dispatch, and the residual
    reports all failed on generate:start (audio). Every one is now bounded by the
    same wall-clock guard (OMNIVOICE_GENERATE_TIMEOUT_S, default 300s) that
    abandons the wedged worker and rebuilds the pool, so capacity is restored
    automatically and you get an actionable timeout instead of a dead backend.
    Closes the whole class of GPU-job-hang reports (#851#850, #802, #755, #723,
    #721, and the 0.3.7 cohort, all tracked in #730).

  • An unsupported GPU now falls back to CPU instead of 500-ing every generate.
    When the installed PyTorch build has no kernels for your GPU's compute
    capability — a too-old card (Pascal / GTX 10-series) or a too-new one
    (Blackwell RTX 50-series on pre-cu128 wheels) — CUDA failed at launch with the
    cryptic CUDA error: no kernel image is available for execution. The backend
    now detects that up front and runs on CPU (slower, but it works), and any raw
    occurrence is reported as "your GPU isn't supported — switch to CPU or install a
    matching PyTorch," not a Flush-the-memory dead end. Force the GPU anyway with
    OMNIVOICE_FORCE_CUDA=1. (#756)

  • The "TRANSLATION FAILED" banner now dismisses and clears itself. The Dub
    translation-error banner used to be sticky — it survived a successful re-try and
    never went away. It now has a close (×), auto-clears on the next corrective
    action (re-translating, changing the engine, or installing the package), and
    self-clears after a short timeout — fixing the whole class of translate/pipeline
    banners that outlived the state that caused them.

  • Dubbing a video URL no longer fails with "ffmpeg is not installed." yt-dlp
    downloads video and audio as separate streams and muxes them with ffmpeg, but
    it only looked on PATH — so on Windows (where OmniVoice's ffmpeg is a bundled
    sidecar / imageio-ffmpeg binary off PATH) the merge aborted before the dub
    could start. yt-dlp is now pointed at the same ffmpeg OmniVoice resolves. (#712)

  • A synth that succeeded no longer 500s because of a history-logging hiccup.
    If the local database somehow missed schema init, recording the clip to
    generation history failed with "no such table: generation_history" and
    surfaced as a 500 — even though the audio had already been generated and saved.
    The write now self-heals the schema and retries, and a history-logging failure
    never fails the generation: you get your audio regardless. (#710)

  • Long-video dubs no longer spike RAM during assembly. Dub generation used
    to hold every segment's audio in memory until the whole track was mixed, so a
    50-video batch or a single feature-length dub could exhaust RAM and crash. Each
    segment now streams to disk as it's rendered and the final track is assembled
    from those files via a 30s-chunk memmap writer, keeping memory flat regardless
    of video length. Per-segment download WAVs and the final track stay correctly
    watermarked (marked once at synthesis, no double-mark), and zero/negative-length
    segments no longer crash the run. (#639)

  • A corrupt or wrong-architecture native component no longer masquerades as
    "out of memory."
    A synth failure caused by a bad .dll/.pyd/.exe on
    Windows ([WinError 193] %1 is not a valid Win32 application — e.g. torch,
    ffmpeg, or an engine binary) was labelled "ran out of memory — try Flush,"
    sending users down the wrong path. It now says the component is corrupt or
    built for the wrong architecture and to reinstall/repair it. (#705)

  • A "[Errno 32] Broken pipe" mid-generation no longer poses as "out of
    memory."
    When the desktop app that launched the backend closes or relaunches,
    the backend's output pipe breaks and a synth can fail with [Errno 32] Broken pipe. That was labelled "ran out of memory — try Flush," which never helps;
    it now tells you the backend lost its pipe and to restart the app. (#715)

  • Settings content no longer sprawls or spills out of view. The content
    column capped at 1280px, so on wide windows rows stretched edge-to-edge with a
    big empty gap between each label and its control ("too spread out"), and a few
    panels (API keys, the shared button rows, appearance scale) used rigid pixel
    widths that pushed controls past the card's padding on narrow content. Now the
    content sits at a readable measure (a single --settings-measure token), the
    shared button/badge rows wrap instead of overflowing, rigid widths can shrink,
    and rows decide whether to sit side-by-side or stack based on their actual
    width (a container query) — not the viewport, which the 168px nav rail skews.
    Everything stays inside its padding, edge to edge, on every width. (#696)

  • File drag-and-drop works on macOS again. The app's drop zones use HTML5
    file drops, but Tauri intercepts OS drag-and-drop by default (dragDropEnabled)
    and swallowed the files before the webview saw them — most visibly on macOS
    WKWebView, and fully broken on macOS 26 (Tahoe), where dropping a file did
    nothing. Disabled the interception so the webview handles native HTML5 drops
    on every platform. (#700)

  • A misconfigured OMNIVOICE_MODEL no longer bricks model load with a 500.
    A stale or leaked TTS engine id (e.g. omnivoice) reaching the model loader
    used to fail every launch with "omnivoice is not a local folder and is not a
    valid model identifier."
    It now self-heals — only a real HF repo id
    (org/repo) or an explicit local path is honored; anything else falls back to
    the default with a logged warning. Every consumer of the setting routes through
    the same resolver, so a bad value also can't silently disable model warm-up,
    mislabel the Settings checkpoint, or get baked into an exported persona bundle.
    (#693)

  • ASR no longer crashes the dub/transcribe preflight when CTranslate2's native
    library can't load.
    On hardened kernels / newer glibc (e.g. WSL2) the
    CTranslate2 .so is rejected with "cannot enable executable stack" — an
    OSError the WhisperX/faster-whisper checks didn't catch, so it took down the
    whole preflight. They now report the engine as unavailable and auto-detect
    falls back to PyTorch-Whisper instead of dead-ending. (#692)

  • A wedged transcription can no longer take the whole backend offline ("Can't
    reach the local backend").
    On some Windows + CUDA setups a whisperx/CTranslate2
    transcribe hangs hard and never returns. Because ASR shares a small (1–2 worker)
    GPU pool with TTS, one stuck worker starved every other request — so the next
    thing you did (often a TTS generate) failed with "can't reach backend" even
    though the process was alive. Two fixes: every transcribe path — whole-file
    (dub whole-file, batch, live dictation) and the chunked dub stream — is now
    wall-clock bounded like the dub QC / dictation / OpenAI paths already were;
    and on timeout the poisoned GPU worker is abandoned and the pool rebuilt, so
    capacity is restored without restarting the app. You still get an actionable
    message (Flush VRAM / pick a smaller ASR model) for the durable fix. (#730)

  • The stale-dub-session recovery now also covers the first upload/ingest, not
    just retry/import.
    A dubbing job that vanished server-side during the initial
    transcribe flow showed the scary "Job not found … report a bug" toast; it
    now resets gracefully and invites a fresh upload, like the other paths. (#695)

  • In-app preview of finished audiobooks/stories now plays on Windows.
    The preview decoded the entire render into one in-memory PCM buffer via Web
    Audio decodeAudioData, which fails on long-form .m4b/AAC under WebView2
    (EncodingError: Unable to decode audio data), and the blob-URL fallback can't
    play in a Tauri <audio> element — so nothing played. The fallback now uploads
    to the preview endpoint (ffmpeg-extracts a streamable WAV) and plays the HTTP
    URL, the same path video previews use. Short TTS previews are unchanged. (#653)

  • First-run setup splash no longer shows a raw bootstrap.lines key in English.
    The log-line counter string was present in 4 locales but missing from the en
    reference, so English (and 16 other locales falling back to it) rendered the
    literal key instead of "{{count}} lines". Added it to en. Also removed 160
    dead gallery.cat_* keys (renamed to archetypes.use_* long ago) orphaned
    across 20 non-English locales, clearing the i18n orphan-key advisory.

  • Backend no longer hangs on startup (unreachable, no error) on Apple-Silicon Macs.
    The MCP session manager could hang on its anyio task group during lifespan
    startup (observed on M1, #632); because that start was awaited before the server
    began serving, "Application startup complete" never fired and the whole backend
    was unreachable. The MCP start is now timeout-bounded (OMNIVOICE_MCP_START_TIMEOUT_S,
    default 30s) — a hang becomes a logged warning and the backend serves normally
    without MCP, instead of wedging. (#632)

  • Dubbing a URL no longer fails with [Errno 22] Invalid argument on Windows.
    yt-dlp stamps the downloaded file's modified-time with the video's upload
    date; an out-of-range/invalid timestamp makes the os.utime call raise
    [Errno 22] and aborts the whole URL ingest. OmniVoice downloads to a throwaway
    file and never uses its mtime, so it now skips the stamp entirely
    (updatetime=False). (#642)

  • Dubbing a YouTube link that 403s now retries with a different player
    client.
    Some videos serve their formats signature-protected to the default
    player client, so the media download fails with HTTP Error 403: Forbidden
    even though extraction worked — and a plain retry keeps 403ing. The URL
    download now escalates the YouTube player client (tv → android → web_safari)
    on a 403, which commonly bypasses it, before surfacing the actionable error.
    (#625)

  • A synth glitch that produced unreadable audio is now caught instead of a
    misleading "out of memory".
    A numerical glitch in the model (seen on Apple
    Silicon/MPS) could leave NaN/∞ samples, which wrote a WAV that then failed
    decoding with an opaque ffmpeg returned error code: 183 / Invalid data — and
    the generic error handler labelled it "ran out of memory". Non-finite samples
    are now sanitized to silence before any encode (so the WAV is always
    decodable), and a genuine decode failure is reported as "unreadable audio —
    Flush and regenerate", not OOM. (#629)

  • A silent startup hang now leaves a diagnostic instead of nothing. On some
    setups the backend could load all model weights and then hang forever before
    "Application startup complete" — no error, no crash, an unusable app (reported
    as a Mac M1 hang after Loading weights: 527/527, #632). A startup watchdog
    now dumps every thread's stack to the error log if startup stalls past a
    window (default 5 min, OMNIVOICE_STARTUP_WATCHDOG_S to tune, 0 to disable),
    so the deadlock is captured rather than invisible. It's disarmed the instant
    startup finishes, so a normal (even slow-first-download) boot never trips it.
    (#632)

  • First-run demo voice is back. The bundled demo clip
    (backend/assets/samples/demo_voice.wav) was a build artifact that never got
    committed, so it shipped absent — onboarding logged "Demo audio not found" and
    seeded nothing, leaving a brand-new install with an empty Launchpad and no
    /demo_audio route. The clip is now committed (it's already un-ignored and
    bundled via the Tauri backend resource), so first-run seeds the demo voice
    on every platform; onboarding still degrades gracefully (with a regenerate
    hint) if it's ever absent. (#621)

  • Multi-speaker dubbing: two speakers' turns merged onto one line are now
    split apart.
    Segmentation groups words into sentences before diarization
    runs, so a back-and-forth exchange could land in a single segment; the speaker
    pass then only relabelled that segment with its majority speaker, losing the
    turn boundary (the second half of #486; the per-speaker voice auto-assign was
    fixed earlier in #490). A new post-diarization pass re-splits any segment whose
    words span more than one speaker at the word-level boundary, assigning each
    piece its own speaker. Single-speaker segments pass through byte-for-byte
    unchanged
    , so single-speaker dubs and their timing never move, and a lone
    mis-attributed word (diarization noise) is smoothed rather than causing a
    spurious split. (#486)

  • Designed voices saved with a bad style no longer render wrong or crash
    generation.
    A designed voice could persist an instruct the engine
    validator rejects — either the literal "[object Object]" from an old build,
    or freeform prose typed into the style field — which made every generation or
    dub that used the voice fail with Unsupported instruct items found in …
    (surfacing to users as a 400/500 and, when it tore down mid-render, "Can't
    reach the local backend"). The previous fix only blanked "[object Object]",
    which silently dropped the design — so an Indonesian female voice came out
    male. Now the stored instruct is sanitized down to valid tags at every
    seam (save, edit, and when a profile drives Generate or Dub), and when the
    stored value is unusable the tags are rebuilt from the design's saved
    category picks (vd_states)
    so the intended gender/age/pitch/accent survive.
    A migration (0007) heals existing poisoned profiles in place — no reinstall,
    no manual fix. (#550 #571 #594 #596)

  • "Transcribe stream dropped … Likely ASR backend failed to load" now shows
    the real reason.
    When transcription failed to load its ASR model (the
    reported case was WhisperX on Windows — typically a faster-whisper /
    CTranslate2-cuDNN mismatch, a missing model download, or the torch-2.6
    weights-only VAD regression), the UI dead-ended on a generic "stream dropped"
    message with no actionable cause. Two root causes: (1) WhisperX loads lazily
    inside transcription, so the load failure was buried in per-chunk errors and
    retried on every chunk; the transcribe pre-flight now eagerly loads the ASR
    model (new ASRBackend.ensure_loaded()), surfacing the genuine cause once, up
    front, as a structured error. (2) Pre-flight and audio-load errors closed the
    SSE stream with a bare error and no terminal done, so the browser's native
    EventSource connection-drop could race and win against the structured error —
    discarding the real cause and falling back to the generic message; every
    terminal error now emits done, and the frontend latches the structured cause
    so a connection drop can't overwrite it. Net: WhisperX load failures are
    diagnosable instead of a silent dead-end. Fail-before/pass-after regression
    test included. (#578)

  • Dubbing: the PLAY button on the dubbed-video preview did nothing. Same
    autoplay-policy trap that #510 fixed for the standalone audio player, but the
    dub editor's timeline player was missed. WaveSurfer builds its AudioContext
    at mount — before any user gesture — so on Windows WebView2 (and Linux
    Firefox/Chrome, Android Chrome) it stays "suspended"; playPause() then
    resolves with no sound and the preview just sits there. Every playback entry
    point in the dub timeline (the toolbar Play button and the per-segment "play
    this slot") now resumes the context via the shared unlockAudio() on the
    click before starting playback, and swallowed play() rejections are logged
    instead of hidden. A source-contract regression test pins the invariant so a
    future refactor can't quietly reintroduce a silent play path. macOS is
    unaffected (its context was never blocked). (#595)

  • Voice design: the script text field couldn't be expanded. The Script
    textarea was a flex: 1 item inside a flex column, so flex-grow recomputed
    its height on every reflow and snapped the user's drag back — resize: vertical is silently ignored on a flex-grown item in Chromium/WebView2. The
    field now owns its own height (starts taller, and the corner grip grows it
    reliably on every platform). (#595)

  • An interrupted model download now self-repairs instead of dead-ending.
    When the OmniVoice TTS cache was missing weight shards (the usual aftermath of
    an interrupted first download), the next synthesize failed with a 500 and a
    "delete the model and install it again" instruction — a manual dead-end. The
    backend now detects the truncated-cache error on load, re-fetches just the
    missing files via snapshot_download (already-present blobs are skipped, so a
    near-complete cache repairs in seconds and a healthy cache is never touched),
    and retries the load automatically. Offline mode (HF_HUB_OFFLINE) is
    respected — repair never makes a network call the user opted out of — and if
    the re-fetch still can't fix it, the actionable delete-and-reinstall message
    is preserved as the fallback. (#581) The repair now also retries the
    re-fetch (3 attempts, resuming each time) so a single transient blip — the very
    thing that interrupts a download in the first place — doesn't bounce you back
    to a manual reinstall; tune with OMNIVOICE_MODEL_REPAIR_RETRIES. And if a
    resume-repair still won't load — the signature of a corrupt file that kept
    its size, which a resume trusts and never re-fetches — it now force
    re-downloads
    the model files once before giving up, so even a bit-rotted
    cache self-heals without a manual reinstall. (#739)

  • Dubbing a YouTube URL no longer dies on a transient "Broken pipe."
    Pasting a video link could fail outright with download: Unable to download video: [Errno 32] Broken pipe — a broken pipe raised while the write side of
    a pipe closes mid-stream (a killed ffmpeg merge child, a CDN reset during
    muxing). yt-dlp's own per-fragment retries don't cover that case, so a single
    transient blip aborted the whole ingest. The URL download now retries up to
    twice on broken-pipe / network-drop failures, wiping the partial download
    between attempts, and only surfaces the (already-actionable) "connection
    dropped — just retry" hint after the retries are exhausted. Unsupported links
    still fail fast with their own hint — no wasted retries. (#579, #598)

  • No module named 'omnivoice' on installs whose venv lost its editable
    record.
    An interrupted or offline uv sync (common during an in-place
    upgrade) could install all dependencies yet never lay the editable install of
    the project's own omnivoice package — or an antivirus quarantine could
    remove it. The venv still started uvicorn, so the bootstrap's health gate
    passed it through, and the app only failed at the first generate/dub with
    No module named 'omnivoice'. The bootstrap now also verifies omnivoice is
    importable (via a cheap find_spec, no torch load) and forces a repair
    uv sync that re-lays the editable install when it isn't; the backend also
    resolves omnivoice from its bundled source tree at runtime as a safety net.
    No reinstall needed — relaunch and it self-repairs. (#564)

  • "cannot schedule new futures after shutdown" no longer breaks generate/dub
    after a slow first load.
    When a model load timed out, the backend reset its
    GPU worker pool to recover — but several request handlers had captured the old
    pool object at import time and kept submitting to it, so every subsequent
    generate, dub, transcribe, or translate failed with cannot schedule new futures after shutdown (a 500, or "Can't reach the local backend" when it
    took the worker down). The GPU pool is now a single self-healing handle whose
    worker pool is rebuilt on demand, so a reset can never strand an in-flight or
    later request. No settings change; the recovery is automatic. (#589 #599)

  • Transcription / dubbing works on Windows again. WhisperX failed to load on
    Windows because speechbrain's guard that suppresses stray optional-integration
    imports used a POSIX-only path check, so a k2_fsa import error aborted the
    whole transcription. Fixed cross-platform — covers the entire class of optional
    integrations, not just k2. (#630 #611 #647)

  • A slow transcription no longer looks like a dead backend. Whole-file
    transcribe paths (dub QC, dictation, OpenAI-compat) ran unbounded, so a
    VRAM-starved large-v3 could spin for minutes and hold a GPU worker — surfacing
    as "Can't reach the local backend". They're now time-bounded and return a clear,
    actionable 504 (free VRAM / pick a smaller ASR model / use CPU) instead of
    hanging. New troubleshooting section documents it. (#656)

  • Windows preview playback fixed. The audiobook/clone preview's streaming
    fallback fetched localhost, which on Windows resolves to IPv6 and missed the
    IPv4-only backend — so previews failed with "decode error" / "no supported
    sources". The preview API now targets 127.0.0.1 (matching the main client),
    and the expected decode→stream fallback is logged calmly instead of as a scary
    error. (#653 #659)

  • A stale dub session resets cleanly instead of erroring. Reopening the Dub
    tab after the backend restarted tried to resume a job that no longer existed and
    surfaced "Job not found" as a bug-report error. It now quietly clears the dead
    session and invites a fresh upload. (#660)

  • A bad voice-style instruct is a clear 400, not a scary 500. Typing free-form
    prose (or a non-English description) into the style/instruct field returned a
    500 telling you to Flush for memory you never ran out of; it now returns a clean
    400 that lists the valid style tags. The Voice Clone UI also drops unrecognized
    style text locally and generates anyway. (#664 #612)

  • The ⊕ Insert token popover stays on screen. On Voice Clone it could grow
    tall enough to clip off the top of the window; it's now a compact, scrollable
    box anchored above the button. (#672)

  • First-run no longer hangs on Apple Silicon. The MCP session-manager startup
    is now timeout-bounded so a slow/stuck mount can't wedge the whole backend boot
    on M1. (#632)

CI

  • Feature-coverage test system. A backend route-inventory test diffs all 213
    HTTP/WebSocket endpoints against a committed snapshot (plus a critical-endpoint
    guard and a route-count floor), and a frontend feature-coverage test asserts
    every app mode is wired to a page and every feature has its i18n namespace — so
    an endpoint or page silently disappearing now fails CI on every PR.
  • bun desktop no longer kills its own dev backend. The dev launcher runs the
    API and the Tauri app side-by-side, but the app's backend manager would "take
    ownership" of port 3900 and kill the API the moment it booted (before it was
    healthy), tearing the whole session down. The dev app now sets
    TAURI_SKIP_BACKEND so it attaches to the running API instead of fighting it —
    production launch is unaffected. (#745)

Linux x64 artifacts

7be5f1829a9b6540a06f35181d214f8b1e4eb4d6f20906ff94bff93ebbb89d84  OmniVoice Studio_0.3.8_amd64.AppImage
01f847d8be001add07c4bf83bbe4ca18f49c7a50a1f5aa7b5500abf4df1c5397  OmniVoice Studio_0.3.8_amd64.AppImage.sig

Windows x64 artifacts

a5679d964ec726f294da2393c8640b806c30363439f6af8e47c856b0712191e7 *OmniVoice Studio_0.3.8_x64_en-US.msi
cf63f4346692b9ac475d64dee75591121be2ee852a6834554f0bcdcd788b48be *OmniVoice Studio_0.3.8_x64_en-US.msi.sig

macOS Intel artifacts

977ce4813ec3f347075cccc64311fef6f08e1707dd4d73c6e7703f33a72a51de  OmniVoice Studio_0.3.8_x64.dmg
86dd4c2345938b8e2f84de05f4775eb81c833f621f9a5e182aac6cd13975c3d5  OmniVoice Studio.app.tar.gz
fc40b30bd97a20bdffce41235f9ed2313e7f28efbdabe85ee5f33c9796644b18  OmniVoice Studio.app.tar.gz.sig

Don't miss a new OmniVoice-Studio release

NewReleases is sending notifications on new releases.