TGSpeechBox v3.0-beta2
Changes since v3.0-beta1.
New Features
- Per-voice engine settings: All 13 voice quality sliders, 5 FrameEx character sliders, pitch mode, inflection scale, and inflection are now stored independently per voice (Adam, Benjamin, Caleb, David, Robert). Switch voices and each one remembers its own tuning. Available on iOS, macOS, and Android. Output settings (sample rate, pause mode, volume) remain global.
- SAPI static-link refactor: TGSpeechSapi.dll is now a single self-contained binary — speechPlayer, nvspFrontend, and eSpeak are all statically linked in. No more separate speechPlayer.dll, nvspFrontend.dll, or libespeak-ng.dll in the installer. The GPL3/MIT licensing boundary is now crystal clear: source is MIT, the combined SAPI binary is GPL3. The runtime code was cut nearly in half (tgsb_runtime.cpp: 1826 → 883 lines) by removing 30 function pointer typedefs, all GetProcAddress calls, SEH crash wrappers, and shadow struct definitions.
- Quote-aware clause type detection: All platforms (NVDA, SAPI, Linux, iOS, Android) now detect sentence vs. question intonation correctly even through quoted speech. Previously,
She said "hello." Thewas treated as one continuous clause because the closing quote hid the period from the clause splitter. Now the scanner skips closing quotes and brackets to find the real punctuation underneath. - Pause mode on Android: Configurable inter-clause silence (Off / Short / Long) in the Output section, matching iOS. Short inserts 35ms after sentences and 25ms after commas; Long inserts 60ms/50ms. Default is Short.
- Reset to Defaults on Android: One-tap reset of all engine settings with confirmation dialog, matching iOS. Resets all sliders, pitch mode, sample rate, volume, pause mode, and language filter.
- Adaptive diphthong onset hold: The onset vowel hold duration now scales based on how far the formants need to travel. Narrow diphthongs (GOAT, where F1/F2 move <300 Hz) get full hold so the onset establishes clearly. Wide diphthongs (PRICE, where F1 drops 260+ Hz and F2 rises 700+ Hz) reduce hold to start the glide sooner, avoiding a perceptible "stall". A 40ms minimum offglide duration (per Gay 1968) ensures the offset vowel is always audible before stops or silence.
- Velar word-final boost: Word-final /k/ in "rock", "back", "blank" was too quiet after the unreleased word-final frication cut. A new
velar_word_final_boostallophone rule (same pattern as the existing labial boost) restores mid-range burst presence with aspirationAmplitude 1.40, pa2 1.35, pa3 1.30. Applied to en-us, en-gb, and en-au. - GOAT monophthongization: Both en-us (oʊ → oː) and en-gb (əʊ → əː) GOAT vowels are now monophthongs. The diphthong collapse glide from onset to /ʊ/ created a noticeable "ow" sweep, making words like "combobox" and "social" sound drawn-out. The monophthong keeps the onset quality without the distracting glide.
- Sample rate picker on iOS: Choose between 11025, 16000, 22050, and 44100 Hz with live DSP switching. Only speechPlayer is recreated — eSpeak and the frontend are rate-agnostic.
Join Test Flight for Mac OS and iOS!
Join TestFlight here by clicking this link from your mobile device.
Bug Fixes
- Diphthong bandwidth interpolation: The diphthong collapse pass was sweeping formant frequencies but holding bandwidths constant at the onset vowel's values. This caused "shaking" and "squished" quality — e.g., PRICE /aɪ/ onset B1=116 Hz is natural for F1=677, but way too broad for offset F1=413 where natural B1 is around 55–70 Hz. The resonance couldn't focus, smearing the vowel. Fix: bandwidths (cb1/2/3, pb1/2/3) now cosine-interpolate alongside frequencies during micro-frame sweeps. Explicit bandwidth values added to ɪ and ʊ vowel aliases in phonemes.yaml.
- Limiter pitch-rate shimmer: Phrases like "same page" sounded gritty at speech rates 70–75 but not 80–85. The 0.1ms limiter attack was tracking individual glottal pulses as transients — at certain pitches, harmonics crossing formant peaks created per-period amplitude spikes, and the limiter did pitch-synchronous gain pumping. Fix: attack widened to 2ms (tracks multi-cycle envelope), threshold raised from 3.86 to 4.0 (keeps limiter out of steady voiced speech).
- Stop frication too high → affricate confusion: /t/ sounded like /t͡ʃ/ ("eight" → "eitch"), /p/ sounded like /f/ at lower volumes. Base fricationAmplitude was too high, and the
unreleased_word_finalallophone rule cut pa1/pa2 but left pa3/pa4/pa5 at 1.0, creating upward spectral tilt into affricate territory. Fix: reduced base frication (/t/ 0.92→0.80, /p/ 0.75→0.50, /k/ 0.65→0.62) and added high-frequency rolloff (pa3 0.60, pa4 0.50, pa5 0.40) to unreleased word-final stops in en-us, en-gb, and en-au. - Syllabic nasal diphthong guard: "tightening" (/tˈaɪʔn̩ɪŋ/) lost its entire "-ing". Syllabic n̩ has
_isVowel: truefor duration purposes, soautoTieDiphthongssaw n̩+ɪ as a vowel pair anddiphthong_collapseerased /ɪ/. Fix:!prevIsNasalguard inautoTieDiphthongs. (thank you @fastfinge for the report!) - iOS AudioUnit clicking/popping: Rapid VoiceOver swiping caused audible clicks from DAC transients when the AU render session opened and closed. Root cause was rate-dependent: 44100 Hz ASBD triggered it, 22050 Hz does not. Fix: ASBD set to 22050 Hz (matching eSpeak-NG-mobile), DispatchSemaphore for thread safety, and AVAudioConverter (sinc interpolation) replacing the linear resampler that introduced aliasing at lower DSP rates. Thank you @kaveinthran!
- SAPI trailing silence: Some SAPI hosts (e.g. Balabolka) stop audio playback immediately when
Speak()returns, clipping the final syllable. Added 50ms silence padding so the audio pipeline has time to drain. Thank you @rommix0 for reporting this one! - Post-stop fricative masking: /kʃ/ in words like "action" — the fricative attack ramp was completely skipped after stops, so the fricative slammed in at full amplitude and masked the preceding burst. Fix: shortened 2.5ms ramp gives the stop burst headroom to register as a separate articulation. Thank you Aaron for reporting that.
- Diphthong onset settle time: Replaced the experimental bandwidth widening approach (any value high enough to help made diphthongs sound squishy) with
diphthongOnsetSettleMs— extra duration on the first micro-frame so IIR resonators can establish the onset vowel before the glide begins. en-us: 12ms. Total diphthong duration unchanged. Root cause was temporal (resonators never settled), not spectral. - TalkBack traversal order on Android: Material3
ExposedDropdownMenuBoxcreates internalSurfacenodes withisTraversalGroup=true, scrambling TalkBack linear navigation order for pitch mode and pause mode dropdowns. Replaced with plainBox+DropdownMenu.
Platform Improvements
- SAPI build is now opt-in: Default CMake build produces only speechPlayer.dll and nvspFrontend.dll. SAPI requires explicit
-DTGSB_BUILD_SAPI=ON -DESPEAK_NG_DIR=<path>, so forks don't break on missing eSpeak checkout. - iOS sinc resampler: Linear interpolation for DSP→ASBD upsampling replaced with AVAudioConverter (sinc), eliminating aliasing artifacts at 11025 and 16000 Hz DSP rates.
- Android per-voice settings in TalkBack path: The TTS Service now reads per-voice preference keys (
adv_key.voiceName) with fallback to global keys, so TalkBack speech also respects per-voice tuning.