TGSpeechBox v2.97
New Features
Diphthong collapse pass — Diphthongs like /eɪ/, /aɪ/, /aʊ/, /oʊ/, /ɔɪ/ are no longer rendered as two separate vowel plateaus connected by a crossfade. The new diphthong collapse pass merges tied vowel pairs into a single token and emits cosine-smoothed micro-frame waypoints (default 8ms interval, pitch-adaptive down to 3ms at higher F0). The result is a continuous formant sweep instead of a "two-beat" artifact. Configurable settings include durationFloorMs, microFrameIntervalMs, onsetHoldExponent (linger at onset before sweeping), and amplitudeDipFactor (natural midpoint weakening). The pass is aware of microprosody — diphthongs before voiceless consonants skip pre-voiceless shortening so words like "eight" and "state" don't get crushed.
This pass will be huge for languages like English and Portuguese where Diphthongs are very common.
Two new pitch models:
-
legacyPitchMode: "impulse_style"— Additive impulse pitch model. Linear declination baseline with count-based stress peaks that diminish across the utterance (first stress gets the largest boost, subsequent stresses progressively smaller). Terminal gestures shape the final vowel by clause type. A two-pole IIR smoothing filter removes discontinuities, producing warm rounded pitch bumps. -
legacyPitchMode: "klatt_style"— Klatt 1987 hat-pattern intonation model. A three-state machine (BEFORE_HAT / ON_HAT / AFTER_HAT): pitch starts at a declining baseline, rises sharply on the first primary-stressed syllable, sustains a raised plateau with diminishing per-stress peaks, then falls back below baseline on the final stressed syllable. Statements get glottal lowering on the final vowel; questions rise instead of falling. Single-pole IIR smoothing.
These join the existing espeak_style, legacy, and fujisaki_style modes. Each language pack can select whichever model suits its prosody.
Post-cascade additive radiation model
Lip radiation (+6 dB/oct spectral tilt) has been moved from the pre-cascade source to a post-cascade additive stage (kPostRadiationMix). In the old model, the derivative was mixed into the glottal source before it entered the formant cascade, which boosted upper formants disproportionately and contributed to the classic "buzzy Klatt" character. The new model applies a first-difference differentiator after the cascade output, preserving the natural formant amplitude balance the cascade produces (where F1 carries proper weight) while still adding the physically correct lip radiation slope. The result is warmer, more chest-voice vowel quality. Thanks to @Simon818 for pointing out the spectral gaps in the 3–5 kHz region that led to this investigation.
Bug Fixes
Auto-tie diphthongs semivowel bug — Fixed a bug where autoTieDiphthongs was incorrectly converting diphthong offglides to semivowels in contexts where they should remain vowels. The offglide-to-semivowel conversion (autoDiphthongOffglideToSemivowel) now correctly respects tie-bar boundaries.
Removed duplicated trajectory limiting code — Cleaned up redundant trajectory limit logic that had accumulated during earlier refactoring. The single canonical implementation now lives in one place, reducing maintenance risk and ensuring consistent behavior across all formant fields.
Mexican Spanish seseo (θ→s_mx) fixed in NVDA driver —
Language Tuning
Spanish (es) and Mexican Spanish (es-mx) — Major tuning improvements from @rmcpantoja (Mateo Cedillo), now live in both packs. Normalization rules for lenition (word-final /b/→/β/, word-final /d/→/ð/), word-initial velar reinforcement (/ɣ/→/ɡ/), consonant cluster separation with micro-schwas (βɾ, ðɾ, ɡɾ, kɾ, pɾ, tɾ), palatal nasal normalization, and vowel-to-semivowel diphthong handling. Language-specific phoneme variants tuned for the Spanish five-vowel triangle (a_es, e_es, o_es, u_es), softer /d/ (d_es), Castilian /x/ (x_es), brighter /s/ (s_es for Castilian apical, s_mx for Mexican laminal), and clearer affricates (t͡ʃ_es). Mexican Spanish inherits all shared rules from es.yaml and uses skipReplacements to suppress Castilian-specific phonemes, adding its own jj → ʝ pre-replacement for intervocalic ll handling.
Portuguese — Separated Brazilian Portuguese-specific vowel remapping rules that were incorrectly placed in the shared pt.yaml during v2.95. European Portuguese speakers will no longer hear Brazilian-shifted vowels.