TGSpeechBox v2.96 Release Notes
Overview
v2.96 is a major quality release focused on naturalness, consonant identity, and duration intelligence. The synthesizer now produces speech with noticeably better rhythmic flow, crisper consonant distinctions, and more natural phrase boundaries. Many of these improvements draw on classic formant synthesis research from the 1980s–90s, adapted and tuned for modern use across all 27 language packs. This release also includes a critical normalization engine fix discovered during Spanish language pack development.
Normalization Engine: Cascade Protection Fix
A bug in the normalization replacement engine caused cascade corruption when multiple rules targeted single-character phonemes with suffixed variants. When rules like a → a_es and e → e_es fired in sequence, the e inside a_es would be caught by the next rule, and the s by the rule after that — each rule's output feeding unprotected into subsequent rules. The result was garbled phoneme streams that spelled out suffixes literally instead of mapping to language-specific phonemes.
The replacement engine now tracks protected positions using a length-based heuristic: when a replacement produces more characters than it consumed (e.g. a → a_es), the output is marked protected and invisible to subsequent rules. Same-length swaps (e.g. u → ᵾ) remain unprotected, preserving intentional chaining where one rule's output feeds the next. This fixes the clone-for-language workflow in the phoneme editor and unblocks any language pack that remaps multiple single-character phonemes — which is the standard approach for language-specific tuning.
With this fix, rule ordering becomes more significant: conditional/specific rules (e.g. ɣ → ɡ when atWordStart) must come before their unconditional counterparts (e.g. ɣ → ɣ_es), since the first matching rule now definitively claims the position.
Found and reported by Mateo (@rmcpantoja), who persisted through two rounds of "not a bug" before the cascade mechanism was identified. Community testing at its finest.
Micro-Event Frame Emission
The frame emission pipeline has been rewritten around a micro-event architecture. Instead of emitting uniform frames for each phoneme, the synthesizer now generates context-sensitive sub-segments — voice bars, bursts, aspiration, and steady-state portions — each with independently tuned amplitudes and durations. This eliminates several classes of artifacts:
- Voice bar clicks at stop onsets are gone
- Pitch discontinuities across segment boundaries are smoothed
- Fricative clusters no longer produce spurious energy spikes
- Affricates have properly shaped burst-to-frication transitions
- Burst amplitude scales naturally with phoneme context
Duration Model Improvements
The duration system has been significantly expanded, informed by analysis of historical formant synthesizer duration tables and Klatt (1979) timing research.
Voiceless coda lengthening — When a voiceless consonant follows a voiced segment, the consonant stretches to compensate for pre-voiceless vowel shortening. In "bat," the vowel shortens but the /t/ expands — the syllable stays balanced. Time shifts from vowel to consonant, giving voiceless codas a crisper, more defined presence.
Phrase-final nasal sustain — Nasals in phrase-final position now sustain longer than other coda consonants. "Column." and "reason." have a natural linger on the final /n/ that was previously absent. Nasals are sonorant and continuous — they can sustain without sounding artificial.
Per-place stop duration scaling — Stops now carry individual duration multipliers based on place of articulation. Velars are inherently longer than alveolars, which are longer than labials. This matches articulatory reality — the tongue body moves more slowly than the tongue tip.
Phrase-final lengthening retuning — Nucleus, onset, and coda scaling at phrase boundaries have been recalibrated against research data. Diphthongs are now handled separately from monophthongs, receiving appropriate lengthening without over-stretching.
Pre-voiceless shortening calibration — Vowel shortening before voiceless consonants has been tuned to match measured ratios from acoustic phonetics literature. The effect is stronger before stops than fricatives, matching natural speech patterns.
Prominence-based reduction — Unstressed vowels reduce more aggressively, with calibration informed by historical synthesizer duration tables.
NVDA Driver: Language now in synth ring, moved up.
Minor but major for usability, as requested by Muchanchoasado- "language" is now moved next to "voice" in the speech dialog, and is now in the ring for easy switching. This should improve things a lot.
Tuning work
Below are general notes on tuning done this round.
Consonant Identity
Nasal place distinctions — The five nasals (/m/, /n/, /n̩/, /ŋ/, /ɲ/) now have distinct spectral identities through properly differentiated nasal anti-resonance frequencies. Previously, /n/ and /ŋ/ shared identical anti-resonance values, making "ban" and "bang" nearly indistinguishable. Each nasal now has a unique spectral notch reflecting its oral cavity length — from labial (lowest) through alveolar, palatal, to velar (highest).
Velar burst compactness — Velar stops (/k/, /ɡ/) now have a tighter, lower-pitched burst shape. Energy is concentrated in the F2–F3 pinch region with minimal high-frequency leakage, giving velars their characteristic "compact" quality distinct from the bright, diffuse bursts of alveolars.
Velar vowel anticipation — Vowels preceding velar consonants now show anticipatory F2/F3 convergence via targeted special coarticulation rules. This velar pinch on the vowel is the primary perceptual cue that tells a listener "a velar is coming" before the consonant itself arrives. Affects /ŋ/, /k/, and /ɡ/ in coda position.
Context-dependent velar locus — Velar coarticulation now adapts to vowel context. Before back vowels ("go," "cool"), the velar locus sits low, producing a natural glide as the tongue body pulls away from the soft palate. Before front vowels ("geese," "key"), the locus stays high. This split, grounded in decades of acoustic phonetics research, was the missing piece that made "go" and "doe" sound nearly identical in previous versions.
Affricate and Fricative Tuning
- /d͡ʒ/ voice amplitude reduced for cleaner frication onset
- Fricative cluster guards prevent energy spikes when multiple fricatives abut
- GOAT vowel (/oʊ/) rewritten for more natural diphthong trajectory
Fujisaki Pitch Model
- Reset behavior fixed to prevent pitch carry-over across utterances
- Phrase commands now clear properly at sentence boundaries
Language & Voice Tuning
Nasal bandwidth retuning — Cascade bandwidths (cb1–cb3) for nasals have been tightened across all language packs. Narrower bandwidths sharpen the spectral peaks that distinguish nasal place of articulation, making /m/ vs /n/ vs /ŋ/ more perceptually distinct. Anti-resonance bandwidths (cbN0, cbNP) adjusted per nasal to match oral cavity coupling differences.
Vowel bandwidth calibration — Close vowels (/i/, /u/) now have tighter cb1 values than open vowels (/a/, /æ/), reflecting the inverse relationship between jaw opening and F1 bandwidth in natural speech. Mid vowels sit between. This subtle change improves vowel identity at high speaking rates where formant trajectories compress.
UK English PALM vowel separation — UK RP BATH/PALM words now map directly to their own vowel symbol (ᵅː, F2=1200) instead of sharing ɑː with US English. Previously, the US LOT-backing rule (ɑː → ᵅː) was inadvertently catching UK BATH output, creating a dependency between the two dialects. Each dialect now owns its vowel mapping independently.
What This Means for Language Packs
Many of these features are controlled by YAML settings and automatically benefit all language packs:
- Voiceless coda lengthening — any language with
microprosodyPreVoicelessShortenEnabledgets coda compensation. One YAML line, every language benefits. - Nasal sustain — French phrase-final nasals, Portuguese nasal codas, Hungarian sentence-final nasals all improve with a single scale value.
- Nasal anti-resonance — language packs should audit their nasal cfN0 values to ensure proper place distinctions.
- Velar locus splitting — Germanic, Romance, and Slavic languages all have velar stops that behave differently before front vs. back vowels. The front/back locus settings are available to all packs.
- Per-place stop duration — the durationScale field on phoneme definitions works for any language.
- Cascade-safe normalization — language packs can now freely use single-character phoneme remapping (a → a_xx, e → e_xx) without cascade corruption. This is the recommended pattern for language-specific phoneme tuning.
Future tuning work
Not everything is green and gold. More tuning will be done to separate out the Brazilian rules from PT.yaml. English words like "firefox" and "dot" need to sound longer on their vowel. Spanish language needs more work so that there is not as much of a Castilian sound to Mexican. 2.97 coming soon to address more tuning work, as for the first time in months I feel like we have a solid framework to build robust languages on.