TGSpeechBox v2.98 Release Notes
A minor point release
As we inch ever closer to the next milestone of 3.0, this release focuses on small but meaningful enhancements that help intelligibility and reading.
High-Rate Intelligibility (DSP Layer)
At high speech rates (NVDA 85–100%), short phonemes like the American English flapped /ɾ/ in "batteries" were vanishing — the word sounded like "barries." Root cause analysis using DECTalk Mini source code revealed that the cascade resonators couldn't express formant identity fast enough at 3x+ speed, and tap segments had 100% crossfade with zero steady-state samples at every rate.
Six changes, three in the DSP layer, two in the frontend, one in the normalization engine:
-
Rate-adaptive bandwidth widening (
frame_emit.cpp): Cascade resonator bandwidths (cb1–cb3) scale up abovehighRateThreshold, reducing resonator settling time τ ≈ 1/(π×BW). Gated with a linear ramp from threshold to ceiling so normal-rate speech is untouched. Configurable per language pack viahighRateBandwidthWideningFactor. -
Speed-scaled formant alpha (
frame.cpp): The endCf exponential smoothing time constant now clamps to 15% of frame duration, so formant ramps reach their target even in short high-speed frames. Previously the fixed alpha (0.004) meant formant trajectories covered only 1/3 of their path at 3x speed. -
Tap steady-state floor (
frame_emit.cpp): Taps (_isTap) now have fade ratio capped at 0.5, guaranteeing a sliver of unblended formant identity in the middle of the segment. Previously taps had fade == min (ratio 1.00) at every speed — the tap was 100% crossfade with no moment where its formant signature existed uncrosssfaded. -
Coarticulation strength scaling (
coarticulation.cpp): Rate-dependent rateScale reduces coarticulation displacement at speeds above highRateThreshold, so formants don't need to travel as far. -
Fade ratio capping (
boundary_smoothing.cpp): maxFadeRatio tightens from 0.75 to 0.40 at high speeds, preserving more steady-state time per segment. -
Threshold tuning:
highRateThresholdraised to 2.5 (from 2.0) so rates below NVDA 85% (2.3x speed) get zero compensation — no muffling at comfortable listening rates.
Result: Intelligible through NVDA rate 85 (2.5X speed). Rate 90+ shows improvement but approaches fundamental limits of formant synthesis at those durations.
Normalization Engine: Tie-Bar Replacement Guard
Bug: Spanish _es suffix phonemes (a_es, s_es, etc.) were being spoken as literal letters in certain words. "WhatsApp" → "whatsesapp." Reported by @rmcpantoja.
Root cause: The global alias table fuses ts → t͡s (affricate with tie bar) before language replacement rules run. When s → s_es then fires, it matches the bare s that follows the affricate, producing t͡s_es. The greedy tokenizer correctly consumes t͡s as a single phoneme, leaving _es as orphaned characters — underscore dropped, e and s spoken literally.
Fix: Seven-line guard in applyRules() (ipa_engine.cpp): if the character immediately before a match start is a tie bar (U+0361 or U+035C), the match is rejected. A tied phoneme is bound to its predecessor and must not be replaced independently. This protects every X͡Y affricate across all language packs — no per-language YAML workarounds needed.
Also visible on "IBMTTS" → same t͡s_es pattern.
Prominence Pass Updates
-
Full-vowel floor (
fullVowelFloor: 0.4): Non-schwa vowels with zero prominence now get boosted to a minimum floor instead of being reduced like schwas. Fixes compound word second elements — "firefox", "laptop", "desktop" — where the unstressed-but-full vowel (/ɒ/, /æ/, /ɛ/) was getting hit byreducedCeilingandamplitudeReductionDb, making them sound swallowed. At 0.4 they escape the reduced path and land on the prominent floor instead. -
Monosyllable skip: now both the stress dictionary and prominence pass skip monosyllabic words entirely and eSpeak's stress placement wins. Function words like "for", "is", "or" can sound flat in phrases because eSpeak strips their stress mark, leaving Fujisaki with no accent command and prominence with nothing to boost. Solving this properly requires either Fujisaki baseline accent commands for all voiced words or a prominence floor for lone-word stress groups — both deferred to a future release.
Engine Internals
-
Formant alpha scaling uses frame-duration-relative capping (
maxTau = minNumSamples × 0.15) rather than requiring speedQuotient passthrough — simpler architecture, same effect. -
Bandwidth widening applied to base[] array after prevBase save, so voice bar bandwidths stay at natural width while all micro-frame paths (diphthong, stop burst, fricative) inherit widened values automatically.
Spanish Language Pack
-
Word-final tap allophone (ɾ_wf): fricationAmplitude 0.03 with voicing 0.95 and raised F2 (1720 Hz) distinguishes utterance-final /ɾ/ from /n/. Fixes "perdén" → "perder", "amón" → "amor." Confirmed by Dreamburguer.
-
es.yaml normalization: Intervocalic tap micro-schwa insertion (V-ɾ-V → V-ᵊɾ-V), consonant cluster separation (βɾ, ðɾ, ɡɾ, kɾ, pɾ, tɾ + sonorant/velar clusters), word-final lenition (b→β, d→ð), word-start velar hardening (ɣ→ɡ), trill as doubled tap (r → ɾᵊɾ).
NVDA Driver Profile saving bug fix
A bug whereby the currently chosen voice could not be saved is now fixed. This was reported as issue #17 - and is resolved in 2.98.