github tgeczy/TGSpeechBox v-290
TG SpeechBox with phoneme editor, NVDA Addon, SAPI5, Speech Dispatcher module version 2.90

latest releases: v-301, v-300, v-300rc2...
one month ago

TGSpeechBox v2.90 Release Notes

Release date: February 2026

This is a major feature release bringing substantially improved speech quality through continuous formant movement, place-aware transitions, syllable-aware processing, a stress correction dictionary, and multilingual support improvements including NVDA driver localization and script-aware language switching for non-Latin languages. The phoneme editor and tuning guide receive continuous updates with the new settings, so it is your one-stop-shop for tuning phonemes and language settings.


Speech quality

Subglottal coupling — chest voice warmth

A single Sg1 pole-zero pair (630 Hz pole / 590 Hz zero) is now inserted into the cascade filter chain before the oral formants. This models the acoustic coupling between the trachea and the oral cavity that gives real voices their "chest resonance." The effect is subtle but addresses the hollow, nasal quality that users reported in previous versions. The frequencies are fixed (tracheal length doesn't change with articulation) and will be exposed as voicingTone parameters in a future release for per-voice tuning.

Cluster blend — consonant-to-consonant coarticulation

New frontend pass that shapes formant trajectories between adjacent consonants in clusters. Where cluster timing (v2.80) shortened consonants to prevent stacking, cluster blend modifies the spectral trajectory so consonants anticipate each other's place of articulation instead of jumping abruptly.

The pass classifies consonant pairs by manner (stop, fricative, nasal, liquid) and applies pair-specific blend strengths:

Pair type Scale Example
Nasal → Stop 1.30 /ŋk/ in "blank", /nt/ in "sent"
Fricative → Stop 0.85 /st/ in "start", /sk/ in "desk"
Stop → Fricative 0.70 /ts/, /ks/
Liquid → Stop 0.85 /lt/ in "belt", /rk/ in "work"
Stop → Stop 0.55 /kt/ in "act", /pt/ in "kept"

Context modifiers reduce blending for homorganic pairs (same place of articulation — less shift needed) and across word boundaries (weaker articulatory planning). F1 blend uses a reduced strength since F1 reflects jaw height, not place.

All parameters are YAML-configurable per language pack under the clusterBlend: block. See Tuning.md for the full reference.

Forward drift — continuous formant movement

The cluster blend pass now includes a second stage that fills endCf targets on any token that doesn't already have them. Previously, consonants adjacent to vowels had frozen formants during their hold phase — vowels glided smoothly (thanks to coarticulation), but consonants between them sat dead still. The ear heard glide-freeze-glide-freeze at every phoneme boundary.

Forward drift fixes this by looking ahead to the next real phoneme and setting endCf to drift partway toward it. With the default forwardDriftStrength: 0.30, a /b/ (cf2=900) before /ə/ (cf2=1400) now drifts 150 Hz toward the schwa during its hold phase. F1 drift is scaled down by f1Scale to prevent a tight-jaw sound.

This is not coarticulation (which computes where formants should be using locus math) — it's a simple "don't freeze" mechanism that ensures formants are always moving toward the next target. The result is speech that flows continuously rather than stepping between discrete segments.

Boundary smoothing — place-aware transition speeds

Boundary smoothing has been significantly expanded with physics-based per-place transition speeds. Different articulators have different physical mass and inertia, so their formant transitions should arrive at different rates:

Place Articulator F2 Speed Rationale
Labial Lips (light, fast) 0.60 Lips move independently of the tongue — F2/F3 drift rather than snap
Alveolar Tongue tip (light, precise) 0.40 Tip flicks fast — formants arrive quickly
Palatal Tongue blade (medium) 0.55 F3 must linger in the postalveolar region — F3 IS the /ʃ/ vs /s/ distinction
Velar Tongue body (heavy) 0.65 Tongue body is the heaviest articulator — F2 transition IS the velar cue

Additional improvements:

  • V→C departure slowdown (1.4×): when leaving a vowel, formant transitions are 40% slower — the ear expects the vowel to hold its identity before consonant influence takes over.
  • Pre-silence vowel protection: utterance-final vowels skip transScale entirely, preventing formants from bending into silence and sounding metallic (e.g. the /iː/ in "three").
  • Utterance-final consonant gentling (1.5×): word-final consonants before silence get softer transition speeds — there's no following target to rush toward.
  • Within-syllable gentling: transitions between tokens sharing the same syllable index get 1.5× slower transScale and 1.3× longer fade, since they belong to a single articulatory gesture.
  • All 13 boundary-type fade times are now YAML-configurable per language pack.

Prominence pass improvements

The prominence pass now includes syllable-position duration shaping (Pass 2b). After the main stressed/unstressed vowel scaling, a second stage adjusts durations based on position within the syllable:

  • Onset consonants: +10% duration (they initiate the articulatory gesture)
  • Coda consonants: −15% duration (they trail off)
  • Unstressed open-syllable vowels: −10% duration (stacks with prominence reduction)

Word-final syllables are skipped since they're already shaped by phrase-final lengthening. Monosyllabic words are skipped (no positional contrast). Configurable via the syllableDuration: block in language YAML.


Syllable detection

The engine now assigns a syllableIndex to every token, enabling passes to distinguish within-syllable transitions (one articulatory gesture, smoother) from cross-syllable transitions (separate gestures, crisper).

Onset maximization

For words found in the stress dictionary, the text parser inserts IPA . syllable boundary markers at linguistically correct positions using a legal onset table. The algorithm finds the longest legal onset suffix in a consonant cluster between two vowel nuclei.

Example: "abstract" /æbstɹækt/ — the cluster /bstɹ/ between vowels is split as /æb.stɹækt/ because /stɹ/ is a legal English onset. Without onset maximization, the heuristic would produce /æbstɹ.ækt/.

The legal onset table is defined in the language pack:

settings:
  syllableStructure:
    legalOnsets:
      - pl
      - 
      - bl
      - 
      - 
      - kl
      - 
      - fl
      - 
      - sl
      - sm
      - sn
      - sp
      - st
      - sk
      - sw
      - ʃɹ
      - spl
      - spɹ
      - stɹ
      - skɹ
      - skw

Languages without a syllableStructure: block use the heuristic fallback (consonant→vowel transition = syllable start).


Stress dictionary

The text parser now corrects eSpeak's stress placement using a dictionary derived from CMU Pronouncing Dictionary data. The dictionary contains stress-only digit patterns (no phonemes, no ARPABET) for 109,659 words, tuned by Rommix. From his work, we were able to get the placement for not just stress, but audit against Espeak's own to come up with the En-US table. Huge help.

The approach: for each word, the parser counts vowel nuclei in eSpeak's IPA output. If the count matches the dictionary entry, stress marks (ˈ/ˌ) are repositioned to the correct syllables. eSpeak's phoneme choices stay untouched — no flapped T's, no "sundee", no ARPABET-to-IPA alignment nightmares. Safety guards prevent placing primary stress on reduced vowels (ə, ɐ, ᵻ) and skip monosyllables entirely.

Verification: 96,879 of 109,660 words pass eSpeak alignment (nucleus count matches). The remaining ~12% are words where eSpeak segments differently than CMU Dict expected — these are silently skipped.


SAPI5 engine

The SAPI5 wrapper (tgsbSAPI5.dll) is now included in release builds. Windows applications that use the SAPI5 interface can access TGSpeechBox directly without the NVDA add-on.


NVDA driver

Localization

The NVDA driver now supports localization via standard .po / .mo gettext files. Three translations ship with this release:

  • Spanish
  • Brazilian Portuguese
  • Hungarian

The tools/compile_mo.py script in the repository compiles .po source files into the .mo binary format needed by NVDA. Community translators can add new languages by creating a .po file and running the compile script.

MultiLang addon compatibility

The driver now fires synthDoneSpeaking events correctly, fixing compatibility with the MultiLang-V.2025.10.04.nvda-addon. Thank you to @Sevapopov@mastodon.ml for reporting this.

Script-aware language switching

When the active language uses a non-Latin script (Russian, Bulgarian, Greek, Arabic, etc.) and the text contains Latin-script words, the driver now detects script boundaries and temporarily switches eSpeak to English for those segments. This produces correct IPA for Latin words instead of routing them through the wrong phonology.

Example: "Привет Hello мир" with Russian active → eSpeak processes "Привет" and "мир" as Russian, "Hello" as English, and the frontend receives correct IPA for all three words.

The Latin fallback language defaults to British English (en-gb). The detection covers Cyrillic, Greek, Arabic, Hebrew, Georgian, Armenian, CJK, Indic, Thai, and other non-Latin script families. Performance impact on Latin-script languages (English, Spanish, etc.) is effectively zero — the function exits on the third line after a single set lookup.

Language packs

  • Mexican Spanish (es-mx) seseo fix: θ → s normalization rule added.
  • Syllable-aware processing enabled for English with the legal onset table above.
  • Forward drift and cluster blend enabled by default for English (en-us, en-gb).

Summary of new YAML settings

All new settings are documented in Tuning.md with full parameter references. Key additions:

Block Key settings
clusterBlend: strength, per-manner-pair scales, homorganicScale, wordBoundaryScale, f1Scale, forwardDriftStrength
boundarySmoothing: Per-place transition scales (labial/alveolar/palatal/velar × F1/F2/F3), withinSyllableScale, withinSyllableFadeScale, per-boundary fade times
syllableDuration: onsetScale, codaScale, unstressedOpenNucleusScale
syllableStructure: legalOnsets list
prominence: durationReducedCeiling, durationProminentFloorMs (refinements)


Don't miss a new TGSpeechBox release

NewReleases is sending notifications on new releases.