github tgeczy/TGSpeechBox v-280
TG SpeechBox with phoneme editor and NVDA Addon, Speech Dispatcher module version 2.80

latest releases: v-300, v-300rc2, v-300rc1...
one month ago

Overview to Version 2.80

TGSpeechBox v2.80 is a major update to the DSP and frontend, introducing DSP v7 with per-formant transition control, a YAML-driven allophone rule engine, boundary smoothing, cluster timing, Fujisaki pitch model improvements, and expanded platform support including Linux AARCH64.


DSP v7

FrameEx Expansion (18 → 23 Parameters)

Five new per-frame parameters give the frontend fine-grained control over how transitions between phonemes are rendered:

  • transF1Scale, transF2Scale, transF3Scale, transNasalScale — per-formant-group transition speed. Values below 1.0 make formants reach their target faster within the fade window. This allows formant frequencies to lead the amplitude crossfade, preventing artifacts where full-amplitude audio plays with incorrect formant positions.
  • transAmplitudeMode — selects the interpolation curve for amplitude parameters. 0.0 = linear (default), 1.0 = equal-power (sin/cos). The equal-power path maintains constant energy across source transitions like voiced→voiceless. Currently dormant (default 0.0) but the full infrastructure is in place for future use.

Resonator Decay During Silence

Cascade and parallel resonator state now decays during silence gaps (when preFormantGain drops below threshold). This prevents residual resonator energy from coloring the onset of the next phoneme — correct acoustic behavior that eliminates a class of subtle onset artifacts.

Equal-Power Crossfade Infrastructure

A complete frontend→DSP transition hint system. The frontend detects source type changes (voiced↔voiceless) using frame-level amplitude tracking and signals the DSP via transAmplitudeMode. The DSP applies sin²/cos² interpolation curves that maintain constant total energy across the transition. While dormant by default, this architecture enables per-frame crossfade control for any future transition type — glottal stops, aspiration contrasts, tense/lax distinctions — without DSP code changes.


Frontend

Allophone Rule Engine

A new YAML-driven system for expressing phonological rules that depend on position, stress, and phonetic context. Rules are defined per language pack and support five action types:

  • replace — swap a phoneme for another (e.g., intervocalic flapping: /t/ → [ɾ])
  • scale — multiply duration, fade, or field values (e.g., unreleased word-final stops)
  • shift — blend formant values toward a target (e.g., dark /l/ with F2 → 900 Hz)
  • insert-before / insert-after — add phonemes at boundaries (e.g., glottal reinforcement)

Rules can match by phoneme key, feature flags, position (word-initial, word-final, intervocalic, pre-vocalic, post-vocalic, syllabic), stress level, and neighboring phoneme identity. The engine skips closure, aspiration, and gap tokens when determining neighbors, so rules match on phonological context rather than synthesis artifacts.

Boundary Smoothing

A new pass that adjusts formant transition speeds at segment boundaries to reduce the "concatenated phoneme" effect. Rather than modifying amplitude crossfade timing, the pass computes per-boundary-type transition ratios and applies them through transF1Scale/transF2Scale/transF3Scale — formants arrive at their target early in the fade window while amplitude timing remains untouched.

The hybrid architecture includes:

  • Per-transition-type fade targets — 13 boundary types (vowel→stop, stop→vowel, nasal→fricative, etc.) each with YAML-configurable timing
  • Aspiration bypass — tokens with high aspiration and low voicing keep crisp onsets
  • Voicing flip guard — prevents buzz from stretching fades across voiced↔voiceless boundaries, with exemptions for stops (which have natural closure gaps)
  • Formants-lead design — formant frequencies reach targets before or with the amplitude crossfade, never after

Currently enabled for Hungarian; disabled for English pending per-language tuning.

Special Coarticulation Pass

A new YAML-driven pass that applies targeted formant shifts on vowels adjacent to specific consonant classes. Runs after the general MITalk-K locus coarticulation, stacking additional context-dependent adjustments:

  • Rhotic F3 lowering — vowels next to /ɹ/ get F3 pulled down, creating the American English rhotic color
  • Labial F2 lowering — vowels next to /w/, /b/, /p/, /m/ get rounder
  • Alveolar back-vowel fronting — back vowels next to /t/, /d/, /n/, /s/, /z/ get subtle F2 raising

Rules support cumulative application, per-side control (before/after/both), unstressed vowel scaling, and phrase-final stressed scaling.

Cluster Timing

A new pass for shortening consonants within clusters so words like "adjust," "texts," and "strengths" sound natural rather than dictated one consonant at a time. YAML-configurable scales for:

  • Fricative before stop, stop before fricative, fricative before fricative, stop before stop
  • Triple cluster middle consonant compression
  • Affricate in cluster scaling
  • Word-medial consonant and word-final obstruent shortening

Fujisaki Pitch Model Rewrite

The Fujisaki-Bartman pitch contour generator in the frontend has been rewritten for more accurate phrase and accent contour generation. Improvements include proper declination modeling, corrected filter implementation, and better interaction with eSpeak's base pitch values. The result is more natural intonation curves, particularly noticeable in longer utterances.

Flapping Stress Fix

Intervocalic flapping (/t,d/ → [ɾ]) in American English no longer fires across word boundaries. Previously, "number to" incorrectly flapped the /t/ because the rule only checked for surrounding vowels without verifying the consonant was word-internal. The allophone engine's intervocalic position check now rejects word-initial tokens.

YAML-Configurable Boundary Smoothing Fade Times

All 13 boundary transition types now read their fade times from the language pack YAML rather than using hardcoded values. Languages can override individual transitions:

boundarySmoothing:
  enabled: true
  vowelToStopFadeMs: 22.0
  stopToVowelFadeMs: 20.0
  stopToFricFadeMs: 14.0
  # ... etc.

Defaults match previous hardcoded behavior, so existing packs are unaffected.


Deprecated Settings

The following YAML settings are still parsed for backward compatibility but are no longer functional. They can be safely removed from language pack files. No warnings are emitted — existing packs continue to load without error.

Coarticulation (replaced by MITalk-K locus model + special coarticulation)
  • coarticulationTransitionExtent — replaced by endCf1-3 ramping within vowel duration
  • coarticulationFadeIntoConsonants — new model modifies vowel start formants instead of fading into consonants
  • coarticulationWordInitialFadeScale — no longer used
  • coarticulationAlveolarBackVowelEnabled — replaced by specialCoarticulation rules (e.g., alveolar-back-vowel-fronting)
  • coarticulationLabializedFricativeFrontingEnabled — replaced by specialCoarticulation rules
Fujisaki Pitch (replaced by multi-phrase exponential declination)
  • fujisakiDeclinationScale — replaced by fujisakiDeclinationRate (exponential decay, no kink)
  • fujisakiDeclinationMax — no longer needed; exponential declination asymptotes naturally
  • fujisakiDeclinationPostFloor — no longer needed; exponential curve has no floor discontinuity

Language Packs

Hungarian

  • Intervocalic /h/ → [ɦ] allophone rule for natural voicing continuity between vowels
  • Boundary smoothing enabled with tuned per-transition values
  • Vowel length contrast system with short ceiling / long floor enforcement
  • Geminate consonant timing (closure scale, release scale, pre-geminate vowel shortening)

English (US)

  • Intervocalic flapping with word-boundary guard
  • Dark /l/ by position (post-vocalic, syllabic, pre-vocalic with graduated blending)
  • Unreleased word-final voiceless stops (duration and frication scaling)
  • Rhotic, labial, and alveolar special coarticulation rules
  • Liquid dynamics with rhotic F3 dip

Platform Support

Linux AARCH64

TGSpeechBox now ships native Linux AARCH64 binaries for Raspberry Pi and other ARMv64 machines. The same feature set as x86_64 — full DSP v7, all language packs, Speech Dispatcher integration.

Linux x86_64

Updated to support DSP v7. Binaries: tgsbRender, libtgspeechbox.so, libtgsbFrontend.so, tgsp wrapper. Symlinks maintained for backward compatibility with older names.


Tools

tgSBPhonemeEditor

  • New menu items for editing allophone rules and special coarticulation rules directly in the GUI
  • Full support for the allophone rule system: replace, scale, shift, insert actions with position/stress/neighbor filtering
  • Special coarticulation rule editing with trigger phonemes, vowel filters, formant targets, and per-side control

NVDA Driver

  • Race conditions removed when canceling speech — the driver no longer randomly interrupts utterances mid-playback
  • Compatible with NVDA 2023.2 through 2026.1

Don't miss a new TGSpeechBox release

NewReleases is sending notifications on new releases.