github tgeczy/TGSpeechBox v-300b3
TG SpeechBox with phoneme editor, NVDA Addon, SAPI5, Linux, Android, iOS, Mac OS, version 3.0 public beta 3

latest releases: v-310b301, v-310b3, v-310b201...
pre-releaseone month ago

TGSpeechBox v3.0-beta3

Changes since v3.0-beta2.

New Features

  • Fujisaki clause-type prosody overrides: The hardcoded clause-type multipliers (question, exclamation, comma) are now exposed as 18 YAML settings via FujisakiClauseOverrides — 6 fields (phraseAmpScale, accentBoost, declinationScale, basePitchScale, finalRiseScale, finalDropScale) × 3 clause types. Language pack authors can now fine-tune intonation contours without touching C++.
  • Per-diphthong duration scaling: The global diphthongDurationScale is replaced by per-pair pairScales in the diphthong collapse YAML section. Wide diphthongs (PRICE, MOUTH, CHOICE) get longer scales; narrow diphthongs (FACE, GOAT) get shorter. No more one-size-fits-all tradeoff between starving PRICE or bloating GOAT.
  • Diphthong rate compensation: New diphthongRateCompensation setting partially undoes speech-rate compression on diphthong glides. At NVDA rate 80+, bare diphthongs like "I" and "Y" were losing their identity. en-us tuned to 0.15 — enough to preserve glide character at high rates.
  • Unreleased first stop in clusters: New stop_before_stop_unreleased allophone rule suppresses the burst on the first stop in clusters like /kt/ ("locked"), /pt/ ("kept"), /gd/ ("bagged"). In natural speech the first stop is unreleased — the vocal tract moves directly from one closure to the next. Only the last stop gets released. Applied to en-us, en-gb, and en-au.
  • Stop burst spectral templates: Alveolar, labial, and velar stops now have research-based burst spectra from Stevens & Blumstein (1978) and Blumstein & Stevens (1979). Alveolar = diffuse-rising (energy toward high frequencies), velar = compact (mid-frequency peak), labial = diffuse-falling (flat/low). This makes /d/ clearly distinct from /g/ and /t/ from /k/.
  • Clause-final sonorant hold: New clauseFinalHoldMs setting (30ms for English) extends the final voiced sonorant in multi-word utterances, preventing clipping of word-final nasals, liquids, and vowels in connected speech.
  • NVDA exponential glottal sharpness slider: Wider usable range via exponential mapping.
  • Python synthesis tools upgraded: formant_trajectory.py now has 9 micro-event emitter functions matching the C++ frame_emit.cpp dispatch. Resonator class matches the bilinear-transform DF1 architecture. Glottal source upgraded from symmetric cosine to hybrid LF model with SR-adaptive sharpness, breathiness/creakiness modulation, and anti-alias filtering.

Join Test Flight for Mac OS and iOS!

Join TestFlight here by clicking this link from your mobile device.

Bug Fixes

  • MOUTH/NURSE vowel overlap: The MOUTH diphthong onset (ä) had F2=1430, only 30 Hz from schwa/NURSE (F2=1400), making "outside" sound like "ertside". Retuned to Hillenbrand (1995) GenAm data: F1=780, F2=1300. Now 100 Hz from schwa — clearly "ah" not "er".
  • en-gb GOAT/NURSE merger: Monophthongizing GOAT əʊ→əː merged it with NURSE (ɜː) because both phonemes share identical F1=500, F2=1400. Fix: restore GOAT as a diphthong. Also lowered onset hold exponents (1.8/1.7→1.3) so the offglide is actually audible.
  • en-gb word-final vowel cutoff: Word-final schwa in "paper", "never" cut out abruptly. Added singleWordFinalHoldMs (15ms) and clauseFinalFadeMs (10ms) for en-gb.
  • Stop cluster merging: Two consecutive stops (e.g. /kt/ in "locked") sounded like one consonant. Widened cluster closure gap (en.yaml 16→24ms, en-us 20→26ms) and added the unreleased-first-stop rule. Cluster-final stops are now exempt from the unreleased_word_final rule so the released stop keeps its full burst.
  • Cascade resonator pops: Audible clicks during fast formant transitions in connected speech. Fixed with transition bandwidth widening (sine-envelope Q reduction during crossfades), voice bar formant pre-positioning, and a minimum 10ms transition floor.
  • Cascade nasal quality: Series cascade architecture produced nasal-murmur character. Fixed with high-shelf boost (+5.5 dB at 2 kHz), per-formant cascade bandwidth sharpening (F1×0.75, F2×0.88), and reduced subglottal coupling (0.6→0.3).
  • Diphthong glide regression (v2.97→v2.98): Three bugs — diphthong collapse skipped semivowel offglides, rate-adaptive fade ratio starved tied offglide crossfades, and prevTokenWasTap flag leaked through diphthong continue paths.
  • False diphthong tying on lengthened vowels: "going" /oː/+/ɪ/ was falsely tied after GOAT monophthongization, creating an /o→ɪ/ glide that sounded like CHOICE. Fix: require onset lengthened == 0.
  • Tap "spy-(gap)-der" at slow rates: The flapping allophone kept 50% of the original stop closure gap. At slow rates this produced a 15ms silence before the tap. Fix: replaceClosureScale 0→no gap.
  • Tap "fourgy" transient: Taps shorter than 15ms compressed the 3-phase notch into an affricate-like burst. Fix: 15ms floor falls through to single-frame emission.
  • Intervocalic tap inaudibility: Tap /ɾ/ between similar vowels (/iː/→/ɪ/ in "eating") was nearly silent. Raised fricationAmplitude 0.25→0.55 and micro-event notch threshold 15→8ms.
  • Text parser number misalignment: Numbers like "6402" expand to multiple IPA words but are one text word, misaligning all subsequent stress corrections. Fixed with multi-stress chunk splitting and greedy-merge look-ahead anchoring.
  • Text parser time format stress: "07:06:23 PM" had PM stress mangled by over-eager look-ahead. Reverted to first-match anchoring which handles both numbers and time formats.
  • Purge handler pitch bleed: On purge, stale low-pitch from utterance-final declination was snapshotted and crossfaded into the next utterance. Fix: mark as silence so from-silence transition fires.
  • Cross-utterance trajectory leak: hasPrevBase and hasPrevFrameEx in TrajectoryState were never reset between emitFrames calls, causing the first stop of a new utterance to inherit stale pitch.
  • Diphthong micro-frame shimmer at low pitch: At 80 Hz, 8ms micro-frames don't span a full glottal cycle, so resonators never settle. Fix: pitch-proportional interval with 2-cycle floor.
  • Allophone replace was cosmetic-only: The replace action only changed the phoneme key — now overwrites all fields from the new PhonemeDef.
  • NEAR tie bar for en-us: eSpeak API produces i͡ə with tie bar for words like "medium". Strip via preReplacement — GenAm doesn't use the centering diphthong.
  • Cross-phase preReplacement corruption: preReplacement output could be mangled by the replacements phase. Fix: escape into PUA-A codepoints.

Language Pack Improvements

  • Spanish dialect-specific approximants: Castilian vs Latin American /d/ allophone rules now split — Castilian uses interdental [θ̠] in intervocalic position, Latin American uses standard [ð]. Also fixed base Spanish approximant phonemes (β, ð, ɣ, ʝ, ɲ).
  • en-au Australian vowel recovery: Recovered 2016 hand-tuned KIT, STRUT, LOT formants that were overwritten during v2.96 en-gb refactor. Reduced ɒ_au voiceAmplitude (0.9→0.7).
  • en-gb RP tuning: Diphthong offglide targets, labial burst shaping, boundary smoothing refinements. Place-specific voiced stop burst boost (labial/alveolar/velar). Word-final voiced alveolar softening.
  • en-us alveolar burst reshape: /d/ and /t/ burst spectra reshaped to diffuse-rising template. /d/ cf2 1600→1700, burstDurationMs 7→9. Word-final /d/ softened via allophone rule.
  • VOT tuning: Aspiration duration standardized at 35ms, voice bar amplitude at 0.50.

Platform Improvements

  • Android per-voice engine settings: Voice quality sliders are now stored per-voice with fallback to global keys for migration. Editing Voice dropdown on Advanced tab. Reset dialog with all-voices toggle.
  • iOS AudioUnit version sync: Extension version now tracks CURRENT_PROJECT_VERSION so TestFlight updates are discovered correctly. Default volume lowered to 80% to prevent clipping.

Known Issues (Active Tuning)

  • en-gb PRICE vowel still sounds like a snarky teenager: "Five" and "nine" in en-gb are improved but the glide doesn't curve down the way RP speakers naturally let their tongue sweep through the diphthong. Right now it lands more like a thick "ey" — think Stewie Griffin doing an impression of a British punk. The onset-to-offglide shaping is on the workbench for beta4.

Don't miss a new TGSpeechBox release

NewReleases is sending notifications on new releases.