TGSpeechBox v3.10 Beta 4 — Spanish + Croatian phonetic tuning, ear-tested
A tuning-focused beta. No DSP architecture changes (the planned
endCb1/2/3 bandwidth sweep work is deferred to a later beta), but
substantial improvements to specific mispronunciations across two
languages — every change validated against the linguist-tuned eSpeak NG
output and confirmed by ear during a live A/B test session.
Croatian — three concrete fixes from Mario's "sounds German" feedback
Mario Perčinić reported on Mastodon (2026-04-25) that Croatian "rules
sound OK with eSpeak but a bit unnatural with TGSpeechBox. Vowels need
to be improved... sounds more like a German person talking Croatian."
A live ear-test session 2026-04-26 isolated three concrete bugs that
together produced exactly that "Germanic" percept:
1. Word-final stops dropped. "let" rendered as "lé", "put" as "pú".
Croatian /t/ /k/ /d/ at word end weren't releasing — the frame engine
cut to silence abruptly before the burst could decay. hr.yaml had no
single-word tuning settings, so it inherited 0ms hold + 0ms fade.
Other Slavic-adjacent languages set ~30ms / ~18ms (de.yaml, bg.yaml).
Adopted matching values. "let", "put", "pet" all now release final /t/
audibly. This was likely the dominant trigger of Mario's "German"
percept since German final stops devoice — same end-state to a
non-Germanic listener.
2. /e/ vowel rendered as Hungarian "é" (closed). Croatian /e/ is
open-mid front, F1≈470 / F2≈1900 per Bašić 2023. hr.yaml mapped /e/ to
e_fi (395 / 2070, Finnish closed /e/) — F1 75 Hz too low. eSpeak hr
emits /ɛ/ but hr.yaml had no /ɛ/ replacement so it stayed as base /ɛ/
(F2 too low). Both now route to ɛ_hu (555 / 1870), which is the
closest existing borrow to the Bašić target. Words affected: "let",
"pet", "selo".
3. /a/-vs-/æ/ stress distinction collapsed. eSpeak hr emits /æ/
for word-final unstressed /a/ ("kava" → kˈavæ), giving a subtle
stress-induced vowel raise Croatian listeners actually hear. hr.yaml
had collapsed both /a/ and /æ/ → a_open assuming "Croatian wants
plain /a/", erasing the distinction. /æ/ now routes to base /a/ for a
subtle central raise without going front like English /æ/.
Bonus: word-final unstressed vowel truncation in disyllabic words
("novo", "voda" had final /o/ /a/ cut off) is incidentally fixed by
the same singleWordFinalHoldMs settings.
Spanish — issue #81 cluster clarity
Greg de Jesus's #81 list of rs/rc/zz pronunciation issues. Live
ear-test on b3+ es-mx output isolated two distinct fixes:
1. /s_mx/ /ʃ/-thickness in /sj/ and /ts/ contexts. Words affected:
"pizza" rendering as "pee-z-sh-a"; "percepción" rendering "perCep-shón"
on the second /s/. /s_mx/ had pa4=0.15 (parallel formant energy at
3300 Hz — right in /ʃ/ peak territory, 2500-3500 Hz). The "Mexican
brightness via spread" intent was creating /ʃ/-coloration in palatalized
or affricated contexts. Matched /s_mx/ parallel amplitude profile to
base /s/ (pa4=0, pa5=0, pa6=0.75) keeping pf2=1700 as the laminal
distinguishing cue. Mexican /s/ now reads as bright high-frequency
sibilant in all contexts.
2. /sɣ/ schwa-and-stop too separated. "hasgo" rendering as "as-Go"
with hard /g/ separation; eSpeak renders smooth "asgo" with continuous
approximant. The es.yaml sɣ → sᵊɡ rule (insert schwa, route to plain
/ɡ/'s 30ms closure) was over-articulating. Removed the rule entirely;
/sɣ/ now flows through dialect replacement as /s_mx/ + /ɡ_es/ (8ms
closure + full burst from b201's intervocalic lenition fix). Result
matches eSpeak es-mx /ˈasɣo/ continuous shape. Other cluster rules
(/lɣ/, /ɾɣ/) keep their schwa-insertion since they were unaffected.
Methodology — live ear-test harness via PowerShell SoundPlayer
This beta introduced a new way of working that paid off immediately.
Where ear-testing isn't available (engineer doesn't speak the language
fluently — Croatian here), the methodology now triangulates against
three reference points:
- eSpeak NG output — decades of linguist-tuned phonemization,
public, reproducible. Mario's "eSpeak sounds OK" gives the
calibration anchor. - Peer-reviewed formant targets — Bašić 2023 for Croatian, Quilis
1981 for Mexican Spanish, etc. via Consensus paper search. - TGSB rendered output — measured via librosa LPC formant
extraction on captured PCM.
Where the engineer DOES speak the language (Spanish, Hungarian, English
here), the same A/B-against-eSpeak-NG harness runs but with the human
ear as the gating signal. WAVs render through tgsb_unit_tests audit
cases, eSpeak parallel-renders the same words, and PowerShell
SoundPlayer plays them back-to-back during the iteration loop. Every
change in this beta was validated this way.
This generalizes wherever a render pipeline can produce WAVs — including
SAPI engines on Windows, sd_generic via tgsbRender on Linux, etc. New
language tuning can now ship with measurable, defensible acoustic claims
("TGSB lands within X Hz of published Croatian targets, X Hz closer than
eSpeak") instead of vibes-based "we tweaked something and it sounds
better maybe."
Carried over from b3 / b3.01
The DSP v9 fricationTiltDb fix from b3 (#95 Bug 1, fast-rate /ɡ/
intelligibility) and the b301 hotfix that gated it on Place::Velar
remain in. Those tags are unified under b4 here — there's no separate
b3.01; lineage is b1, b2, b3, b4.
Deferred to a later beta
- endCb1/2/3 per-phoneme bandwidth evolution (Stevens 1998) for /l/
steady-state distinction and geminates cross-linguistically - Croatian: /vr/ cluster onset transition cue, /k/ initial burst
weak frication, "uvod" silence on vowel-initial words (filed as
issue #99) - Spanish: "siguiente" /ʝ/ over-articulation (the same /ʝ/ phoneme is
doing two jobs — full LL-fricative for "yo"/"calle" and softer
glide for /ɣj/ coalescence — needs a phoneme split, tracked in #20) - /o-ɣ-o/ Kingston perceptual integration trap (diálogo) — phonetically
hard, may need a distinct DSP-level approach
Testers, you've been amazing
@mariopercinic — Mastodon feedback drove the entire Croatian arc. The
"sounds German" framing was diagnostic gold once we triangulated against
eSpeak.
@gregodejesus2 — #81's word-by-word "what should sound like what" list
was exactly the test corpus we needed. Perceptual reports that map to
specific phonotactic contexts make this work tractable.
@29-Bloo, @yaresDg, @rmcpantoja, @dgomez42 — your accumulated reports
across #84/#95/#74/#81 set the methodology bar that this beta tries to
meet. Specific, ear-truthful, paired with example words. Thank you.
— Tamas + Claudeo (Opus 4.7, 1M context)