tgeczy/TGSpeechBox v-310b5 on GitHub

TGSpeechBox v3.10 Beta 5 — the Android /s/→/z/ fix, and a post-mortem

This beta closes the two bugs that have defined the Android experience
for Spanish users since Beta 4: the /s/ /c/ /z/ sounds going muddy or
voiced on many devices (#100), and speech going silent when navigating
quickly with a screen reader. Both root causes are now understood end
to end, and we owe the testers on #100 a full accounting, so here it
is.

The /s/→/z/ leak — what actually happened (issues #95, #100)

Two things had to combine to produce this bug, and we shipped one of
them ourselves.

Part 1 — a regression in Beta 4. Thirteen minutes before the b4
tag, we landed a fix for a real problem: Mexican /s/ sounded /ʃ/-thick
("pizza" → "pizsha", "percepción" → "perCep-shón"). That fix removed
parallel formant energy at 3300 Hz — correct — but it also removed the
3850 Hz band and cut total /s/ frication energy by about a third,
leaving a single spectral peak where there had been a broad plateau.
Desktop listening said it was fine. Issue #100 was filed ten hours
after the release. Beta 3's /s/, with its spread-out energy, had been
surviving your devices all along; gregodejesus2's report that "this
didn't occur in previous versions" was exactly right, and it was the
clue that eventually cracked the case.

Part 2 — device audio processing. Many phones (mid-range models
especially) run voice-clarity / noise-suppression DSP on the live
screen-reader audio path. That processing treats high-frequency hiss
as noise and ducks it. Measured on the b4 /s/: essentially zero energy
above 6 kHz survived — and what remains below reads to the ear as /z/.
eSpeak NG's /s/ is brighter and louder up top, so it survives the same
processing on the same devices. That's why eSpeak was clean in every
A/B you ran while TGSB wasn't.

The fix. Spanish /s/ (both Castilian apical s_es and Mexican
laminal s_mx — base English /s/ untouched) now peaks at 6500 Hz with
wider bandwidth and more parallel energy (pf6 5250→6500, pb6→2000,
pa6→0.9). In a simulated -15 dB high-frequency duck, the new /s/
retains about twice the high-frequency energy of the old one and stays
unambiguously an /s/. Ear-validated across both dialects against the
baseline and an alternative candidate: clearest of the three, no
/ʃ/-thickness regression on pizza/percepción, no hiss on the Castilian
apical /s/.

What we need from you: A/B the words where you hear the leak today
(Restablecer, Seleccionado, Español, suspendido…) on the same phones,
same screen reader, same settings. Report back on #100 — better, worse,
gone. That's the final gate.

Fast-navigation speech loss — a second real bug (#100 side report)

29-Bloo reported that scrolling quickly between elements sometimes
silences speech entirely, and that it didn't happen in earlier betas.
He also proposed a theory — that TGSB routes audio through the media
path instead of the accessibility path. We took it seriously and
compared our audio code against eSpeak NG's line by line: provably not
it (neither engine touches stream routing; the framework and screen
reader own it). But the report itself was pointing at something real:

Android delivers stop requests with no utterance identity. Our
service reset its stop flag at the top of synthesis — ~90 lines and
potentially a full language-pack reload before audio started. A stop
meant for the PREVIOUS utterance landing in that window killed the
new utterance unheard. The flag is now armed at the last possible
moment, shrinking the race window from tens of milliseconds to
microseconds.
Worse, a stopped utterance was reported to the framework as an
ERROR. A stop is an interruption, not a failure — some screen
readers go silent on error instead of retrying. We now report
completion, exactly as eSpeak does. This is very likely why eSpeak
never showed the symptom.

Why it "didn't happen in previous betas" even though this code hadn't
changed since v3.00: the race window kept widening underneath it —
phonemes.yaml grew, settings were re-applied on every utterance, and
the DSP got heavier. Slow devices finally saw it regularly. Which
leads to:

Per-utterance overhead cut (all Android users, biggest on slow devices)

The service was re-applying voice presets, advanced settings, phoneme
overrides, and dictionary overlays — and in some paths reloading the
entire language pack — on EVERY utterance. All of that is engine state
that persists; it now re-applies only when something actually changed.
Time-to-first-audio drops accordingly, and this keeps the engine
scalable as the phoneme inventory grows.

Platform parity audit

All four platform bridges (Android, iOS, Windows SAPI, NVDA) were
audited this cycle for engine-integration divergence: same 22050 Hz
default, same init ordering, same text preprocessing, zero
platform-specific DSP differences in the core. The engine is the same
engine everywhere — the per-platform differences that remain are in
the audio delivery layer, and they're now on the books:

iOS had its own (different) cancel race — a cancel arriving while an
utterance was being prepared got silently eaten (the utterance spoke
in full), and a stale cancel could wipe the NEXT utterance's audio.
Both fixed in this beta via a request-generation counter, plus the
same per-utterance overhead cut as Android.
iOS output is currently capped at 22050 Hz while Windows can run
44100. At 44100 the engine is genuinely brighter (whiter aspiration
noise, active 6.5/7.5 kHz presence resonators). If Windows sounds
"crisper" or "more American" to you than iOS, this is why. The cap
is a deliberate Feb-2026 workaround for a rate-dependent DAC
click/pop on render-session open (44100 clicked, 22050 didn't —
thanks @kaveinthran). Nothing architectural requires it; retesting
44100 on current iOS is on the books.

Known issue — carried into this beta, root cause isolated

On-device validation caught one straggler: words whose IPA begins
with a secondary stress mark — including, ironically, "Español"
itself, and "escuchar" — do not yet pick up the new brighter /s/.
Their /s/ falls through to the base (English) /s/ because of a
text-to-IPA alignment quirk in the platform text path (it predates
this beta unchanged; measured byte-identical in b4). casa, rosa,
mismo, suspendido, Restablecer, Seleccionado, espuela all get the
full fix — measured 3x brighter on-device. If you A/B "Español" and
hear less improvement than on other words, that's this, not a failed
fix. Frontend fix lands next beta and benefits all platforms.

Deferred (unchanged from b4)

endCb1/2/3 per-phoneme bandwidth evolution
Croatian #99 items; Spanish "siguiente" /ʝ/ split (#20)
/o-ɣ-o/ Kingston perceptual integration trap (diálogo)

Testers, this one was yours

@gregodejesus2 — "it didn't occur in previous versions" turned out to
be the single most important sentence in the whole investigation. Plus
the sample-rate sweep and the eSpeak A/B that ruled out half the
hypothesis space.

@29-Bloo — the recordings, the enhancement-toggle tests, and the
audio-path theory. The theory was wrong about the mechanism but right
that a second bug existed — your scroll report led us straight to a
race condition that's been latent since v3.00.

@rmcpantoja, @dgomez42 — the accumulated #95/#100 reports kept the
signal alive across months.

— Tamas + Claudeo (Fable 5)

tgeczy/TGSpeechBox v-310b5 TG SpeechBox with phoneme editor, NVDA Addon, SAPI5, Linux, Android, iOS, Mac OS, version 310 beta 5 on GitHub