Matcher hardening: ports three normalize_name input-cleaning fixes from Stream-Mapparr (the matcher template) so noisy provider stream names normalize to the same form as clean channel-database names. Purely additive to fuzzy_matcher.py (122 insertions, 0 deletions) — no existing matching logic changed. See docs/MATCHER-NORMALIZATION-PORT.md.
Matching
- Stylized-Unicode decoration stripping (bug-048) — Drops whole tokens that are pure stylized decoration (superscript / small-capital tier markers, bullets) before the ASCII tag pipeline, detected by Unicode character name rather than code-point range. A superscript "RAW" suffix no longer blocks a match to
WeatherNation. Real ASCII tier words (Gold/VIP) and non-Latin scripts (Arabic/Cyrillic/CJK) are preserved. - Emoji-as-letter normalization (bug-051) — Maps an emoji used as a letter inside a word (
SP⚽RTS→SPoRTS, the beIN family) to its letter when flanked by ASCII letters, and strips emoji used purely as decoration plus zero-width selectors.beIN SP⚽RTSnow matchesbeIN Sports. - Numeric resolution markers (bug-055) — Strips
720p/1080p/2160p/3840P-style markers (a 3–4 digit run glued to p/i) that the keyword quality list misses, while keeping bare numbers (Channel 4,Studio 1080), 5-digit runs, and spaced standalone roman numerals (Volume 100 I) intact. Gated by the same tag-handling flag as the other quality tags.
Beneficial side effect: the NFKD canonicalization in the stylized-strip step unifies accented and ASCII spellings of the same channel, so UniMás/UniMas and TeleFórmula/TeleFormula now match where they previously did not. Verified: 0 changes to any ASCII channel name across all 42,246 database names; no different-channel false-merges.
Data
- Deduplicated channel databases — removed 651 fully-identical rows across 7 country files (UK 168, MX 206, DE 136, CA 62, BR 43, FR 19, ES 17); all
*_channels.jsonnormalized to LF. - Norwegian channel database (
NO_channels.json) — 94 channels; registeredNO → norwayinCOUNTRY_DIR_MAPso the per-channel logo action resolves them. Coverage is now 12 countries.
Tests
tests/test_normalization_port.py(48 cases) locks all three fixes at the helper, regex, and full-pipeline levels, with editor-proof escaped Unicode constants. Newfuzzy_moduleconftest fixture exposes the module-level helpers. A corpus no-regression test asserts the fixes never alter any of the ~41.5K ASCII channel names (CI-enforced, baseline-free). Full suite: 149 passing.