This pull request improves the robustness of the AudibleSearchService by enhancing its ability to detect and filter out irrelevant page content, such as locale/redirect messages and generic site labels, which can interfere with accurate title extraction. The changes also improve the reliability of title parsing by prioritizing metadata fields.
Noise and redirect detection improvements:
- Added a new
RedirectNoisePhrasesarray to detect common locale/geo redirect messages in multiple languages, helping to filter out non-content pages. - Updated the
IsHeaderNoisemethod to treat any title containing known redirect phrases as noise, preventing false positives in title extraction. - Enhanced HTML parsing logic to skip product pages if the fetched HTML contains redirect or locale noise, and logs a warning with a snippet for debugging.
Title extraction reliability:
- Improved title extraction by preferring the
og:titlemeta tag and the<title>element before falling back to<h1>, making parsing more reliable across different locales and page templates. Also added logic to strip site suffixes from<title>.
Automated canary build