- Readability drill-down — when
<main>or<article>wraps >90% of body, the extractor now searches inside for narrower content elements (.main-page-content,.article-content,.entry-content, etc.) instead of discarding. Fixes MDN pages returning 35 chars and StackOverflow returning only the question - Base64 image stripping —
data:URI images removed in both HTML cleaning (lol_html) and markdown post-processing (regex safety net). Eliminates massive base64 blobs from Reddit and similar sites - Select/dropdown removal —
<select>elements removed inonlyMainContentmode; dropdown/city-selector/location-selector noise patterns added. Fixes Hürriyet city dropdown leaking into content - Extended scored selectors — added
.main-page-content,.js-post-body,.s-prose,#question,.page-content,#page-content,[role="article"]for better MDN, StackOverflow, and generic site coverage - Smarter fallback chain — when primary extraction produces too-short markdown, both fallbacks are tried and the longer result is picked