github us/crw v0.0.12

latest releases: v0.16.0, v0.15.2, v0.15.1...
3 months ago
  • Readability drill-down — when <main> or <article> wraps >90% of body, the extractor now searches inside for narrower content elements (.main-page-content, .article-content, .entry-content, etc.) instead of discarding. Fixes MDN pages returning 35 chars and StackOverflow returning only the question
  • Base64 image strippingdata: URI images removed in both HTML cleaning (lol_html) and markdown post-processing (regex safety net). Eliminates massive base64 blobs from Reddit and similar sites
  • Select/dropdown removal<select> elements removed in onlyMainContent mode; dropdown/city-selector/location-selector noise patterns added. Fixes Hürriyet city dropdown leaking into content
  • Extended scored selectors — added .main-page-content, .js-post-body, .s-prose, #question, .page-content, #page-content, [role="article"] for better MDN, StackOverflow, and generic site coverage
  • Smarter fallback chain — when primary extraction produces too-short markdown, both fallbacks are tried and the longer result is picked

Don't miss a new crw release

NewReleases is sending notifications on new releases.