Improvements
- Async extraction support e.g. X URLs
- Generic footnote detection fallback and backref cleanup (#138, #120)
- Substack app support
- Better Wikidot support
- Better heading/code/pre preservation
- Shiki language detection for code blocks
- Improved scoring around code blocks and bios
- Fixed nested list indentation
Fixes
- Fix HTML element with
id="menu"breaking content extraction (#106) - Fix page content not being able to start with a divider (#114)
- Fix invalid CSS selector
span.leading-tight,, img(#128) - Fix
[href*="/category"]exact selector removing legitimate page content (#131) - Fix
.heroexact selector removing primary content on documentation landing pages (#132) - Fix content of
<time>element being removed (#136) - Fix DOMParser is not defined when running via defuddle/node (#137)
- Fix content sanitization bypass via schema.org text fallback (#139)
Security
- Fix XSS via attribute injection in image handling
- Sanitize HTML to prevent unsafe elements in schema text fallback (#139)
Other
- New website (#133), playground updates, README updates