jsoup 1.22.2 is out now, with fixes and refinements across the library. It makes editing the DOM during traversal more predictable, refreshes the default HTML tag definitions with newer elements and better text boundaries, and improves reliability in parsing and HTTP transport. The release also fixes a number of edge cases in cleaning, stream parsing, XML doctype handling, and Android packaging.
jsoup is a Java library for working with real-world HTML and XML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Expanded and clarified
NodeTraversorsupport for in-place DOM rewrites duringNodeVisitor.head(). Current-node edits such asremove,replace, andunwrapnow recover more predictably, while traversal stays within the original root subtree. This makes single-pass tree cleanup and normalization visitors easier to write, for example when unwrapping presentational elements or replacing text nodes as you walk the DOM. #2472 - Documentation: clarified that a configured
Cleanermay be reused across concurrent threads, and that sharedSafelistinstances should not be mutated while in use. #2473 - Updated the default HTML
TagSetfor current HTML elements: addeddialog,search,picture, andslot; madeins,del,button,audio,video, andcanvasinline by default (Tag#isInline(), aligned to phrasing content in the spec); and added readableElement.text()boundaries for controls and embedded objects via the newTag.TextBoundaryoption. This improves pretty-printing and keeps normalized text from running adjacent words together. #2493
Bug Fixes
- Android (R8/ProGuard): added a rule to ignore the optional
re2jdependency when not present. #2459 - Fixed a
NodeTraversorregression in 1.21.2 where removing or replacing the current node duringhead()could revisit the replacement node and loop indefinitely. The traversal docs now also clarify which inserted nodes are visited in the current pass. #2472 - Parsing during charset sniffing no longer fails if an advisory
available()call throwsIOException, as seen on JDK 8HttpURLConnection. #2474 Cleanerno longer makes relative URL attributes in the input document absolute when cleaning or validating aDocument. URL normalization now applies only to the cleaned output, andSafelist.isSafeAttribute()is side effect free. #2475Cleanerno longer duplicates enforced attributes when the inputDocumentpreserves attribute case. A case-variant source attribute is now replaced by the enforced attribute in the cleaned output. #2476- If a per-request SOCKS proxy is configured, jsoup now avoids using the JDK
HttpClient, because the JDK would silently ignore that proxy and attempt to connect directly. Those requests now fall back to the legacyHttpURLConnectiontransport instead, which does support SOCKS. #2468 Connection.Response.streamParser()andDataUtil.streamParser(Path, ...)could fail on small inputs without a declared charset, if the initial 5 KB charset sniff fully consumed the input and closed it before the stream parse began. #2483- In XML mode, doctypes with an internal subset, such as
<!DOCTYPE root [<!ENTITY name "value">]>, now round-trip correctly. The subset is preserved as raw text only; entities are not expanded and external DTDs are not loaded. #2486
Build Changes
- Migrated the integration test server from Jetty to Netty, which actively maintains support for our minimum JDK target (8). #2491
My sincere thanks to everyone who contributed to this release!
If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.