github jhy/jsoup jsoup-1.22.2
jsoup Java HTML Parser release 1.22.2

9 hours ago

jsoup 1.22.2 is out now, with fixes and refinements across the library. It makes editing the DOM during traversal more predictable, refreshes the default HTML tag definitions with newer elements and better text boundaries, and improves reliability in parsing and HTTP transport. The release also fixes a number of edge cases in cleaning, stream parsing, XML doctype handling, and Android packaging.

jsoup is a Java library for working with real-world HTML and XML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Download jsoup now.

Improvements

  • Expanded and clarified NodeTraversor support for in-place DOM rewrites during NodeVisitor.head(). Current-node edits such as remove, replace, and unwrap now recover more predictably, while traversal stays within the original root subtree. This makes single-pass tree cleanup and normalization visitors easier to write, for example when unwrapping presentational elements or replacing text nodes as you walk the DOM. #2472
  • Documentation: clarified that a configured Cleaner may be reused across concurrent threads, and that shared Safelist instances should not be mutated while in use. #2473
  • Updated the default HTML TagSet for current HTML elements: added dialog, search, picture, and slot; made ins, del, button, audio, video, and canvas inline by default (Tag#isInline(), aligned to phrasing content in the spec); and added readable Element.text() boundaries for controls and embedded objects via the new Tag.TextBoundary option. This improves pretty-printing and keeps normalized text from running adjacent words together. #2493

Bug Fixes

  • Android (R8/ProGuard): added a rule to ignore the optional re2j dependency when not present. #2459
  • Fixed a NodeTraversor regression in 1.21.2 where removing or replacing the current node during head() could revisit the replacement node and loop indefinitely. The traversal docs now also clarify which inserted nodes are visited in the current pass. #2472
  • Parsing during charset sniffing no longer fails if an advisory available() call throws IOException, as seen on JDK 8 HttpURLConnection. #2474
  • Cleaner no longer makes relative URL attributes in the input document absolute when cleaning or validating a Document. URL normalization now applies only to the cleaned output, and Safelist.isSafeAttribute() is side effect free. #2475
  • Cleaner no longer duplicates enforced attributes when the input Document preserves attribute case. A case-variant source attribute is now replaced by the enforced attribute in the cleaned output. #2476
  • If a per-request SOCKS proxy is configured, jsoup now avoids using the JDK HttpClient, because the JDK would silently ignore that proxy and attempt to connect directly. Those requests now fall back to the legacy HttpURLConnection transport instead, which does support SOCKS. #2468
  • Connection.Response.streamParser() and DataUtil.streamParser(Path, ...) could fail on small inputs without a declared charset, if the initial 5 KB charset sniff fully consumed the input and closed it before the stream parse began. #2483
  • In XML mode, doctypes with an internal subset, such as <!DOCTYPE root [<!ENTITY name "value">]>, now round-trip correctly. The subset is preserved as raw text only; entities are not expanded and external DTDs are not loaded. #2486

Build Changes

  • Migrated the integration test server from Jetty to Netty, which actively maintains support for our minimum JDK target (8). #2491

My sincere thanks to everyone who contributed to this release!
If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.

You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.

Don't miss a new jsoup release

NewReleases is sending notifications on new releases.