jsoup 1.21.2 is out now, adding support for custom SSLContext
in HTTP/2 connections, and improving consistency in how user data is handled in attributes. It also brings performance gains in DOM manipulation and fragment parsing, and fixes several edge cases in stream parsing, traversal, cloning, and concurrent reads.
jsoup is a Java library for working with real-world HTML and XML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Changes
- Deprecated internal (yet visible) methods
Normalizer#normalize(String, bool)
andAttribute#shouldCollapseAttribute(Document.OutputSettings)
. These will be removed in a future version. - Deprecated
Connection#sslSocketFactory(SSLSocketFactory)
in favor of the newConnection#sslContext(SSLContext)
. UsingsslSocketFactory
will force the use of the legacyHttpUrlConnection
implementation, which does not support HTTP/2. #2370
Improvements
- When pretty-printing, if there are consecutive text nodes (via DOM manipulation), the non-significant whitespace between them will be collapsed. #2349.
- Updated
Connection.Response#statusMessage()
to return a simple loggable string message (e.g. "OK") when using theHttpClient
implementation, which doesn't otherwise return any server-set status message. #2356 Attributes#size()
andAttributes#isEmpty()
now exclude any internal attributes (such as user data) from their count. This aligns with the attributes' serialized output and iterator. #2369- Added
Connection#sslContext(SSLContext)
to provide a custom SSL (TLS) context to requests, supporting both theHttpClient
and the legacyHttUrlConnection
implementations. #2370 - Performance optimizations for DOM manipulation methods including when repeatedly removing an element's first child (
element.child(0).remove()
, and when usingParser#parseBodyFragement()
to parse a large number of direct children. #2373.
Bug Fixes
- When parsing from an InputStream and a multibyte character happened to straddle a buffer boundary, the stream would not be completely read. #2353.
- In
NodeTraversor
, if a last child element was removed during thehead()
call, the parent would be visited twice. #2355. - Cloning an Element that has an Attributes object would add an empty internal user-data attribute to that clone, which would cause unexpected results for
Attributes#size()
andAttributes#isEmpty()
. #2356 - In a multithreaded application where multiple threads are calling
Element#children()
on the same element concurrently, a race condition could happen when the method was generating the internal child element cache (a filtered view of its child nodes). Since concurrent reads of DOM objects should be threadsafe without external synchronization, this method has been updated to execute atomically. #2366 - Malformed HTML could throw an IndexOutOfBoundsException during the adoption agency. #2377.