jsoup Java HTML Parser release 1.15.4
jsoup 1.15.4 is out now, and includes a bunch of improvements, particularly when pretty-printing HTML, and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Added the ability to escape CSS selectors (tags, IDs, classes) to match elements that don't follow regular CSS syntax. For example, to match by classname
<p class="one.two">
, usedocument.select("p.one\\.two");
#838
- When pretty-printing, wrap text that follows a
<br>
tag. #1858
- When pretty-printing, normalize newlines that follow self-closing tags in custom tags. #1852
- When pretty-printing, collapse non-significant whitespace between a block and an inline tag. #1802
- In
Element.forEach()
andNode.forEachNode()
, usejava.util.function.Consumer
instead of the previous Android compatibility shimorg.jsoup.helper.Consumer
. Subsequently, the latter has been deprecated. #1870
- Added a new method
Document.forms()
, to conveniently retrieve aList<FormElement>
containing the<form>
elements in a document.
- Added a new method
Document.expectForm()
, to find the first matchingFormElement
, or blow up trying.
Bug Fixes
- URLs containing characters such as
and <code>
were not escaped correctly, and would throw aMalformedURLException
when fetched. #1873
Element.cssSelector()
would create invalid selectors for elements where the tag name, ID, or classnames needed to be escaped (e.g. if a class name contained a:
or.
). #1742
Element.text()
should have a space between a block and an inline element. #1877
- Form data on a previous request was copied to a new request in
newRequest()
, resulting in an accumulation of form data when executing multi-step form submissions, or data sent to later requests incorrectly. Now,newRequest()
only copies session related settings (cookies, proxy settings, user-agent, etc) but not the request data nor the body. #1778
- Fixed an issue in
Safelist.removeAttributes()
which could throw aConcurrentModificationException
when using the:all
pseudo-attribute.
- Given extremely deeply nested HTML, a number of methods in
Element
could throw aStackOverflowError
due to excessive recursion. Namely:#data()
,#hasText()
,#parents()
, and#wrap(html)
. #1864
Changes
- Deprecated the unused
Document.normalise()
method. Normalization occurs during the HTML tree construction, and no longer as a distinct phase.
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.