Added
- You can now call the new
useHeadlessBrowser
method on theHttpLoader
class to use a headless Chrome browser to load pages. This is enough to get HTML after executing javascript in the browser. For more sophisticated tasks a separate Loader and/or Steps should better be created. - With the
maxOutputs()
method of the abstractStep
class you can now limit how many outputs a certain step should yield at max. That's for example helpful during development, when you want to run the crawler only with a small subset of the data/requests it will actually have to process when you eventually remove the limits. When a step has reached its limit, it won't even call theinvoke()
method any longer until the step is reset after a run. - With the new
outputHook()
method of the abstractCrawler
class you can set a closure that'll receive all the outputs from all the steps. Should be only for debugging reasons. - The
extract()
method of theHtml
andXml
(children ofDom
) steps now also works with a single selector instead of an array with a mapping. Sometimes you'll want to just get a simple string output e.g. for a next step, instead of an array with mapped extracted data. - In addition to
uniqueOutputs()
there is now alsouniqueInputs()
. It works exactly the same asuniqueOutputs()
, filtering duplicate input values instead. Optionally also by a key when expected input is an array or an object. - In order to be able to also get absolute links when using the
extract()
method of Dom steps, the abstractDomQuery
class now has a methodtoAbsoluteUrl()
. The Dom step will automatically provide theDomQuery
instance with the base url, presumed that the input was an instance of theRespondedRequest
class and resolve the selected value against that base url.
Changed
- Remove some not so important log messages.
- Improve behavior of group step's
combineToSingleOutput()
. When steps yield multiple outputs, don't combine all yielded outputs to one. Instead, combine the first output from the first step with the first output from the second step, and so on. - When results are not explicitly composed, but the outputs of the last step are arrays with string keys, it sets those keys on the Result object instead of setting a key
unnamed
with the whole array as value.
Fixed
- The static methods
Html::getLink()
andHtml::getLinks()
now also work without argument, like theGetLink
andGetLinks
classes. - When a
DomQuery
(CSS selector or XPath query) doesn't match anything, itsapply()
method now returnsnull
(instead of an empty string). When theHtml(/Xml)::extract()
method is used with a single, not matching selector/query, nothing is yielded. When it's used with an array with a mapping, it yields an array with null values. If the selector for one of the methodsHtml(/Xml)::each()
,Html(/Xml)::first()
orHtml(/Xml)::last()
doesn't match anything, that's not causing an error any longer, it just won't yield anything. - Removed the (unnecessary) second argument from the
Loop::withInput()
method because whenkeepLoopingWithoutOutput()
is called andwithInput()
is called after that call, it resets the behavior. - Issue when date format for expires date in cookie doesn't have dashes in
d-M-Y
(sod M Y
).