Added
Http
steps can now receive body and headers from input data (instead of statically defining them via argument likeHttp::method(headers: ...)
) using the new methodsuseInputKeyAsBody(<key>)
anduseInputKeyAsHeader(<key>, <asHeader>)
oruseInputKeyAsHeaders(<key>)
. Further, when invoked with associative array input data, the step will by default use the value fromurl
oruri
for the request URL. If the input array contains the URL in a key with a different name, you can use the newuseInputKeyAsUrl(<key>)
method. That was basically already possible with the existinguseInputKey(<key>)
method, because the URL is the main input argument for the step. But if you want to use it in combination with the other newuseInputKeyAsXyz()
methods, you have to useuseInputKeyAsUrl()
, because usinguseInputKey(<key>)
would invoke the whole step with that key only.Crawler::runAndDump()
as a simple way to just run a crawler and dump all results, each as an array.addToResult()
now also works with serializable objects.- If you know certain keys that the output of a step will contain, you can now also define aliases for those keys, to be used with
addToResult()
. The output of anHttp
step (RespondedRequest
) contains the keysrequestUri
andeffectiveUri
. The aliasesurl
anduri
refer toeffectiveUri
, soaddToResult(['url'])
will add theeffectiveUri
asurl
to the result object. - The
GetLink
(Html::getLink()
) andGetLinks
(Html::getLinks()
) steps, as well as the abstractDomQuery
(parent ofCssSelector
(/Dom::cssSelector
) andXPathQuery
(/Dom::xPath
)) now have a methodwithoutFragment()
to get links respectively URLs without their fragment part. - The
HttpCrawl
step (Http::crawl()
) has a new methoduseCanonicalLinks()
. If you call it, the step will not yield responses if its canonical link URL was already yielded. And if it discovers a link, and some document pointing to that URL via canonical link was already loaded, it treats it as if it was already loaded. Further this feature also sets the canonical link URL as theeffectiveUri
of the response. - All filters can now be negated by calling the
negate()
method, so theevaluate()
method will return the opposite bool value when called. Thenegate()
method returns an instance ofNegatedFilter
that wraps the original filter. - New method
cacheOnlyWhereUrl()
in theHttpLoader
class, that takes an instance of theFilterInterface
as argument. If you define one or multiple filters using this method, the loader will cache only responses for URLs that match all the filters.
Fixed
- The
HttpCrawl
step (Http::crawl()
) by default now removes the fragment part of URLs to not load the same page multiple times, because in almost any case, servers won't respond with different content based on the fragment. That's why this change is considered non-breaking. For the rare cases when servers respond with different content based on the fragment, you can call the newkeepUrlFragment()
method of the step. - Although the
HttpCrawl
step (Http::crawl()
) already respected the limit of outputs defined via themaxOutputs()
method, it actually didn't stop loading pages. The limit had no effect on loading, only on passing on outputs (responses) to the next step. This is fixed in this version. - A so-called byte order mark at the beginning of a file (/string) can cause issues. So just remove it, when a step's input string starts with a UTF-8 BOM.
- There seems to be an issue in guzzle when it gets a PSR-7 request object with a header with multiple string values (as array, like:
['accept-encoding' => ['gzip', 'deflate', 'br']]
). When testing it happened that it only sent the last part (in this casebr
). Therefor theHttpLoader
now prepares headers before sending (in this case to:['accept-encoding' => ['gzip, deflate, br']]
). - You can now also use the output key aliases when filtering step outputs. You can even use keys that are only present in the serialized version of an output object.