github crwlrsoft/crawler v1.1.0

latest releases: v2.1.3, v2.1.2, v2.1.1...
18 months ago

Added

  • Http steps can now receive body and headers from input data (instead of statically defining them via argument like Http::method(headers: ...)) using the new methods useInputKeyAsBody(<key>) and useInputKeyAsHeader(<key>, <asHeader>) or useInputKeyAsHeaders(<key>). Further, when invoked with associative array input data, the step will by default use the value from url or uri for the request URL. If the input array contains the URL in a key with a different name, you can use the new useInputKeyAsUrl(<key>) method. That was basically already possible with the existing useInputKey(<key>) method, because the URL is the main input argument for the step. But if you want to use it in combination with the other new useInputKeyAsXyz() methods, you have to use useInputKeyAsUrl(), because using useInputKey(<key>) would invoke the whole step with that key only.
  • Crawler::runAndDump() as a simple way to just run a crawler and dump all results, each as an array.
  • addToResult() now also works with serializable objects.
  • If you know certain keys that the output of a step will contain, you can now also define aliases for those keys, to be used with addToResult(). The output of an Http step (RespondedRequest) contains the keys requestUri and effectiveUri. The aliases url and uri refer to effectiveUri, so addToResult(['url']) will add the effectiveUri as url to the result object.
  • The GetLink (Html::getLink()) and GetLinks (Html::getLinks()) steps, as well as the abstract DomQuery (parent of CssSelector (/Dom::cssSelector) and XPathQuery (/Dom::xPath)) now have a method withoutFragment() to get links respectively URLs without their fragment part.
  • The HttpCrawl step (Http::crawl()) has a new method useCanonicalLinks(). If you call it, the step will not yield responses if its canonical link URL was already yielded. And if it discovers a link, and some document pointing to that URL via canonical link was already loaded, it treats it as if it was already loaded. Further this feature also sets the canonical link URL as the effectiveUri of the response.
  • All filters can now be negated by calling the negate() method, so the evaluate() method will return the opposite bool value when called. The negate() method returns an instance of NegatedFilter that wraps the original filter.
  • New method cacheOnlyWhereUrl() in the HttpLoader class, that takes an instance of the FilterInterface as argument. If you define one or multiple filters using this method, the loader will cache only responses for URLs that match all the filters.

Fixed

  • The HttpCrawl step (Http::crawl()) by default now removes the fragment part of URLs to not load the same page multiple times, because in almost any case, servers won't respond with different content based on the fragment. That's why this change is considered non-breaking. For the rare cases when servers respond with different content based on the fragment, you can call the new keepUrlFragment() method of the step.
  • Although the HttpCrawl step (Http::crawl()) already respected the limit of outputs defined via the maxOutputs() method, it actually didn't stop loading pages. The limit had no effect on loading, only on passing on outputs (responses) to the next step. This is fixed in this version.
  • A so-called byte order mark at the beginning of a file (/string) can cause issues. So just remove it, when a step's input string starts with a UTF-8 BOM.
  • There seems to be an issue in guzzle when it gets a PSR-7 request object with a header with multiple string values (as array, like: ['accept-encoding' => ['gzip', 'deflate', 'br']]). When testing it happened that it only sent the last part (in this case br). Therefor the HttpLoader now prepares headers before sending (in this case to: ['accept-encoding' => ['gzip, deflate, br']]).
  • You can now also use the output key aliases when filtering step outputs. You can even use keys that are only present in the serialized version of an output object.

Don't miss a new crawler release

NewReleases is sending notifications on new releases.