Added
- New functionality to paginate: There is the new
Paginate
child class of theHttp
step class (easy access viaHttp::get()->paginate()
). It takes an instance of thePaginatorInterface
and uses it to iterate through pagination links. There is one implementation of that interface, theSimpleWebsitePaginator
. TheHttp::get()->paginate()
method uses it by default, when called just with a CSS selector to get pagination links. Paginators receive all loaded pages and implement the logic to find pagination links. The paginator class is also called before sending a request, with the request object that is about to be sent as an argument (prepareRequest()
). This way, it should even be doable to implement more complex pagination functionality. For example when pagination is built using POST request with query strings in the request body. - New methods
stopOnErrorResponse()
andyieldErrorResponses()
that can be used withHttp
steps. By callingstopOnErrorResponse()
the step will throw aLoadingException
when a response has a 4xx or 5xx status code. By calling theyieldErrorResponse()
even error responses will be yielded and passed on to the next steps (this was default behaviour until this version. See the breaking change below). - The body of HTTP responses with a
Content-Type
header containingapplication/x-gzip
are automatically decoded whenHttp::getBodyString()
is used. Therefor addedext-zlib
to suggested incomposer.json
. - New methods
addToResult()
andaddLaterToResult()
.addToResult()
is a single replacement forsetResultKey()
andaddKeysToResult()
(they are removed, seeChanged
below) that can be used for array and non array output.addLaterToResult()
is a new method that does not create a Result object immediately, but instead adds the output of the current step to all the Results that will later be created originating from the current output. - New methods
outputKey()
andkeepInputData()
that can be used with any step. Using theoutputKey()
method, the step will convert non array output to an array and use the key provided as an argument to this method as array key for the output value. ThekeepInputData()
method allows you to forward data from the step's input to the output. If the input is non array you can define a key using the method's argument. This is useful e.g. if you're having data in the initial inputs that you also want to add to the final crawling results. - New method
createsResult()
that can be used with any step, so you can differentiate if a step creates a Result object, or just keeps data to add to results later (newaddLaterToResult()
method). But primarily relevant for library internal use. - The
FileCache
class can compress the cache data now to save disk space. Use theuseCompression()
method to do so. - New method
retryCachedErrorResponses()
inHttpLoader
. When called, the loader will only use successful responses (status code < 400) from the cache and therefore retry already cached error responses. - New method
writeOnlyCache()
inHttpLoader
to only write to, but don't read from the response cache. Can be used to renew cached responses. Filter::urlPathMatches()
to filter URL paths using a regex.- Option to provide a chrome executable name to the
chrome-php/chrome
library viaHttpLoader::setChromeExecutable()
.
Changed
- BREAKING: Group steps can now only produce combined outputs, as previously done when
combineToSingleOutput()
method was called. The method is removed. - BREAKING:
setResultKey()
andaddKeysToResult()
are removed. Calls to those methods can both be replaced with calls to the newaddToResult()
method. - BREAKING:
getResultKey()
is also removed withsetResultKey()
. It's removed without replacement, as it doesn't really make sense any longer. - BREAKING: Error responses (4xx as well as 5xx), by default, won't produce any step outputs any longer. If you want to receive error responses, use the new
yieldErrorResponses()
method. - BREAKING: Removed the
httpClient()
method in theHttpCrawler
class. If you want to provide your own HTTP client, implement a customloader
method passing your client to theHttpLoader
instead. - Deprecated the loop feature (class
Loop
andCrawler::loop()
method). Probably the only use case is iterating over paginated list pages, which can be done using the new Paginator functionality. It will be removed in v1.0. - In case of a 429 (Too Many Requests) response, the
HttpLoader
now automatically waits and retries. By default, it retries twice and waits 10 seconds for the first retry and a minute for the second one. In case the response also contains aRetry-After
header with a value in seconds, it complies to that. Exception: by default it waits at max60
seconds (you can set your own limit if you want), if theRetry-After
value is higher, it will stop crawling. If all the retries also receive a429
it also throws an Exception. - Removed logger from
Throttler
as it doesn't log anything. - Fail silently when
robots.txt
can't be parsed. - Default timeout configuration for the default guzzle HTTP client:
connect_timeout
is10
seconds andtimeout
is60
seconds. - The
validateAndSanitize...()
methods in the abstractStep
class, when called with an array with one single element, automatically try to use that array element as input value. - With the
Html
andXml
data extraction steps you can now add layers to the data that is being extracted, by just adding furtherHtml
/Xml
data extraction steps as values in the mapping array that you pass as argument to theextract()
method. - The base
Http
step can now also be called with an array of URLs as a single input. Crawl and Paginate steps still require a single URL input.
Fixed
- The
CookieJar
now also works withlocalhost
or other hosts without a registered domain name. - Improve the
Sitemap::getUrlsFromSitemap()
step to also work when the<urlset>
tag contains attributes that would cause the symfony DomCrawler to not find any elements. - Fixed possibility of infinite redirects in
HttpLoader
by adding a redirects limit of 10.