crwlrsoft/crawler v1.0.0 on GitHub

Added

New method Step::refineOutput() to manually refine step output values. It takes either a Closure or an instance of the new RefinerInterface as argument. If the step produces array output, you can provide a key from the array output, to refine, as first argument and the refiner as second argument. You can call the method multiple times and all the refiners will be applied to the outputs in the order you add them. If you want to refine multiple output array keys with a Closure, you can skip providing a key and the Closure will receive the full output array for refinement. As mentioned you can provide an instance of the RefinerInterface. There are already a few implementations: StringRefiner::afterFirst(), StringRefiner::afterLast(), StringRefiner::beforeFirst(), StringRefiner::beforeLast(), StringRefiner::betweenFirst(), StringRefiner::betweenLast() and StringRefiner::replace().
New method Step::excludeFromGroupOutput() to exclude a normal steps output from the combined output of a group that it's part of.
New method HttpLoader::setMaxRedirects() to customize the limit of redirects to follow. Works only when using the HTTP client.
New filters to filter by string length, with the same options as the comparison filters (equal, not equal, greater than,...).
New Filter::custom() that you can use with a Closure, so you're not limited to the available filters only.
New method DomQuery::link() as a shortcut for DomQuery::attribute('href')->toAbsoluteUrl().
New static method HttpCrawler::make() returning an instance of the new class AnonymousHttpCrawlerBuilder. This makes it possible to create your own Crawler instance with a one-liner like: HttpCrawler::make()->withBotUserAgent('MyCrawler'). There's also a withUserAgent() method to create an instance with a normal (non bot) user agent.

Changed

BREAKING: The FileCache now also respects the ttl (time to live) argument and by default it is one hour (3600 seconds). If you're using the cache and expect the items to live (basically) forever, please provide a high enough value for default the time to live. When you try to get a cache item that is already expired, it (the file) is immediately deleted.
BREAKING: The TooManyRequestsHandler (and with that also the constructor argument in the HttpLoader) was renamed to RetryErrorResponseHandler. It now reacts the same to 503 (Service Unavailable) responses as to the 429 (Too Many Requests) responses. If you're actively passing your own instance to the HttpLoader, you need to update it.
You can now have multiple different loaders in a Crawler. To use this, return an array containing your loaders from the protected Crawler::loader() method with keys to name them. You can then selectively use them by calling the Step::useLoader() method on a loading step with the key of the loader it should use.

Removed

BREAKING: The loop feature. The only real world use case should be paginating listings and this should be solved with the Paginator feature.
BREAKING: Step::dontCascade() and Step::cascades() because with the change in v0.7, that groups can only produce combined output, there should be no use case for this anymore. If you want to exclude one steps output from the combined group output, you can use the new Step::excludeFromGroupOutput() method.