Initial Version containing
Crawler
class being the main unit that executes all the steps that you'll add to it, handling input and output of the steps.HttpCrawler
class using thePoliteHttpLoader
(version ofHttpLoader
sticking torobots.txt
rules) using any PSR-18 HTTP client under the hood and having an own implementation for a cookie jar.- Some ready to use steps for HTTP, HTML, XML, JSON and CSV.
- Loops and Groups.
- Crawler has a PSR-3 LoggerInterface and passes it on to all the steps. The included steps log some messages about what they're doing. Package includes a simple CliLogger.
- Crawler requires a User Agent and an included BotUserAgent class provides an easy interface for bot user agent strings.
- Stores to save the final results can be added to the Crawler. Simple CSV File Store is shipped with the package.