github crwlrsoft/crawler v0.6.0

latest releases: v2.1.3, v2.1.2, v2.1.1...
2 years ago

Added

  • New step Http::crawl() (class HttpCrawl extending the normal Http step class) for conventional crawling. It loads all pages of a website (same host or domain) by following links. There's also a lot of options like depth, filtering by paths, and so on.
  • New steps Sitemap::getSitemapsFromRobotsTxt() (GetSitemapsFromRobotsTxt) and Sitemap::getUrlsFromSitemap() (GetUrlsFromSitemap) to get sitemap (URLs) from a robots.txt file and to get all the URLs from those sitemaps.
  • New step Html::metaData() to get data from meta tags (and title tag) in HTML documents.
  • New step Html::schemaOrg() (SchemaOrg) to get schema.org structured data in JSON-LD format from HTML documents.
  • The abstract DomQuery class (parent of the CssSelector and XPathQuery classes) now has some methods to narrow the selected matches further: first(), last(), nth(n), even(), odd().

Changed

  • BREAKING: Removed PoliteHttpLoader and traits WaitPolitely and CheckRobotsTxt. Converted the traits to classes Throttler and RobotsTxtHandler which are dependencies of the HttpLoader. The HttpLoader internally gets default instances of those classes. The RobotsTxtHandler will respect robots.txt rules by default if you use a BotUserAgent and it won't if you use a normal UserAgent. You can access the loader's RobotsTxtHandler via HttpLoader::robotsTxt(). You can pass your own instance of the Throttler to the loader and also access it via HttpLoader::throttle() to change settings.

Fixed

  • Getting absolute links via the GetLink and GetLinks steps and the toAbsoluteUrl() method of the CssSelector and XPathQuery classes, now also look for <base> tags in HTML when resolving the URLs.
  • The SimpleCsvFileStore can now also save results with nested data (but only second level). It just concatenates the values separated with a |.

Don't miss a new crawler release

NewReleases is sending notifications on new releases.