Added
- New step
Http::crawl()
(classHttpCrawl
extending the normalHttp
step class) for conventional crawling. It loads all pages of a website (same host or domain) by following links. There's also a lot of options like depth, filtering by paths, and so on. - New steps
Sitemap::getSitemapsFromRobotsTxt()
(GetSitemapsFromRobotsTxt
) andSitemap::getUrlsFromSitemap()
(GetUrlsFromSitemap
) to get sitemap (URLs) from a robots.txt file and to get all the URLs from those sitemaps. - New step
Html::metaData()
to get data from meta tags (and title tag) in HTML documents. - New step
Html::schemaOrg()
(SchemaOrg
) to get schema.org structured data in JSON-LD format from HTML documents. - The abstract
DomQuery
class (parent of theCssSelector
andXPathQuery
classes) now has some methods to narrow the selected matches further:first()
,last()
,nth(n)
,even()
,odd()
.
Changed
- BREAKING: Removed
PoliteHttpLoader
and traitsWaitPolitely
andCheckRobotsTxt
. Converted the traits to classesThrottler
andRobotsTxtHandler
which are dependencies of theHttpLoader
. TheHttpLoader
internally gets default instances of those classes. TheRobotsTxtHandler
will respect robots.txt rules by default if you use aBotUserAgent
and it won't if you use a normalUserAgent
. You can access the loader'sRobotsTxtHandler
viaHttpLoader::robotsTxt()
. You can pass your own instance of theThrottler
to the loader and also access it viaHttpLoader::throttle()
to change settings.
Fixed
- Getting absolute links via the
GetLink
andGetLinks
steps and thetoAbsoluteUrl()
method of theCssSelector
andXPathQuery
classes, now also look for<base>
tags in HTML when resolving the URLs. - The
SimpleCsvFileStore
can now also save results with nested data (but only second level). It just concatenates the values separated with a|
.