github crwlrsoft/crawler v1.8.0

latest releases: v1.9.3, v1.9.2, v1.9.1...
one month ago

Added

  • New methods Step::keep() and Step::keepAs(), as well as Step::keepFromInput() and Step::keepInputAs(), as alternatives to Step::addToResult() (or Step::addLaterToResult()). The keep() method can be called without any argument, to keep all from the output data. It can be called with a string, to keep a certain key or with an array to keep a list of keys. If the step yields scalar value outputs (not an associative array or object with keys) you need to use the keepAs() method with the key you want the output value to have in the kept data. The methods keepFromInput() and keepInputAs() work the same, but uses the input (not the output) that the step receives. Most likely only needed with a first step, to keep data from initial inputs (or in a sub crawler, see below). Kept properties can also be accessed with the Step::useInputKey() method, so you can easily reuse properties from multiple steps ago as input.
  • New method Step::outputType() with default implementation returning StepOutputType::Mixed. Please consider implementing this method yourself in all your custom steps, because it is going to be required in v2 of the library. It allows detecting (potential) problems in crawling procedures immediately when starting a run instead of failing after already running a while.
  • New method Step::subCrawlerFor(), allowing to fill output properties from an actual full child crawling procedure. As the first argument, you give it a key from the step's output, that the child crawler uses as input(s). As the second argument you need to provide a Closure that receives a clone of the current Crawler without steps and with initial inputs, set from the current output. In the Closure you then define the crawling procedure by adding steps as you're used to do it, and return it. This allows to achieve nested output data, scraped from different (sub-)pages, more flexible and less complicated as with the usual linear crawling procedure and Step::addToResult().

Deprecated

  • The Step::addToResult(), Step::addLaterToResult() and Step::keepInputData() methods. Instead, please use the new keep methods. This can cause some migration work for v2, because especially the add to result methods are a pretty central functionality, but the new "keep" methodology (plus the new sub crawler feature) will make a lot of things easier, less complex and the library will most likely work more efficiently in v2.

Fixed

  • When a cache file was generated with compression, and you're trying to read it with a FileCache instance without compression enabled, it also works. When unserializing the file content fails it tries decoding the string first before unserializing it.

Don't miss a new crawler release

NewReleases is sending notifications on new releases.