github apify/crawlee v0.22.1

latest releases: v3.9.2, v3.9.1, v3.9.0...
3 years ago

This is the last major release before SDK v1.0.0. We're committed to deliver v1 at the
end of 2020 so stay tuned. Besides Playwright integration via a new BrowserPool,
it will be the first release of SDK that we'll support for an extended period of time.
We will not make any breaking changes until 2.0.0, which will come at the end of
2021. But enough about v1, let's see the changes in 0.22.0.

In this release we've changed a lot of code, but you may not even notice.
We've updated the underlying apify-client package which powers all communication with
the Apify API to version 1.0.0. This means a completely new API for all internal calls.
If you use Apify.client calls in your code, this will be a large breaking change for you.
Visit the client docs
to see what's new in the client, but also note that we removed the default client
available under Apify.client and replaced it with Apify.newClient() function.
We think it's better to have separate clients for users and internal use.

Until now, local emulation of Apify Storages has been a part of the SDK. We moved the logic
into a separate package @apify/storage-local which shares interface with apify-client.
RequestQueue is now powered by SQLite3 instead of file system, which improves
reliability and performance quite a bit. Dataset and KeyValueStore still use file
system, for easy browsing of data. The structure of apify_storage folder remains unchanged.

After collecting common developer mistakes, we've decided to make argument validation stricter.
You will no longer be able to pass extra arguments to functions and constructors. This is
to alleviate frustration, when you mistakenly pass useChrome to PuppeteerPoolOptions
instead of LaunchPuppeteerOptions and don't realize it. Before this version, SDK wouldn't
let you know and would silently continue with Chromium. Now, it will throw an error saying
that useChrome is not an allowed property of PuppeteerPoolOptions.

Based on developer feedback, we decided to remove --no-sandbox from the default Puppeteer
launch args. It will only be used on Apify Platform. This gives you the chance to use
your own sandboxing strategy.

LiveViewServer and puppeteerPoolOptions.useLiveView were never very user-friendly
or performant solutions, due to the inherent performance issues with rapidly taking many
screenshots in Puppeteer. We've decided to remove it. If you need similar functionality,
try the devtools-server NPM package, which utilizes the Chrome DevTools Frontend for
screen-casting live view of the running browser.

Full list of changes:

  • BREAKING: Updated apify-client to 1.0.0 with a completely new interface.
    We also removed the Apify.client property and replaced it with an Apify.newClient()
    function that creates a new ApifyClient instance.

  • BREAKING: Removed --no-sandbox from default Puppeteer launch arguments.
    This will most likely be breaking for Linux and Docker users.

  • BREAKING: Function argument validation is now more strict and will not accept extra
    parameters which are not defined by the functions' signatures.

  • DEPRECATED: puppeteerPoolOptions.useLiveView is now deprecated.
    Use the devtools-server NPM package instead.

  • Added postResponseFunction to CheerioCrawlerOptions. It allows you to override
    properties on the HTTP response before processing by CheerioCrawler.

  • Added HTTP2 support to utils.requestAsBrowser(). Set useHttp2 to true
    in RequestAsBrowserOptions to enable it.

  • Fixed handling of XML content types in CheerioCrawler.

  • Fixed capitalization of headers when using utils.puppeteer.addInterceptRequestHandler.

  • Fixed utils.puppeteer.saveSnapshot() overwriting screenshots with HTML on local.

  • Updated puppeteer to version 5.4.1 with Chrom(ium) 87.

  • Removed RequestQueueLocal in favor of @apify/storage-local API emulator.

  • Removed KeyValueStoreLocal in favor of @apify/storage-local API emulator.

  • Removed DatasetLocal in favor of @apify/storage-local API emulator.

  • Removed the userData option from Apify.utils.enqueueLinks (deprecated in Jun 2019).
    Use transformRequestFunction instead.

  • Removed instanceKillerIntervalMillis and killInstanceAfterMillis (deprecated in Feb 2019).
    Use instanceKillerIntervalSecs and killInstanceAfterSecs instead.

  • Removed the memory option from Apify.call options which was (deprecated in 2018).
    Use memoryMbytes instead.

  • Removed delete() methods from Dataset, KeyValueStore and RequestQueue (deprecated in Jul 2019).
    Use .drop().

  • Removed utils.puppeteer.hideWebDriver() (deprecated in May 2019).
    Use LaunchPuppeteerOptions.stealth.

  • Removed utils.puppeteer.enqueueRequestsFromClickableElements() (deprecated in 2018).
    Use utils.puppeteer.enqueueLinksByClickingElements.

  • Removed request.doNotRetry() (deprecated in June 2019)
    Use request.noRetry = true.

  • Removed RequestListOptions.persistSourcesKey (deprecated in Feb 2020)
    Use persistRequestsKey.

Don't miss a new crawlee release

NewReleases is sending notifications on new releases.