github apify/crawlee v1.3.0

latest releases: v3.11.5, v3.11.4, v3.11.3...
3 years ago

Navigation hooks in CheerioCrawler

CheerioCrawler downloads the web pages using the requestAsBrowser utility function.
As opposed to the browser based crawlers that are automatically encoding the URLs, the
requestAsBrowser function will not do so. We either need to manually encode the URLs
via encodeURI() function, or set forceUrlEncoding: true in the requestAsBrowserOptions,
which will automatically encode all the URLs before accessing them.

We can either use forceUrlEncoding or encode manually, but not both - it would
result in double encoding and therefore lead to invalid URLs.

We can use the preNavigationHooks to adjust requestAsBrowserOptions:

preNavigationHooks: [
    (crawlingContext, requestAsBrowserOptions) => {
        requestAsBrowserOptions.forceUrlEncoding = true;
    }
]

Apify class and Configuration

Adds two new named exports:

  • Configuration class that serves as the main configuration holder, replacing explicit usage of
    environment variables.
  • Apify class that allows configuring the SDK. Env vars still have precedence over the SDK configuration.

When using the Apify class, there should be no side effects.
Also adds new configuration for WAL mode in ApifyStorageLocal.

As opposed to using the global helper functions like main, there is an alternative approach using Apify class.
It has mostly the same API, but the methods on Apify instance will use the configuration provided in the constructor.
Environment variables will have precedence over this configuration.

const { Apify } = require('apify'); // use named export to get the class

const sdk = new Apify({ token: '123' });
console.log(sdk.config.get('token')); // '123'

// the token will be passed to the `call` method automatically
const run = await sdk.call('apify/hello-world', { myInput: 123 });
console.log(`Received message: ${run.output.body.message}`);

Another example shows how the default dataset name can be changed:

const { Apify } = require('apify'); // use named export to get the class

const sdk = new Apify({ defaultDatasetId: 'custom-name' });
await sdk.pushData({ myValue: 123 });

is equivalent to:

const Apify = require('apify'); // use default export to get the helper functions

const dataset = await Apify.openDataset('custom-name');
await dataset.pushData({ myValue: 123 });

Full list of changes:

  • Add Configuration class and Apify named export, see above.
  • Fix proxyUrl without a port throwing an error when launching browsers.
  • Fix maxUsageCount of a Session not being persisted.
  • Update puppeteer and playwright to match stable Chrome (90).
  • Fix support for building TypeScript projects that depend on the SDK.
  • add taskTimeoutSecs to allow control over timeout of AutoscaledPool tasks
  • add forceUrlEncoding to requestAsBrowser options
  • add preNavigationHooks and postNavigationHooks to CheerioCrawler
  • deprecated prepareRequestFunction and postResponseFunction methods of CheerioCrawler
  • Added new event aborting for handling gracefully aborted run from Apify platform.

Don't miss a new crawlee release

NewReleases is sending notifications on new releases.