github D4Vinci/Scrapling v0.4
Release v0.4

latest release: v0.4.1
18 days ago

The biggest release of Scrapling yet — introducing the Spider framework, proxy rotation, and major parser improvements

This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.

🕷️ Spider Framework

A new async crawling framework built on top of anyio for structured, large-scale scraping:

from scrapling.spiders import Spider, Response

class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}

MySpider().start()
  • Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, Request/Response objects, and priority queue.
  • Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
  • Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
  • Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
  • Streaming Mode: Stream scraped items as they arrive via async for item in spider.stream() with real-time stats - ideal for UI, pipelines, and long-running crawls.
  • Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
  • Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with result.items.to_json() / result.items.to_jsonl() respectively.
  • Lifecycle hooks: on_start(), on_close(), on_error(), on_scraped_item(), and more hooks for full control over the crawl lifecycle.
  • Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
  • uvloop support: Pass use_uvloop=True to spider.start() for faster async execution when available.

A new section has been added to the website with the Full details. Click here

🔄 Proxy Rotation

  • New ProxyRotator class with thread-safe rotation. Works with all fetchers and sessions:
    from scrapling import ProxyRotator
    rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
    Fetcher.get(url, proxy_rotator=rotator)
  • Custom rotation strategies: Make your own proxy rotation logic
  • Per-request proxy override: Pass proxy= to any individual get()/post()/fetch() call to override the session proxy for that request.

🌐 Browser Fetcher Improvements

  • Domain blocking: New blocked_domains parameter on DynamicFetcher/StealthyFetcher to block requests to specific domains (subdomains matched automatically).
  • Automatic retries: Browser fetchers now retry on failure with retries (default: 3) and retry_delay (default: 1s) parameters. Includes proxy-aware error detection.
  • Response metadata: Response.meta dict automatically stores the proxy used, and merges request metadata.
  • Response.follow(): Create follow-up Request objects with automatic referer flow, designed for the spider system.
  • No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
  • Speed: Improved stealth and speed by adjusting browser flags.

🔧 Bug Fixes & Improvements

  • Parser optimization: Optimized the parser for repeated operations, improving performance.
  • Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
  • Empty body: Handle responses with empty body.
  • Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
  • Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.

⚠️ Breaking Changes

  • css_first/xpath_first removed: Use css('.selector').first, css('.selector')[0], or css('.selector').get() instead.
  • All selection now returns Selectors: css('::text'), xpath('//text()'), css('::attr(href)'), and xpath('//@href') now return Selectors (wrapping text nodes in Selector objects with tag="#text") instead of TextHandlers. This makes the API consistent across all selection methods and the type hints.
  • Response.body is always bytes: Previously could be str or bytes, now always returns bytes.
  • get()/getall() behavior: On Selector: get() returns TextHandler (serialized HTML or text value), getall() returns TextHandlers. Aliases: extract_first = get, extract = getall. Old get_all() on Selectors is removed.
  • Selectors.first/.last: Safe accessors that return Selector | None instead of raising IndexError.
  • Internal constants renamed: DEFAULT_FLAGSDEFAULT_ARGS, DEFAULT_STEALTH_FLAGSSTEALTH_ARGS, HARMFUL_DEFAULT_ARGSHARMFUL_ARGS, DEFAULT_DISABLED_RESOURCESEXTRA_RESOURCES.

🔨 Other Changes

  • Dependency changes: Replaced tldextract with tld, removed internal _html_utils.py in favor of w3lib.html.replace_entities, added typing_extensions as a hard requirement.
  • Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

Don't miss a new Scrapling release

NewReleases is sending notifications on new releases.