The biggest release of Scrapling yet — introducing the Spider framework, proxy rotation, and major parser improvements
This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.
🕷️ Spider Framework
A new async crawling framework built on top of anyio for structured, large-scale scraping:
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()- Scrapy-like Spider API: Define spiders with
start_urls, asyncparsecallbacks,Request/Responseobjects, and priority queue. - Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
- Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
- Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
- Streaming Mode: Stream scraped items as they arrive via
async for item in spider.stream()with real-time stats - ideal for UI, pipelines, and long-running crawls. - Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
- Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with
result.items.to_json()/result.items.to_jsonl()respectively. - Lifecycle hooks:
on_start(),on_close(),on_error(),on_scraped_item(), and more hooks for full control over the crawl lifecycle. - Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
- uvloop support: Pass
use_uvloop=Truetospider.start()for faster async execution when available.
A new section has been added to the website with the Full details. Click here
🔄 Proxy Rotation
- New
ProxyRotatorclass with thread-safe rotation. Works with all fetchers and sessions:from scrapling import ProxyRotator rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"]) Fetcher.get(url, proxy_rotator=rotator)
- Custom rotation strategies: Make your own proxy rotation logic
- Per-request proxy override: Pass
proxy=to any individualget()/post()/fetch()call to override the session proxy for that request.
🌐 Browser Fetcher Improvements
- Domain blocking: New
blocked_domainsparameter onDynamicFetcher/StealthyFetcherto block requests to specific domains (subdomains matched automatically). - Automatic retries: Browser fetchers now retry on failure with
retries(default: 3) andretry_delay(default: 1s) parameters. Includes proxy-aware error detection. - Response metadata:
Response.metadict automatically stores the proxy used, and merges request metadata. - Response.follow(): Create follow-up
Requestobjects with automatic referer flow, designed for the spider system. - No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
- Speed: Improved stealth and speed by adjusting browser flags.
🔧 Bug Fixes & Improvements
- Parser optimization: Optimized the parser for repeated operations, improving performance.
- Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
- Empty body: Handle responses with empty body.
- Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
- Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.
⚠️ Breaking Changes
css_first/xpath_firstremoved: Usecss('.selector').first,css('.selector')[0], orcss('.selector').get()instead.- All selection now returns
Selectors:css('::text'),xpath('//text()'),css('::attr(href)'), andxpath('//@href')now returnSelectors(wrapping text nodes inSelectorobjects withtag="#text") instead ofTextHandlers. This makes the API consistent across all selection methods and the type hints. Response.bodyis alwaysbytes: Previously could bestrorbytes, now always returnsbytes.get()/getall()behavior: OnSelector:get()returnsTextHandler(serialized HTML or text value),getall()returnsTextHandlers. Aliases:extract_first = get,extract = getall. Oldget_all()onSelectorsis removed.Selectors.first/.last: Safe accessors that returnSelector | Noneinstead of raisingIndexError.- Internal constants renamed:
DEFAULT_FLAGS→DEFAULT_ARGS,DEFAULT_STEALTH_FLAGS→STEALTH_ARGS,HARMFUL_DEFAULT_ARGS→HARMFUL_ARGS,DEFAULT_DISABLED_RESOURCES→EXTRA_RESOURCES.
🔨 Other Changes
- Dependency changes: Replaced
tldextractwithtld, removed internal_html_utils.pyin favor ofw3lib.html.replace_entities, addedtyping_extensionsas a hard requirement. - Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.
🙏 Special thanks to our Discord community for all the continuous testing and feedback







