github D4Vinci/Scrapling v0.4.5
Release v0.4.5

17 hours ago

A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉

🚀 New Stuff and quality of life changes

  • Spider Development Mode: Iterating on a spider's parse() logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:

    class MySpider(Spider):
        name = "my_spider"
        start_urls = ["https://example.com"]
        development_mode = True
    
        async def parse(self, response):
            yield {"title": response.css("title::text").get("")}

    The cache lives in .scrapling_cache/{spider.name}/ by default and can be redirected anywhere with development_cache_dir. Two new stat counters, cache_hits and cache_misses, let you see how the cache performed. Cache replay bypasses download_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider with development_mode = True -- it's a development tool, not a production cache. See the docs for the full story.

  • Safer redirects by default: follow_redirects now defaults to "safe" across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Pass follow_redirects="all" to get the old behavior, or False to disable redirects entirely.

🐛 Bug Fixes

  • Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with crawldir enabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leaving paused=False and triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before calling cancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.

🙏 Special thanks to the community for all the continuous testing and feedback




Don't miss a new Scrapling release

NewReleases is sending notifications on new releases.