A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉
🚀 New Stuff and quality of life changes
-
Spider Development Mode: Iterating on a spider's
parse()logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:class MySpider(Spider): name = "my_spider" start_urls = ["https://example.com"] development_mode = True async def parse(self, response): yield {"title": response.css("title::text").get("")}
The cache lives in
.scrapling_cache/{spider.name}/by default and can be redirected anywhere withdevelopment_cache_dir. Two new stat counters,cache_hitsandcache_misses, let you see how the cache performed. Cache replay bypassesdownload_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider withdevelopment_mode = True-- it's a development tool, not a production cache. See the docs for the full story. -
Safer redirects by default:
follow_redirectsnow defaults to"safe"across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Passfollow_redirects="all"to get the old behavior, orFalseto disable redirects entirely.
🐛 Bug Fixes
- Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with
crawldirenabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leavingpaused=Falseand triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before callingcancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.
🙏 Special thanks to the community for all the continuous testing and feedback