A big spider update that takes the crawling framework to the next level 🕷️
🚀 New Stuff and quality of life changes
-
Added a
LinkExtractorprimitive inscrapling.spiders.LinkExtractorto pull URLs out of aResponse. There are a lot of controls (Check the docs)from scrapling.spiders import LinkExtractor extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
-
Added
CrawlSpiderandCrawlRulegeneric spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Overriderules()to return a list ofCrawlRuleobjects, each pairing aLinkExtractor. (Check the docs)from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor class QuotesSpider(CrawlSpider): name = "blog" start_urls = ["https://quotes.toscrape.com/"] def rules(self): return [ CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author), CrawlRule(LinkExtractor(allow=r"/page/\d+/")), # pagination, no callback ] async def parse_author(self, response): yield { "name": response.css(".author-title::text").get(), "birthday": response.css(".author-born-date::text").get(), "url": response.url, }
-
Added a
SitemapSpidertemplate that seeds a crawl directly from a sitemap, orrobots.txtURLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor class NewsSitemap(SitemapSpider): name = "news" sitemap_urls = ["https://example.com/robots.txt"] def rules(self): return [ CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article), ] async def parse_article(self, response): yield {"url": response.url, "title": response.css("h1::text").get()}
-
Adaptive relocation now defaults to a 40% similarity threshold instead of
0across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lowerpercentagedeliberately if needed. -
Updated all browsers and fingerprints. Run a new
scrapling install --forceafter updating to refresh the browsers and fingerprints.
🐛 Bug Fixes
- Fixed
Fetcher.configure(...)not applying to per-request calls. Same fix applied toAsyncFetcher. - Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
- Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.
Docs
- Refreshed older code examples across the documentation to match the current version.
- Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.
🙏 Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors