github jaypyles/Scraperr v1.0.6
v1.0.6 (Media Collection)

latest releases: v1.1.6, v1.1.2, v1.1.1...
3 months ago

✨ Add Media Collection to Scraping Pipeline

Summary

This PR introduces the collect_media function, which enhances scraping capabilities by automatically detecting and downloading various types of media assets from a web page using a Selenium-controlled browser session.

🔧 Features

Supported Media Types:

  • Images (<img>)
  • Videos (<video>)
  • Audio files (<audio>)
  • PDFs (<a href="*.pdf">)
  • Documents (.doc, .docx, .txt, .rtf)
  • Presentations (.ppt, .pptx)
  • Spreadsheets (.xls, .xlsx, .csv)

Functionality:

  • Uses CSS selectors to find elements containing media links.
  • Downloads each valid media file (HTTP/HTTPS only).
  • Saves all assets to a structured media/ directory, grouped by media type.
  • Writes a download_summary.txt with the original URLs and their local file paths.

Error Handling:

  • Skips failed downloads and logs the error.
  • Generates fallback filenames when none are detected in the URL.

Don't miss a new Scraperr release

NewReleases is sending notifications on new releases.