Release: Improved Offline Mirroring and External Asset Handling
This release improves offline site mirroring by expanding asset discovery, improving link rewriting, and adding safer handling for modern websites that rely on external resources such as CDNs, manifests, responsive images, and inline CSS references.
✨ Highlights
Added external domain whitelisting
You can now limit external asset downloads to only approved domains using:
--external-domains
This also automatically enables external asset downloading, making it easier to mirror CDN-heavy websites in a more controlled way.
Improved external asset support
External assets are now handled more cleanly for offline use:
- external resources can be stored under
cdn/<domain>/... - rewritten references now better support localized CDN assets
- external assets can be selectively allowed instead of downloading everything
Expanded rewrite coverage
Offline rewriting now supports more resource types and locations, including:
srchrefdata-srcpostersrcset- inline
styleURLs - inline
<style>blocks - CSS
url(...) - CSS
@import - common static asset references inside downloaded JS files
og:imagetwitter:image
Broader modern web resource support
Support has been expanded for more <link rel> resource types, including:
stylesheeticonshortcutapple-touch-iconpreloadmodulepreloadmanifest
Better URL normalization and protocol-relative handling
Improved canonicalization and protocol-relative URL handling now help reduce malformed paths and duplicate fetches.
Examples:
//cdn.example.com/file.css- default port normalization
- fragment removal for stable deduplication
Safer filesystem handling
Path handling has been hardened to reduce failures across platforms:
- illegal character replacement
- reserved filename protection
- segment sanitization
- long-path shortening
- hashed fallbacks for overly long filenames
- safer handling of query-string collisions
Improved CSS and JS asset discovery
Downloaded CSS files are now scanned and rewritten for asset references such as fonts, images, and imports.
Downloaded JS files can also rewrite obvious static asset URLs when they point to known file types, improving offline compatibility for some frontend bundles.
Better non-fetchable URL handling
The crawler now skips more unsupported schemes safely, including:
mailto:tel:sms:javascript:data:geo:blob:about:
Optional Brotli-aware request handling
The script now detects Brotli support and adjusts Accept-Encoding automatically when available.
Improved offline compatibility for localized external assets
When external assets are rewritten locally, integrity and crossorigin attributes are removed from localized <script> and <link> tags where needed to avoid offline loading problems.
🔧 CLI
New
--external-domains
Existing
--url--destination--max-pages--threads--download-external-assets
📌 Notes
This release is focused on making the downloader more reliable for modern websites with:
- CDN-hosted assets
- responsive images
- inline style-based assets
- CSS imports
- social preview images
- manifest and icon resources
It should provide a more complete offline mirror than the previous release, especially for sites that depend on external static assets.
⚠️ Limitations
This is still best suited for static or mostly server-rendered websites.
Some sites may still require additional handling if they depend heavily on:
- authentication flows
- JavaScript-driven navigation
- API-loaded content
- dynamic tokens or runtime state
🙌 Feedback
If you test this release on a site that previously had missing assets or broken offline rendering, feel free to open an issue with:
- target URL
- command used
- what improved
- what still failed
- relevant log snippets
What's Changed
- Update website-downloader.py by @PKHarsimran in #34
- Update website-downloader.py by @PKHarsimran in #35
- Update website-downloader.py by @PKHarsimran in #36
- Update website-downloader.py by @PKHarsimran in #37
- Update website-downloader.py by @PKHarsimran in #38
Full Changelog: v2.3.2...v2.4.0