Warning
This is a pre-release for the upcoming ArchiveBox v0.9.x line. The latest stable release is still v0.7.4. Please test on a backup first and report regressions before using this for a production collection.
v0.9.x is a large architectural upgrade from v0.8.x: plugin execution has moved out into a standalone plugin ecosystem, archiving runs through the new one-shot abx-dl CLI, and the old pluggy path has been replaced by an event-driven, append-only-log architecture designed to support future browser-extension capture and p2p sync.
⬇️ RC Instructions: use the :dev Docker image with the repo docker-compose.yml
git clone https://github.com/ArchiveBox/ArchiveBox && cd ArchiveBox
ARCHIVEBOX_IMAGE=archivebox/archivebox:dev docker compose up -dHighlights
-
🔌 New plugin system split out into
abx-plugins
The extractor/plugin catalog now lives ingithub.com/ArchiveBox/abx-plugins, with per-plugin config, dependencies, hooks, docs, and install metadata. -
⚡ New one-shot CLI powered by
abx-dl
ArchiveBox now builds on the standalone downloader atgithub.com/ArchiveBox/abx-dl, so plugin-based archiving can run inside ArchiveBox or independently as a focused CLI. -
🧱 No more
pluggy; new event-driven runtime
v0.9.xreplaces the oldpluggy-style in-process plugin system with an event-driven, append-only-log flow. This gives us cleaner resumability, auditability, and a path toward future browser extension capture, distributed workers, and p2p sync. -
🗃️ Safer data layout and migrations
The new layout keeps user-owned snapshot/crawl data underdata/archive/users/{username}/..., creates crawl records for migrated snapshots, and preserves legacy ArchiveResult data instead of rediscovering everything from slow filesystem scans. -
🖥️ Improved browser isolation and output serving
Chrome-based extractors now use crawl-scoped browser sessions, cleanup cloned profiles after each crawl, and serve snapshot outputs throughweb.archivebox.io/snap-*.archivebox.iostyle isolation for safer replay. -
🧭 Tons of UI, CLI, REST API, and database improvements
Faster snapshot/admin list views, better live progress visibility, improved ArchiveResult detail pages, cleaner REST API paths, and more durable Process/Crawl/Snapshot database state.
What's Changed
- 🔌 Extractor plugins moved into
abx-plugins, with generated plugin docs atarchivebox.github.io/abx-plugins. - ⚡ Archiving execution now goes through
abx-dl, which can also be used directly:uvx abx-dl --plugins=title,screenshot,singlefile 'https://example.com' - 🧾 Runtime execution now writes structured append-only JSONL records for Crawls, Processes, Snapshots, and ArchiveResults.
- 🔄 Migrations now preserve legacy
Snapshot,Tag,ArchiveResult, and filesystem output data across0.7.x/0.8.x→0.9.xupgrades. - 📁 Heavy crawl/snapshot data stays under
data/archive/users/...for Docker volume compatibility. - 🌐 Added subdomain-aware replay for public web, admin, API, and isolated snapshot hosts.
- 🔒 Added stricter public/admin/API separation and safer default demo deployment options, including disabling public add flows by default.
- 🧩 Browser extractors now share crawl-scoped Chrome sessions and centralize profile cleanup / lock handling through
chrome_utils.js. - 📸 SingleFile, screenshot, DOM, readability, media, git, gallery, forum, and other extractors now run as plugin hooks with explicit outputs.
- 🛠️ Improved
archivebox add,archivebox run,archivebox update,archivebox version,archivebox status, and dependency install flows. - 🚀 Docker image now includes the new unified runtime stack, with Sonic search managed inside the main container.
- 🧪 Validated against large migrated collections, including the public demo dataset, with resumable migration behavior and preserved DB/file outputs.
Helpful related projects and resources:
SingleFilefor self-contained HTML capturesyt-dlpfor media extractiongallery-dlfor gallery/media sitesDjango Ninjafor the REST APISonicfor lightweight search indexing
Full Changelog: v0.8.5-rc...v0.9.31-rc