New features
- new ArchiveResult log in the admin web UI, with full editing ability of individual extractor outputs + list of outputs under each Snapshot admin entry
- ability to save multiple snapshots of the same URL over time using new
Re-snapshot
button
- add
init --quick
andserver --quick-init
options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful)
- add new
archivebox setup
command andarchivebox init --setup
flag to aid in automatically installing dependencies and creating a superuser during initial setup
- new
SNAPSHOTS_PER_PAGE=40
andMEDIA_MAX_SIZE=750m
config options
- allow hotlinking directly to specific extractor output on the snapshot detail page using URL
#hash
e.g./archive/<timestamp>/index.html#git
- add ability to view snapshot matching a given URLs by visiting
/archive/https://example.com/some/url
-> redirects to ->/archive/<timestamp>/index.html
(also works without scheme/archive/example.com
)
- #660 add ability to tag URLs while adding them via the web UI and via the CLI using
archivebox add --tag=tag1,tag2,tag3 ...
- #659 add back ability to override visual styling with custom HTML and CSS using new config option
CUSTOM_TEMPLATES_DIR
- ability to add and remove multiple tags at once from the snapshot admin using autocompleting dropdown
Enhancements
- lots of performance improvements! (in testing with 100k entries, the main index was brought down from 10-14 second load times to ~110ms once cache warms up)
- full text search now works on the public snapshot list
- dates and times are now localized to your browser's timezone instead of showing in UTC
- integrity and correctness improvements to readability, mercury, warc, and other extractors
- video subtitles and description are now added to the full-text search index as well (including youtube's autogenerated transcripts in all languages)
- log all errors with full tracebacks to new
data/logs/errors.log
file (so users no longer have to run in --debug mode to see error details)
- better
archivebox schedule
logging and changed logfile location to./logs/schedule.log
- better docker-compose setup experience with sonic config example in
docker-compose.yml
- add Django Debug Toolbar +
djdt_flamegraph
for developers to profile UI performance
- add
--overwrite
flag support toarchivebox schedule
, archived urls get added similarly toadd --overwrite
- #644 remove boostrap and jquery remove network requests to CDNs by inlining them instead
- #647 allow filtering by ArchiveResult status in the Snapshot admin UI to select only links that have been archived or not archived
- #550 kill all orphan child processes after each extractor finishes to prevent dangling chromium/node subprocesses and memory leaks
- 3276434 add new
SEARCH_BACKEND_TIMEOUT
config option to tune amount of time search backend can take before it gives up
- more diagnostic info added to the Snapshot admin view including most recent status code, content type, detected server, etc
- make the order of the table columns, layout, and spacing the same on the public view and private view (also remove DataTable, we're not using it)
- better snapshot grid page (faster load times, nicer CSS for tags and cards, more actions supported and metadata shown)
- added
Cache-Control
headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreams
- new project releases page https://releases.archivebox.io and demo url https://demo.archivebox.io
Bugfixes
- #673 fix searching by URL substring in Snapshot admin list
- #658 fix Snapshot admin action buttons not working in Safari and some other browsers
- #678 fix
AssertionError
error when archivebox would to attempt archive withCHROME_BINARY=None
when Chrome was not found on host system
- #654 fix some issues with sonic attempting to index massive text blobs or binary blobs on some pages and hanging
- #674 fix UTF-8 encoding encoding problems with file reading/writing on Windows (supporting a Python pkg on Windows is unreasonably painful ya'll)
- #433 fix deleted items sometimes reappearing on next import/update
- #473 fix issue preventing use of archivebox python API inside raw REPL (not using archivebox shell)
- fix stdin/stdout/stderr handling for some edge cases in Docker/Docker-Compose