xiph/rav1e v0.4.0-alpha on GitHub

This is a new big release of rav1e after 7 months making the encoder sensibly faster and better.

Video	PSNR	PSNR HVS	SSIM	CIEDE 2000	APSNR	MS SSIM	VMAF
Average	-2.38	-2.02	-3.06	-3.04	-2.51	-2.68	-1.84

From 0.3 round there have been new 435 commits with around 50,000 additions and 17,000 deletions from 29 contributors.

Improvements

Enable Open-Partition on frame boundary, gives ~2% rd gains.
Use av-metrics in CLI to compute PNSR, PSNR-HVS, SSIM, MS-SSIM,
CIEDE2000 (see --metrics)
Unwaffle Rebase for Loop-filter: Now deblocking is enabled to loopfilter RDO
giving 0.5 to 1.5% gains
Thread CDEF with tiles giving ~1.2% performance using 2x2 Tiles
New Rate Control API that is less error-prone to use.
Full Monochrome Support
Enabling CDEF, Restoration Filter for 4:2:2, decreasing encoding time by ~37%
and making overall improvements substantial between 0.8 to 5%
Added compound prediction mode variants for drl=2 and drl=3
Enable NEAR_NEAR1MV, NEAR_NEAR2MV Compound mode
Support arbitrary SAR anamorphic video
Enforce a frame limit of 1 in STILL_PICTURE_MODE
Quiet Mode in CLI with -q or --quiet
Ensure all mv predictors are converted to fullpel
Update non-broken Motion Estimation Predictors giving ~0.28% gains
Substantially rework initial motion estimation: 9% improved performance
Optimise Preditors for multipass motion estimation giving 0.3-0.4% gains
Optimize Chroma quantizer offsets for subset3 4:4:4 giving 31% for Luma Metrics
and 14% BD-Rate Improvement for CIEDE2000 for 4:4:4 clips
Opaque data can be pinned to frames and retrieved from the matching packet.
Merge of dav1d 0.6.0 dav1d 0.7.0, 0.7.1 Assembly for both x86 and AArch64
Naive x86_64 intrinsics for get_satd HBD
Added NEON assembly for dist::get_sad on aarch64 giving ~66% improved encoding time
Integration of around 200+ 16BPC AArch64 Functions from dav1d resulting in an
overall speedup of around ~9.5%
Added x86 SIMD for weighted SSE computation giving 5-7% speedup on PSNR
Derive quantizers using linear models giving ~0.7 to 1.7% gains in metrics for
4:2:0
Pruned Intra Mode list by SATD reducing encoding time between 5.5% to 12.2%
at default speed level
Optimization of rdo_loop_decision reducing total allocation count by 25% and
1% for encoding time
Removal of Initial Allocation for lookahead_intra_costs
Avoid temporary allocation for inter pruning resulting in a reduced allocation
significantly
Reduce manual indexing in for_each in TileBlocks giving 1.5% speedup

Bug Fixes

Fixed the rebuild with fresh assembly output
Fixed the Chroma Desync for narrow-frames
Abort pass encoding without a bitrate target in CLI
Fixed the -v cli option
Fixed a crash when using 4 tiles for 1080p 4:2:2 input
Fixed the 4:2:0 assumption in IEF block context selection
Fixed the symbol redefinition error for AArch64 builds using Clang
Fixed for LRF choosing different LRU sizes in Y and UV when not 4:2:0
Fixed the broken borrow checker for tile_blocks
Fixed the quantizer index clamping
Fixed the Cross-compiling from macOS to mingw-W64
Avoids a buffer underflow condition in CDEF pad_into_tmp16()
Properly validate minimum rdo_lookahead_frames value

Changes

Bumped minimum version of NASM to 2.14.0
Updated Speed Preset Settings
- Full SGR Search is enabled for Speed Levels till 4 instead of 8
- Enabled Fine Directional Intra Preset for all speed levels
- Removed Diamond Motion Estimation
- Reduced TX_Set preset is now enabled from Speed 6 instead of Speed 5
- Disabled TX-Type RDO for inter frames.
Rename of Native CPU Feature level to Rust: Use RAV1E_CPU_TARGET=rust from rav1e
0.4.0-alpha instead of RAV1E_CPU_TARGET=NATIVE
Removed in-library psnr computation facility
Moved Frame related data structures to a separate crate (v_frame)
Extended dump_lookahead_data
- Now the frame_subtype is exported
- Use the RAV1E_DATA_PATH env to place the output file.
Major Refactoring in CDEF is both towards allowing easier import of dav1d CDEF
assembly, as well as simplifying bitdepth and [re-]buffering requirements in LR.
Remove of leftover libaom code
Remove unused diamond motion estimation
Reduced Build Time:
- do not enable LTO by default,
- use as many codegen unit
- allow incremental builds for the release profile
- in-lined various functions
- removed large stack allocation, improved HBD SATD for x86 targets
- split large modules in multiple submodules

Unstable features

Channel-based API
A mean to use a pre-allocated threadpool, and share it across multiple encoders.