This is a new big release of rav1e after 7 months making the encoder sensibly faster and better.
Video | PSNR | PSNR HVS | SSIM | CIEDE 2000 | APSNR | MS SSIM | VMAF |
---|---|---|---|---|---|---|---|
Average | -2.38 | -2.02 | -3.06 | -3.04 | -2.51 | -2.68 | -1.84 |
From 0.3
round there have been new 435 commits with around 50,000 additions and 17,000 deletions from 29 contributors.
Improvements
- Enable Open-Partition on frame boundary, gives ~2% rd gains.
- Use av-metrics in CLI to compute PNSR, PSNR-HVS, SSIM, MS-SSIM,
CIEDE2000 (see--metrics
) - Unwaffle Rebase for Loop-filter: Now deblocking is enabled to loopfilter RDO
giving 0.5 to 1.5% gains - Thread CDEF with tiles giving ~1.2% performance using 2x2 Tiles
- New Rate Control API that is less error-prone to use.
- Full Monochrome Support
- Enabling CDEF, Restoration Filter for 4:2:2, decreasing encoding time by ~37%
and making overall improvements substantial between 0.8 to 5% - Added compound prediction mode variants for drl=2 and drl=3
- Enable NEAR_NEAR1MV, NEAR_NEAR2MV Compound mode
- Support arbitrary SAR anamorphic video
- Enforce a frame limit of 1 in STILL_PICTURE_MODE
- Quiet Mode in CLI with -q or --quiet
- Ensure all mv predictors are converted to fullpel
- Update non-broken Motion Estimation Predictors giving ~0.28% gains
- Substantially rework initial motion estimation: 9% improved performance
- Optimise Preditors for multipass motion estimation giving 0.3-0.4% gains
- Optimize Chroma quantizer offsets for subset3 4:4:4 giving 31% for Luma Metrics
and 14% BD-Rate Improvement for CIEDE2000 for 4:4:4 clips - Opaque data can be pinned to frames and retrieved from the matching packet.
- Merge of dav1d 0.6.0 dav1d 0.7.0, 0.7.1 Assembly for both x86 and AArch64
- Naive x86_64 intrinsics for get_satd HBD
- Added NEON assembly for dist::get_sad on aarch64 giving ~66% improved encoding time
- Integration of around 200+ 16BPC AArch64 Functions from dav1d resulting in an
overall speedup of around ~9.5% - Added x86 SIMD for weighted SSE computation giving 5-7% speedup on PSNR
- Derive quantizers using linear models giving ~0.7 to 1.7% gains in metrics for
4:2:0 - Pruned Intra Mode list by SATD reducing encoding time between 5.5% to 12.2%
at default speed level - Optimization of rdo_loop_decision reducing total allocation count by 25% and
1% for encoding time - Removal of Initial Allocation for lookahead_intra_costs
- Avoid temporary allocation for inter pruning resulting in a reduced allocation
significantly - Reduce manual indexing in for_each in TileBlocks giving 1.5% speedup
Bug Fixes
- Fixed the rebuild with fresh assembly output
- Fixed the Chroma Desync for narrow-frames
- Abort pass encoding without a bitrate target in CLI
- Fixed the
-v
cli option - Fixed a crash when using 4 tiles for 1080p 4:2:2 input
- Fixed the 4:2:0 assumption in IEF block context selection
- Fixed the symbol redefinition error for AArch64 builds using Clang
- Fixed for LRF choosing different LRU sizes in Y and UV when not 4:2:0
- Fixed the broken borrow checker for tile_blocks
- Fixed the quantizer index clamping
- Fixed the Cross-compiling from macOS to mingw-W64
- Avoids a buffer underflow condition in CDEF pad_into_tmp16()
- Properly validate minimum rdo_lookahead_frames value
Changes
- Bumped minimum version of NASM to 2.14.0
- Updated Speed Preset Settings
- Full SGR Search is enabled for Speed Levels till 4 instead of 8
- Enabled Fine Directional Intra Preset for all speed levels
- Removed Diamond Motion Estimation
- Reduced TX_Set preset is now enabled from Speed 6 instead of Speed 5
- Disabled TX-Type RDO for inter frames.
- Rename of Native CPU Feature level to Rust: Use RAV1E_CPU_TARGET=rust from rav1e
0.4.0-alpha instead of RAV1E_CPU_TARGET=NATIVE - Removed in-library psnr computation facility
- Moved Frame related data structures to a separate crate (v_frame)
- Extended
dump_lookahead_data
- Now the
frame_subtype
is exported - Use the
RAV1E_DATA_PATH
env to place the output file.
- Now the
- Major Refactoring in CDEF is both towards allowing easier import of dav1d CDEF
assembly, as well as simplifying bitdepth and [re-]buffering requirements in LR. - Remove of leftover libaom code
- Remove unused diamond motion estimation
- Reduced Build Time:
- do not enable LTO by default,
- use as many codegen unit
- allow incremental builds for the release profile
- in-lined various functions
- removed large stack allocation, improved HBD SATD for x86 targets
- split large modules in multiple submodules
Unstable features
- Channel-based API
- A mean to use a pre-allocated threadpool, and share it across multiple encoders.