8.7.3 (2023-10-30)
Behavior Changes
- Deleting stale files upon recovery are delegated to SstFileManger if available so they can be rate limited.
8.7.2 (2023-10-25)
Public API Changes
- Add new Cache APIs GetSecondaryCacheCapacity() and GetSecondaryCachePinnedUsage() to return the configured capacity, and cache reservation charged to the secondary cache.
Bug Fixes
- Fixed a possible underflow when computing the compressed secondary cache share of memory reservations while updating the compressed secondary to total block cache ratio.
- Fix an assertion failure when UpdeteTieredCache() is called in an idempotent manner.
8.7.1 (2023-10-20)
Bug Fixes
- Fix a bug in auto_readahead_size where first_internal_key of index blocks wasn't copied properly resulting in corruption error when first_internal_key was used for comparison.
- Add bounds check in WBWIIteratorImpl and make BaseDeltaIterator, WriteUnpreparedTxn and WritePreparedTxn respect the upper bound and lower bound in ReadOption. See 11680.
8.7.0 (2023-09-22)
New Features
- Added an experimental new "automatic" variant of HyperClockCache that does not require a prior estimate of the average size of cache entries. This variant is activated when HyperClockCacheOptions::estimated_entry_charge = 0 and has essentially the same concurrency benefits as the existing HyperClockCache.
- Add a new statistic
COMPACTION_CPU_TOTAL_TIME
that records cumulative compaction cpu time. This ticker is updated regularly while a compaction is running. - Add
GetEntity()
API for ReadOnly DB and Secondary DB. - Add a new iterator API
Iterator::Refresh(const Snapshot *)
that allows iterator to be refreshed while using the input snapshot to read. - Added a new read option
merge_operand_count_threshold
. When the number of merge operands applied during a successful point lookup exceeds this threshold, the query will return a special OK status with a new subcodekMergeOperandThresholdExceeded
. Applications might use this signal to take action to reduce the number of merge operands for the affected key(s), for example by running a compaction. - For
NewRibbonFilterPolicy()
, made thebloom_before_level
option mutable through the Configurable interface and the SetOptions API, allowing dynamic switching between all-Bloom and all-Ribbon configurations, and configurations in between. See comments onNewRibbonFilterPolicy()
- RocksDB now allows the block cache to be stacked on top of a compressed secondary cache and a non-volatile secondary cache, thus creating a three-tier cache. To set it up, use the
NewTieredCache()
API in rocksdb/cache.h.. - Added a new wide-column aware full merge API called
FullMergeV3
toMergeOperator
.FullMergeV3
supports wide columns both as base value and merge result, which enables the application to perform more general transformations during merges. For backward compatibility, the default implementation implements the earlier logic of applying the merge operation to the default column of any wide-column entities. Specifically, if there is no base value or the base value is a plain key-value, the default implementation falls back toFullMergeV2
. If the base value is a wide-column entity, the default implementation invokesFullMergeV2
to perform the merge on the default column, and leaves any other columns unchanged. - Add wide column support to ldb commands (scan, dump, idump, dump_wal) and sst_dump tool's scan command
Public API Changes
- Expose more information about input files used in table creation (if any) in
CompactionFilter::Context
. SeeCompactionFilter::Context::input_start_level
,CompactionFilter::Context::input_table_properties
for more. Options::compaction_readahead_size
's default value is changed from 0 to 2MB.- When using LZ4 compression, the
acceleration
parameter is configurable by setting the negated value inCompressionOptions::level
. For example,CompressionOptions::level=-10
will setacceleration=10
- The
NewTieredCache
API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified inTieredCacheOptions
. Any capacity specified inLRUCacheOptions
,HyperClockCacheOptions
andCompressedSecondaryCacheOptions
is ignored. A new API,UpdateTieredCache
is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. - The
NewTieredVolatileCache()
API in rocksdb/cache.h has been renamed toNewTieredCache()
.
Behavior Changes
- Compaction read performance will regress when
Options::compaction_readahead_size
is explicitly set to 0 - Universal size amp compaction will conditionally exclude some of the newest L0 files when selecting input with a small negative impact to size amp. This is to prevent a large number of L0 files from being locked by a size amp compaction, potentially leading to write stop with a few more flushes.
- Change ldb scan command delimiter from ':' to '==>'.
- For non direct IO, eliminate the file system prefetching attempt for compaction read when
Options::compaction_readahead_size
is 0
Bug Fixes
- Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
- Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
- Fix a bug with atomic_flush=true that can cause DB to stuck after a flush fails (#11872).
- Fix a bug where RocksDB (with atomic_flush=false) can delete output SST files of pending flushes when a previous concurrent flush fails (#11865). This can result in DB entering read-only state with error message like
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst
. - Fix an assertion fault during seek with async_io when readahead trimming is enabled.
- When the compressed secondary cache capacity is reduced to 0, it should be completely disabled. Before this fix, inserts and lookups would still go to the backing
LRUCache
before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead. - Updating the tiered cache (cache allocated using NewTieredCache()) by calling SetCapacity() on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But SetCapacity() would just set the primary cache capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, SetCapacity() will distribute the new capacity across the two caches by the same ratio.
- Fixed a bug in
MultiGet
for cleaning up SuperVersion acquired with locking db mutex. - Fix a bug where row cache can falsely return kNotFound even though row cache entry is hit.
- Fixed a race condition in
GenericRateLimiter
that could cause it to stop granting requests - Fix a bug (Issue #10257) where DB can hang after write stall since no compaction is scheduled (#11764).
- Add a fix for async_io where during seek, when reading a block for seeking a target key in a file without any readahead, the iterator aligned the read on a page boundary and reading more than necessary. This increased the storage read bandwidth usage.
- Fix an issue in sst dump tool to handle bounds specified for data with user-defined timestamps.
- When auto_readahead_size is enabled, update readahead upper bound during readahead trimming when reseek changes iterate_upper_bound dynamically.
- Fixed a bug where
rocksdb.file.read.verify.file.checksums.micros
is not populated - Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than
Status::NotSupported()
Performance Improvements
- Added additional improvements in tuning readahead_size during Scans when auto_readahead_size is enabled. However it's not recommended for backward scans and might impact the performance. More details in options.h.
- During async_io, the Seek happens in 2 phases. Phase 1 starts an asynchronous read on a block cache miss, and phase 2 waits for it to complete and finishes the seek. In both phases, it tries to lookup the block cache for the data block first before looking in the prefetch buffer. It's optimized by doing the block cache lookup only in the first phase that would save some CPU.