7.4.3 (07/13/2022)
Behavior Changes
- For track_and_verify_wals_in_manifest, revert to the original behavior before #10087: syncing of live WAL file is not tracked, and we track only the synced sizes of closed WALs. (PR #10330).
7.4.2 (06/30/2022)
Bug Fixes
- Fix a bug in Logger where if dbname and db_log_dir are on different filesystems, dbname creation would fail wrt to db_log_dir path returning an error and fails to open the DB.
7.4.1 (06/28/2022)
Bug Fixes
- Pass
rate_limiter_priority
through filter block reader functions toFileSystem
.
7.4.0 (06/19/2022)
Bug Fixes
- Fixed a bug in calculating key-value integrity protection for users of in-place memtable updates. In particular, the affected users would be those who configure
protection_bytes_per_key > 0
onWriteBatch
orWriteOptions
, and configureinplace_callback != nullptr
. - Fixed a bug where a snapshot taken during SST file ingestion would be unstable.
- Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
- Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.
- Fix a race condition in WAL size tracking which is caused by an unsafe iterator access after container is changed.
- Fix unprotected concurrent accesses to
WritableFileWriter::filesize_
byDB::SyncWAL()
andDB::Put()
in two write queue mode. - Fix a bug in WAL tracking. Before this PR (#10087), calling
SyncWAL()
on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequentDB::Open
even if the WAL file is missing or corrupted. - Fix a bug that could return wrong results with
index_type=kHashSearch
and usingSetOptions
to change theprefix_extractor
. - Fixed a bug in WAL tracking with wal_compression. WAL compression writes a kSetCompressionType record which is not associated with any sequence number. As result, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
- Avoid a crash if the IDENTITY file is accidentally truncated to empty. A new DB ID will be written and generated on Open.
- Fixed a possible corruption for users of
manual_wal_flush
and/orFlushWAL(true /* sync */)
, together withtrack_and_verify_wals_in_manifest == true
. For those users, losing unsynced data (e.g., due to power loss) could make future DB opens fail with aStatus::Corruption
complaining about missing WAL data. - Fixed a bug in
WriteBatchInternal::Append()
where WAL termination point in write batch was not considered and the function appends an incorrect number of checksums. - Fixed a crash bug introduced in 7.3.0 affecting users of MultiGet with
kDataBlockBinaryAndHash
. - Add some fixes in async_io which was doing extra prefetching in shorter scans.
Public API changes
- Add new API GetUnixTime in Snapshot class which returns the unix time at which Snapshot is taken.
- Add transaction
get_pinned
andmulti_get
to C API. - Add two-phase commit support to C API.
- Add
rocksdb_transaction_get_writebatch_wi
androcksdb_transaction_rebuild_from_writebatch
to C API. - Add
rocksdb_options_get_blob_file_starting_level
androcksdb_options_set_blob_file_starting_level
to C API. - Add
blobFileStartingLevel
andsetBlobFileStartingLevel
to Java API. - Add SingleDelete for DB in C API
- Add User Defined Timestamp in C API.
rocksdb_comparator_with_ts_create
to create timestamp aware comparator- Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix
with_ts
- And Add C API's for Transaction, SstFileWriter, Compaction as mentioned here
- The contract for implementations of Comparator::IsSameLengthImmediateSuccessor has been updated to work around a design bug in
auto_prefix_mode
. - The API documentation for
auto_prefix_mode
now notes some corner cases in which it returns different results thantotal_order_seek
, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected. - Obsoleted the NUM_DATA_BLOCKS_READ_PER_LEVEL stat and introduced the NUM_LEVEL_READ_PER_MULTIGET and MULTIGET_COROUTINE_COUNT stats
- Introduced
WriteOptions::protection_bytes_per_key
, which can be used to enable key-value integrity protection for live updates.
New Features
- Add FileSystem::ReadAsync API in io_tracing
- Add blob garbage collection parameters
blob_garbage_collection_policy
andblob_garbage_collection_age_cutoff
to both force-enable and force-disable GC, as well as selectively override age cutoff when using CompactRange. - Add an extra sanity check in
GetSortedWalFiles()
(also used byGetLiveFilesStorageInfo()
,BackupEngine
, andCheckpoint
) to reduce risk of successfully created backup or checkpoint failing to open because of missing WAL file. - Add a new column family option
blob_file_starting_level
to enable writing blob files during flushes and compactions starting from the specified LSM tree level. - Add support for timestamped snapshots (#9879)
- Provide support for AbortIO in posix to cancel submitted asynchronous requests using io_uring.
- Add support for rate-limiting batched
MultiGet()
APIs
Behavior changes
- DB::Open(), DB::OpenAsSecondary() will fail if a Logger cannot be created (#9984)
- Removed support for reading Bloom filters using obsolete block-based filter format. (Support for writing such filters was dropped in 7.0.) For good read performance on old DBs using these filters, a full compaction is required.
- Per KV checksum in write batch is verified before a write batch is written to WAL to detect any corruption to the write batch (#10114).
Performance Improvements
- When compiled with folly (Meta-internal integration; experimental in open source build), improve the locking performance (CPU efficiency) of LRUCache by using folly DistributedMutex in place of standard mutex.