New Features
- Added support for follower fetching #9780
- Server-side validation of Schema IDs encoded into the header of a record encoded by a serde client library #10650
- Tiered Storage now uses a fine-grained local disk cache using segment chunks as opposed to full segments #9426
- Improve scalability of Tiered Storage by offloading metadata to cloud storage and fetching it when its needed #11150, #10560
- Redpanda now loads the controller log from a snapshot on startup. This significantly improves startup times of nodes in long-running Redpanda clusters. Enabled on new clusters, disabled on existing clusters. To enable in 23.2.1 contact Redpanda Support #8285
- Use multiple partitions in the transaction manager topic. This significantly improves throughput when using the Kafka Transactions API. It is configurable with transaction_coordinator_partitions option and can be set only on cluster creation #9362
- Adds a built-in CPU profiler configured with the cpu_profiler_enabled and cpu_profiler_sample_rate_ms cluster properties #10708
- Adds support for the deleteRecords Kafka API #10061
- Adds support for rpk profile to manage multiple rpk configuration profiles #10528
- Schema Registry: Support for Protobuf ‘well known types’ #7431
- Adds a ‘force reconfiguration’ API to uncleanly reconfigure a raft group into a smaller replica set. Meant to be used as an escape hatch or equivalent high level abstractions to recover partitions from unrecoverable state. #9096
- Kubernetes Operator: Support mTLS with customer provided client CA at Panda Proxy and Schema Registry #9863
- Redpanda OperatorV2 (BETA *Note: Helm-based deployment is still recommended at this point An upcoming patch release will deliver a GA version of the OperatorV2. #10317
RPK CLI improvements
- adds rpk topic trim-prefix, rpk support for DeleteRecords Kafka API #11781
- rpk: adds rpk topic describe-storage which describe the cloud storage status of a given topic #9500, #8894
- Introduce SSO to rpk cloud login. #9864
- Adds support for rpk cloud auth to manage multiple rpk cloud auths #10528
- Automatically populates a new cloud profile when you run rpk cloud login #10528
- You can now generate a sample starter application to produce and consume using rpk generate app #12037
- request. Now you can trim records to a specific offset in the given topic/partition
- rpk now generate a password automatically when running rpk acl user create with no password #11235
- Add the ability to pass rpk TLS configuration flags to the legacy dashboard. #5623
- rpk now downloads the Grafana dashboard from our repo https://github.com/redpanda-data/observability #9662
- rpk now support for a -X flag. This will eventually replace --brokers, --user, --password, etc. -- i.e., all flags that configure how rpk talks to brokers. -X supports -X list for a short list of available configuration parameters, -X help for a full list, and autocomplete. #9981
- rpk now supports --rack to specify the rack to consume from, which also opts into follower fetching #11105
- adds --print-commits / -c to rpk group describe to opt into printing only the commits & lag section #12113
- now you can update user credentials with rpk acl user update. #11289
- You can now generate a sample starter application to produce and consume using rpk generate app #12037
Other Notable Improvements
- The Redpanda ‘Admin API’ is now documented publicly
- A new admin API endpoint is added at /v1/cloud_storage/manifest/{topic}/{partition} which allows for retrieving the in-memory manifest for a partition in JSON format. #9745
- Cloud storage connections are balanced between the shards according to workload #9410
- pandaproxy: Support users with
SCRAM-SHA-512
for `authentication: #11425 - Responses to Kafka DescribeLogDirs requests now include remote tiered storage space utilization. Remote space utilization is reported as a special directory called "remote://{your_bucket_name}". #8145
- Remove ghost Redpanda Node IDs #9750
- adds usage metrics to each Kafka request handler #10623
- Add admin API command that changes --blocked-reactor-notify-ms parameter on the fly. #9331
- Adds a large_allocation_warning_threshold node config option to enable warnings to be logged on large allocations. #8596
- Add mechanism that can be used to truncate tiered-storage using kafka offset. #9994
- Hierarchical hard/soft partition allocation constraints #9767
- Log retention rules applied eagerly in low disk space scenarios. #10658
- The vectorized_memory_available_memory metric is now available on the /metrics endpoint in addition to /public_metrics. #10018
Bug Fixes
- A bug is fixed where tiered storage topics might delay uploading manifests when partition leadership changes shortly after segments are uploaded #9154
- A spurious "Leaving S3 objects behind" log message is no longer printed when deleting non-tiered-storage partitions. #9405
- Always reconcile Node ID annotation in Redpanda POD #9749
- Avoids a crash when attempting to create a read replica topic while cloud storage is not configured. #11792
- Changes to enable deployment of new operator, not by default. #11651
- Convert tm_snapshot to fragmented_vector to reduce memory pressure. #10035
- Decommission during scale down uses correct Redpanda Node ID #11312
- Disable adjacent segment merging when the partition is not a leader #10538
- Do not remove the controller broker on
not_controller
error fromkafka::client::create_topic(
#9830 - Fix a bug in the re-uploading of compacted segment that could lead to consumers getting blocked. #9404
- Fix a bug where recovered topics would use the retention settings specified at creation time (ignoring updates after that point. #10297
- Fix a rare race between segment rolling and the readers cache which caused spurious, temporary dead-locks #9676
- Fix an issue with compacted segment reupload that could cause self-concatenated segments to be reuploaded. #9576
- Fix cases where sending very large batches to compacted topics using LZ4 or gzip compression could result in bad_alloc errors. #9598
- Fix for OffsetForLeaderEpoch returning a value for when the requestedEpoch is larger then the highest known. #12063
- Fix memory leak upon shutdown which can occur when using tiered storage with ABS and TLS. #9998
- Fix offset translation errors after topic recovery #9919
- Fix offset translation failure that could be triggered after topic recovery #9865
- Fix race between compaction and application of retention to the local log. This occurred when compaction happened below the start offset of the log. It did not have a noticeable impact upon user workloads. #9483
- Fix race condition during forced segment roll #10506
- Fix rare histogram metric reporting bug which could lead to Redpanda crashing or terminating the HTTP connection on which it was serving the metric request. This only occurred when metrics aggregation was enabled. #9192
- Fix upload housekeeping errors on log truncation #9409
- Fixed a bug in which the segment merging could run from read replicas. #9685
- Fixed a bug that would cause Redpanda to return an invalid offset when consuming from a term that falls below the beginning of the local log when reading from tiered storage is disabled. #9264
- Fixed a bug that would previously prevent read replicas from progressing if
cloud_storage_enable_remote_write
is false. #9684 - Fixed a potential invalid memory access when iterating through segments with timestamps in the future. #11956
- Fixed an issue where sq or sq-split mode may be chosen for NIC IRQ affinity in the redpanda tuner, resulting in core/shard 0 becoming a bottleneck for some high packet-per-second loads. #10568
- Fixed another bug where multiple configuration updates would cause an inconsistency upon next reload of usage state #9917
- Fixed not being able to elect a leader in situation when only voter is in maintenance mode #10800
- Fixes a bug in our http client that may crash redpanda in exceptional cases #9903
- Fixes a bug where usage_manger would report incorrect cloud storage metrics when queried on the leader node. #9917
- Fixes an issue with the archival upload path that could contribute to data loss when archival metadata updates took a long time to be replicated and applied. #9633
- Limits on tiered storage read concurrency are enforced more strictly, to improve stability when clients issue very large numbers of concurrent fetches, or do time queries on topics with large numbers of partitions. #11914
- Make tiered-storage metadata handling more strict during rolling upgrades #12001
- Makes starting Redpanda with long controller log containing redundant topic properties update faster #9739
- Prevent possible data loss in situation when the same segment is added twice to the manifest #9597
- Regenerate the console configmap after RP cluster spec is changed so that the console works properly. #11211
- Schema Registry: Fix a bug in
GET /subjects/<subject>/versions/latest
that would previously not find the latest non-deleted version. #10543 - The PR resolves, if requireClientAuth is enabled and clientCACertRef is NOT set for panda proxy or schema registry TLS listener, the Redpanda pods will not come up due to the error below. #10546
- The
redpanda_cloud_storage_cache_op_miss
metric was not showing the right value. #9333 - A bug is fixed where if a PUT to object storage was in flight during shutdown, Redpanda might incorrectly record a failed upload as successful. #10107
- A stability issue is fixed where many concurrent Produce requests using very large compressed batches could exhaust memory. #10235
- A stability issue is fixed where very large ZSTD-compressed batches could exhaust memory #10235
- Fix post-recovery Raft boostrap #10535
- Prevent allocation failures with many idempotent producers #10250
- Fixes an assertion during archival when a leadership change occurs #10244
- Fixed an issue where the metric
redpanda_kafka_consumer_group_consumers
was reporting double the real count of consumers. #10320 - Fix over allocation in metadata dissemination leadership update #10585
- bugfix: add correct file metadata to files in the debug bundle generated by
rpk debug bundle
#10705 - Fix bug caused by unexpected exception on the remote read path which caused Kafka fetch requests to time-out and leave lingering connections. #10828
- rpk disk_irq tuner now provides a warning for a known issue introduced in kernel 5.17 where instances utilizing MSI IRQ may encounter an empty IRQs list in sysfs that caused hwloc segmentation faults and results in a tuner failure. #10864
- An issue is fixed where clients might see a lower partition high watermark than expected if querying very soon after a new raft leader is elected for the partition. #10921
- Fixes inspecting Docker networks on macOS when using podman. #10902
- k8s: allow console to be deleted even if cluster is not configured #11026
- Fixed a bug where ephemeral credentials were stored in controller snapshots #11563
- Schema Registry: Return
references
forGET /schemas/ids/<id>
. #11216 - fixes rare situation in which consumer may stuck due to incorrect truncation point #11450
- Fix a possible dangling pointer issue in the storage layer #11436
- rpk: change required kernel version from 4.9 to 3.9 in
rpk redpanda check
andrpk redpanda start
#11502 - Memory consumption for housekeeping on compacted topics is reduced #11681
- fixed not being able to clear consumer backlog when using incremental fetch requests and follower fetching #11748
- fixed returning read replica that may be unavailable from replica selector #11748
- Prevent uploaded segment from having incorrect archiver term #11892
- Pandaproxy will no longer assert if attempting to consume compressed message batches #10117
- Schema Registry will no longer assert if the
_schemas
topic contains compressed batches #10117 - Fix bug in calculate_unevenness_error when allocated_replicas on node goes below zero #9870
- Fixed issue where Redpanda will assert if data being consumed by Pandaproxy is not JSON serializable #9867
- The consumer used in our internal kafka client will automatically find the new consumer group coordinator on
not_coordinator
errors. #9967 - rpk will not modify
redpanda.rpc_server_tls
property when decoding the redpanda.yaml, which means that it will leave the field as a list or as an element depending on what the user has configured before executing rpk commands. #7719 - Fixed issue with offset translation in
rm_stm
on snapshot hydrate #10232 - Fix race condition in
tm_stm
take_snapshot
. #9575 - Do not validate console if it's being removed #9164
- Fix a memory leak in the operator #9201
- Fixed an issue that could prevent cloud storage truncation following leadership changes. #9494
- Fixed excessive kvstore writes that could lead to memory fragmentation issues during heavy produce load. #9827
- fixes an assertion that may happen when Raft snapshot is delivered to recover follower having any of
rm
tx
orid_allocator
stms #9656 - #9507 Fixed automatic rebalancing of replicas stopping to early before achieving even distribution of replicas. #9515
- Fixed a bug that would result in read replicas reporting a high watermark that was too high. #9493
- Fixed memory allocation errors when using very large batches with LZ4 compression on compacted topics #9563
- This change fixes an out of memory in the replicate batcher by ensuring that only one flush task is outstanding at any time. #9966
- Fix race between re-uploaded segment compaction and local truncation that prevented the re-upload from succeeding. #9566
- Fixes violation of the atomicity of the consumer - transform - produce loop #9573
- #9602 Made Join and configuration change validation logic consistent #10325
- Improve precision of the adjacent segment merger to avoid unnecessary segment reuploads #9657
- An issue is fixed where time queries on tiered storage partitions could return
-1
incorrectly if the queried timestamp was earlier than the start of the log. #9815 - An issue is fixed where time queries on tiered storage partitions using offsets close to the end of the local raft log could sometimes return offsets slightly ahead of the correct offset, if the segment containing the correct offset had already been offloaded to object storage. #9815
- fix consumer group kafka incompatibilities (update
offset_fetch
to returncommitted_leader_epoch
#9851 redpanda_cluster_partitions
is incremented per partition, not per replica, on a topic creation #9393rpk cluster logdirs
no longer panics if there is an error getting a response from Redpanda #11900rpk group offset-delete
no longer tries to delete offsets for all topics if no empty topics are specified #11900rpk group offset-delete
no longer tries to delete offsets for empty-name topics #11900- close Redpanda admin connection after reconciling the Console user and ACL. #11016
- fixed
rm_stm
resource leak #11597 - fixed a bug in
rpk cluster storage recovery
which rendered thestatus
command unusable. #9795 - fixed stms interoperation with relaxed consistency semantics #11840
- fixes rare issue that may arise when new node id is assigned to a node with the same set of ip addresses #9254
- k8s: Status.Version is not updated until upgrade to that version is finished #10877
- net: Fix a rare crash during shutdown of a failed connection with outstanding requests #11586
- operator: fixed an issue that caused continue reconciliations on the custom resource #9950
- rpk bugfix: now
rpk debug bundle
in k8s works for brokers with TLS enabled. #9320 - rpk now properly handles
-o @:-1h
to consume from the start to one hour ago #9667 - rpk: Fix a panic when using
rpk container
with Podman #9133 - rpk: In k8s
rpk debug bundle
--since flag no longer supports non-standard durations (such as day, week, years, only Go standard durations will be accepted. #10002 - rpk: fixes a bug that prevented shell autocompletion to work in zsh and improve the help text for Mac users. #10170