redpanda-data/redpanda v23.2.1 on GitHub

New Features

Added support for follower fetching #9780
Server-side validation of Schema IDs encoded into the header of a record encoded by a serde client library #10650
Tiered Storage now uses a fine-grained local disk cache using segment chunks as opposed to full segments #9426
Improve scalability of Tiered Storage by offloading metadata to cloud storage and fetching it when its needed #11150, #10560
Redpanda now loads the controller log from a snapshot on startup. This significantly improves startup times of nodes in long-running Redpanda clusters. Enabled on new clusters, disabled on existing clusters. To enable in 23.2.1 contact Redpanda Support #8285
Use multiple partitions in the transaction manager topic. This significantly improves throughput when using the Kafka Transactions API. It is configurable with transaction_coordinator_partitions option and can be set only on cluster creation #9362
Adds a built-in CPU profiler configured with the cpu_profiler_enabled and cpu_profiler_sample_rate_ms cluster properties #10708
Adds support for the deleteRecords Kafka API #10061
Adds support for rpk profile to manage multiple rpk configuration profiles #10528
Schema Registry: Support for Protobuf ‘well known types’ #7431
Adds a ‘force reconfiguration’ API to uncleanly reconfigure a raft group into a smaller replica set. Meant to be used as an escape hatch or equivalent high level abstractions to recover partitions from unrecoverable state. #9096
Kubernetes Operator: Support mTLS with customer provided client CA at Panda Proxy and Schema Registry #9863
Redpanda OperatorV2 (BETA *Note: Helm-based deployment is still recommended at this point An upcoming patch release will deliver a GA version of the OperatorV2. #10317

RPK CLI improvements

adds rpk topic trim-prefix, rpk support for DeleteRecords Kafka API #11781
rpk: adds rpk topic describe-storage which describe the cloud storage status of a given topic #9500, #8894
Introduce SSO to rpk cloud login. #9864
Adds support for rpk cloud auth to manage multiple rpk cloud auths #10528
Automatically populates a new cloud profile when you run rpk cloud login #10528
You can now generate a sample starter application to produce and consume using rpk generate app #12037
request. Now you can trim records to a specific offset in the given topic/partition
rpk now generate a password automatically when running rpk acl user create with no password #11235
Add the ability to pass rpk TLS configuration flags to the legacy dashboard. #5623
rpk now downloads the Grafana dashboard from our repo https://github.com/redpanda-data/observability #9662
rpk now support for a -X flag. This will eventually replace --brokers, --user, --password, etc. -- i.e., all flags that configure how rpk talks to brokers. -X supports -X list for a short list of available configuration parameters, -X help for a full list, and autocomplete. #9981
rpk now supports --rack to specify the rack to consume from, which also opts into follower fetching #11105
adds --print-commits / -c to rpk group describe to opt into printing only the commits & lag section #12113
now you can update user credentials with rpk acl user update. #11289
You can now generate a sample starter application to produce and consume using rpk generate app #12037

Other Notable Improvements

The Redpanda ‘Admin API’ is now documented publicly
A new admin API endpoint is added at /v1/cloud_storage/manifest/{topic}/{partition} which allows for retrieving the in-memory manifest for a partition in JSON format. #9745
Cloud storage connections are balanced between the shards according to workload #9410
pandaproxy: Support users with SCRAM-SHA-512 for `authentication: #11425
Responses to Kafka DescribeLogDirs requests now include remote tiered storage space utilization. Remote space utilization is reported as a special directory called "remote://{your_bucket_name}". #8145
Remove ghost Redpanda Node IDs #9750
adds usage metrics to each Kafka request handler #10623
Add admin API command that changes --blocked-reactor-notify-ms parameter on the fly. #9331
Adds a large_allocation_warning_threshold node config option to enable warnings to be logged on large allocations. #8596
Add mechanism that can be used to truncate tiered-storage using kafka offset. #9994
Hierarchical hard/soft partition allocation constraints #9767
Log retention rules applied eagerly in low disk space scenarios. #10658
The vectorized_memory_available_memory metric is now available on the /metrics endpoint in addition to /public_metrics. #10018

Bug Fixes

A bug is fixed where tiered storage topics might delay uploading manifests when partition leadership changes shortly after segments are uploaded #9154
A spurious "Leaving S3 objects behind" log message is no longer printed when deleting non-tiered-storage partitions. #9405
Always reconcile Node ID annotation in Redpanda POD #9749
Avoids a crash when attempting to create a read replica topic while cloud storage is not configured. #11792
Changes to enable deployment of new operator, not by default. #11651
Convert tm_snapshot to fragmented_vector to reduce memory pressure. #10035
Decommission during scale down uses correct Redpanda Node ID #11312
Disable adjacent segment merging when the partition is not a leader #10538
Do not remove the controller broker on not_controller error from kafka::client::create_topic( #9830
Fix a bug in the re-uploading of compacted segment that could lead to consumers getting blocked. #9404
Fix a bug where recovered topics would use the retention settings specified at creation time (ignoring updates after that point. #10297
Fix a rare race between segment rolling and the readers cache which caused spurious, temporary dead-locks #9676
Fix an issue with compacted segment reupload that could cause self-concatenated segments to be reuploaded. #9576
Fix cases where sending very large batches to compacted topics using LZ4 or gzip compression could result in bad_alloc errors. #9598
Fix for OffsetForLeaderEpoch returning a value for when the requestedEpoch is larger then the highest known. #12063
Fix memory leak upon shutdown which can occur when using tiered storage with ABS and TLS. #9998
Fix offset translation errors after topic recovery #9919
Fix offset translation failure that could be triggered after topic recovery #9865
Fix race between compaction and application of retention to the local log. This occurred when compaction happened below the start offset of the log. It did not have a noticeable impact upon user workloads. #9483
Fix race condition during forced segment roll #10506
Fix rare histogram metric reporting bug which could lead to Redpanda crashing or terminating the HTTP connection on which it was serving the metric request. This only occurred when metrics aggregation was enabled. #9192
Fix upload housekeeping errors on log truncation #9409
Fixed a bug in which the segment merging could run from read replicas. #9685
Fixed a bug that would cause Redpanda to return an invalid offset when consuming from a term that falls below the beginning of the local log when reading from tiered storage is disabled. #9264
Fixed a bug that would previously prevent read replicas from progressing if cloud_storage_enable_remote_write is false. #9684
Fixed a potential invalid memory access when iterating through segments with timestamps in the future. #11956
Fixed an issue where sq or sq-split mode may be chosen for NIC IRQ affinity in the redpanda tuner, resulting in core/shard 0 becoming a bottleneck for some high packet-per-second loads. #10568
Fixed another bug where multiple configuration updates would cause an inconsistency upon next reload of usage state #9917
Fixed not being able to elect a leader in situation when only voter is in maintenance mode #10800
Fixes a bug in our http client that may crash redpanda in exceptional cases #9903
Fixes a bug where usage_manger would report incorrect cloud storage metrics when queried on the leader node. #9917
Fixes an issue with the archival upload path that could contribute to data loss when archival metadata updates took a long time to be replicated and applied. #9633
Limits on tiered storage read concurrency are enforced more strictly, to improve stability when clients issue very large numbers of concurrent fetches, or do time queries on topics with large numbers of partitions. #11914
Make tiered-storage metadata handling more strict during rolling upgrades #12001
Makes starting Redpanda with long controller log containing redundant topic properties update faster #9739
Prevent possible data loss in situation when the same segment is added twice to the manifest #9597
Regenerate the console configmap after RP cluster spec is changed so that the console works properly. #11211
Schema Registry: Fix a bug in GET /subjects/<subject>/versions/latest that would previously not find the latest non-deleted version. #10543
The PR resolves, if requireClientAuth is enabled and clientCACertRef is NOT set for panda proxy or schema registry TLS listener, the Redpanda pods will not come up due to the error below. #10546
The redpanda_cloud_storage_cache_op_miss metric was not showing the right value. #9333
A bug is fixed where if a PUT to object storage was in flight during shutdown, Redpanda might incorrectly record a failed upload as successful. #10107
A stability issue is fixed where many concurrent Produce requests using very large compressed batches could exhaust memory. #10235
A stability issue is fixed where very large ZSTD-compressed batches could exhaust memory #10235
Fix post-recovery Raft boostrap #10535
Prevent allocation failures with many idempotent producers #10250
Fixes an assertion during archival when a leadership change occurs #10244
Fixed an issue where the metric redpanda_kafka_consumer_group_consumers was reporting double the real count of consumers. #10320
Fix over allocation in metadata dissemination leadership update #10585
bugfix: add correct file metadata to files in the debug bundle generated by rpk debug bundle #10705
Fix bug caused by unexpected exception on the remote read path which caused Kafka fetch requests to time-out and leave lingering connections. #10828
rpk disk_irq tuner now provides a warning for a known issue introduced in kernel 5.17 where instances utilizing MSI IRQ may encounter an empty IRQs list in sysfs that caused hwloc segmentation faults and results in a tuner failure. #10864
An issue is fixed where clients might see a lower partition high watermark than expected if querying very soon after a new raft leader is elected for the partition. #10921
Fixes inspecting Docker networks on macOS when using podman. #10902
k8s: allow console to be deleted even if cluster is not configured #11026
Fixed a bug where ephemeral credentials were stored in controller snapshots #11563
Schema Registry: Return references for GET /schemas/ids/<id>. #11216
fixes rare situation in which consumer may stuck due to incorrect truncation point #11450
Fix a possible dangling pointer issue in the storage layer #11436
rpk: change required kernel version from 4.9 to 3.9 in rpk redpanda check and rpk redpanda start #11502
Memory consumption for housekeeping on compacted topics is reduced #11681
fixed not being able to clear consumer backlog when using incremental fetch requests and follower fetching #11748
fixed returning read replica that may be unavailable from replica selector #11748
Prevent uploaded segment from having incorrect archiver term #11892
Pandaproxy will no longer assert if attempting to consume compressed message batches #10117
Schema Registry will no longer assert if the _schemas topic contains compressed batches #10117
Fix bug in calculate_unevenness_error when allocated_replicas on node goes below zero #9870
Fixed issue where Redpanda will assert if data being consumed by Pandaproxy is not JSON serializable #9867
The consumer used in our internal kafka client will automatically find the new consumer group coordinator on not_coordinator errors. #9967
rpk will not modify redpanda.rpc_server_tls property when decoding the redpanda.yaml, which means that it will leave the field as a list or as an element depending on what the user has configured before executing rpk commands. #7719
Fixed issue with offset translation in rm_stm on snapshot hydrate #10232
Fix race condition in tm_stm take_snapshot. #9575
Do not validate console if it's being removed #9164
Fix a memory leak in the operator #9201
Fixed an issue that could prevent cloud storage truncation following leadership changes. #9494
Fixed excessive kvstore writes that could lead to memory fragmentation issues during heavy produce load. #9827
fixes an assertion that may happen when Raft snapshot is delivered to recover follower having any of rm tx or id_allocator stms #9656
#9507 Fixed automatic rebalancing of replicas stopping to early before achieving even distribution of replicas. #9515
Fixed a bug that would result in read replicas reporting a high watermark that was too high. #9493
Fixed memory allocation errors when using very large batches with LZ4 compression on compacted topics #9563
This change fixes an out of memory in the replicate batcher by ensuring that only one flush task is outstanding at any time. #9966
Fix race between re-uploaded segment compaction and local truncation that prevented the re-upload from succeeding. #9566
Fixes violation of the atomicity of the consumer - transform - produce loop #9573
#9602 Made Join and configuration change validation logic consistent #10325
Improve precision of the adjacent segment merger to avoid unnecessary segment reuploads #9657
An issue is fixed where time queries on tiered storage partitions could return -1 incorrectly if the queried timestamp was earlier than the start of the log. #9815
An issue is fixed where time queries on tiered storage partitions using offsets close to the end of the local raft log could sometimes return offsets slightly ahead of the correct offset, if the segment containing the correct offset had already been offloaded to object storage. #9815
fix consumer group kafka incompatibilities (update offset_fetch to return committed_leader_epoch #9851
redpanda_cluster_partitions is incremented per partition, not per replica, on a topic creation #9393
rpk cluster logdirs no longer panics if there is an error getting a response from Redpanda #11900
rpk group offset-delete no longer tries to delete offsets for all topics if no empty topics are specified #11900
rpk group offset-delete no longer tries to delete offsets for empty-name topics #11900
close Redpanda admin connection after reconciling the Console user and ACL. #11016
fixed rm_stm resource leak #11597
fixed a bug in rpk cluster storage recovery which rendered the status command unusable. #9795
fixed stms interoperation with relaxed consistency semantics #11840
fixes rare issue that may arise when new node id is assigned to a node with the same set of ip addresses #9254
k8s: Status.Version is not updated until upgrade to that version is finished #10877
net: Fix a rare crash during shutdown of a failed connection with outstanding requests #11586
operator: fixed an issue that caused continue reconciliations on the custom resource #9950
rpk bugfix: now rpk debug bundle in k8s works for brokers with TLS enabled. #9320
rpk now properly handles -o @:-1h to consume from the start to one hour ago #9667
rpk: Fix a panic when using rpk container with Podman #9133
rpk: In k8s rpk debug bundle --since flag no longer supports non-standard durations (such as day, week, years, only Go standard durations will be accepted. #10002
rpk: fixes a bug that prevented shell autocompletion to work in zsh and improve the help text for Mac users. #10170

Full Changelog:

v23.1.13...v23.2.1