This release contains 164 contributions from 29 contributors. We also have 12 new contributors. Thank you all for the contributions!
Some notable changes and improvements in this release are:
- New Parquet mode for Store Gateway
- Configurable OTLP metric suffixes via
-distributor.otlp.add-metric-suffixes - Multiple PRW2 bug fixes for data corruption and panics
- Graduate Ruler API, Alertmanager API/sharding, tenant federation, FIFO/Redis cache, instance limits, and memcached DNS-based service discovery from experimental support
- New Overrides API module to control tenant limits via api
- HATracker memberlist experimental support
- Tenant federation partial response experimental support
- Alertmanager upgraded to v0.31.1 with IncidentIO and Mattermost integrations
- Bucket index enabled by default
What's Changed
- [CHANGE] Ruler: Graduate Ruler API from experimental. #7312
- Flag: Renamed
-experimental.ruler.enable-apito-ruler.enable-api. The old flag is kept as deprecated. - Ruler API is no longer marked as experimental.
- Flag: Renamed
- [CHANGE] Alertmanager: Graduate Alertmanager API and sharding from experimental. #7315
- Flag: Renamed
-experimental.alertmanager.enable-apito-alertmanager.enable-api. The old flag is kept as deprecated. - Alertmanager sharding is no longer marked as experimental.
- Flag: Renamed
- [CHANGE] Blocks storage: Bucket index is now enabled by default. Disabling the bucket index (
-blocks-storage.bucket-store.bucket-index.enabled=false) is not recommended for production. #7259 - [CHANGE] Users Scanner: Rename user index update configuration. #7180
- Flag: Renamed
-*.users-scanner.user-index.cleanup-intervalto-*.users-scanner.user-index.update-interval. - Config: Renamed
clean_up_intervaltoupdate_intervalwithin theusers_scannerconfiguration block..
- Flag: Renamed
- [CHANGE] Querier: Refactored parquet cache configuration naming. #7146
- Metrics: Renamed
cortex_parquet_queryable_cache_*tocortex_parquet_cache_*. - Flags: Renamed
-querier.parquet-queryable-shard-cache-sizeto-querier.parquet-shard-cache-sizeand-querier.parquet-queryable-shard-cache-ttlto-querier.parquet-shard-cache-ttl. - Config: Renamed
parquet_queryable_shard_cache_sizetoparquet_shard_cache_sizeandparquet_queryable_shard_cache_ttltoparquet_shard_cache_ttl.
- Metrics: Renamed
- [FEATURE] Overrides: Add new Overrides API component and rename old overrides module to
overrides-configs. #6975 - [FEATURE] HATracker: Add experimental support for
memberlistandmultias a KV store backend. #7284 - [FEATURE] Distributor: Add
-distributor.otlp.add-metric-suffixesflag. If true, suffixes will be added to the metrics for name normalization. #7286 - [FEATURE] StoreGateway: Introduces a new parquet mode. #7046
- [FEATURE] StoreGateway: Add a parquet shard cache to parquet mode. #7166
- [FEATURE] Distributor: Add a per-tenant flag
-distributor.enable-type-and-unit-labelsthat enables adding__unit__and__type__labels for remote write v2 and OTLP requests. This is a breaking change; the-distributor.otlp.enable-type-and-unit-labelsflag is now deprecated, operates as a no-op, and has been consolidated into this new flag. #7077 - [FEATURE] Querier: Add experimental projection pushdown support in Parquet Queryable. #7152
- [FEATURE] Ingester: Add experimental active series queried metric. #7173
- [FEATURE] Update prometheus Alertmanager version to v0.31.1 and add new integration to IncidentIO and Mattermost. #7092 #7267
- [FEATURE] Tenant Federation: Add experimental support for partial responses using the
-tenant-federation.allow-partial-dataflag. When enabled, failures from individual tenants during a federated query are treated as warnings, allowing results from successful tenants to be returned. #7232 - [FEATURE] Alertmanager: Add
-alertmanager.disable-replica-set-extensionflag to limit blast radius during config corruption incidents. #7153 - [ENHANCEMENT] Distributor: Add
cortex_distributor_push_requests_totalmetric to track the number of push requests by type. #7239 - [ENHANCEMENT] Querier: Add
-querier.store-gateway-series-batch-sizeflag to configure the maximum number of series to be batched in a single gRPC response message from Store Gateways. #7203 - [ENHANCEMENT] HATracker: Add
-distributor.ha-tracker.enable-startup-syncflag. If enabled, the ha-tracker fetches all tracked keys on startup to populate the local cache. #7213 - [ENHANCEMENT] Distributor: Add validation to ensure remote write v2 requests contain at least one sample or histogram. #7201
- [ENHANCEMENT] Ingester: Add support for ingesting Native Histogram with Custom Buckets. #7191
- [ENHANCEMENT] Ingester: Optimize labels out-of-order (ooo) check by allowing the iteration to terminate immediately upon finding the first unsorted label. #7186
- [ENHANCEMENT] Distributor: Skip attaching
__unit__and__type__labels when-distributor.enable-type-and-unit-labelsis enabled, as these are appended from metadata. #7145 - [ENHANCEMENT] Distributor: Add
cortex_distributor_ingester_push_timeouts_totalmetric to track the number of push requests to ingesters that were canceled due to timeout. #7155 #7229 - [ENHANCEMENT] StoreGateway: Add tracings to parquet mode. #7125
- [ENHANCEMENT] Querier: Add a
-querier.parquet-queryable-shard-cache-ttlflag to add TTL to parquet shard cache. #7098 - [ENHANCEMENT] Ingester: Add
enable_matcher_optimizationconfig to apply low selectivity matchers lazily. #7063 - [ENHANCEMENT] Distributor: Add a label references validation for remote write v2 request. #7074
- [ENHANCEMENT] Distributor: Add count, spans, and buckets validations for native histogram. #7072
- [ENHANCEMENT] Alertmanager/Ruler: Introduce a user scanner to reduce the number of list calls to object storage. #6999
- [ENHANCEMENT] Ruler: Add DecodingConcurrency config flag for Thanos Engine. #7118
- [ENHANCEMENT] Query Frontend: Add query priority based on operation. #7128
- [ENHANCEMENT] Compactor: Avoid double compaction by cleaning partition files in 2 cycles. #7130 #7209 #7257
- [ENHANCEMENT] Distributor: Optimize memory usage by recycling v2 requests. #7131
- [ENHANCEMENT] Compactor: Avoid double compaction by not filtering delete blocks on real time when using bucketIndex lister. #7156
- [ENHANCEMENT] Upgrade to go 1.25.8 #7164 #7340
- [ENHANCEMENT] Upgraded container base images to
alpine:3.23. #7163 - [ENHANCEMENT] Ingester: Instrument Ingester CPU profile with userID for read APIs. #7184
- [ENHANCEMENT] Ingester: Add fetch timeout for Ingester expanded postings cache. #7185
- [ENHANCEMENT] Ingester: Add feature flag to collect metrics of how expensive an unoptimized regex matcher is and new limits to protect Ingester query path against expensive unoptimized regex matchers. #7194 #7210
- [ENHANCEMENT] Querier: Add active API requests tracker logging to help with OOMKill troubleshooting. #7216
- [ENHANCEMENT] Compactor: Add partition group creation time to visit marker. #7217
- [ENHANCEMENT] Compactor: Add concurrency for partition cleanup and mark block for deletion #7246
- [ENHANCEMENT] Distributor: Validate metric name before removing empty labels. #7253
- [ENHANCEMENT] Ruler/Ingester: Propagate append hints to discard out of order samples on Ingester #7226
- [ENHANCEMENT] Make cortex_ingester_tsdb_sample_ooo_delta metric per-tenant #7278
- [ENHANCEMENT] Distributor: Add dimension
nhcbto keep track of nhcb samples incortex_distributor_received_samples_totalandcortex_distributor_samples_in_totalmetrics. - [ENHANCEMENT] Distributor: Add
-distributor.accept-unknown-remote-write-content-typeflag. When enabled, requests with unknown or invalid Content-Type header are treated as remote write v1 instead of returning 415 Unsupported Media Type. Default is false. #7293 - [ENHANCEMENT] Ingester: Added
cortex_ingester_ingested_histogram_bucketsmetric to track number of histogram buckets ingested per user. #7297 - [ENHANCEMENT] Ring: Reuse timers in lifecycler and backoff loops to reduce allocations. #7270
- [ENHANCEMENT] Ring/KV: Reuse timers in DynamoDB watch loops to avoid per-poll allocations. #7266
- [ENHANCEMENT] Ring/KV: Reuse timers in memberlist client to reduce allocations. #7285
- [ENHANCEMENT] PromQL: Add
holt_wintersbackwards compatibility as alias fordouble_exponential_smoothing. #7223 - [ENHANCEMENT] Query Frontend: Add logical plan fragmentation for distributed query execution. #7018
- [ENHANCEMENT] Parquet: Support sharded parquet files in parquet converter and queryable. #7189
- [ENHANCEMENT] Compactor: Add graceful period for compaction groups to prevent compacting recently written blocks. #7182
- [ENHANCEMENT] Query Engine: Add projection pushdown optimizer for improved query performance. #7141
- [ENHANCEMENT] Ruler: Allow ExternalPusher and ExternalQueryable to be specified separately. #7224
- [BUGFIX] Distributor: Add bounds checking for symbol references in Remote Write V2 requests to prevent panics when UnitRef or HelpRef exceed the symbols array length. #7290
- [BUGFIX] Distributor: If remote write v2 is disabled, explicitly return HTTP 415 (Unsupported Media Type) for Remote Write V2 requests instead of attempting to parse them as V1. #7238
- [BUGFIX] Ring: Change DynamoDB KV to retry indefinitely for WatchKey. #7088
- [BUGFIX] Ruler: Add XFunctions validation support. #7111
- [BUGFIX] Querier: propagate Prometheus info annotations in protobuf responses. #7132
- [BUGFIX] Scheduler: Fix memory leak by properly cleaning up query fragment registry. #7148
- [BUGFIX] Compactor: Add back deletion of partition group info file even if not complete #7157
- [BUGFIX] Query Frontend: Add Native Histogram extraction logic in results cache #7167
- [BUGFIX] Alertmanager: Fix alertmanager reloading bug that removes user template files #7196
- [BUGFIX] Query Scheduler: If max_outstanding_requests_per_tenant value is updated to lesser value than the current number of requests in the queue, the excess requests (newest ones) will be dropped to prevent deadlocks. #7188
- [BUGFIX] Distributor: Return remote write V2 stats headers properly when the request is HA deduplicated. #7240
- [BUGFIX] Cache: Fix Redis Cluster EXECABORT error in MSet by using individual SET commands instead of transactions for cluster mode. #7262
- [BUGFIX] Distributor: Fix an
index out of rangepanic in PRW2.0 handler caused by dirty metadata when reusing requests fromsync.Pool. #7299 - [BUGFIX] Distributor: Fix data corruption in the push handler caused by shallow copying
SamplesandHistogramswhen converting Remote Write V2 requests to V1. #7337 - [BUGFIX] Ingester: Fix panic due to concurrent access to rand in active queried series. #7329
- [BUGFIX] Distributor: Fix request slice not being properly reused in push error paths. #7123
New Contributors
- @thc1006 made their first contribution in #7068
- @zanderfriz made their first contribution in #7085
- @kishorekg1999 made their first contribution in #7179
- @b-wu26 made their first contribution in #7153
- @sh4shv4t made their first contribution in #7215
- @rice-junhaoyu made their first contribution in #7224
- @Shvejan made their first contribution in #7241
- @venkatchinmay made their first contribution in #7262
- @sandy2008 made their first contribution in #7266
- @archy-rock3t-cloud made their first contribution in #7320
- @siddharthahuja1 made their first contribution in #7286
Full Changelog: v1.20.0...v1.21.0-rc.0