This release contains 1433 PRs from 97 authors, including new contributors Alex Weaver, Ali Asghar, Anas, Andy Hay, Bernd Hois, Charbel Mitri, Chris, CR, Dominik Eisenberg, Elsa Adjei, Federico Torres, francoposa, Gerard van Engelen, HuanMeng, imishchuk-tsgs, Jara Suárez de Puga García, Juliette O, Justin Grothe, Kai Udo, Karl Skewes, Karol Chrapek, Kim Nylander, Kyle Fazzari, Lars Lehtonen, Laurent Dufresne, lif, Manas Srivastava, Manuel Alonso, Mariell Hoversholm, Nico Pazos, Nikolai Tikhonov, Olzhas, Ömer Çengel, Pavel Panfilov, psauvage, Q, Rodrigo Kellermann, Sander Ruitenbeek, Satyam Raj, sherinabr, Shouhei, Soya Kubodera, srpvpn, Thimo Soet. Thank you!
Grafana Mimir version 3.1.0-rc.0 release notes
Grafana Labs is excited to announce version 3.1 of Grafana Mimir.
The highlights that follow include the top features, enhancements, and bug fixes in this release. For the complete list of changes, refer to the CHANGELOG.
Features and enhancements
Grafana Mimir version 3.1 includes the following key features and enhancements.
More Kafka options for Ingest storage
Ingest storage now supports additional ways to authenticate with Kafka through the new -ingest-storage.kafka.sasl-mechanism flag, including SCRAM, OAUTHBEARER, and AWS MSK IAM authentication. In addition, new -ingest-storage.kafka.tls* flags allow connecting to Kafka clusters over TLS, including mTLS.
You can also configure multiple Kafka seed brokers via comma-separated values in -ingest-storage.kafka.address and enable rack-aware consumption with -ingest-storage.kafka.client-rack.
Separate ingestion limits by tenant metadata
Distributors can now track limits separately based on tenant metadata. This allows operators to track limits separately for subsets of write requests belonging to the same tenant, for example to prioritize some sources of metrics over others.
Clients may pass tenant metadata in the X-Scope-OrgID header using the format tenantID:key1=value1:key2=value2, and operators may define per-metadata overrides in the runtime configuration.
Mimir Query Engine (MQE) improvements
MQE continues to receive significant optimizations in this release:
- Support for experimental PromQL extended range selector modifiers
smoothedandanchored, enabled with-query-frontend.enabled-promql-extended-range-selectors=true. - Optimization passes for common subexpression elimination, subset selector elimination, projection pushdown, and multi-aggregation without buffering.
- Improved per-query memory consumption limit enforcement in a variety of scenarios.
- Experimental support for splitting and caching intermediate results for functions over range vectors in instant queries.
- Support for the experimental
info()PromQL function.
Zone-aware memberlist routing
A new experimental zone-aware routing feature for memberlist reduces cross-AZ data transfer by routing gossip messages within the local availability zone when possible. Configure it with -memberlist.zone-aware-routing.* flags.
Additional improvements
Grafana Mimir 3.1 also includes:
- Store-gateways now verify CRC32 checksums for 1 out of every 128 chunks read from object storage and the chunks cache to detect corruption.
- GCS uploads are now optionally retryable, with configurable max retries per storage backend.
- Disk interaction has been removed when loading ruler rules. Rule evaluation failures now include a
reasonlabel (operatororuser) inprometheus_rule_evaluation_failures_totalfor better error classification. - Query blocking via
blocked_queriesis now stable and no longer experimental, with support for blocking queries exceeding a time range duration (time_range_longer_than) or with steps smaller than a threshold (step_size_shorter_than). - The per-tenant postings-for-matchers cache is now stable.
- Out-of-order ingestion support is now stable, configured via
-ingester.out-of-order-time-window. - The
-alertmanager.utf8-strict-mode-enabledflag is now stable. - The query-scheduler now drains the queue before exiting during shutdown.
- Distributors support zone-aware rate limiting via
-distributor.ring.instance-availability-zone, dividing the global ingestion rate by zones instead of total distributors. - Default ingest storage configuration now enables concurrency settings for improved throughput.
- Optional per-tenant max limits for label name and label value requests via
max_label_names_limitandmax_label_values_limit. - Runtime configuration can now be loaded from HTTP URLs in addition to local files via
-runtime-config.file. This may reduce configuration propagation times. - Blocked queries configuration is now validated at load time.
Important changes
Grafana Mimir 3.1 introduces several updates that change default behavior and configuration. Review these changes before upgrading:
- Experimental support for disabling ring heartbeats and heartbeat timeouts has been removed.
- The
-target=flushermode has been removed; use the/ingester/flushHTTP endpoint instead. - Uploaded TSDB blocks must now use v2 of the index file format. Store-gateways no longer generate index-headers from v1 index format blocks.
- Per-step stats are no longer supported when MQE is enabled. The
-query-frontend.cache-samples-processed-statsflag is deprecated and has no effect. - The
-querier.response-streaming-enabledflag has been removed; active series responses are now always streamed. cortex_ingest_storage_writer_buffered_produce_byteshas been renamed tocortex_ingest_storage_writer_buffered_produce_bytes_distribution.- Metric
cortex_ingester_owned_target_info_serieshas been removed. - The
cost_attribution_labelsconfiguration option has been removed; usecost_attribution_labels_structuredinstead. -querier.prefer-availability-zonehas been renamed to-querier.prefer-availability-zonesand now accepts a comma-separated list.- The per-query memory consumption limit now considers more sources of memory consumption. As a result, queries that previously succeeded may now fail due to exceeding the memory consumption limit.
- The following flags have been removed:
-distributor.metric-relabeling-enabled-compactor.no-blocks-file-cleanup-enabled-compactor.in-memory-tenant-meta-cache-size-blocks-storage.bucket-store.index-header.eager-loading-startup-enabled*.memcached.dns-ignore-startup-failures
Experimental features
Grafana Mimir 3.1 includes some features that are experimental. Use these features with caution and report any issues that you encounter:
- New usage-tracker component to enforce series limits before data is ingested.
- Zone-aware memberlist routing to reduce cross-AZ data transfer.
- Query planning in query-frontends with distributed execution across queriers.
- Support in MQE for experimental PromQL extended range selector modifiers (
smoothed,anchored). - Support in MQE for the experimental
info()PromQL function. - MQE optimization passes: multi-aggregation, subset selector elimination, common subexpression elimination for range vector expressions.
- Per-zone store-gateway shard size (
-store-gateway.tenant-shard-size-per-zone). - Running ingesters with no tokens in the ring when ingest storage is enabled (
-ingester.ring.num-tokens=0). - Per-sample HA deduplication (
-distributor.ha-tracker.per-sample-dedupe). - Per-tenant early head compaction for ingesters based on owned series count.
- Store-gateway excluded zones (
-store-gateway.sharding-ring.excluded-zones). - Controlling OTLP metric name suffix addition and translation strategy via request headers, gated by
-api.otlp-translation-headers-enabled. - Memberlist propagation delay tracker (
-memberlist.propagation-delay-tracker.enabled). - Reporting the number of samples read per query in MQE.
Bug fixes
For a detailed list of bug fixes, refer to the CHANGELOG.
Helm chart improvements
The Grafana Mimir Helm chart is released independently. Refer to the Grafana Mimir Helm chart documentation.
Changelog
3.1.0-rc.0
Grafana Mimir
- [CHANGE] Query-frontend: Renamed
minimum_step_sizefilter inblocked_queriesconfiguration tostep_size_shorter_thanto follow the naming convention oftime_range_longer_than. Users withminimum_step_sizein their runtime configuration must rename the field. #15081 - [CHANGE] Query-frontend:
blocked_queriesconfiguration is now validated at load time; a configuration error is returned if a rule has an emptypattern, or hasregex: truewith apatternthat is not a valid regular expression. #14978 - [CHANGE] Ingester: Changed default value of
-include-tenant-id-in-profile-labelsfrom false to true. #13375 - [CHANGE] Hash ring: removed experimental support for disabling heartbeats (setting
-*.ring.heartbeat-period=0) and heartbeat timeouts (setting-*.ring.heartbeat-timeout=0). These configurations are now invalid. #13104 - [CHANGE] Distributor: removed experimental flag
-distributor.metric-relabeling-enabled. #13143 - [CHANGE] Compactor: removed experimental flag
-compactor.no-blocks-file-cleanup-enabled. Cleanup of remaining files when no blocks exist is now always enabled. #13108 - [CHANGE] Ruler: Add "unknown" alert rule state to alerts and rules on the
GET <prometheus-http-prefix>/api/v1/alertsend point. Alerts are in the "unknown" state when they haven't yet been evaluated since the ruler started. #13060 - [CHANGE] All: remove experimental feature that allowed disabling ring heartbeats and timeouts. #13142
- [CHANGE] Store-gateway: Removed experimental
-blocks-storage.bucket-store.index-header.eager-loading-startup-enabledflag. The eager loading feature is now always enabled when lazy loading is enabled. #13126 - [CHANGE] Compactor: remove experimental
-compactor.in-memory-tenant-meta-cache-size. #13131 - [CHANGE] Distributor: Replace per-label-value warning on value length exceeded by an aggregated summary per metric and label name. #13189
- [CHANGE] Limits: removed the experimental
cost_attribution_labelsconfiguration option. Usecost_attribution_labels_structuredinstead. #13286 - [CHANGE] Ingester: Renamed
cortex_ingest_storage_writer_buffered_produce_bytesmetric tocortex_ingest_storage_writer_buffered_produce_bytes_distribution(Prometheus summary), and addedcortex_ingest_storage_writer_buffered_produce_bytesmetric that exports the buffer size as a Prometheus Gauge. #13414 - [CHANGE] Querier and query-frontend: Removed support for per-step stats when MQE is enabled. #13582
- [CHANGE] Querier: Make the experimental
enable_delayed_name_removalsetting configurable as a per-tenant limit instead of a global flag. #13926 - [CHANGE] Compactor: Require that uploaded TSDB blocks use v2 of the index file format. #13815
- [CHANGE] Store-gateway: Remove support for generating index-headers from TSDB blocks that use v1 of the index file format. #13824
- [CHANGE] Query-frontend: Removed support for calculating 'cache-adjusted samples processed' query statistic. The
-query-frontend.cache-samples-processed-statsCLI flag has been deprecated and will be removed in a future release. Setting it has now no effect. #13582 - [CHANGE] Querier: Renamed experimental flag
-querier.prefer-availability-zoneto-querier.prefer-availability-zonesand changed it to accept a comma-separated list of availability zones. All zones in the list are given equal priority when querying ingesters and store-gateways. #13756 #13758 - [CHANGE] Ingester: Stabilize experimental flag
-ingest-storage.write-logs-fsync-before-kafka-commit-concurrencyto fsync write logs before the offset is committed to Kafka. Remove-ingest-storage.write-logs-fsync-before-kafka-commit-enabledsince this is always enabled now. #13591 - [CHANGE] Query-frontend: Remove experimental flags
-query-frontend.shard-active-series-queries,-query-frontend.use-active-series-decoder. These settings are always enabled now. #15091 - [CHANGE] Ingester: Remove metric
cortex_ingester_owned_target_info_series; if needed this is better done as a custom active series tracker. #13831 - [CHANGE] Querier: Remove experimental flag
-querier.response-streaming-enabled, active series responses are now always streamed to query-frontends. #14095 #14114 - [CHANGE] Store-gateway: Warn when loading index headers based on TSDB blocks that use v1 of the index file format. #13834
- [CHANGE] Cache: Remove the experimental setting
-<prefix>.memcached.dns-ignore-startup-failuresthat allowed failure to discover Memcached servers to be a soft error and always consider failure to discover Memcached servers a hard error. #14038 - [CHANGE] Ingester: Removed the
-target=flushermode. If you need to flush ingester data, use the/ingester/flushHTTP endpoint instead. #14032 - [CHANGE] Limits: Add new limit
-validation.max-active-series-additional-custom-trackers(default: 0) to control the maximum number of additional custom trackers per tenant. This limit only applies toactive_series_additional_custom_trackers, not to-ingester.active-series-custom-trackers. Set to 0 (the default) to disable the limit. #14226 #14256 - [CHANGE] Querier and query-frontend: The experimental
-querier.mimir-query-engine.enable-common-subexpression-elimination-for-range-vector-expressions-in-instant-queriesand-querier.mimir-query-engine.enable-skipping-histogram-decodingflags have been removed. Both were previously enabled by default and now cannot be disabled. #14237 - [CHANGE] Query-frontend: The per-query memory consumption limit now considers the estimated memory consumed by buffered messages received from queriers but not yet consumed. #14775
- [CHANGE] Querier: The flag
-querier.filter-queryables-enabledis deprecated and will be removed in Mimir 3.3. #14843 - [CHANGE] Ingest storage: The
cortex_ingest_storage_writer_latency_secondsmetric now tracks the latency to write an incoming request to all Kafka partitions in a single call, instead of tracking latency individually for each partition. #14898 - [CHANGE] Ingest storage: deprecated
-ingest-storage.kafka.write-clientsCLI flag. The flag is now ignored and Mimir always uses a single Kafka write client. The flag will be removed in Mimir 3.3. #14903 - [CHANGE] Alertmanager:
-alertmanager.grafana-alertmanager-idle-grace-periodrenamed to-alertmanager.strict-initialization-idle-grace-period. #14960 - [CHANGE] Query-frontend: The per-query memory consumption limit now spans all time-split sub-queries when MQE is enabled rather than applying per split query. #14980
- [CHANGE] Query-frontend: Rewriting middleware now runs before user-injected middlewares. #15111
- [FEATURE] Distributor: experimental per-tenant limit
-distributor.ha-tracker.per-sample-dedupe(per-tenantha_tracker_per_sample_dedupe) to evaluate HA deduplication for each timeseries within a write request rather than making a single decision based on the first series. Enables correct behavior for mixed-label requests (e.g. Prometheus federation, metrics proxies) without affecting standard setups that have uniform HA labels within a single request. Disabled by default. #15064 - [FEATURE] Distributor: add
-validation.enforce-out-of-order-window-on-distributorper-tenant option. When enabled andpast_grace_periodis 0, distributors reject samples older thanout_of_order_time_window, matching ingester behavior, without relying on a smallpast_grace_period. #15090 - [FEATURE] Runtime config: Support loading configuration from
http://andhttps://URLs in addition to local files via-runtime-config.file. Added-runtime-config.http-client-timeout(default30s) to control the HTTP fetch timeout. Added-runtime-config.http-client-cluster-validation.label(inheritable from-common.client-cluster-validation.label) to send theX-Clustervalidation header when fetching from a cluster-validated HTTP endpoint. #15052 #15244 - [FEATURE] Distributor: Derive limits based on tenant metadata. Supported limits are
max_active_series_per_user,ingestion_rate,ingestion_burst_size,ingestion_burst_factor,otel_metric_suffixes_enabled,name_validation_schemeandotel_translation_strategy. #14289 - [FEATURE] Distributor: add
cortex_distributor_request_body_compression_ratiohistogram that tracks the compression of write requests. #14232 - [FEATURE] Distributor: add
-distributor.otel-label-name-underscore-sanitizationand-distributor.otel-label-name-preserve-underscoresthat control sanitization of underscores during OTLP translation. #13133 - [FEATURE] Query-frontends: Automatically adjust features used in query plans generated for remote execution based on what the available queriers support. #13017 #13164 #13544
- [FEATURE] Memberlist: Add experimental support for zone-aware routing in order to reduce memberlist cross-AZ data transfer. #13129 #13651 #13664
- [FEATURE] Query-frontend and querier: Add experimental support for performing query planning in query-frontends and distributing portions of the plan to queriers for execution. #13058 #13685 #13800 #14001 #14027
- [FEATURE] MQE: Add support for experimental extended range selector modifiers
smoothedandanchored. You can enable these modifiers with-query-frontend.enabled-promql-extended-range-selectors=smoothed,anchored#13398 - [FEATURE] MQE: Add support for the experimental PromQL function
info. #13443 - [FEATURE] Querier: Add
querier.mimir-query-engine.enable-reduce-matchersflag that enables a new MQE AST optimization pass that eliminates duplicate or redundant matchers that are part of selector expressions. #13178 - [FEATURE] Continuous test: Add
prometheus2option-tests.write-protocolflag to select Prometheus Remote-Write 2.0 as a protocol. #13659 #13982 - [FEATURE] Continuous test: Write metrics metadata along with samples. #13659 #13732 #13796
- [FEATURE] Store-gateway: Add experimental per-zone shard size
-store-gateway.tenant-shard-size-per-zone. When set, the total shard size is computed as this value multiplied by the number of zones. This option takes precedence over-store-gateway.tenant-shard-size. #13835 - [FEATURE] Distributor, Ingester: Add experimental reactive limiter setting
-distributor.reactive-limiter.max-limit-factor-decay. #14007 - [FEATURE] Ingester: Added experimental per-tenant early head compaction. New per-tenant limits
-ingester.early-head-compaction-owned-series-thresholdand-ingester.early-head-compaction-min-estimated-series-reduction-percentagetrigger compaction based on owned series count across all ingesters. #13980 #15056 - [FEATURE] Ingester: Added experimental support to run ingesters with no tokens in the ring when ingest storage is enabled. You can set
-ingester.ring.num-tokens=0to enable this feature. #14024 - [FEATURE] Store-gateway: Add
-store-gateway.sharding-ring.excluded-zonesflag to exclude specific zones from the store-gateway ring. #14120 - [FEATURE] Ingest storage: Add
-ingest-storage.kafka.sasl-mechanismflag supporting more ways to authenticate with Kafka. #14307 #14344 #14540 #14674 - [FEATURE] Ingest storage: Add
-ingest-storage.kafka.tls*flags to connect to Kafka using TLS. #14550 - [FEATURE] Ingest storage: Add
-ingest-storage.ingestion-partition-tenant-write-shard-sizeto limit the number of partitions used for writes independently from reads, allowing safely reducing the shard size without losing query coverage during the migration. #14780 - [FEATURE] MQE: Add experimental support for splitting and caching intermediate results for functions over range vectors in instant queries. #13472 #14479 #14506 #14499 #14517 #14536 #14614 #14645 #14677 #14788
- [FEATURE] MQE: Add experimental support for reporting the number of samples read per query. #14828 #14839 #14952 #15035 #15045
- [FEATURE] Compactor: Add
-compactor.ooo-split-and-merge-shardsper-tenant limit to allow a separate shard count for blocks with the out-of-order external label. #14704 - [FEATURE] Distributor: add experimental support for controlling OTLP metric name suffix addition and translation strategy via
X-Mimir-OTLP-AddSuffixesandX-Mimir-OTLP-TranslationStrategyrequest headers on the OTLP push path, gated by-api.otlp-translation-headers-enabled(off by default). #14782 - [ENHANCEMENT] Ingest storage: Default to the more efficient
-ingest-storage.kafka.producer-record-version=2based on Remote-Write 2.0, which reduces Kafka record size and improves write throughput. #15185 - [ENHANCEMENT] Distributor: Add per-tenant
-distributor.active-series-limit-response-codeoverride to configure the HTTP response code returned when rejecting series due to the active series limit. Defaults to 429 (Too Many Requests). Set to 400 (Bad Request) to prevent clients from retrying rejected requests. #14981 - [ENHANCEMENT] Query-frontend: Add
minimum_step_sizefilter to blocked queries config to reject range queries with a step smaller than the configured threshold. #14885 - [ENHANCEMENT] Query-frontend: Add support for blocking queries exceeding a time range duration with
time_range_longer_than. #14609 - [ENHANCEMENT] Distributor: Add zone-aware rate limiting via
-distributor.ring.instance-availability-zone. When configured the global ingestion rate limit is divided by the number of zones and the number of distributors in the local zone, instead of the total number of distributors. #14515 - [ENHANCEMENT] Memberlist: Add experimental propagation delay tracker to measure gossip propagation delay across the memberlist cluster. Enable with
-memberlist.propagation-delay-tracker.enabled=true. #14312 #14406 - [ENHANCEMENT] Memberlist: Add
-memberlist.received-messages-queue-sizeto configure the size of the internal queue for messages received from other nodes. Increasing this value may help to avoid dropping messages when the node is processing a large number of messages from other nodes. #14995 - [ENHANCEMENT] Compactor: Make the compactor dashboard autoscaling panel work with non-CPU scaling metrics. #15017
- [ENHANCEMENT] Compactor: Add 0-100% jitter to the first compaction interval to spread compactions when multiple compactors start simultaneously. #14280
- [ENHANCEMENT] Compactor, Store-gateway: Remove experimental setting
-compactor.upload-sparse-index-headersand always upload sparse index-headers. This improves lazy loading performance in the store-gateway. #13089 #13882 - [ENHANCEMENT] Querier: Reduce memory consumption of queries samples for a single series are retrieved from multiple ingesters or store-gateways. #13806
- [ENHANCEMENT] Store-gateway: Verify CRC32 checksums for 1 out of every 128 chunks read from object storage and the chunks cache to detect corruption. #13151
- [ENHANCEMENT] Ingester: the per-tenant postings for matchers cache is now stable. Use the following configuration options: #13101
-blocks-storage.tsdb.head-postings-for-matchers-cache-ttl-blocks-storage.tsdb.head-postings-for-matchers-cache-max-bytes-blocks-storage.tsdb.head-postings-for-matchers-cache-force-blocks-storage.tsdb.block-postings-for-matchers-cache-ttl-blocks-storage.tsdb.block-postings-for-matchers-cache-max-bytes-blocks-storage.tsdb.block-postings-for-matchers-cache-force
- [ENHANCEMENT] OTLP: De-duplicate
target_infosamples with conflicting timestamps. #13204 - [ENHANCEMENT] Query-frontend: Include the number of remote execution requests performed for a request in query stats logs emitted by query-frontends when remote execution is enabled. #13248
- [ENHANCEMENT] Update Docker base images from
alpine:3.22.1toalpine:3.22.2. #12991 - [ENHANCEMENT] Compactor, Store-gateway: Add metrics to track performance of in-memory and disk-based metadata caches. #13150
- [ENHANCEMENT] Ruler: Removed disk interaction when loading rules. #13156
- [ENHANCEMENT] Ingester: Cost-based index lookup planner accounts for query sharding when estimating cardinality and filter costs. #13374
- [ENHANCEMENT] GCS: Make uploads optionally retryable. Use the following advanced flags (default true): #13226 #13842
-alertmanager-storage.gcs.enable-upload-retries-blocks-storage.gcs.enable-upload-retries-common.storage.gcs.enable-upload-retries-ruler-storage.gcs.enable-upload-retries-alertmanager-storage.gcs.max-retries-blocks-storage.gcs.max-retries-common.storage.gcs.max-retries-ruler-storage.gcs.max-retries
- [ENHANCEMENT] Usage-tracker: Improve first snapshot loading & rehash speed. #13284
- [ENHANCEMENT] Query-frontend: Return different error messages when experimental functions, aggregations, or extended range selector modifiers are used but not enabled for a tenant. #13398
- [ENHANCEMENT] Usage-tracker: Improved snapshot loading by doing it in parallel with GOMAXPROCS workers. #13608 #13622
- [ENHANCEMENT] Usage-tracker, distributor: Make usage-tracker calls asynchronous for users who are far enough from the series limits. #13427
- [ENHANCEMENT] Usage-tracker: Ensure tenant shards have enough capacity when loading a snapshot. #13607
- [ENHANCEMENT] Usage-tracker: Limit-aware map growth in tenant shards to avoid excessive memory allocation when tenants grow slightly beyond their limit. #14642
- [ENHANCEMENT] Ruler: Implemented
OperatorControllableErrorClassifierfor rule evaluation, allowing differentiation between operator-controllable errors (e.g., storage failures, 5xx errors, rate limiting) and user-controllable errors (e.g., bad queries, validation errors, 4xx errors). This change affects the rule evaluation failure metricprometheus_rule_evaluation_failures_total, which now includes areasonlabel with valuesoperatororuserto distinguish between them. #13313, #13470 - [ENHANCEMENT] Store-gateway: Added
cortex_bucket_store_block_discovery_latency_secondsmetric to track time from block creation to discovery by store-gateway. #13489 #13552 #13963 - [ENHANCEMENT] Alertmanager, distributor, querier, ruler: Added experimental CLI flags to configure a grace period for health checks for connections to other services or other replicas. The default value of 0 preserves the existing behaviour of immediately removing connections that have failed a health check. #13521 #13846
-alertmanager.alertmanager-client.health-check-grace-period-distributor.ingester-health-check-grace-period-querier.frontend-client.health-check-grace-period-querier.store-gateway-client.health-check-grace-period-ruler.client.health-check-grace-period
- [ENHANCEMENT] Query-frontend: Added more efficient encoding and decoding of JSON payloads. #13561
- [ENHANCEMENT] Querier: Add optional per-tenant max limits for label name and label value requests,
max_label_names_limitandmax_label_values_limit. #13654 - [ENHANCEMENT] Usage tracker:
loadSnapshot()checks shard emptiness instead of using explicitfirstparameter. #13534 - [ENHANCEMENT] OTLP: Add metric
cortex_distributor_otlp_requests_by_content_type_totalto track content type (json or proto) of OTLP packets. #13525 - [ENHANCEMENT] Query-scheduler: Gracefully handle shutdown by draining the queue before exiting. #13603 #13976
- [ENHANCEMENT] OTLP: Add experimental metric
cortex_distributor_otlp_array_lengthsto better understand the layout of OTLP packets in practice. #13525 - [ENHANCEMENT] Ruler: gRPC errors without details are classified as
operatorerrors, and rule evaluation failures (such as duplicate labelsets) are classified asusererrors. #13586 - [ENHANCEMENT] Server: The
/metricsendpoint now supports metrics filtering by providing one or morename[]query parameters. #13746 - [ENHANCEMENT] Distributor: Improved the performance of configuration retrieval in the validation middleware. #13807
- [ENHANCEMENT] Ingester: Make sharded active-series requests matching all series faster. #13491
- [ENHANCEMENT] Ingester: New
-blocks-storage.tsdb.close-idle-tsdb-when-shipping-disabledflag to enforce closing of idle TSDBs when block shipping is disabled. #13862 - [ENHANCEMENT] Partitions ring: Add support to forcefully lock a partition state through the web UI. #13811
- [ENHANCEMENT] Usage-tracker: Serialize metrics gathering to reduce tail latency when running many partitions on a single instance. #13886
- [ENHANCEMENT] Usage-tracker: Add experimental per-user series created and removed counter metrics, gated behind
-usage-tracker.enable-verbose-series-creation-deletion-prometheus-metrics. #14486 - [ENHANCEMENT] API: The
/api/v1/user_limitsendpoint is now stable and no longer experimental. #13218 - [ENHANCEMENT] Ingester: limiting CPU and memory utilized by the read path (
-ingester.read-path-cpu-utilization-limitand-ingester.read-path-memory-utilization-limit) is now considered stable. #13167 - [ENHANCEMENT] Querier:
-querier.max-estimated-fetched-chunks-per-query-multiplieris now stable and no longer experimental. #13120 - [ENHANCEMENT] Alertmanager: UTF-8 strict mode (
-alertmanager.utf8-strict-mode-enabled) is now stable and no longer experimental. #13109 - [ENHANCEMENT] Promote the logger rate-limiting configuration parameters from experimental to stable. #13128
- [ENHANCEMENT] Ingester: Out-of-order ingestion support is now stable, use
-ingester.out-of-order-time-windowand-ingester.out-of-order-blocks-external-label-enabledto configure it. #13132 - [ENHANCEMENT] Ruler:
align_evaluation_time_on_intervalis now stable and no longer experimental. #13103 - [ENHANCEMENT] Query-frontend: query blocking (configured with
blocked_querieslimit) is now stable and no longer experimental. #13107 - [ENHANCEMENT] Querier:
-querier.active-series-results-max-size-bytesis now stable and no longer experimental. #13110 - [ENHANCEMENT] API: The
/api/v1/cardinality/active_seriesendpoint is now stable and no longer experimental. #13111 - [ENHANCEMENT] Querier: Default to streaming active series responses to query-frontends via
querier.response-streaming-enabled. #13883 - [ENHANCEMENT] Store-gateway: Add
cortex_bucket_store_blocks_loaded_size_bytesmetric to track per-tenant disk utilization. #13891 - [ENHANCEMENT] Compactor: If compaction fails because the result block would exceed the size limit for its postings offsets table, symbol table, or index, mark input blocks for no-compaction to avoid blocking future compactor runs. #13876 #14466 #14482
- [ENHANCEMENT] Query-frontend: add support for
range()duration expression. #13931 - [ENHANCEMENT] Add experimental flag
common.instrument-reference-leaks-percentageto leaked references to gRPC buffers. #13609 #14083 - [ENHANCEMENT] Querier: Add experimental flag
-querier.mimir-query-engine.enable-projection-pushdownto enable an MQE optimization pass for reducing data transferred between queriers and the storage layer. #14006 #14132 #14239 #14241 #14326 #14720 #14751 #14800 - [ENHANCEMENT] MQE: Default to enabling the "eliminate deduplicate and merge" optimization pass via
-querier.mimir-query-engine.enable-eliminate-deduplicate-and-merge. #14172 - [ENHANCEMENT] Ingester: Reduce likelihood of ingestion being paused while idle TSDB compaction is in progress. #13978
- [ENHANCEMENT] Ingester: Extend
cortex_ingester_tsdb_forced_compactions_in_progressmetric to report a value of 1 when there's an idle or forced TSDB head compaction in progress. #13979 - [ENHANCEMENT] Usage-tracker, distributor: Distributor accumulates batches of series and sends them to usage-tracker in fewer RPCs if '-distributor.usage-tracker-client.use-batched-tracking' is enabled. #13966 #13983
- [ENHANCEMENT] MQE: Include metric name in
histogram_quantilewarning/info annotations when delayed name removal is enabled. #13905 - [ENHANCEMENT] MQE: Add metrics to track step-invariant expression usage and data point reuse savings:
cortex_mimir_query_engine_step_invariant_nodes_totalandcortex_mimir_query_engine_step_invariant_steps_saved_total. #13911 - [ENHANCEMENT] MQE: Add explicit error handling for unsupported Prometheus experimental binary operator modifiers
fill,fill_leftandfill_right. #14107 - [ENHANCEMENT] MQE: Add experimental support for computing multiple aggregations over the same data without buffering. Enable with
-querier.mimir-query-engine.enable-multi-aggregation=true. #14123 #14174 - [ENHANCEMENT] Querier: Add support for the first phase of using non-opaque GRPC types between queriers and store-gateways per #14264. #14253
- [ENHANCEMENT] Querier: Optimize querying store-gateways when many of them are in a LEAVING state. #14157
- [ENHANCEMENT] Memberlist: Add "Size" column to "KV Store" table in the memberlist web page. #14200
- [ENHANCEMENT] Memberlist: Add experimental configuration option
-memberlist.rejoin-seed-nodesto set custom seed nodes used by periodic rejoin (when enabled). #14208 - [ENHANCEMENT] Ingester: Check labels order and uniqueness on ingestion to protect against corruption. #14089
- [ENHANCEMENT] Ingester: Add experimental file based Kafka consumer group offset tracking via flag
-ingest-storage.kafka.consumer-group-offset-commit-file-enforced. #14110 - [ENHANCEMENT] Store-gateway: Add "OOO" column to the tenant blocks page to indicate whether each block was created from out-of-order samples. #14283
- [ENHANCEMENT] Ingester: Optimize ingestion from Kafka in clusters with mixed size tenants. #13924 #13961 #14302
- [ENHANCEMENT] Querier: Add new config flag
querier.enable-delayed-name-removal-prometheus-engineto enable delayed name removal for Prometheus engine. #14349 - [ENHANCEMENT] Ingester: reduce heap usage during streaming chunk queries by releasing series label memory after each batch is sent rather than holding it until chunk streaming completes. #14422
- [ENHANCEMENT] Ingester: Eliminate 20-minute active series metrics loading period when custom tracker or cost attribution configuration changes. Active series counts are now immediately correct after a config reload. #14537
- [ENHANCEMENT] Ingester: Export
cortex_ingester_active_series_loadinggauge metric that is1while active series counts are still warming up after ingester startup, and0once they are accurate (after IdleTimeout has elapsed). #14783 - [ENHANCEMENT] Ingester: Allow
/ingester/flushendpoint to be called while the ingester is starting up. This is useful during incidents where ingesters are stuck replaying from Kafka because they hit the max series limit, and an operator needs to manually trigger TSDB head compaction to free up in-memory series. #15065 - [ENHANCEMENT] Ingest storage: Skip kotel tracing hooks for unsampled traces in the franz-go Kafka client, significantly reducing CPU and memory overhead. #14852
- [ENHANCEMENT] Distributor: Reduced CPU utilization when writing to ingest storage with a large number of partitions by batching all partitions into a single Kafka produce call instead of one per partition. #14898
- [ENHANCEMENT] Ingest storage: Allow configuring multiple Kafka seed brokers via
-ingest-storage.kafka.address(comma-separated). #14328 - [ENHANCEMENT] MQE: Add experimental support for eliminating selectors that are a subset of another selector. Enable with
-querier.mimir-query-engine.enable-subset-selector-elimination=true. #14456 #14457 #14546 #14559 #14561 #14621 - [ENHANCEMENT] Ingest storage: Add
-ingest-storage.kafka.client-rackflag to enable rack awareness. #14434 - [ENHANCEMENT] Ingester: Add
cortex_ingester_queried_blocks_totalmetric to track TSDB block generations queried. #14572 - [ENHANCEMENT] Distributor, ingest storage: Add
cortex_distributor_received_bytes_totalandcortex_ingest_storage_writer_input_bytes_totalmetrics to measure Remote Write v2 symbols table compression effectiveness. #14453 - [ENHANCEMENT] Store-gateway: Added
cortex_bucket_store_chunk_size_estimate_type_totalmetric to track how often do we infer the size of a chunk or use the default size. #14477 - [ENHANCEMENT] Block-builder: Expose per-tenant TSDB metrics. #14364 #14699
- [ENHANCEMENT] Block-builder: Add experimental
-block-builder.generate-sparse-index-headersoption. Construct and upload sparse index headers to object storage as part of block creation to make the sparse headers available to store-gateways when loading uncompacted blocks. #14494 - [ENHANCEMENT] Add experimental
-http.response-compression-levelCLI flag to set the gzip compression level used for compressed HTTP responses. #14586 - [ENHANCEMENT] Query-frontend: Add support for
lookback_deltaquery parameter for instant and range queries. #14582 #14588 - [ENHANCEMENT] Query-frontend: Extend query blocking to optionally only apply a blocking rule if the query is an unaligned range query. Set
unaligned_range_queries: trueto enable. #14643 - [ENHANCEMENT] Store-gateway: Add experimental flag
blocks-storage.bucket-store.partitioner-max-gap-bytes-chunksto specify the gap size for the chunks partitioner. #14649 - [ENHANCEMENT] Compactor: Add expermental
-compactor.first-level-compaction-ooo-wait-periodto configure a separate compaction wait period for out-of-order blocks. It's an analogue of-compactor.first-level-compaction-wait-period, which currently ignores out-of-order blocks. #14627 - [ENHANCEMENT] Block-builder: Support for experimental
-blocks-storage.tsdb.early-head-compaction-min-in-memory-seriesto enforce early head compaction, if in-memory series reach threshold. #14678 - [ENHANCEMENT] Usage-tracker: Improve performance of TrackSeriesBatch by preprocessing input data. #14702 #14734
- [ENHANCEMENT] MQE: Improve per-query memory consumption limit enforcement in histogram function evaluations. #14691
- [ENHANCEMENT] MQE: Improve per-query memory consumption limit enforcement within aggregation operations. #14735
- [ENHANCEMENT] Usage-tracker: Improve performance by using a special shard grouping algorithm. #14715
- [ENHANCEMENT] MQE: Support subset selector elimination for expressions where the subset is given by regex selectors. #14732
- [ENHANCEMENT] API: activity tracker (if enabled) covers the full request lifecycle and used on all routes. #14777
- [ENHANCEMENT] MQE: Add metrics for tracking in-flight memory consumption tracking.
cortex_querier_inflight_query_max_estimated_memory_consumption_limit_bytes,cortex_querier_inflight_query_current_estimated_memory_consumption_bytes,cortex_querier_inflight_query_peak_estimated_memory_consumption_bytesandcortex_querier_inflight_query_sampled_count. #14807 - [ENHANCEMENT] Query-frontend: Stream JSON encoding directly to the response body to avoid a full-copy allocation of the serialized payload. #14840
- [ENHANCEMENT] Activity tracker: Added
activity_tracker_unfinished_activities_loadedmetric to report the number of unfinished activities detected on startup. #14860 - [ENHANCEMENT] Distributor now uses record validation time as Kafka record timestamp to reduce rejections among consumers. #14921
- [ENHANCEMENT] MQE: Add optimisation pass to optimise away expressions that can't produce results such as those containing comparisons with
timestamp()due to the query time range or conflicting matchers. #14989 #15014 #15163 #15117 - [ENHANCEMENT] Distributor: OTLP endpoint now returns partial success (HTTP 200) instead of HTTP 429 when the usage tracker rejects some series due to the active series limit but other series are successfully ingested. The
RejectedDataPointsfield reports the count of distributor-side rejections (usage tracker filtering). #14789 - [ENHANCEMENT] MQE: Account for memory consumption of labels returned by binary operations in query memory consumption estimate earlier. #15033
- [ENHANCEMENT] Query-frontend: Log the number of series and samples returned for queries in
query statslog lines. #15044 - [ENHANCEMENT] Querier and query-frontend: When remote execution is enabled, send series metadata in batches, rather than in a single large message. The batch size can be configured with
-query-frontend.remote-execution-series-metadata-batch-size. #15047 - [ENHANCEMENT] Ingest storage: Update the default configuration to enable ingest storage concurrency: #15072
-ingest-storage.kafka.fetch-concurrency-maxfrom0to12-ingest-storage.kafka.ingestion-concurrency-maxfrom0to8-ingest-storage.kafka.ingestion-concurrency-queue-capacityfrom5to3-ingest-storage.kafka.ingestion-concurrency-target-flushes-per-shardfrom80to40-ingest-storage.kafka.max-buffered-bytesfrom100MBto1GB
- [ENHANCEMENT] MQE: Enable narrow selectors optimisation and hints passing for
and/unlessbinary operation. #15096 - [ENHANCEMENT] MQE: Add support for common subexpression elimination and subset selector elimination of range vector selectors in range queries. Enable with
-querier.mimir-query-engine.enable-range-query-range-vector-common-subexpression-elimination=true. #15127 - [ENHANCEMENT] MQE: Use series selected for one side to reduce data selected on the other side in one-to-many and many-to-one binary operations (eg.
group_leftandgroup_right). #15137 - [ENHANCEMENT] MQE: Reduced per-query memory overhead by no longer holding a reference to the HTTP request for the lifetime of a query. #15251
- [BUGFIX] Query-frontend: Fixed a memory leak caused that could occur on some error paths if MQE was enabled. #15251
- [BUGFIX] Alertmanager: Skip empty/zero config. #15184
- [BUGFIX] Tracing: Respect
OTEL_TRACES_SAMPLERandOTEL_TRACES_SAMPLER_ARGenvironment variables inNewOTelFromEnv(). Previously, the sampler was always hardcoded toAlwaysSample()when no Jaeger remote sampler was configured, making it impossible to control trace volume through standard OpenTelemetry configuration. #15128 - [BUGFIX] API: Scope activity tracking middleware to query routes only, preventing it from rejecting write requests that have an unexpected
Content-Typeheader with HTTP 500. #15129 - [BUGFIX] Ingester: enforce a minimum 10s delay between TSDB head compaction iterations when an iteration approaches or exceeds the configured
-blocks-storage.tsdb.head-compaction-interval, so ingestion is not starved by back-to-back compactions. #15061 - [BUGFIX] Update to Go v1.25.9. #15030
- [BUGFIX] Distributor: OTLP partial success responses now correctly populate
RejectedDataPointswith the actual count of rejected samples, instead of always reporting 0. In classical architecture, this includes rejected samples propagated from the ingester. #14789 - [BUGFIX] Distributor: Fix race condition where usage-tracker partition ring may not be initialized before the distributor service starts, causing
usage-tracker partition ring is requirederror on startup. #14675 - [BUGFIX] Store-gateway: Fix
cortex_bucket_store_series_data_touched{data_type="series", stage="returned"}metric observing negative values when series-for-postings cache is hit and pending matchers filter out some series. #14655 - [BUGFIX] Mimir: Fix false positive in filesystem path overlap detection when one path is a string prefix of another but not an ancestor directory. #14426
- [BUGFIX] Build: Fixed config descriptor generation to correctly handle custom field types without CLI flags. #14632
- [BUGFIX] Query-frontend: Fixed blocked queries tests to use production code path instead of bypassing YAML parsing and canonicalization. #14585
- [BUGFIX] Distributor: Fix ingestion rate limit error message reporting incorrect burst size when
ingestion_burst_factoris configured. #14471 - [BUGFIX] Mimir: Fix nil pointer dereference when
-targetis set to an empty string. #14381 - [BUGFIX] API: Fixed web UI links not respecting
-server.path-prefixconfiguration. #14090 - [BUGFIX] API: Fixed embedded web UI static assets (CSS, JS, images) returning 404 when
-server.path-prefixis configured. #15181 - [BUGFIX] Distributor: Fix issue where distributors didn't send custom values of native histograms. #13849
- [BUGFIX] Compactor: Fix potential concurrent map writes. #13053
- [BUGFIX] Query-frontend: Fix issue where queries sometimes fail with
failed to receive query result stream message: rpc error: code = Canceled desc = context canceledif remote execution is enabled. #13084 - [BUGFIX] Query-frontend: Fix issue where query stats, such as series read, did not include the parameters to the
histogram_quantileandhistogram_fractionfunctions if remote execution was enabled. #13084 - [BUGFIX] Query-frontend: Fix issue where requests that are canceled or time out are sometimes cached if remote execution is enabled. #13098
- [BUGFIX] Querier: Fix issue where errors are logged as "EOF" when sending results to query-frontends in response to remote execution requests fails. #13099 #13121
- [BUGFIX] Usage-Tracker: Fix underflow in current limit calculation when series >= limit. #13113
- [BUGFIX] Querier: Fix issue where a problem sending a response to a query-frontend may cause all other responses from the same querier to the same query-frontend to fail or be delayed. #13123
- [BUGFIX] Ingester: fix index lookup planning with regular expressions which match empty strings on non-existent labels. #13117
- [BUGFIX] Memberlist: Fix memberlist initialization when Mimir is executed with
-target=memberlist-kv. #13129 - [BUGFIX] Query-frontend: Fix issue where queriers may receive a
rpc error: code = Internal desc = cardinality violation: expected <EOF> for non server-streaming RPCs, but received another messageerror while sending a query result to a query-frontend if remote execution is enabled. #13147 - [BUGFIX] Querier: Fix issue where cancelled queries may cause a
error notifying scheduler about finished querymessage to be logged. #13186 - [BUGFIX] Querier: Fix issue where evaluation metrics and logs aren't emitted if remote execution is enabled. #13207
- [BUGFIX] Query-frontend: Fix issue where queries containing subqueries could fail with
slice capacity must be a power of two, but is Xif remote execution is enabled. #13211 - [BUGFIX] Query-frontend: Fix issue where queries containing duplicated shardable expressions would fail with
could not materialize query: no registered node materializer for node of type NODE_TYPE_REMOTE_EXECif running sharding inside MQE is enabled. #13247 - [BUGFIX] Runtime config: Fix issue when inconsistent map key types (numbers and strings) caused some of the runtime config files silently skipped from loading. #13270
- [BUGFIX] Store-gateway: Fix how out-of-order blocks are tracked in the
cortex_bucket_store_series_blocks_queriedmetric. #13261 - [BUGFIX] Cost attribution: Fix panic when metrics are created with invalid labels. #13273
- [BUGFIX] Distributor: Fix in-flight request counter when the reactive limiter is full. #13406
- [BUGFIX] Query-frontend: Fix panic when evaluating a sharded
avgexpression when running sharding inside MQE. #13484 - [BUGFIX] Query-frontend: Fix incorrect annotation position information when running sharding inside MQE. #13484
- [BUGFIX] Query-frontend: Fix incorrect query results when evaluating some sharded aggregations with
withoutwhen running sharding inside MQE. #13484 - [BUGFIX] Ingester: Panic when push and read reactive limiters are enabled with prioritization. #13482
- [BUGFIX] Usage-tracker: Prevent tracking requests to be handled by partition handlers that are not in Running state. #13532
- [BUGFIX] MQE: Fix an issue when applying extra matchers to one side of a binary operation to avoid adding matchers for labels that do not exist. #13499 #13592
- [BUGFIX] Query-frontend: Fix excessive CPU and memory consumption when running sharding inside MQE. #13580
- [BUGFIX] Rename
cortex_bucket_store_cached_postings_compression_time_seconds,cortex_query_frontend_regexp_matcher_count, andcortex_query_frontend_regexp_matcher_optimized_countto follow naming conventions. #13599 - [BUGFIX] MQE: Fix issue where the
conflicting counter resets during histogramwarning could be incorrectly emitted during sharded histogram aggregations. #13623 - [BUGFIX] Query-frontend: Fix incorrect query results when running sharding inside MQE is enabled and the query contains a subquery eligible for subquery spin-off wrapped in a shardable aggregation. #13619
- [BUGFIX] Memberlist: Fix occasional nil pointer dereference panics. #13635
- [BUGFIX] Query-scheduler: Fix issue where queries executed with remote execution could time out rather than fail immediately if the querier evaluating the request crashes after receiving the query from the query-scheduler. #13742
- [BUGFIX] Query-frontend: Fix silent panic when executing a remote read API request if the request has no matchers. #13745
- [BUGFIX] Ruler: Fixed
-ruler.max-rule-groups-per-tenant-by-namespaceto only count rule groups in the specified namespace instead of all namespaces. #13743 - [BUGFIX] Ruler: Fix parsing of rule expressions with leading newlines. #14947
- [BUGFIX] Query-frontend: Fix race condition that could sometimes cause unnecessary resharding of queriers if querier shuffle sharding and remote execution is enabled. #13794 #13838
- [BUGFIX] Query-frontend: Fix
step()duration expression returning 1000x larger value. #13920 - [BUGFIX] Store-gateway: Fix parent-child relationship in LabelNames and LabelValues trace spans. #13932
- [BUGFIX] MQE: Map remote execution storage errors correctly. #13944
- [BUGFIX] Ingester: Fix race condition where new partition could reach Active partition ring state for a before its ingester instances reached Active ring state. #14025
- [BUGFIX] Ingester: Query all ingesters when shuffle sharding is disabled. #14041
- [BUGFIX] Query-frontend: Fix issue where per-query memory consumption limit is not enforced. #14086
- [BUGFIX] Ingester: Fix race condition during shutdown where TSDBs could be closed while appends are still in progress. #14094
- [BUGFIX] Store-gateway: Fix blocks being incorrectly dropped during shutdown when the store-gateway is terminated while fetching an updated bucket index. #14113
- [BUGFIX] Ingester: Defensive correctness fix for buffer reference counting in pkg/mimirpb. #14108
- [BUGFIX] Ingester: Add timeouts to wait for instance state on startup and deferred shutdown of tasks on failure. #14134, #14180
- [BUGFIX] Distributor: Fix duplicate label validation bypass when label value exceeds length limit and is handled by Truncate or Drop strategy. #14131
- [BUGFIX] Block-builder-scheduler: Fix bug where data could be skipped when partition is fully consumed at startup but later grows. #14136
- [BUGFIX] Ingester: Create TSDB directory on startup #14112
- [BUGFIX] Querier: Fix strategy used to select partitions to query when some partions are Inactive since longer than lookback period and shuffle sharding is disabled. #14261
- [BUGFIX] Block-builder-scheduler: Fix data race when reading partition state during pending jobs enqueueing. #14489
- [BUGFIX] Querier: Fix issue where queries can time out if remote execution is enabled and sending the initial message from queriers to query-frontends fails. #14557
- [BUGFIX] Querier: Fix issue where different sharded legs of a query could be evaluated with different lookback deltas if different queriers were configured with different default lookback deltas. #14575
- [BUGFIX] Query-frontend: Fixed partial cache hit returning incomplete data for native histogram series due to incorrect response ordering before merge. #14612
- [BUGFIX] Update to Go v1.25.8 to address CVE-2026-27142, CVE-2026-27139, CVE-2026-25679, CVE-2026-27138, CVE-2026-27137. #14623
- [BUGFIX] Distributor: Fix nil pointer panic in
WriteRequest.Unmarshalwhen receiving a Remote Write 2.0 request with zero timeseries. #14698 - [BUGFIX] MQE: Fix
info()incorrectly dropping inner series with no matching info series when a data label matcher matches the empty string. #14819 - [BUGFIX] MQE: Fix
info()emitting un-enriched series when a data label matcher doesn't match the empty string and the info series is unavailable at some timestamps. #14812 - [BUGFIX] MQE: Fix and/unless functions to not pass matchers to RHS as it can result in incorrect filtering. #14902
- [BUGFIX] MQE: Fix internal error when executing a subquery with delayed name removal enabled. #14946
- [BUGFIX] Alertmanager: Fix deadlock when trying to broadcast after stopping a tenant #14922
- [BUGFIX] Query-frontend: Fix max total query length limit (
-query-frontend.max-total-query-length) not being enforced on instant queries with subqueries or range selectors. #14985 - [BUGFIX] Compactor: Fix potential goroutine leak when compaction iteration exits early due to errors. #13420
- [BUGFIX] Query-frontend: Fix bugs with matcher propagation for binary operations where it was not being properly applied within nested expressions and also wrongly propagating internal label matchers. #15110
- [BUGFIX] Distributor: Cancel DoUntilQuorum in cardinality analysis API when active_series_results_max_size_bytes is breached. #15177
- [BUGFIX] MQE: Fix issue where queries with step-invariant range vector expressions (eg.
quantile_over_time(scalar(arg), metric[5m] @ 1000)) could return incorrect results. #15192 - [BUGFIX] MQE: Fix
info()function not enriching series when inner series are missing one identifying label (instance/job) but matching info series exist. #14832 - [BUGFIX] MQE: Fix
info()function only retaining one matcher when multiple data label matchers target the same label name. #14832 - [BUGFIX] MQE: Fix
info()function silently overwriting conflicting labels from different info metrics instead of returning an error. #14832 - [BUGFIX] MQE: Fix
info()function incorrectly grouping labels from replaced info series at the same evaluation timestamp due to lookback. #14832
Mixin
- [CHANGE] Alerts and rules: Replaced
_config.base_alerts_range_interval_minutesand_config.recording_rules_range_intervalwith_config.scrape_interval(default15s). Instead of configuring a pre-multiplied number of minutes, configure your actual Prometheus scrape interval and the mixin will compute safe rate-function windows automatically (at least 4× the scrape interval). #15174 #15176 - [CHANGE] Dashboards: Add configuration option
dashboards_default_latency_modeto control the default value of the native/classic latency variable (uses 'classic' if unset). #14424 - [CHANGE] Alerts: Renamed the following alerts to fit within 40 characters: #13363
MimirAlertmanagerPartialStateMergeFailing→MimirAlertmanagerStateMergeFailingMimirServerInvalidClusterValidationLabelRequests→MimirServerInvalidClusterLabelRequestsMimirClientInvalidClusterValidationLabelRequests→MimirClientInvalidClusterLabelRequestsMimirHighGRPCConcurrentStreamsPerConnection→MimirHighGRPCStreamsPerConnectionMimirDistributorReachingInflightPushRequestLimit→MimirDistributorInflightRequestsHighMimirIngesterHasNotShippedBlocks→MimirIngesterNotShippingBlocksMimirIngesterHasNotShippedBlocksSinceStart→MimirIngesterNotShippingBlocksSinceStartMimirIngesterTSDBCheckpointCreationFailed→MimirIngesterTSDBCheckpointCreateFailedMimirIngesterTSDBCheckpointDeletionFailed→MimirIngesterTSDBCheckpointDeleteFailedMimirCompactorHasNotSuccessfullyCleanedUpBlocks→MimirCompactorNotCleaningUpBlocksMimirCompactorHasNotSuccessfullyRunCompaction→MimirCompactorNotRunningCompactionMimirCompactorFailingToBuildSparseIndexHeaders→MimirCompactorBuildingSparseIndexFailedMimirIngesterLastConsumedOffsetCommitFailed→MimirIngesterOffsetCommitFailedMimirIngesterFailedToReadRecordsFromKafka→MimirIngesterKafkaReadFailedMimirStartingIngesterKafkaReceiveDelayIncreasing→MimirStartingIngesterKafkaDelayGrowingMimirIngesterFailsToProcessRecordsFromKafka→MimirIngesterKafkaProcessingFailedMimirIngesterStuckProcessingRecordsFromKafka→MimirIngesterKafkaProcessingStuckMimirStrongConsistencyOffsetNotPropagatedToIngesters→MimirStrongConsistencyOffsetMissingMimirKafkaClientBufferedProduceBytesTooHigh→MimirKafkaClientProduceBufferHigh
- [CHANGE] Alerts: Replaced
MimirCompactorSkippedUnhealthyBlockswith more genericMimirCompactorSkippedBlocks. #13876 - [CHANGE] Dashboards: replace usage of
container_spec_cpu_quota / container_spec_cpu_periodwithkube_pod_container_resource_limitsfor calculation of CPU limits. #14425 - [CHANGE] Dashboards: The queries used in latency panels no longer convert seconds to milliseconds. The dashboard panels now use "seconds" unit instead of "milliseconds". #14896
- [ENHANCEMENT] Dashboards: Group compactor compaction-related panels into a single collapsible "Compaction" row. #14784
- [ENHANCEMENT] Dashboards: Merge CPU and memory panels in the "Compactor resources" dashboard into a single collapsible row. #14866
- [ENHANCEMENT] Alerts: Add more native histogram versions of alerts using classic histograms. #13814
- [ENHANCEMENT] Alerts: Improve
MimirCompactorNotRunningCompactionalert to be restart-resistant. Added warning severity alerts for early detection (6h threshold) and lowered thesince-startupcritical duration from 24h to 12h. #14282 - [ENHANCEMENT] Dashboards: Support native histograms in the Alertmanager, Compactor, Queries, Rollout operator, Reads, RemoteRuler-Reads, Ruler, and Writes dashboards. #13556 #13621 #13629 #13673 #13690 #13678 #13633 #13672
- [ENHANCEMENT] Alerts: Add
MimirFewerIngestersConsumingThanActivePartitionsalert. #13159 - [ENHANCEMENT] Querier and query-frontend: Add alerts for querier ring, which is used when performing query planning in query-frontends and distributing portions of the plan to queriers for execution. #13165
- [ENHANCEMENT] Alerts: Add
MimirBlockBuilderSchedulerNotRunningalert. #13208 - [ENHANCEMENT] Alerts: Add
MimirBlockBuilderPersistentJobFailurealert. #13278 - [ENHANCEMENT] Dashboards: Update default regular expressions to match multi-zone deployments for query-frontend, querier, distributor and ruler. #13200
- [ENHANCEMENT] Alerts: Update
MimirHighVolumeLevel1BlocksQueriedalert to fire on a percentage of the level 1 blocks queried. #13229 - [ENHANCEMENT] Dashboards: Plot OMMKilled events in the workingset memory panels of resources dashboards. #13377
- [ENHANCEMENT] Dashboards: Add variable to compactor and object store dashboards to switch between classic and native latencies. Use native histogram
thanos_objstore_bucket_operation_duration_seconds. #12137 - [ENHANCEMENT] Recording rules: Add native histogram version of histogram recording rules. #13553
- [ENHANCEMENT] Alerts: Add
MimirMemberlistBridgeZoneUnavailablealert. #13647 - [ENHANCEMENT] Alerts: Add
MimirMemberlistZoneAwareRoutingAutoFailoveralert that fires when memberlist zone-aware routing auto-failover triggers due to missing memberlist bridges. #13726 - [ENHANCEMENT] Dashboards and recording rules: Add usage-tracker rows to writes, writes-networking, writes-resources dashboards if the config.usage_tracker_enabled var is set. Add usage-tracker client latency recording rules. #13639 #13652 #14865
- [ENHANCEMENT] Recording rules and dashboards: Add
stagelabel tocortex_ingester_queried_seriesrecording rules and filter Queries dashboard "Series per query" panel to show onlystage=merged_blocks. #13666 - [ENHANCEMENT] Dashboards: Add "Owned series" and "Active series" panels to the writes dashboard Headlines row. #13895
- [ENHANCEMENT] Alerts: Add
IncorrectWebhookConfigurationFailurePolicy,BadZoneAwarePodDisruptionBudgetConfigurationandHighNumberInflightZpdbRequestsrollout-operator alerts. #13840 - [ENHANCEMENT] Dashboards: Add additional panels to the rollout-operator dashboard related to the zone aware pod disruption budget controller. #13840
- [ENHANCEMENT] Dashboards: Sort tooltips in descending order to show main contributors to spike or query. #13827
- [ENHANCEMENT] Dashboards: Add "By store-gateway disk utilization" panel to the Top Tenants dashboard showing per-tenant disk usage and their shard size. #13917
- [ENHANCEMENT] Dashboards: Add panels showing the distribution of estimated query memory consumption and rate of fallback to Prometheus' query engine in query-frontends to the Queries dashboard. #14029
- [ENHANCEMENT] Dashboards: Add "Forced TSDB head compactions in progress" panel to "Mimir / Writes" dashboard. #14248
- [ENHANCEMENT] Dashboards: Improve "Last successful run per-compactor replica" table in the compactor dashboard to show time since process start for compactors that haven't completed their first run yet. #14285
- [ENHANCEMENT] Alerts: Add
MimirUsageTrackerSnapshotUploadFailingandMimirUsageTrackerSnapshotDownloadFailingalerts to detect usage-tracker snapshot upload/download failures. #14778 - [ENHANCEMENT] Alerts: Add dashboard_url annotations to Prometheus alerts. #14458
- [ENHANCEMENT] Dashboards: Change the "Rules" panel in the "Mimir / Reads resources" dashboard to use a stacked visualization. #14707
- [ENHANCEMENT] Dashboards: Split the "All series" panel in the Tenants dashboard into "Active series" and "Owned & in-memory series" panels, and added the active series limit. #14648 #14771
- [ENHANCEMENT] Dashboards: Add "In memory series" panel to experimental "Mimir / Block-builder" dashboard. #14700
- [ENHANCEMENT] Dashboards: Unify object store rows into a single collapsible row across Alertmanager, Compactor, Reads, and Ruler dashboards. #14850
- [ENHANCEMENT] Alerts: Make
MimirInconsistentRuntimeConfigalert less flaky when performing multiple configuration changes in a row in a large Kubernetes cluster. #14743 #14933 #15051 #15257 - [ENHANCEMENT] Alerts: Suppress
MimirRingMembersMismatchalert during ingester rollouts. The alert now uses anunlessclause to avoid false positives when ingester statefulsets are being updated. #14895 - [ENHANCEMENT] Recording rules: add a low-cardinality recorded version of usage_tracker_active_series. #14901
- [ENHANCEMENT] Alerts: Fix
MimirSchedulerQueriesStuckfalse positives by only looking for cases where the number of enqueued queries doesn't decrease. #14943 #15193 - [ENHANCEMENT] Dashboards: Add ephemeral storage panels to "Resources" dashboards. #14999
- [ENHANCEMENT] Dashboards: Add disk utilization panels to experimental Block-builder dashboard. #15029
- [BUGFIX] Dashboards: Fix compactor dashboard to exclude instances without the last successful run metric in the "Last successful run per-compactor replica" table. #14784
- [BUGFIX] Dashboards: Fix issue where throughput dashboard panels would group all gRPC requests that resulted in a status containing an underscore into one series with no name. #13184
- [BUGFIX] Dashboards: Filter out 0s from
max_serieslimit on Writes Resources > Ingester > In-memory series panel. #13419 - [BUGFIX] Dashboards: Fix issue where the "Tenant gateway requests" panels on Tenants dashboard would show data from all components. #13940
- [BUGFIX] Dashboards: Fix issue where the MQE-related dashboard panels on the Queries dashboard would show data from both queriers and query-frontends, instead of just queriers. #14029
- [BUGFIX] Alerts: Fix alert definitions with short range vector selectors that did not respect the configured
base_alerts_range_interval_minutes. #15083 - [BUGFIX] Dashboards: Fix mixin build failure when
singleBinaryistrue. #15108 - [BUGFIX] Alerts: Fix alert for store-gateway object storage operation failures to alert based on percentage of failed operations, not raw number of them. #15196
Jsonnet
- [CHANGE] Renamed the following configuration parameters to add the
_per_zonesuffix, to better reflect that these values apply per zone in multi-zone deployments: #13632autoscaling_querier_min_replicas→autoscaling_querier_min_replicas_per_zoneautoscaling_querier_max_replicas→autoscaling_querier_max_replicas_per_zoneautoscaling_query_frontend_min_replicas→autoscaling_query_frontend_min_replicas_per_zoneautoscaling_query_frontend_max_replicas→autoscaling_query_frontend_max_replicas_per_zoneautoscaling_ruler_min_replicas→autoscaling_ruler_min_replicas_per_zoneautoscaling_ruler_max_replicas→autoscaling_ruler_max_replicas_per_zoneautoscaling_ruler_querier_min_replicas→autoscaling_ruler_querier_min_replicas_per_zoneautoscaling_ruler_querier_max_replicas→autoscaling_ruler_querier_max_replicas_per_zoneautoscaling_ruler_query_frontend_min_replicas→autoscaling_ruler_query_frontend_min_replicas_per_zoneautoscaling_ruler_query_frontend_max_replicas→autoscaling_ruler_query_frontend_max_replicas_per_zone
- [CHANGE] Store-gateway: The store-gateway disk class now honors the one configured via
$._config.store_gateway_data_disk_classand doesn't replacefastwithfast-dont-retain. #13152 - [CHANGE] Rollout-operator: Vendor jsonnet from rollout-operator repository. #13245 #13317 #13793 #13799 #13840 #14240 #14463 #14854 #14900
- [CHANGE] Ruler: Set default memory ballast to 1GiB to reduce GC pressure during startup. #13376
- [CHANGE] Zone pod disruption budget: Remove
multi_zone_zpdb_enabledand replace it withmulti_zone_ingester_zpdb_enabledandmulti_zone_store_gateway_zpdb_enabledto allow to selectively enable the zone pod disruption budget on a per-component basis. #13813 - [CHANGE] Reduced dynamic replication factor when running store-gateways with replication factor set to a value higher than 3. #14304
- [CHANGE] Disable ingester ring tokens by default when ingest storage architecture is enabled. #14613
- [CHANGE] Querier: Set
ignoreNullValuesto false by default for KEDAScaledObjectto prevent autoscaling down when there is no data returned from scaling metrics. #14641 - [CHANGE] Ingester: Change default ingestion concurrency configuration used by ingest storage architecture, to maximize throughput when consuming records from Kafka. #14668
- [CHANGE] Memberlist: when the multi-zone memberlist bridge is enabled (
multi_zone_memberlist_bridge_enabled), Mimir components now use memberlist-bridge pods as seed nodes by default, instead of the shared gossip ring service. This reduces inter-AZ data transfer. The newmemberlist_bridge_seed_nodes_enabledconfiguration option can be used to disable this behavior. #14994 - [CHANGE] Ruler remote evaluation: Split the ruler-query-frontend service into a ClusterIP service (for HTTP load balancing) and a headless service (for gRPC client-side load balancing by rulers). The ruler now connects to the headless service. #15001
- [CHANGE] Memberlist bridge: Changed default value of
memberlist_bridge_replicas_per_zonefrom 2 to 3. #14667 - [FEATURE] Add multi-zone support for read path components (memcached, querier, query-frontend, query-scheduler, ruler, and ruler remote evaluation stack). Add multi-AZ support for ingester and store-gateway multi-zone deployments. Add memberlist-bridge support for zone-aware memberlist routing. #13559 #13628 #13636 #13915 #14260 #14301
- [FEATURE] Add deletion protection support for ingesters and store-gateways StatefulSet. It can be enabled by setting
ingester_deletion_protection_enabledandstore_gateway_deletion_protection_enabledin the_configblock. #13819 - [FEATURE] Shuffle sharding: Add the following configuration options to enable the experimental per-zone store-gateway shard size: #13908 #13941
$._config.shuffle_sharding.store_gateway_shard_size_per_zone_enabled$._config.shuffle_sharding.store_gateway_shard_size_per_zone_defaults_enabled(takes precedence overstore_gateway_shard_size_per_zone_enabled)$._config.shuffle_sharding.store_gateway_shard_size_per_zone_overrides_enabled(takes precedence overstore_gateway_shard_size_per_zone_enabled)
- [FEATURE] Ruler: Add
$._config.multi_zone_ruler_balanced_autoscaling_enabledoption to ensure equally balanced replica counts across ruler zones in multi-AZ deployments by using aggregate metrics for autoscaling. #14198 - [FEATURE] Add
query_engine_range_vector_splitting_enabledconfiguration option to enable experimental range vector splitting with memcached cache. #14435 - [FEATURE] Store-gateway: Add the ability to autoscale store-gateways based on disk usage when automated downscale is enabled. #15019
$._config.autoscaling_store_gateway_enabled$._config.autoscaling_store_gateway_disk_usage_threshold$._config.autoscaling_store_gateway_min_replicas_per_zone$._config.autoscaling_store_gateway_max_replicas_per_zone
- [ENHANCEMENT] Ruler querier and query-frontend: Add support for newly-introduced querier ring, which is used when performing query planning in query-frontends and distributing portions of the plan to queriers for execution. #13017
- [ENHANCEMENT] Ingester: Increase
$._config.ingester_tsdb_head_early_compaction_min_in_memory_seriesdefault when Mimir is running with the ingest storage architecture. #13450 - [ENHANCEMENT] Memberlist bridge: Add
memberlist_bridge_replicas_per_zoneconfiguration option (default: 2). #13727 - [ENHANCEMENT] Update the list of OTel resource attributes used for tracing. #13469
- [ENHANCEMENT] Ingester: Set
-ingester.partition-ring.delete-inactive-partition-afterbased on-querier.query-ingesters-within. #13550 - [ENHANCEMENT] Add extra, experimental, KEDA ScaledObject trigger to prevent from down-scaling during OOM kills, if memory trigger is disabled and
$._config.autoscaling_oom_protection_enabledis true. #13509 - [ENHANCEMENT] Multi-zone: Make config validation exclusions configurable via
multi_zone_config_validation_excluded_argsandmulti_zone_config_validation_excluded_env_vars, and add validation for multi-zone distributor deployments. #13728 - [ENHANCEMENT] Overrides-exporter: Include query configuration so that query limit defaults are reported accurately. #13850
- [ENHANCEMENT] Expose pod termination grace period for alertmanagers, ingesters, query-frontends, rulers and store-gateways. #13852
- [ENHANCEMENT] Store-gateways configured in multi-zone deployment will only scale up once the preceding zones replicas are all ready. #13879
- [ENHANCEMENT] Multi-zone: Add config options to enable multi-zone (virtual zones) and multi-AZ deployments for all write and read path components respectively: #13906
multi_zone_write_path_enabledmulti_zone_read_path_enabledmulti_zone_read_path_multi_az_enabled
- [ENHANCEMENT] Overrides-exporter: Add
overrides_exporter_exported_limitsconfig option to specify the limits exposed by the exporter. The default list of limits has not been changed compared to the previous version. #13912 - [ENHANCEMENT] Ingester: Add
ingester_priority_classconfig option to customise the ingester priority class. By default no explicit priority class is configured, and the Kubernetes default class is used. #14093 - [ENHANCEMENT] Store-gateway: Add config options to enable store-gateway multi-AZ deployments on a per-zone basis. #14111
multi_zone_store_gateway_zone_a_multi_az_enabledmulti_zone_store_gateway_zone_b_multi_az_enabledmulti_zone_store_gateway_zone_c_multi_az_enabled
- [ENHANCEMENT] Querier: Add
autoscaling_querier_ignore_null_valuesoption to set KEDAignoreNullValuesfor querier autoscaling metrics. #14101 - [ENHANCEMENT] Multi-zone: Add config validation for
-querier.prefer-availability-zonesflag on querier and ruler-querier deployments. #14539 - [ENHANCEMENT] Distributor: render the experimental
-distributor.max-active-series-per-userflag on distributor if$._config.limits.max_active_series_per_useris set. #14636 - [ENHANCEMENT] Ingester: Add
$._config.ingest_storage_set_client_rackto pass-ingest-storage.kafka.client-rackwhen zone-aware replication is enabled. #14654 - [ENHANCEMENT] Ingester: Add
$._config.multi_zone_ingester_multi_az_zone_(a|b|c)_enabledto simplify migrations not using a temporary zone-c. #15000 - [BUGFIX] Ingester: Fix
$._config.ingest_storage_ingester_autoscaling_max_owned_series_thresholddefault value, to compute it based on the configured$._config.ingester_instance_limits.max_series. #13448
Documentation
- [ENHANCEMENT] Runbook: Add section on "Ring Failures" to
MimirCompactorNotRunningCompactionrunbook. #14391 - [ENHANCEMENT] Add Azure object store workload identity example configuration. #13135
- [ENHANCEMENT] Ruler: clarify that internal distributor applies to both operational modes. #13300
- [ENHANCEMENT] Native histograms: Set expectations on querying classic histograms versus NHCBs. #13689
- [ENHANCEMENT] Add a scenario to the MimirCompactorNotRunningCompaction runbook. #13874
- [ENHANCEMENT] Document how ingesters calculate partition ID from ring's instance ID in ingest storage. #13903
- [ENHANCEMENT] Add AWS profile authentication example to
mark-blockstool documentation and add centralized section in runbooks with examples for all cloud providers. #14281 - [BUGFIX] Distributor: Fix type error in multi-zone distributor container constructor's env map. #14403
- [BUGFIX] Native histograms: Fix PromQL query example for
histogram_fractionto filter NaN results when there are no observations. #14433 - [BUGFIX] OTLP: Exponential histograms over OTLP are not experimental. #14437
- [ENHANCEMENT] Kafka: Document that Apache Kafka and Confluent Kafka require
message.max.bytes=16000000to support Mimir's default producer record size. #14875
Tools
- [FEATURE] mimir-tool: Add
validate alerts-filecommand that performs checks on alert files defined as YAML. #14043 - [FEATURE] mimir-tool: Add
partition-ring add-partitionandpartition-ring remove-partitioncommands. #14265 - [FEATURE] mimir-tool: Add
partition-ring add-ownerandpartition-ring remove-ownercommands. #14462 - [FEATURE] tsdb-index-header: Add tool to inspect the content of a block's index or index-header. #13738 #14279 #14944
- [FEATURE] tsdb-chunks, tsdb-print-chunk: When printing samples, include the start time (ST) in the output. #14337
- [FEATURE] kafkatool: Add
create-topiccommand to create a Kafka topic with a specified number of partitions. #14639 - [FEATURE] kafkatool: Add
list-topicscommand to list all Kafka topics and their partition counts. #14639 - [ENHANCEMENT] mimir-tool: Add
__ignore_usage__=""label selector to queries used inanalyze prometheuscommand, so that Adaptive Metrics' recommendations service ignores them. #14474 - [ENHANCEMENT] mimir-tool: Add TLS client flags (
--tls-ca-path,--tls-cert-path,--tls-key-path,--tls-server-name,--tls-insecure-skip-verify) to theremote-readsubcommands so they can talk to an endpoint protected by mTLS or a private CA. #15132 - [ENHANCEMENT] copyblocks: Support resolving S3 credentials from the environment (IAM roles for service accounts, ECS task roles, and EC2 instance metadata) when
-s3.<source|destination>.access-key-idand-s3.<source|destination>.secret-access-keyare omitted. #15075 - [BUGFIX] mimir-tool-action: Fix base image of the Github action. #13303
- [BUGFIX] mimir-tool: do not fail on
$latency_metricsdashboard variable, documented for native histograms migrations. #13526 - [BUGFIX] kafkatool: Fix
kafkatool dump printto support RW2 records. #13848 - [BUGFIX] mimir-tool-action: Fix special character handling in NAMESPACES input #14247
Query-tee
- [CHANGE] Added
/api/v1/readas a registered route. #13227 - [CHANGE] Added cluster validation label configuration
-query-tee.client-cluster-validation.label. If set, query-tee will setX-Clusterheader before forwarding the request to both primary and secondary backends. #13302 - [CHANGE] Make HTTP and gRPC server options configurable through the same dskit
serverflags and config block as Mimir. This begins the deprecation cycle for query-tee'sserver.http-service-address,server.http-service-port,"server.grpc-service-address, andserver.grpc-service-portflags. #13328 #13355 #13360 - [ENHANCEMENT] Add
/readyendpoint that returns HTTP 200 when the proxy is running. #14478 - [BUGFIX] Fix bug where query-tee can panic if forwarding a request fails. #14015
All changes in this release: mimir-3.0.6...mimir-3.1.0-rc.0