This release contains 735 PRs from 78 authors, including new contributors Bernaud Vincent, Carrie Edwards, danieleandreatta, David Vávra, Edgaras Giedrė, Gabija Bruzgaitė, Henrique Lourenço, Innokentii Konstantinov, Jasper Maes, Jeanette Tan, Joobi S B, Julius Hinze, Lukas Bischofberger, mihaelmiklec, mimir-vendoring[bot], Nasiel, pierremahot, rektabhi, sam clulow, Shay Pletcher, Thor K. Høgås, Toni Cárdenas, Yuran Ou, zhuoyuan-liu. Thank you!
Grafana Mimir version 2.17.0 release notes
Grafana Labs is excited to announce version 2.17 of Grafana Mimir.
The highlights that follow include the top features, enhancements, and bug fixes in this release.
For the complete list of changes, refer to the CHANGELOG.
Features and enhancements
MQE is enabled by default in queriers. MQE provides benefits over the Prometheus engine, inluding reduced memory and CPU consumption and improved performance. To use the Prometheus engine instead of MQE, set -querier.query-engine=prometheus
.
Grafana Mimir now supports using the Mimir Query Engine (MQE) in query-frontends in addition to queriers. You can enable MQE in query-frontends by setting the experimental CLI flag -query-frontend.query-engine=mimir
or through the corresponding YAML option.
You can export the cortex_ingester_attributed_active_native_histogram_series
and cortex_ingester_attributed_active_native_histogram_buckets
native histogram cost attribution metrics to a custom Prometheus registry with user-specified labels.
Grafana Mimir supports converting OTel explicit bucket histograms to Prometheus native histograms with custom buckets using the distributor.otel-convert-histograms-to-nhcb
flag.
The following experimental features have been removed:
- The
max_cost_attribution_labels_per_user
cost attribution limit - Read-write deployment mode in the mixin
Important changes
In Grafana Mimir 2.17, the following behavior has changed:
The following default configuration values now apply to the memberlist KV store:
Key | Value |
---|---|
memberlist.packet-dial-timeout
| 500ms
|
memberlist.packet-write-timeout
| 500ms
|
memberlist.max-concurrent-writes
| 5
|
memberlist.acquire-writer-timeout
| 1s
|
These values perform better but might cause long-running packets to be dropped in high-latency networks.
The -ruler-storage.cache.rule-group-enabled
experimental CLI flag has been removed. Caching rule group contents is now always enabled when a cache is configured for the ruler.
The -ingester.ooo-native-histograms-ingestion-enabled
CLI flag and corresponding ooo_native_histograms_ingestion_enabled
runtime configuration option have been removed. Out-of-order native histograms are now enabled whenever both native histogram and out-of-order ingestion is enabled.
The -ingester.stream-chunks-when-using-blocks
CLI flag and corresponding ingester_stream_chunks_when_using_blocks
runtime configuration option have been deprecated and will be removed in a future release.
The cortex_distributor_label_values_with_newlines_total
metric has been removed.
In the distributor, memberlist
is marked as a stable option for backend storage for the high availability tracker. etcd
has been deprecated for this purpose.
Experimental features
Grafana Mimir 2.17 includes some features that are experimental and disabled by default.
Use these features with caution and report any issues that you encounter:
- Prometheus Remote-Write 2.0 protocol.
- Duration expressions in PromQL. These are simple arithmetics on numbers in offset and range specification. For example,
rate(http_requests_total[5m * 2])
. - Promoting OTel scope metadata, including name, version, schema URL, and attributes, to metric labels, prefixed with
otel_scope_
. Enable this feature through the-distributor.otel-promote-scope-metadata
flag. - Allowing primitive delta metrics ingestion through the OTLP endpoint with the
-distributor.otel-native-delta-ingestion
option. - Support for
sort_by_label
andsort_by_label_desc
PromQL functions. - Support for cluster validation in HTTP calls. When enabled, the HTTP server verifies if a request coming from an HTTP client comes from an expected cluster. You can configure this validation with the following options:
-server.cluster-validation.label
-server.cluster-validation.http.enabled
-server.cluster-validation.http.soft-validation
-server.cluster-validation.http.exclude-paths
Bug fixes
For a detailed list of bug fixes, refer to the CHANGELOG.
Helm chart improvements
The Grafana Mimir and Grafana Enterprise Metrics Helm chart is released independently.
Refer to the Grafana Mimir Helm chart documentation.
Changelog
2.17.0
Grafana Mimir
- [CHANGE] Query-frontend: Ensure that cache keys generated from cardinality estimate middleware are less than 250 bytes in length by hashing the tenant IDs that are included in them. This change invalidates all cardinality estimates in the cache. #11568
- [CHANGE] Ruler: Remove experimental CLI flag
-ruler-storage.cache.rule-group-enabled
to enable or disable caching the contents of rule groups. Caching rule group contents is now always enabled when a cache is configured for the ruler. #10949 - [CHANGE] Ingester: Out-of-order native histograms are now enabled whenever both native histogram and out-of-order ingestion is enabled. The
-ingester.ooo-native-histograms-ingestion-enabled
CLI flag and correspondingooo_native_histograms_ingestion_enabled
runtime configuration option have been removed. #10956 - [CHANGE] Distributor: removed the
cortex_distributor_label_values_with_newlines_total
metric. #10977 - [CHANGE] Ingester/Distributor: renamed the experimental
max_cost_attribution_cardinality_per_user
config tomax_cost_attribution_cardinality
. #11092 - [CHANGE] Frontend: The subquery spin-off feature is now enabled with
-query-frontend.subquery-spin-off-enabled=true
instead of-query-frontend.instant-queries-with-subquery-spin-off=.*
#11153 - [CHANGE] Overrides-exporter: Don't export per-tenant overrides that are set to their default values. #11173
- [CHANGE] gRPC/HTTP clients: Rename metric
cortex_client_request_invalid_cluster_validation_labels_total
tocortex_client_invalid_cluster_validation_label_requests_total
. #11237 - [CHANGE] Querier: Use Mimir Query Engine (MQE) by default. Set
-querier.query-engine=prometheus
to continue using Prometheus' engine. #11501 - [CHANGE] Memcached: Ignore initial DNS resolution failure, meaning don't depend on Memcached on startup. #11602
- [CHANGE] Ingester: The
-ingester.stream-chunks-when-using-blocks
CLI flag andingester_stream_chunks_when_using_blocks
runtime configuration option have been deprecated and will be removed in a future release. #11711 - [CHANGE] Distributor: track
cortex_ingest_storage_writer_latency_seconds
metric for failed writes too. Addedoutcome
label to distinguish betweensuccess
andfailure
. #11770 - [CHANGE] Distributor: renamed few metrics used by experimental ingest storage. #11766
- Renamed
cortex_ingest_storage_writer_produce_requests_total
tocortex_ingest_storage_writer_produce_records_enqueued_total
- Renamed
cortex_ingest_storage_writer_produce_failures_total
tocortex_ingest_storage_writer_produce_records_failed_total
- Renamed
- [CHANGE] Distributor: moved HA tracker timeout config to limits. #11774
- Moved
distributor.ha_tracker.ha_tracker_update_timeout
tolimits.ha_tracker_update_timeout
. - Moved
distributor.ha_tracker.ha_tracker_update_timeout_jitter_max
tolimits.ha_tracker_update_timeout_jitter_max
. - Moved
distributor.ha_tracker.ha_tracker_failover_timeout
tolimits.ha_tracker_failover_timeout
.
- Moved
- [CHANGE] Distributor:
Memberlist
marked as stable as an option for backend storage for the HA tracker. #11861 - [CHANGE] Distributor:
etcd
deprecated as an option for backend storage for the HA tracker. #12047 - [CHANGE] Memberlist: Apply new default configuration values for MemberlistKV. This unlocks using it as backend storage for the HA Tracker. We have observed better performance with these defaults across different production loads. #11874
memberlist.packet-dial-timeout
:500ms
memberlist.packet-write-timeout
:500ms
memberlist.max-concurrent-writes
:5
memberlist.acquire-writer-timeout
:1s
These defaults perform better but may cause long-running packets to be dropped in high-latency networks.
- [CHANGE] Query-frontend: Apply query pruning and check for disabled experimental functions earlier in query processing. #11939
- [FEATURE] Distributor: Experimental support for Prometheus Remote-Write 2.0 protocol. Limitations: Created timestamp is ignored, per series metadata is merged on metric family level automatically, ingestion might fail if client sends ProtoBuf fields out of order. The label
version
is added to the metriccortex_distributor_requests_in_total
with a value of either1.0
or2.0
depending on the detected Remote-Write protocol. #11100 #11101 #11192 #11143 - [FEATURE] Query-frontend: expand
query-frontend.cache-errors
andquery-frontend.results-cache-ttl-for-errors
configuration options to cache non-transient response failures for instant queries. #11120 - [FEATURE] Query-frontend: Allow use of Mimir Query Engine (MQE) via the experimental CLI flags
-query-frontend.query-engine
or-query-frontend.enable-query-engine-fallback
or corresponding YAML. #11417 #11775 - [FEATURE] Querier, query-frontend, ruler: Enable experimental support for duration expressions in PromQL, which are simple arithmetics on numbers in offset and range specification. #11344
- [FEATURE] You can configure Mimir to export traces in OTLP exposition format through the standard
OTEL_
environment variables. #11618 - [FEATURE] distributor: Allow configuring tenant-specific HA tracker failover timeouts. #11774
- [FEATURE] OTLP: Add experimental support for promoting OTel scope metadata (name, version, schema URL, attributes) to metric labels, prefixed with
otel_scope_
. Enable via the-distributor.otel-promote-scope-metadata
flag. #11795 - [FEATURE] Distributor: Add experimental
-distributor.otel-native-delta-ingestion
option to allow primitive delta metrics ingestion via the OTLP endpoint. #11631 - [FEATURE] MQE: Add support for experimental
sort_by_label
andsort_by_label_desc
PromQL functions. #11930 - [FEATURE] Ingester/Block-builder: Handle the created timestamp field for remote-write requests. #11977
- [FEATURE] Cost attribution: Labels specified in the limit configuration may specify an output label in order to override emitted label names. #12035
- [ENHANCEMENT] Dashboards: Add "Influx write requests" row to Writes Dashboard. #11731
- [ENHANCEMENT] Mixin: Add
MimirHighVolumeLevel1BlocksQueried
alert that fires when level 1 blocks are queried for more than 6 hours, indicating potential compactor performance issues. #11803 - [ENHANCEMENT] Querier: Make the maximum series limit for cardinality API requests configurable on a per-tenant basis with the
cardinality_analysis_max_results
option. #11456 - [ENHANCEMENT] Querier: Add configurable concurrency limit for remote read queries with the
--querier.max-concurrent-remote-read-queries
flag. Defaults to 2. Set to 0 for unlimited concurrency. #11892 - [ENHANCEMENT] Dashboards: Add "Queries / sec by read path" to Queries Dashboard. #11640
- [ENHANCEMENT] Dashboards: Add "Added Latency" row to Writes Dashboard. #11579
- [ENHANCEMENT] Ingester: Add support for exporting native histogram cost attribution metrics (
cortex_ingester_attributed_active_native_histogram_series
andcortex_ingester_attributed_active_native_histogram_buckets
) with labels specified by customers to a custom Prometheus registry. #10892 - [ENHANCEMENT] Distributor: Add new metrics
cortex_distributor_received_native_histogram_samples_total
andcortex_distributor_received_native_histogram_buckets_total
to track native histogram samples and bucket counts separately for billing calculations. Updatedcortex_distributor_received_samples_total
description to clarify it includes native histogram samples. #11728 - [ENHANCEMENT] Store-gateway: Download sparse headers uploaded by compactors. Compactors have to be configured with
-compactor.upload-sparse-index-headers=true
option. #10879 #11072. - [ENHANCEMENT] Compactor: Upload block index file and multiple segment files concurrently. Concurrency scales linearly with block size up to
-compactor.max-per-block-upload-concurrency
. #10947 - [ENHANCEMENT] Ingester: Add per-user
cortex_ingester_tsdb_wal_replay_unknown_refs_total
andcortex_ingester_tsdb_wbl_replay_unknown_refs_total
metrics to track unknown series references during WAL/WBL replay. #10981 - [ENHANCEMENT] Added
-ingest-storage.kafka.fetch-max-wait
configuration option to configure the maximum amount of time a Kafka broker waits for some records before a Fetch response is returned. #11012 - [ENHANCEMENT] Ingester: Add
cortex_ingester_tsdb_forced_compactions_in_progress
metric reporting a value of 1 when there's a forced TSDB head compaction in progress. #11006 - [ENHANCEMENT] Ingester: Add
cortex_ingest_storage_reader_records_batch_fetch_max_bytes
metric reporting the distribution ofMaxBytes
specified in the Fetch requests sent to Kafka. #11014 - [ENHANCEMENT] All: Add experimental support for cluster validation in HTTP calls. When it is enabled, HTTP server verifies if a request coming from an HTTP client comes from an expected cluster. This validation can be configured by the following experimental configuration options: #11010 #11549
-server.cluster-validation.label
-server.cluster-validation.http.enabled
-server.cluster-validation.http.soft-validation
-server.cluster-validation.http.exclude-paths
- [ENHANCEMENT] Query-frontend: Add experimental support to include the cluster validation label in HTTP request headers. When cluster validation is enabled on the HTTP server side, cluster validation labels from HTTP request headers are compared with the HTTP server's cluster validation label. #11010 #11145
- By setting
-query-frontend.client-cluster-validation.label
, you configure the query-frontend's client cluster validation label. - The flag
-common.client-cluster-validation.label
, if set, provides the default for-query-frontend.client-cluster-validation.label
.
- By setting
- [ENHANCEMENT] Distributor: Add
ignore_ingest_storage_errors
andingest_storage_max_wait_time
flags to control error handling and timeout behavior during ingest storage migration. #11291-ingest-storage.migration.ignore-ingest-storage-errors
-ingest-storage.migration.ingest-storage-max-wait-time
- [ENHANCEMENT] Memberlist: Add
-memberlist.abort-if-fast-join-fails
support and retries on DNS resolution. #11067 - [ENHANCEMENT] Querier: Allow configuring all gRPC options for store-gateway client, similar to other gRPC clients. #11074
- [ENHANCEMENT] Ruler: Log the number of series returned for each query as
result_series_count
as part ofquery stats
log lines. #11081 - [ENHANCEMENT] Ruler: Don't log statistics that are not available when using a remote query-frontend as part of
query stats
log lines. #11083 - [ENHANCEMENT] Ingester: Remove cost-attribution experimental
max_cost_attribution_labels_per_user
limit. #11090 - [ENHANCEMENT] Update Go to 1.24.2. #11114
- [ENHANCEMENT] Query-frontend: Add
cortex_query_samples_processed_total
metric. #11110 - [ENHANCEMENT] Query-frontend: Add
cortex_query_samples_processed_cache_adjusted_total
metric. #11164 - [ENHANCEMENT] Ingester/Distributor: Add
cortex_cost_attribution_*
metrics to observe the state of the cost-attribution trackers. #11112 - [ENHANCEMENT] Querier: Process multiple remote read queries concurrently instead of sequentially for improved performance. #11732
- [ENHANCEMENT] gRPC/HTTP servers: Add
cortex_server_invalid_cluster_validation_label_requests_total
metric, that is increased for every request with an invalid cluster validation label. #11241 #11277 - [ENHANCEMENT] OTLP: Add support for converting OTel explicit bucket histograms to Prometheus native histograms with custom buckets using the
distributor.otel-convert-histograms-to-nhcb
flag. #11077 - [ENHANCEMENT] Add configurable per-tenant
limited_queries
, which you can only run at or less than an allowed frequency. #11097 - [ENHANCEMENT] Ingest-Storage: Add
ingest-storage.kafka.producer-record-version
to allow control Kafka record versioning. #11244 - [ENHANCEMENT] Ruler: Update
<prometheus-http-prefix>/api/v1/rules
and<prometheus-http-prefix>/api/v1/alerts
to reply with HTTP error 422 if rule evaluation is completely disabled for the tenant. If only recording rule or alerting rule evaluation is disabled for the tenant, the response now includes a corresponding warning. #11321 #11495 #11511 - [ENHANCEMENT] Add tenant configuration block
ruler_alertmanager_client_config
which allows the Ruler's Alertmanager client options to be specified on a per-tenant basis. #10816 - [ENHANCEMENT] Distributor: Trace when deduplicating a metric's samples or histograms. #11159 #11715
- [ENHANCEMENT] Store-gateway: Retry querying blocks from store-gateways with dynamic replication until trying all possible store-gateways. #11354 #11398
- [ENHANCEMENT] Query-frontend: Add optional reason to blocked_queries config. #11407 #11434
- [ENHANCEMENT] Distributor: Gracefully handle type assertion of WatchPrefix in HA Tracker to continue checking for updates. #11411 #11461
- [ENHANCEMENT] Querier: Include chunks streamed from store-gateway in Mimir Query Engine memory estimate of query memory usage. #11453 #11465
- [ENHANCEMENT] Querier: Include chunks streamed from ingester in Mimir Query Engine memory estimate of query memory usage. #11457
- [ENHANCEMENT] Query-frontend: Add retry mechanism for remote reads, series, and cardinality prometheus endpoints #11533
- [ENHANCEMENT] Ruler: Ignore rulers in non-operation states when getting and syncing rules #11569
- [ENHANCEMENT] Query-frontend: add optional reason to blocked_queries config. #11407 #11434
- [ENHANCEMENT] Tracing: Add HTTP headers as span attributes when
-server.trace-request-headers
is enabled. You can configure which headers to exclude using the-server.trace-request-headers-exclude-list
flag. #11655 - [ENHANCEMENT] Ruler: Add new per-tenant limit on minimum rule evaluation interval. #11665
- [ENHANCEMENT] store-gateway: download sparse headers on startup when lazy loading is enabled. #11686
- [ENHANCEMENT] Distributor: added more metrics to troubleshoot Kafka records production latency when experimental ingest storage is enabled: #11766 #11771
cortex_ingest_storage_writer_produce_remaining_deadline_seconds
: measures the remaining deadline (in seconds) when records are requested to be produced.cortex_ingest_storage_writer_produce_records_enqueue_duration_seconds
: measures how long it takes to enqueue produced Kafka records in the client.cortex_ingest_storage_writer_kafka_write_wait_seconds
: measures the time spent waiting to write to Kafka backend.cortex_ingest_storage_writer_kafka_write_time_seconds
: measures the time spent writing to Kafka backend.cortex_ingest_storage_writer_kafka_read_wait_seconds
: measures the time spent waiting to read from Kafka backend.cortex_ingest_storage_writer_kafka_read_time_seconds
: measures the time spent reading from Kafka backend.cortex_ingest_storage_writer_kafka_request_duration_e2e_seconds
: measures the time from the start of when a Kafka request is written to the end of when the response for that request was fully read from the Kafka backend.cortex_ingest_storage_writer_kafka_request_throttled_seconds
: measures how long Kafka requests have been throttled by the Kafka client.
- [ENHANCEMENT] Distributor: Add per-user
cortex_distributor_sample_delay_seconds
to track delay of ingested samples with regard to wall clock. #11573 - [ENHANCEMENT] Distributor: added circuit breaker to not produce Kafka records at all if the context is already canceled / expired. This applied only when experimental ingest storage is enabled. #11768
- [ENHANCEMENT] Compactor: Optimize the planning phase for tenants with a very large number of blocks, such as tens or hundreds of thousands, at the cost of making it slightly slower for tenants with a very a small number of blocks. #11819
- [ENHANCEMENT] Query-frontend: Accurate tracking of samples processed from cache. #11719
- [ENHANCEMENT] Store-gateway: Change level 0 blocks to be reported as 'unknown/old_block' in metrics instead of '0' to improve clarity. Level 0 indicates blocks with metadata from before compaction level tracking was added to the bucket index. #11891
- [ENHANCEMENT] Compactor, distributor, ruler, scheduler and store-gateway: Makes
-<component-ring-config>.auto-forget-unhealthy-periods
configurable for each component. Deprecates the-store-gateway.sharding-ring.auto-forget-enabled
flag. #11923 - [ENHANCEMENT] otlp: Stick to OTLP vocabulary on invalid label value length error. #11889
- [ENHANCEMENT] Ingester: Display user grace interval in the tenant list obtained through the
/ingester/tenants
endpoint. #11961 - [ENHANCEMENT]
kafkatool
: addconsumer-group delete-offset
command as a way to delete the committed offset for a consumer group. #11988 - [ENHANCEMENT] Block-builder-scheduler: Detect gaps in scheduled and completed jobs. #11867
- [ENHANCEMENT] Distributor: Experimental support for Prometheus Remote-Write 2.0 protocol has been updated. Created timestamps are now supported. This feature includes some limitations. If samples in a write request aren't ordered by time, the created timestamp might be dropped. Additionally, per-series metadata is automatically merged on the metric family level. Ingestion might fail if the client sends ProtoBuf fields out-of-order. The label
version
is added to the metriccortex_distributor_requests_in_total
with a value of either1.0
or2.0
, depending on the detected remote-write protocol. #11977 - [ENHANCEMENT] Query-frontend: Added labels query optimizer that automatically removes redundant
__name__!=""
matchers from label names and label values queries, improving query performance. You can enable the optimizer per-tenant with thelabels_query_optimizer_enabled
runtime configuration flag. #12054 #12066 #12076 #12080 - [ENHANCEMENT] Query-frontend: Standardise non-regex patterns in query blocking upon loading of config. #12102
- [ENHANCEMENT] Ruler: Propagate GCS object mutation rate limit for rule group uploads. #12086
- [ENHANCEMENT] Stagger head compaction intervals across zones to prevent compactions from aligning simultaneously, which could otherwise cause strong consistency queries to fail when experimental ingest storage is enabled. #12090
- [ENHANCEMENT] Compactor: Add
-compactor.update-blocks-concurrency
flag to control concurrency for updating block metadata during bucket index updates, separate from deletion marker concurrency. #12117 - [ENHANCEMENT] Query-frontend: Allow users to set the
query-frontend.extra-propagated-headers
flag to specify the extra headers allowed to pass through to the rest of the query path. #12174 - [BUGFIX] OTLP: Fix response body and Content-Type header to align with spec. #10852
- [BUGFIX] Compactor: fix issue where block becomes permanently stuck when the Compactor's block cleanup job partially deletes a block. #10888
- [BUGFIX] Storage: fix intermittent failures in S3 upload retries. #10952
- [BUGFIX] Querier: return NaN from
irate()
if the second-last sample in the range is NaN and Prometheus' query engine is in use. #10956 - [BUGFIX] Ruler: don't count alerts towards
cortex_prometheus_notifications_dropped_total
if they are dropped due to alert relabelling. #10956 - [BUGFIX] Querier: Fix issue where an entire store-gateway zone leaving caused high CPU usage trying to find active members of the leaving zone. #11028
- [BUGFIX] Query-frontend: Fix blocks retention period enforcement when a request has multiple tenants (tenant federation). #11069
- [BUGFIX] Query-frontend: Fix
-query-frontend.query-sharding-max-sharded-queries
enforcement for instant queries with binary operators. #11086 - [BUGFIX] Memberlist: Fix hash ring updates before the full-join has been completed, when
-memberlist.notify-interval
is configured. #11098 - [BUGFIX] Query-frontend: Fix an issue where transient errors could be inadvertently cached. #11198
- [BUGFIX] Ingester: read reactive limiters should activate and deactivate when the ingester changes state. #11234
- [BUGFIX] Query-frontend: Fix an issue where errors from date/time parsing methods did not include the name of the invalid parameter. #11304
- [BUGFIX] Query-frontend: Fix a panic in monolithic mode caused by a clash in labels of the
cortex_client_invalid_cluster_validation_label_requests_total
metric definition. #11455 - [BUGFIX] Compactor: Fix issue where
MimirBucketIndexNotUpdated
can fire even though the index has been updated within the alert threshold. #11303 - [BUGFIX] Distributor: fix old entries in the HA Tracker with zero valued "elected at" timestamp. #11462
- [BUGFIX] Query-scheduler: Fix issue where deregistered querier goroutines can cause a panic if their backlogged dequeue requests are serviced. #11510
- [BUGFIX] Ruler: Failures during initial sync must be fatal for the service's startup. #11545
- [BUGFIX] Querier and query-frontend: Fix issue where aggregation functions like
topk
andquantile
could return incorrect results if the scalar parameter is not a constant and Prometheus' query engine is in use. #11548 - [BUGFIX] Querier and query-frontend: Fix issue where range vector selectors could incorrectly ignore samples at the beginning of the range. #11548
- [BUGFIX] Querier: Fix rare panic if a query is canceled while a request to ingesters or store-gateways has just begun. #11613
- [BUGFIX] Ruler: Fix QueryOffset and AlignEvaluationTimeOnInterval being ignored when either recording or alerting rule evaluation is disabled. #11647
- [BUGFIX] Ingester: Fix issue where ingesters could leave read-only mode during forced compactions, resulting in write errors. #11664
- [BUGFIX] Ruler: Fix rare panic when the ruler is shutting down. #11781
- [BUGFIX] Block-builder-scheduler: Fix data loss bug in job assignment. #11785
- [BUGFIX] Compactor: start tracking
-compactor.max-compaction-time
after the initial compaction planning phase, to avoid rare cases where planning takes longer than-compactor.max-compaction-time
and so actual compaction never runs for a tenant. #11834 - [BUGFIX] Distributor: Validate the RW2 symbols field and reject invalid requests that don't have an empty string as the first symbol. #11953
- [BUGFIX] Distributor: Check
max_inflight_push_requests_bytes
before decompressing incoming requests. #11967 - [BUGFIX] Query-frontend: Allow limit parameter to be 0 in label queries to explicitly request unlimited results. #12054
- [BUGFIX] Distributor: Fix a possible panic in the OTLP push path while handling a gRPC status error. #12072
- [BUGFIX] Query-frontend: Evaluate experimental duration expressions before sharding, splitting, and caching. Otherwise, the result is not correct. #12038
- [BUGFIX] Block-builder-scheduler: Fix bugs in handling of partitions with no commit. #12130
- [BUGFIX] Ingester: Fix issue where ingesters can exit read-only mode during idle compactions, resulting in write errors. #12128
- [BUGFIX] otlp: Reverts #11889 which has a pooled memory re-use bug. #12266
Mixin
- [CHANGE] Alerts: Update the query for
MimirBucketIndexNotUpdated
to usemax_over_time
to prevent alert firing when pods rotate. #11311, #11426 - [CHANGE] Alerts: Make alerting threshold for
DistributorGcUsesTooMuchCpu
configurable. #11508 - [CHANGE] Remove support for the experimental read-write deployment mode. #11975
- [CHANGE] Alerts: Replace namespace with job label in golang_alerts. #11957
- [FEATURE] Add an alert if the block-builder-scheduler detects that it has skipped data. #12118
- [ENHANCEMENT] Dashboards: Include absolute number of notifications attempted to alertmanager in 'Mimir / Ruler'. #10918
- [ENHANCEMENT] Alerts: Make
MimirRolloutStuck
a critical alert if it has been firing for 6h. #10890 - [ENHANCEMENT] Dashboards: Add panels to the
Mimir / Tenants
andMimir / Top Tenants
dashboards showing the rate of gateway requests. #10978 - [ENHANCEMENT] Alerts: Improve
MimirIngesterFailsToProcessRecordsFromKafka
to not fire during forced TSDB head compaction. #11006 - [ENHANCEMENT] Alerts: Add alerts for invalid cluster validation labels. #11255 #11282 #11413
- [ENHANCEMENT] Dashboards: Improve "Kafka 100th percentile end-to-end latency when ingesters are running (outliers)" panel, computing the baseline latency on
max(10, 10%)
of ingesters instead of a fixed 10 replicas. #11581 - [ENHANCEMENT] Dashboards: Add "per-query memory consumption" and "fallback to Prometheus' query engine" panels to the Queries dashboard. #11626
- [ENHANCEMENT] Alerts: Add
MimirGoThreadsTooHigh
alert. #11836 #11845 - [ENHANCEMENT] Dashboards: Add autoscaling row for ruler query-frontends to
Mimir / Remote ruler reads
dashboard. #11838 - [BUGFIX] Dashboards: fix "Mimir / Tenants" legends for non-Kubernetes deployments. #10891
- [BUGFIX] Dashboards: fix Query-scheduler RPS panel legend in "Mimir / Reads". #11515
- [BUGFIX] Recording rules: fix
cluster_namespace_deployment:actual_replicas:count
recording rule when there's a mix on single-zone and multi-zone deployments. #11287 - [BUGFIX] Alerts: Enhance the
MimirRolloutStuck
alert, so it checks whether rollout groups as a whole (and not spread across instances) are changing or stuck. #11288
Jsonnet
- [CHANGE] Increase the allowed number of rule groups for small, medium_small, and extra_small user tiers by 20%. #11152
- [CHANGE] Update rollout-operator to latest release. #11232 #11748
- [CHANGE] Memcached: Set a timeout of
500ms
for theruler-storage
cache instead of the default200ms
. #11231 - [CHANGE] Ruler: If ingest storage is enabled, set the maximum buffered bytes in the Kafka client used by the ruler based on the expected maximum rule evaluation response size, clamping it between 1 GB (default) and 4 GB. #11602
- [CHANGE] All: Environment variable
JAEGER_REPORTER_MAX_QUEUE_SIZE
is no longer set. Components will use OTel's default value of2048
unless explicitly configured. You can still configureJAEGER_REPORTER_MAX_QUEUE_SIZE
if you configure tracing using Jaeger env vars, and you can always setOTEL_BSP_MAX_QUEUE_SIZE
OTel configuration. #11700 - [CHANGE] Removed jaeger-agent-mixin and
_config.jaeger_agent_host
configuration. You can configure tracing using an OTLP endpoint through_config.otlp_traces_endpoint
, seetracing.libsonnet
for more configuration options. #11773 - [CHANGE] Removed
ingester_stream_chunks_when_using_blocks
option. #11711 - [CHANGE] Enable
memberlist.abort-if-fast-join-fails
for ingesters using memberlist #11931 #11950 - [CHANGE] Remove average per-pod series scaling trigger for ingest storage ingester HPA and use one based on max owned series instead. #11952
- [CHANGE] Add
store_gateway_grpc_max_query_response_size_bytes
config option to set the max store-gateway gRCP query response send size (and corresponsing querier receive size), and set to 200MB by default. #11968 - [CHANGE] Removed support for the experimental read-write deployment mode. #11974
- [FEATURE] Make ingest storage ingester HPA behavior configurable through
_config.ingest_storage_ingester_hpa_behavior
. #11168 - [FEATURE] Add an alternate ingest storage HPA trigger that targets maximum owned series per pod. #11356
- [FEATURE] Make tracing of HTTP headers as span attributes configurable through
_config.trace_request_headers
. You can exclude certain headers from being traced using_config.trace_request_exclude_headers_list
. #11655 #11714 - [FEATURE] Allow configuring tracing with OTel environment variables through
$._config.otlp_traces_endpoint
. When configured, the$.jaeger_mixin
is no longer available for use. #11773 #11981 #12074 - [FEATURE] Updated rollout-operator to support
OTEL_
environment variables for tracing. #11787 - [ENHANCEMENT] Add
query_frontend_only_args
option to specify CLI flags that apply only to query-frontends but not ruler-query-frontends. #11799 - [ENHANCEMENT] Make querier scale up (
$_config.autoscaling_querier_scaleup_percent_cap
) and scale down rates ($_config.autoscaling_querier_scaledown_percent_cap
) configurable. #11862 - [ENHANCEMENT] Set resource requests and limits for the Memcached Prometheus exporter. #11933 #11946
- [ENHANCEMENT] Add assertion to ensure ingester ScaledObject has minimum and maximum replicas set to a value greater than 0. #11979
- [ENHANCEMENT] Add
ingest_storage_migration_ignore_ingest_storage_errors
andingest_storage_migration_ingest_storage_max_wait_time
configs to control error handling of the partition ingesters during ingest storage migrations. #12105 - [ENHANCEMENT] Add block-builder job processing duration timings and offset-skipped errors to the Block-builder dashboard. #12118
- [BUGFIX] Honor
weight
argument when building memory HPA query for resource scaled objects. #11935
Mimirtool
- [FEATURE] Add
--enable-experimental-functions
flag to commands that parse PromQL to allow parsing experimental functions such assort_by_label()
. - [ENHANCEMENT] Add
--block-size
CLI flag toremote-read export
that allows setting the output block size. #12025 - [BUGFIX] Fix issue where
remote-read
doesn't behave like other mimirtool commands for authentication. #11402 - [BUGFIX] Fix issue where
remote-read export
could omit some samples if the query time range spans multiple blocks. #12025 - [BUGFIX] Fix issue where
remote-read export
could omit some output blocks in the list printed to the console or fail withread/write on closed pipe
. #12025
Mimir Continuous Test
- [FEATURE] Add
-tests.client.cluster-validation.label
flag to send theX-Cluster
header with queries. #11418
Query-tee
Documentation
- [ENHANCEMENT] Update Thanos to Mimir migration guide with a tip to add the
__tenant_id__
label. #11584 - [ENHANCEMENT] Update the
MimirIngestedDataTooFarInTheFuture
runbook with a note about false positives and the endpoint to flush TSDB blocks by user. #11961
Tools
- [ENHANCEMENT]
kafkatool
: Addoffsets
command for querying various partition offsets. #11115 - [ENHANCEMENT]
listblocks
: Output can now also be JSON or YAML for easier parsing. #11184 - [ENHANCEMENT]
mark-blocks
: Allow specifying blocks from multiple tenants. #11343 - [ENHANCEMENT]
undelete-blocks
: Support removing S3 delete markers to avoid copying data when recovering blocks. #11256 - [BUGFIX]
screenshots
: Update to tar-fs v3.1.0 to address CVE-2025-48387. #12030
All changes in this release: mimir-2.16.1...mimir-2.17.0