This release contains 460 PRs from 69 authors, including new contributors Alessandro Verzicco, Alex Greenbank, André Pires, Bjorn Stout, Bruno FERNANDO, Casie Chen, Dustin Wilson, Edwin Tye, Kenny Trytek, Leszek Błażewski, Markus Opolka, Matthew Jacobson, Matt Veitas, mimir-github-bot[bot], Moustafa Baiou, Ryan Brady, TheRealNoob. Thank you!

Grafana Mimir version 2.16.0-rc.0 release notes

Grafana Labs is excited to announce version 2.16 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bug fixes in this release. For the complete list of changes, refer to the CHANGELOG.

Features and enhancements

In rulers, when rule concurrency is enabled for a rule group, its rules will now be reordered and run in batches based on their dependencies. This increases the number of rules that can potentially run concurrently. Note that the global and tenant-specific limits around the number of rule groups and rules per group still apply.

Using mimirtool to analyze Grafana dashboards now supports bar chart, pie chart, state timeline, status history, histogram, candlestick, canvas, flame graph, geomap, node graph, trend, and XY chart panels.

Important changes

In Grafana Mimir 2.16, the following behavior has changed:

Alpine Linux based Docker images are no longer built for releases, only distroless Docker images.

How experimental PromQL functions are enabled has changed.

The experimental CLI flags -querier.promql-experimental-functions-enabled and -query-frontend.block-promql-experimental-functions and respective YAML configuration have been removed from query-frontends and queriers.
Experimental PromQL functions are disabled by default but can be enabled using only the per-tenant setting enabled_promql_experimental_functions.

Support for native histograms and out-of-order native histograms is enabled by default in ingesters.

Distributors discard float and histogram samples with duplicated timestamps from each timeseries in a request before the request is forwarded to ingesters. Discarded samples are tracked by cortex_discarded_samples_total metrics with the reason sample_duplicate_timestamp.

Experimental features

Grafana Mimir 2.16 includes some features that are experimental and disabled by default. Use these features with caution and report any issues that you encounter:

Distributors now include experimental support for the Influx line protocol.

Query-frontends now include experimental support to "spin off" subqueries as actual range queries, so that they benefit from query acceleration techniques such as sharding, splitting, and caching.

Bug fixes

Distributor: Use a boolean to track changes while merging the ReplicaDesc components, rather than comparing the objects directly. #10185
Querier: fix timeout responding to query-frontend when response size is very close to -querier.frontend-client.grpc-max-send-msg-size. #10154
Query-frontend and querier: show warning/info annotations in some cases where they were missing (if a lazy querier was used). #10277
Query-frontend: Fix an issue where transient errors are inadvertently cached. #10537 #10631
Ruler: fix indeterminate rules being always run concurrently (instead of never) when -ruler.max-independent-rule-evaluation-concurrency is set. prometheus/prometheus#15560 #10258
PromQL: Fix various UTF-8 bugs related to quoting. prometheus/prometheus#15531 #10258
Ruler: Fixed an issue when using the experimental -ruler.max-independent-rule-evaluation-concurrency feature, where if a rule group was eligible for concurrency, it would flap between running concurrently or not based on the time it took after running concurrently. #9726 #10189
Mimirtool: remote-read commands will now return data. #10286
PromQL: Fix deriv, predict_linear and double_exponential_smoothing with histograms prometheus/prometheus#15686 #10383
MQE: Fix deriv with histograms #10383
PromQL: Fix <aggr_over_time> functions with histograms prometheus/prometheus#15711 #10400
MQE: Fix <aggr_over_time> functions with histograms #10400
Distributor: return HTTP status 415 Unsupported Media Type instead of 200 Success for Remote Write 2.0 until we support it. #10423 #10916
Query-frontend: Add flag -query-frontend.prom2-range-compat and corresponding YAML to rewrite queries with ranges that worked in Prometheus 2 but are invalid in Prometheus 3. #10445 #10461 #10502
Distributor: Fix edge case at the HA-tracker with memberlist as KVStore, where when a replica in the KVStore is marked as deleted but not yet removed, it fails to update the KVStore. #10443
Distributor: Fix panics in DurationWithJitter util functions when computed variance is zero. #10507
Ingester: Fixed a race condition in the PostingsForMatchers cache that may have infrequently returned expired cached postings. #10500
Distributor: Report partially converted OTLP requests with status 400 Bad Request. #10588
Ruler: fix issue where rule evaluations could be missed while shutting down a ruler instance if that instance owns many rule groups. prometheus/prometheus#15804 #10762
Ingester: Add additional check on reactive limiter queue sizes. #10722
TSDB: fix unknown series errors and possible lost data during WAL replay when series are removed from the head due to inactivity and reappear before the next WAL checkpoint. prometheus/prometheus#16060 #10824
Querier: fix issue where label_join could incorrectly return multiple series with the same labels rather than failing with vector cannot contain metrics with the same labelset. prometheus/prometheus#15975 #10826
Querier: fix issue where counter resets on native histograms could be incorrectly under- or over-counted when using subqueries. prometheus/prometheus#15987 #10871
Ingester: fix goroutines and memory leak when experimental ingest storage enabled and a server-side error occurs during metrics ingestion. #10915
Mimirtool: Fix issue where MIMIR_HTTP_PREFIX environment variable was ignored and the value from MIMIR_MIMIR_HTTP_PREFIX was used instead. #10207

Helm chart improvements

The Grafana Mimir and Grafana Enterprise Metrics Helm charts are released independently.
Refer to the Grafana Mimir Helm chart documentation.

Changelog

2.16.0-rc.0

Grafana Mimir

[CHANGE] Querier: pass context to queryable IsApplicable hook. #10451
[CHANGE] Distributor: OTLP and push handler replace all non-UTF8 characters with the unicode replacement character \uFFFD in error messages before propagating them. #10236
[CHANGE] Querier: pass query matchers to queryable IsApplicable hook. #10256
[CHANGE] Build: removed Mimir Alpine Docker image and related CI tests. #10469
[CHANGE] Query-frontend: Add topic label to cortex_ingest_storage_strong_consistency_requests_total, cortex_ingest_storage_strong_consistency_failures_total, and cortex_ingest_storage_strong_consistency_wait_duration_seconds metrics. #10220
[CHANGE] Ruler: cap the rate of retries for remote query evaluation to 170/sec. This is configurable via -ruler.query-frontend.max-retries-rate. #10375 #10403
[CHANGE] Query-frontend: Add topic label to cortex_ingest_storage_reader_last_produced_offset_requests_total, cortex_ingest_storage_reader_last_produced_offset_failures_total, cortex_ingest_storage_reader_last_produced_offset_request_duration_seconds, cortex_ingest_storage_reader_partition_start_offset_requests_total, cortex_ingest_storage_reader_partition_start_offset_failures_total, cortex_ingest_storage_reader_partition_start_offset_request_duration_seconds metrics. #10462
[CHANGE] Ingester: Set -ingester.ooo-native-histograms-ingestion-enabled to true by default. #10483
[CHANGE] Ruler: Add user and reason labels to cortex_ruler_write_requests_failed_total and cortex_ruler_queries_failed_total; add user to
cortex_ruler_write_requests_total and cortex_ruler_queries_total metrics. #10536
[CHANGE] Querier / Query-frontend: Remove experimental -querier.promql-experimental-functions-enabled and -query-frontend.block-promql-experimental-functions CLI flags and respective YAML configuration options to enable experimental PromQL functions. Instead access to experimental PromQL functions is always blocked. You can enable them using the per-tenant setting enabled_promql_experimental_functions. #10660 #10712
[CHANGE] Store-gateway: Include posting sampling rate in sparse index headers. When the sampling rate isn't set in a sparse index header, store gateway rebuilds the sparse header with the configured blocks-storage.bucket-store.posting-offsets-in-mem-sampling value. If the sparse header's sampling rate is set but doesn't match the configured rate, store gateway either rebuilds the sparse header or downsamples to the configured sampling rate. #10684 #10878
[CHANGE] Distributor: Return specific error message when burst size limit is exceeded. #10835
[CHANGE] Ingester: enable native histograms ingestion by default, meaningingester.native-histograms-ingestion-enabled defaults to true. #10867
[FEATURE] Ingester/Distributor: Add support for exporting cost attribution metrics (cortex_ingester_attributed_active_series, cortex_distributor_received_attributed_samples_total, and cortex_discarded_attributed_samples_total) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269 #10702
[FEATURE] Ruler: Added /ruler/tenants endpoints to list the discovered tenants with rule groups. #10738
[FEATURE] Distributor: Add experimental Influx handler. #10153
[ENHANCEMENT] Compactor: Expose cortex_bucket_index_last_successful_update_timestamp_seconds for all tenants assigned to the compactor before starting the block cleanup job. #10569
[ENHANCEMENT] Query Frontend: Return server-side samples_processed statistics. #10103
[ENHANCEMENT] Distributor: OTLP receiver now converts also metric metadata. See also prometheus/prometheus#15416. #10168
[ENHANCEMENT] Distributor: discard float and histogram samples with duplicated timestamps from each timeseries in a request before the request is forwarded to ingesters. Discarded samples are tracked by cortex_discarded_samples_total metrics with the reason sample_duplicate_timestamp. #10145 #10430
[ENHANCEMENT] Ruler: Add cortex_prometheus_rule_group_last_rule_duration_sum_seconds metric to track the total evaluation duration of a rule group regardless of concurrency #10189
[ENHANCEMENT] Distributor: Add native histogram support for electedReplicaPropagationTime metric in ha_tracker. #10264
[ENHANCEMENT] Ingester: More efficient CPU/memory utilization-based read request limiting. #10325
[ENHANCEMENT] OTLP: In addition to the flag -distributor.otel-created-timestamp-zero-ingestion-enabled there is now -distributor.otel-start-time-quiet-zero to convert OTel start timestamps to Prometheus QuietZeroNaNs. This flag is to make the change rollout safe between Ingesters and Distributors. #10238
[ENHANCEMENT] Ruler: When rule concurrency is enabled for a rule group, its rules will now be reordered and run in batches based on their dependencies. This increases the number of rules that can potentially run concurrently. Note that the global and tenant-specific limits still apply #10400
[ENHANCEMENT] Query-frontend: include more information about read consistency in trace spans produced when using experimental ingest storage. #10412
[ENHANCEMENT] Ingester: Hide tokens in ingester ring status page when ingest storage is enabled #10399
[ENHANCEMENT] Ingester: add active_series_additional_custom_trackers configuration, in addition to the already existing active_series_custom_trackers. The active_series_additional_custom_trackers configuration allows you to configure additional custom trackers that get merged with active_series_custom_trackers at runtime. #10428
[ENHANCEMENT] Query-frontend: Allow blocking raw http requests with the blocked_requests configuration. Requests can be blocked based on their path, method or query parameters #10484
[ENHANCEMENT] Ingester: Added the following metrics exported by PostingsForMatchers cache: #10500 #10525
- cortex_ingester_tsdb_head_postings_for_matchers_cache_hits_total
- cortex_ingester_tsdb_head_postings_for_matchers_cache_misses_total
- cortex_ingester_tsdb_head_postings_for_matchers_cache_requests_total
- cortex_ingester_tsdb_head_postings_for_matchers_cache_skips_total
- cortex_ingester_tsdb_head_postings_for_matchers_cache_evictions_total
- cortex_ingester_tsdb_block_postings_for_matchers_cache_hits_total
- cortex_ingester_tsdb_block_postings_for_matchers_cache_misses_total
- cortex_ingester_tsdb_block_postings_for_matchers_cache_requests_total
- cortex_ingester_tsdb_block_postings_for_matchers_cache_skips_total
- cortex_ingester_tsdb_block_postings_for_matchers_cache_evictions_total
[ENHANCEMENT] Add support for the HTTP header X-Filter-Queryables which allows callers to decide which queryables should be used by the querier, useful for debugging and testing queryables in isolation. #10552 #10594
[ENHANCEMENT] Compactor: Shuffle users' order in BlocksCleaner. Prevents bucket indexes from going an extended period without cleanup during compactor restarts. #10513
[ENHANCEMENT] Distributor, querier, ingester and store-gateway: Add support for limit parameter for label names and values requests. #10410
[ENHANCEMENT] Ruler: Adds support for filtering results from rule status endpoint by file[], rule_group[] and rule_name[]. #10589
[ENHANCEMENT] Query-frontend: Add option to "spin off" subqueries as actual range queries, so that they benefit from query acceleration techniques such as sharding, splitting, and caching. To enable this feature, set the -query-frontend.instant-queries-with-subquery-spin-off=<comma separated list> option on the frontend or the instant_queries_with_subquery_spin_off per-tenant override with regular expressions matching the queries to enable. #10460 #10603 #10621 #10742 #10796
[ENHANCEMENT] Querier, ingester: The series API respects passed limit parameter. #10620 #10652
[ENHANCEMENT] Store-gateway: Add experimental settings under -store-gateway.dynamic-replication to allow more than the default of 3 store-gateways to own recent blocks. #10382 #10637
[ENHANCEMENT] Ingester: Add reactive concurrency limiters to protect push and read operations from overload. #10574
[ENHANCEMENT] Compactor: Add experimental -compactor.max-lookback option to limit blocks considered in each compaction cycle. Blocks uploaded prior to the lookback period aren't processed. This option helps reduce CPU utilization in tenants with large block metadata files that are processed before each compaction. #10585 #10794
[ENHANCEMENT] Distributor: Optionally expose the current HA replica for each tenant in the cortex_ha_tracker_elected_replica_status metric. This is enabled with the -distributor.ha-tracker.enable-elected-replica-metric=true flag. #10644
[ENHANCEMENT] Enable three Go runtime metrics: #10641
- go_cpu_classes_gc_total_cpu_seconds_total
- go_cpu_classes_total_cpu_seconds_total
- go_cpu_classes_idle_cpu_seconds_total
[ENHANCEMENT] All: Add experimental support for cluster validation in gRPC calls. When it is enabled, gRPC server verifies if a request coming from a gRPC client comes from an expected cluster. This validation can be configured by the following experimental configuration options: #10767
- -server.cluster-validation.label
- -server.cluster-validation.grpc.enabled
- -server.cluster-validation.grpc.soft-validation
[ENHANCEMENT] gRPC clients: Add experimental support to include the cluster validation label in gRPC metadata. When cluster validation is enabled on gRPC server side, the cluster validation label from gRPC metadata is compared with the gRPC server's cluster validation label. #10869 #10883
- By setting -<grpc-client-config-path>.cluster-validation.label, you configure the cluster validation label of a single gRPC client, whose grpcclient.Config object is configurable through -<grpc-client-config-path>.
- By setting -common.client-cluster-validation.label, you configure the cluster validation label of all gRPC clients.
[ENHANCEMENT] gRPC clients: Add cortex_client_request_invalid_cluster_validation_labels_total metrics, that are used by Mimir's gRPC clients to track invalid cluster validations. #10767
[ENHANCEMENT] Add experimental metric cortex_distributor_dropped_native_histograms_total to measure native histograms silently dropped when native histograms are disabled for a tenant. #10760
[ENHANCEMENT] Compactor: Add experimental -compactor.upload-sparse-index-headers option. When enabled, the compactor will attempt to upload sparse index headers to object storage. This prevents latency spikes after adding store-gateway replicas. #10684
[ENHANCEMENT] Memcached: Add experimental -<prefix>.memcached.addresses-provider flag to use alternate DNS service discovery backends when discovering Memcached hosts. #10895
[BUGFIX] Distributor: Use a boolean to track changes while merging the ReplicaDesc components, rather than comparing the objects directly. #10185
[BUGFIX] Querier: fix timeout responding to query-frontend when response size is very close to -querier.frontend-client.grpc-max-send-msg-size. #10154
[BUGFIX] Query-frontend and querier: show warning/info annotations in some cases where they were missing (if a lazy querier was used). #10277
[BUGFIX] Query-frontend: Fix an issue where transient errors are inadvertently cached. #10537 #10631
[BUGFIX] Ruler: fix indeterminate rules being always run concurrently (instead of never) when -ruler.max-independent-rule-evaluation-concurrency is set. prometheus/prometheus#15560 #10258
[BUGFIX] PromQL: Fix various UTF-8 bugs related to quoting. prometheus/prometheus#15531 #10258
[BUGFIX] Ruler: Fixed an issue when using the experimental -ruler.max-independent-rule-evaluation-concurrency feature, where if a rule group was eligible for concurrency, it would flap between running concurrently or not based on the time it took after running concurrently. #9726 #10189
[BUGFIX] Mimirtool: remote-read commands will now return data. #10286
[BUGFIX] PromQL: Fix deriv, predict_linear and double_exponential_smoothing with histograms prometheus/prometheus#15686 #10383
[BUGFIX] MQE: Fix deriv with histograms #10383
[BUGFIX] PromQL: Fix <aggr_over_time> functions with histograms prometheus/prometheus#15711 #10400
[BUGFIX] MQE: Fix <aggr_over_time> functions with histograms #10400
[BUGFIX] Distributor: return HTTP status 415 Unsupported Media Type instead of 200 Success for Remote Write 2.0 until we support it. #10423
[BUGFIX] Query-frontend: Add flag -query-frontend.prom2-range-compat and corresponding YAML to rewrite queries with ranges that worked in Prometheus 2 but are invalid in Prometheus 3. #10445 #10461 #10502
[BUGFIX] Distributor: Fix edge case at the HA-tracker with memberlist as KVStore, where when a replica in the KVStore is marked as deleted but not yet removed, it fails to update the KVStore. #10443
[BUGFIX] Distributor: Fix panics in DurationWithJitter util functions when computed variance is zero. #10507
[BUGFIX] Ingester: Fixed a race condition in the PostingsForMatchers cache that may have infrequently returned expired cached postings. #10500
[BUGFIX] Distributor: Report partially converted OTLP requests with status 400 Bad Request. #10588
[BUGFIX] Ruler: fix issue where rule evaluations could be missed while shutting down a ruler instance if that instance owns many rule groups. prometheus/prometheus#15804 #10762
[BUGFIX] Ingester: Add additional check on reactive limiter queue sizes. #10722
[BUGFIX] TSDB: fix unknown series errors and possible lost data during WAL replay when series are removed from the head due to inactivity and reappear before the next WAL checkpoint. prometheus/prometheus#16060 #10824
[BUGFIX] Querier: fix issue where label_join could incorrectly return multiple series with the same labels rather than failing with vector cannot contain metrics with the same labelset. prometheus/prometheus#15975 #10826
[BUGFIX] Querier: fix issue where counter resets on native histograms could be incorrectly under- or over-counted when using subqueries. prometheus/prometheus#15987 #10871
[BUGFIX] Ingester: fix goroutines and memory leak when experimental ingest storage enabled and a server-side error occurs during metrics ingestion. #10915
[BUGFIX] Alertmanager: Avoid fetching Grafana state if Grafana AM compatibility is not enabled. #10857

Mixin

[CHANGE] Alerts: Only alert on errors performing cache operations if there are over 10 request/sec to avoid flapping. #10832
[FEATURE] Add compiled mixin for GEM installations in operations/mimir-mixin-compiled-gem. #10690 #10877
[ENHANCEMENT] Dashboards: clarify that the ingester and store-gateway panels on the 'Reads' dashboard show data from all query requests to that component, not just requests from the main query path (ie. requests from the ruler query path are included as well). #10598
[ENHANCEMENT] Dashboards: add ingester and store-gateway panels from the 'Reads' dashboard to the 'Remote ruler reads' dashboard as well. #10598
[ENHANCEMENT] Dashboards: add ingester and store-gateway panels showing only requests from the respective dashboard's query path to the 'Reads' and 'Remote ruler reads' dashboards. For example, the 'Remote ruler reads' dashboard now has panels showing the ingester query request rate from ruler-queriers. #10598
[ENHANCEMENT] Dashboards: 'Writes' dashboard: show write requests broken down by request type. #10599
[ENHANCEMENT] Dashboards: clarify when query-frontend and query-scheduler dashboard panels are expected to show no data. #10624
[ENHANCEMENT] Alerts: Add warning alert DistributorGcUsesTooMuchCpu. #10641
[ENHANCEMENT] Dashboards: Add "Federation-frontend" dashboard for GEM. #10697 #10736
[ENHANCEMENT] Dashboards: Add Query-Scheduler <-> Querier Inflight Requests row to Query Reads and Remote Ruler reads dashboards. #10290
[ENHANCEMENT] Alerts: Add "Federation-frontend" alert for remote clusters returning errors. #10698
[BUGFIX] Dashboards: fix how we switch between classic and native histograms. #10018
[BUGFIX] Alerts: Ignore cache errors performing delete operations since these are expected to fail when keys don't exist. #10287
[BUGFIX] Dashboards: fix "Mimir / Rollout Progress" latency comparison when gateway is enabled. #10495
[BUGFIX] Dashboards: fix autoscaling panels when Mimir is deployed using Helm. #10473
[BUGFIX] Alerts: fix MimirAutoscalerNotActive alert. #10564

Jsonnet

[CHANGE] Update rollout-operator version to 0.23.0. #10229 #10750
[CHANGE] Memcached: Update to Memcached 1.6.34. #10318
[CHANGE] Change multi-AZ deployments default toleration value from 'multi-az' to 'secondary-az', and make it configurable via the following settings: #10596
- _config.multi_zone_schedule_toleration (default)
- _config.multi_zone_distributor_schedule_toleration (distributor's override)
- _config.multi_zone_etcd_schedule_toleration (etcd's override)
[CHANGE] Ring: relaxed the hash ring heartbeat timeout for store-gateways: #10634
- -store-gateway.sharding-ring.heartbeat-timeout set to 10m
[CHANGE] Memcached: Use 3 replicas for all cache types by default. #10739
[ENHANCEMENT] Enforce persistentVolumeClaimRetentionPolicy Retain policy on partition ingesters during migration to experimental ingest storage. #10395
[ENHANCEMENT] Allow to not configure topologySpreadConstraints by setting the following configuration options to a negative value: #10540
- distributor_topology_spread_max_skew
- query_frontend_topology_spread_max_skew
- querier_topology_spread_max_skew
- ruler_topology_spread_max_skew
- ruler_querier_topology_spread_max_skew
[ENHANCEMENT] Validate the $._config.shuffle_sharding.ingester_partitions_shard_size value when partition shuffle sharding is enabled in the ingest-storage mode. #10746
[BUGFIX] Ports in container rollout-operator. #10273
[BUGFIX] When downscaling is enabled, the components must annotate prepare-downscale-http-port with the value set in $._config.server_http_port. #10367

Mimirtool

[BUGFIX] Fix issue where MIMIR_HTTP_PREFIX environment variable was ignored and the value from MIMIR_MIMIR_HTTP_PREFIX was used instead. #10207
[ENHANCEMENT] Unify mimirtool authentication options and add extra-headers support for commands that depend on MimirClient. #10178
[ENHANCEMENT] mimirtool grafana analyze now supports custom panels. #10669
[ENHANCEMENT] mimirtool grafana analyze now supports bar chart, pie chart, state timeline, status history,
histogram, candlestick, canvas, flame graph, geomap, node graph, trend, and XY chart panels. #10669

Mimir Continuous Test

Query-tee

[ENHANCEMENT] Allow skipping comparisons when preferred backend fails. Disabled by default, enable with -proxy.compare-skip-preferred-backend-failures=true. #10612

Documentation

[CHANGE] Add production tips related to cache size, heavy multi-tenancy and latency spikes. #9978
[ENHANCEMENT] Update MimirAutoscalerNotActive and MimirAutoscalerKedaFailing runbooks, with an instruction to check whether Prometheus has enough CPU allocated. #10257

Tools

[CHANGE] copyblocks: Remove /pprof endpoint. #10329
[CHANGE] mark-blocks: Replace markblocks with added features including removing markers and reading block identifiers from a file. #10597

All changes in this release: mimir-2.15.1...mimir-2.16.0-rc.0

grafana/mimir mimir-2.16.0-rc.0 2.16.0-rc.0 on GitHub