This release contains 460 PRs from 69 authors, including new contributors Alessandro Verzicco, Alex Greenbank, André Pires, Bjorn Stout, Bruno FERNANDO, Casie Chen, Dustin Wilson, Edwin Tye, Kenny Trytek, Leszek Błażewski, Markus Opolka, Matthew Jacobson, Matt Veitas, mimir-github-bot[bot], Moustafa Baiou, Ryan Brady, TheRealNoob. Thank you!
Grafana Mimir version 2.16.0-rc.0 release notes
Grafana Labs is excited to announce version 2.16 of Grafana Mimir.
The highlights that follow include the top features, enhancements, and bug fixes in this release. For the complete list of changes, refer to the CHANGELOG.
Features and enhancements
In rulers, when rule concurrency is enabled for a rule group, its rules will now be reordered and run in batches based on their dependencies. This increases the number of rules that can potentially run concurrently. Note that the global and tenant-specific limits around the number of rule groups and rules per group still apply.
Using mimirtool
to analyze Grafana dashboards now supports bar chart, pie chart, state timeline, status history, histogram, candlestick, canvas, flame graph, geomap, node graph, trend, and XY chart panels.
Important changes
In Grafana Mimir 2.16, the following behavior has changed:
Alpine Linux based Docker images are no longer built for releases, only distroless Docker images.
How experimental PromQL functions are enabled has changed.
- The experimental CLI flags
-querier.promql-experimental-functions-enabled
and-query-frontend.block-promql-experimental-functions
and respective YAML configuration have been removed from query-frontends and queriers. - Experimental PromQL functions are disabled by default but can be enabled using only the per-tenant setting
enabled_promql_experimental_functions
.
Support for native histograms and out-of-order native histograms is enabled by default in ingesters.
Distributors discard float and histogram samples with duplicated timestamps from each timeseries in a request before the request is forwarded to ingesters. Discarded samples are tracked by cortex_discarded_samples_total
metrics with the reason sample_duplicate_timestamp
.
Experimental features
Grafana Mimir 2.16 includes some features that are experimental and disabled by default. Use these features with caution and report any issues that you encounter:
Distributors now include experimental support for the Influx line protocol.
Query-frontends now include experimental support to "spin off" subqueries as actual range queries, so that they benefit from query acceleration techniques such as sharding, splitting, and caching.
Bug fixes
- Distributor: Use a boolean to track changes while merging the ReplicaDesc components, rather than comparing the objects directly. #10185
- Querier: fix timeout responding to query-frontend when response size is very close to
-querier.frontend-client.grpc-max-send-msg-size
. #10154 - Query-frontend and querier: show warning/info annotations in some cases where they were missing (if a lazy querier was used). #10277
- Query-frontend: Fix an issue where transient errors are inadvertently cached. #10537 #10631
- Ruler: fix indeterminate rules being always run concurrently (instead of never) when
-ruler.max-independent-rule-evaluation-concurrency
is set. prometheus/prometheus#15560 #10258 - PromQL: Fix various UTF-8 bugs related to quoting. prometheus/prometheus#15531 #10258
- Ruler: Fixed an issue when using the experimental
-ruler.max-independent-rule-evaluation-concurrency
feature, where if a rule group was eligible for concurrency, it would flap between running concurrently or not based on the time it took after running concurrently. #9726 #10189 - Mimirtool:
remote-read
commands will now return data. #10286 - PromQL: Fix deriv, predict_linear and double_exponential_smoothing with histograms prometheus/prometheus#15686 #10383
- MQE: Fix deriv with histograms #10383
- PromQL: Fix <aggr_over_time> functions with histograms prometheus/prometheus#15711 #10400
- MQE: Fix <aggr_over_time> functions with histograms #10400
- Distributor: return HTTP status 415 Unsupported Media Type instead of 200 Success for Remote Write 2.0 until we support it. #10423 #10916
- Query-frontend: Add flag
-query-frontend.prom2-range-compat
and corresponding YAML to rewrite queries with ranges that worked in Prometheus 2 but are invalid in Prometheus 3. #10445 #10461 #10502 - Distributor: Fix edge case at the HA-tracker with memberlist as KVStore, where when a replica in the KVStore is marked as deleted but not yet removed, it fails to update the KVStore. #10443
- Distributor: Fix panics in
DurationWithJitter
util functions when computed variance is zero. #10507 - Ingester: Fixed a race condition in the
PostingsForMatchers
cache that may have infrequently returned expired cached postings. #10500 - Distributor: Report partially converted OTLP requests with status 400 Bad Request. #10588
- Ruler: fix issue where rule evaluations could be missed while shutting down a ruler instance if that instance owns many rule groups. prometheus/prometheus#15804 #10762
- Ingester: Add additional check on reactive limiter queue sizes. #10722
- TSDB: fix unknown series errors and possible lost data during WAL replay when series are removed from the head due to inactivity and reappear before the next WAL checkpoint. prometheus/prometheus#16060 #10824
- Querier: fix issue where
label_join
could incorrectly return multiple series with the same labels rather than failing withvector cannot contain metrics with the same labelset
. prometheus/prometheus#15975 #10826 - Querier: fix issue where counter resets on native histograms could be incorrectly under- or over-counted when using subqueries. prometheus/prometheus#15987 #10871
- Ingester: fix goroutines and memory leak when experimental ingest storage enabled and a server-side error occurs during metrics ingestion. #10915
- Mimirtool: Fix issue where
MIMIR_HTTP_PREFIX
environment variable was ignored and the value fromMIMIR_MIMIR_HTTP_PREFIX
was used instead. #10207
Helm chart improvements
The Grafana Mimir and Grafana Enterprise Metrics Helm charts are released independently.
Refer to the Grafana Mimir Helm chart documentation.
Changelog
2.16.0-rc.0
Grafana Mimir
- [CHANGE] Querier: pass context to queryable
IsApplicable
hook. #10451 - [CHANGE] Distributor: OTLP and push handler replace all non-UTF8 characters with the unicode replacement character
\uFFFD
in error messages before propagating them. #10236 - [CHANGE] Querier: pass query matchers to queryable
IsApplicable
hook. #10256 - [CHANGE] Build: removed Mimir Alpine Docker image and related CI tests. #10469
- [CHANGE] Query-frontend: Add
topic
label tocortex_ingest_storage_strong_consistency_requests_total
,cortex_ingest_storage_strong_consistency_failures_total
, andcortex_ingest_storage_strong_consistency_wait_duration_seconds
metrics. #10220 - [CHANGE] Ruler: cap the rate of retries for remote query evaluation to 170/sec. This is configurable via
-ruler.query-frontend.max-retries-rate
. #10375 #10403 - [CHANGE] Query-frontend: Add
topic
label tocortex_ingest_storage_reader_last_produced_offset_requests_total
,cortex_ingest_storage_reader_last_produced_offset_failures_total
,cortex_ingest_storage_reader_last_produced_offset_request_duration_seconds
,cortex_ingest_storage_reader_partition_start_offset_requests_total
,cortex_ingest_storage_reader_partition_start_offset_failures_total
,cortex_ingest_storage_reader_partition_start_offset_request_duration_seconds
metrics. #10462 - [CHANGE] Ingester: Set
-ingester.ooo-native-histograms-ingestion-enabled
to true by default. #10483 - [CHANGE] Ruler: Add
user
andreason
labels tocortex_ruler_write_requests_failed_total
andcortex_ruler_queries_failed_total
; adduser
to
cortex_ruler_write_requests_total
andcortex_ruler_queries_total
metrics. #10536 - [CHANGE] Querier / Query-frontend: Remove experimental
-querier.promql-experimental-functions-enabled
and-query-frontend.block-promql-experimental-functions
CLI flags and respective YAML configuration options to enable experimental PromQL functions. Instead access to experimental PromQL functions is always blocked. You can enable them using the per-tenant settingenabled_promql_experimental_functions
. #10660 #10712 - [CHANGE] Store-gateway: Include posting sampling rate in sparse index headers. When the sampling rate isn't set in a sparse index header, store gateway rebuilds the sparse header with the configured
blocks-storage.bucket-store.posting-offsets-in-mem-sampling
value. If the sparse header's sampling rate is set but doesn't match the configured rate, store gateway either rebuilds the sparse header or downsamples to the configured sampling rate. #10684 #10878 - [CHANGE] Distributor: Return specific error message when burst size limit is exceeded. #10835
- [CHANGE] Ingester: enable native histograms ingestion by default, meaning
ingester.native-histograms-ingestion-enabled
defaults to true. #10867 - [FEATURE] Ingester/Distributor: Add support for exporting cost attribution metrics (
cortex_ingester_attributed_active_series
,cortex_distributor_received_attributed_samples_total
, andcortex_discarded_attributed_samples_total
) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269 #10702 - [FEATURE] Ruler: Added
/ruler/tenants
endpoints to list the discovered tenants with rule groups. #10738 - [FEATURE] Distributor: Add experimental Influx handler. #10153
- [ENHANCEMENT] Compactor: Expose
cortex_bucket_index_last_successful_update_timestamp_seconds
for all tenants assigned to the compactor before starting the block cleanup job. #10569 - [ENHANCEMENT] Query Frontend: Return server-side
samples_processed
statistics. #10103 - [ENHANCEMENT] Distributor: OTLP receiver now converts also metric metadata. See also prometheus/prometheus#15416. #10168
- [ENHANCEMENT] Distributor: discard float and histogram samples with duplicated timestamps from each timeseries in a request before the request is forwarded to ingesters. Discarded samples are tracked by
cortex_discarded_samples_total
metrics with the reasonsample_duplicate_timestamp
. #10145 #10430 - [ENHANCEMENT] Ruler: Add
cortex_prometheus_rule_group_last_rule_duration_sum_seconds
metric to track the total evaluation duration of a rule group regardless of concurrency #10189 - [ENHANCEMENT] Distributor: Add native histogram support for
electedReplicaPropagationTime
metric in ha_tracker. #10264 - [ENHANCEMENT] Ingester: More efficient CPU/memory utilization-based read request limiting. #10325
- [ENHANCEMENT] OTLP: In addition to the flag
-distributor.otel-created-timestamp-zero-ingestion-enabled
there is now-distributor.otel-start-time-quiet-zero
to convert OTel start timestamps to Prometheus QuietZeroNaNs. This flag is to make the change rollout safe between Ingesters and Distributors. #10238 - [ENHANCEMENT] Ruler: When rule concurrency is enabled for a rule group, its rules will now be reordered and run in batches based on their dependencies. This increases the number of rules that can potentially run concurrently. Note that the global and tenant-specific limits still apply #10400
- [ENHANCEMENT] Query-frontend: include more information about read consistency in trace spans produced when using experimental ingest storage. #10412
- [ENHANCEMENT] Ingester: Hide tokens in ingester ring status page when ingest storage is enabled #10399
- [ENHANCEMENT] Ingester: add
active_series_additional_custom_trackers
configuration, in addition to the already existingactive_series_custom_trackers
. Theactive_series_additional_custom_trackers
configuration allows you to configure additional custom trackers that get merged withactive_series_custom_trackers
at runtime. #10428 - [ENHANCEMENT] Query-frontend: Allow blocking raw http requests with the
blocked_requests
configuration. Requests can be blocked based on their path, method or query parameters #10484 - [ENHANCEMENT] Ingester: Added the following metrics exported by
PostingsForMatchers
cache: #10500 #10525cortex_ingester_tsdb_head_postings_for_matchers_cache_hits_total
cortex_ingester_tsdb_head_postings_for_matchers_cache_misses_total
cortex_ingester_tsdb_head_postings_for_matchers_cache_requests_total
cortex_ingester_tsdb_head_postings_for_matchers_cache_skips_total
cortex_ingester_tsdb_head_postings_for_matchers_cache_evictions_total
cortex_ingester_tsdb_block_postings_for_matchers_cache_hits_total
cortex_ingester_tsdb_block_postings_for_matchers_cache_misses_total
cortex_ingester_tsdb_block_postings_for_matchers_cache_requests_total
cortex_ingester_tsdb_block_postings_for_matchers_cache_skips_total
cortex_ingester_tsdb_block_postings_for_matchers_cache_evictions_total
- [ENHANCEMENT] Add support for the HTTP header
X-Filter-Queryables
which allows callers to decide which queryables should be used by the querier, useful for debugging and testing queryables in isolation. #10552 #10594 - [ENHANCEMENT] Compactor: Shuffle users' order in
BlocksCleaner
. Prevents bucket indexes from going an extended period without cleanup during compactor restarts. #10513 - [ENHANCEMENT] Distributor, querier, ingester and store-gateway: Add support for
limit
parameter for label names and values requests. #10410 - [ENHANCEMENT] Ruler: Adds support for filtering results from rule status endpoint by
file[]
,rule_group[]
andrule_name[]
. #10589 - [ENHANCEMENT] Query-frontend: Add option to "spin off" subqueries as actual range queries, so that they benefit from query acceleration techniques such as sharding, splitting, and caching. To enable this feature, set the
-query-frontend.instant-queries-with-subquery-spin-off=<comma separated list>
option on the frontend or theinstant_queries_with_subquery_spin_off
per-tenant override with regular expressions matching the queries to enable. #10460 #10603 #10621 #10742 #10796 - [ENHANCEMENT] Querier, ingester: The series API respects passed
limit
parameter. #10620 #10652 - [ENHANCEMENT] Store-gateway: Add experimental settings under
-store-gateway.dynamic-replication
to allow more than the default of 3 store-gateways to own recent blocks. #10382 #10637 - [ENHANCEMENT] Ingester: Add reactive concurrency limiters to protect push and read operations from overload. #10574
- [ENHANCEMENT] Compactor: Add experimental
-compactor.max-lookback
option to limit blocks considered in each compaction cycle. Blocks uploaded prior to the lookback period aren't processed. This option helps reduce CPU utilization in tenants with large block metadata files that are processed before each compaction. #10585 #10794 - [ENHANCEMENT] Distributor: Optionally expose the current HA replica for each tenant in the
cortex_ha_tracker_elected_replica_status
metric. This is enabled with the-distributor.ha-tracker.enable-elected-replica-metric=true
flag. #10644 - [ENHANCEMENT] Enable three Go runtime metrics: #10641
go_cpu_classes_gc_total_cpu_seconds_total
go_cpu_classes_total_cpu_seconds_total
go_cpu_classes_idle_cpu_seconds_total
- [ENHANCEMENT] All: Add experimental support for cluster validation in gRPC calls. When it is enabled, gRPC server verifies if a request coming from a gRPC client comes from an expected cluster. This validation can be configured by the following experimental configuration options: #10767
-server.cluster-validation.label
-server.cluster-validation.grpc.enabled
-server.cluster-validation.grpc.soft-validation
- [ENHANCEMENT] gRPC clients: Add experimental support to include the cluster validation label in gRPC metadata. When cluster validation is enabled on gRPC server side, the cluster validation label from gRPC metadata is compared with the gRPC server's cluster validation label. #10869 #10883
- By setting
-<grpc-client-config-path>.cluster-validation.label
, you configure the cluster validation label of a single gRPC client, whosegrpcclient.Config
object is configurable through-<grpc-client-config-path>
. - By setting
-common.client-cluster-validation.label
, you configure the cluster validation label of all gRPC clients.
- By setting
- [ENHANCEMENT] gRPC clients: Add
cortex_client_request_invalid_cluster_validation_labels_total
metrics, that are used by Mimir's gRPC clients to track invalid cluster validations. #10767 - [ENHANCEMENT] Add experimental metric
cortex_distributor_dropped_native_histograms_total
to measure native histograms silently dropped when native histograms are disabled for a tenant. #10760 - [ENHANCEMENT] Compactor: Add experimental
-compactor.upload-sparse-index-headers
option. When enabled, the compactor will attempt to upload sparse index headers to object storage. This prevents latency spikes after adding store-gateway replicas. #10684 - [ENHANCEMENT] Memcached: Add experimental
-<prefix>.memcached.addresses-provider
flag to use alternate DNS service discovery backends when discovering Memcached hosts. #10895 - [BUGFIX] Distributor: Use a boolean to track changes while merging the ReplicaDesc components, rather than comparing the objects directly. #10185
- [BUGFIX] Querier: fix timeout responding to query-frontend when response size is very close to
-querier.frontend-client.grpc-max-send-msg-size
. #10154 - [BUGFIX] Query-frontend and querier: show warning/info annotations in some cases where they were missing (if a lazy querier was used). #10277
- [BUGFIX] Query-frontend: Fix an issue where transient errors are inadvertently cached. #10537 #10631
- [BUGFIX] Ruler: fix indeterminate rules being always run concurrently (instead of never) when
-ruler.max-independent-rule-evaluation-concurrency
is set. prometheus/prometheus#15560 #10258 - [BUGFIX] PromQL: Fix various UTF-8 bugs related to quoting. prometheus/prometheus#15531 #10258
- [BUGFIX] Ruler: Fixed an issue when using the experimental
-ruler.max-independent-rule-evaluation-concurrency
feature, where if a rule group was eligible for concurrency, it would flap between running concurrently or not based on the time it took after running concurrently. #9726 #10189 - [BUGFIX] Mimirtool:
remote-read
commands will now return data. #10286 - [BUGFIX] PromQL: Fix deriv, predict_linear and double_exponential_smoothing with histograms prometheus/prometheus#15686 #10383
- [BUGFIX] MQE: Fix deriv with histograms #10383
- [BUGFIX] PromQL: Fix <aggr_over_time> functions with histograms prometheus/prometheus#15711 #10400
- [BUGFIX] MQE: Fix <aggr_over_time> functions with histograms #10400
- [BUGFIX] Distributor: return HTTP status 415 Unsupported Media Type instead of 200 Success for Remote Write 2.0 until we support it. #10423
- [BUGFIX] Query-frontend: Add flag
-query-frontend.prom2-range-compat
and corresponding YAML to rewrite queries with ranges that worked in Prometheus 2 but are invalid in Prometheus 3. #10445 #10461 #10502 - [BUGFIX] Distributor: Fix edge case at the HA-tracker with memberlist as KVStore, where when a replica in the KVStore is marked as deleted but not yet removed, it fails to update the KVStore. #10443
- [BUGFIX] Distributor: Fix panics in
DurationWithJitter
util functions when computed variance is zero. #10507 - [BUGFIX] Ingester: Fixed a race condition in the
PostingsForMatchers
cache that may have infrequently returned expired cached postings. #10500 - [BUGFIX] Distributor: Report partially converted OTLP requests with status 400 Bad Request. #10588
- [BUGFIX] Ruler: fix issue where rule evaluations could be missed while shutting down a ruler instance if that instance owns many rule groups. prometheus/prometheus#15804 #10762
- [BUGFIX] Ingester: Add additional check on reactive limiter queue sizes. #10722
- [BUGFIX] TSDB: fix unknown series errors and possible lost data during WAL replay when series are removed from the head due to inactivity and reappear before the next WAL checkpoint. prometheus/prometheus#16060 #10824
- [BUGFIX] Querier: fix issue where
label_join
could incorrectly return multiple series with the same labels rather than failing withvector cannot contain metrics with the same labelset
. prometheus/prometheus#15975 #10826 - [BUGFIX] Querier: fix issue where counter resets on native histograms could be incorrectly under- or over-counted when using subqueries. prometheus/prometheus#15987 #10871
- [BUGFIX] Ingester: fix goroutines and memory leak when experimental ingest storage enabled and a server-side error occurs during metrics ingestion. #10915
- [BUGFIX] Alertmanager: Avoid fetching Grafana state if Grafana AM compatibility is not enabled. #10857
Mixin
- [CHANGE] Alerts: Only alert on errors performing cache operations if there are over 10 request/sec to avoid flapping. #10832
- [FEATURE] Add compiled mixin for GEM installations in
operations/mimir-mixin-compiled-gem
. #10690 #10877 - [ENHANCEMENT] Dashboards: clarify that the ingester and store-gateway panels on the 'Reads' dashboard show data from all query requests to that component, not just requests from the main query path (ie. requests from the ruler query path are included as well). #10598
- [ENHANCEMENT] Dashboards: add ingester and store-gateway panels from the 'Reads' dashboard to the 'Remote ruler reads' dashboard as well. #10598
- [ENHANCEMENT] Dashboards: add ingester and store-gateway panels showing only requests from the respective dashboard's query path to the 'Reads' and 'Remote ruler reads' dashboards. For example, the 'Remote ruler reads' dashboard now has panels showing the ingester query request rate from ruler-queriers. #10598
- [ENHANCEMENT] Dashboards: 'Writes' dashboard: show write requests broken down by request type. #10599
- [ENHANCEMENT] Dashboards: clarify when query-frontend and query-scheduler dashboard panels are expected to show no data. #10624
- [ENHANCEMENT] Alerts: Add warning alert
DistributorGcUsesTooMuchCpu
. #10641 - [ENHANCEMENT] Dashboards: Add "Federation-frontend" dashboard for GEM. #10697 #10736
- [ENHANCEMENT] Dashboards: Add Query-Scheduler <-> Querier Inflight Requests row to Query Reads and Remote Ruler reads dashboards. #10290
- [ENHANCEMENT] Alerts: Add "Federation-frontend" alert for remote clusters returning errors. #10698
- [BUGFIX] Dashboards: fix how we switch between classic and native histograms. #10018
- [BUGFIX] Alerts: Ignore cache errors performing
delete
operations since these are expected to fail when keys don't exist. #10287 - [BUGFIX] Dashboards: fix "Mimir / Rollout Progress" latency comparison when gateway is enabled. #10495
- [BUGFIX] Dashboards: fix autoscaling panels when Mimir is deployed using Helm. #10473
- [BUGFIX] Alerts: fix
MimirAutoscalerNotActive
alert. #10564
Jsonnet
- [CHANGE] Update rollout-operator version to 0.23.0. #10229 #10750
- [CHANGE] Memcached: Update to Memcached 1.6.34. #10318
- [CHANGE] Change multi-AZ deployments default toleration value from 'multi-az' to 'secondary-az', and make it configurable via the following settings: #10596
_config.multi_zone_schedule_toleration
(default)_config.multi_zone_distributor_schedule_toleration
(distributor's override)_config.multi_zone_etcd_schedule_toleration
(etcd's override)
- [CHANGE] Ring: relaxed the hash ring heartbeat timeout for store-gateways: #10634
-store-gateway.sharding-ring.heartbeat-timeout
set to10m
- [CHANGE] Memcached: Use 3 replicas for all cache types by default. #10739
- [ENHANCEMENT] Enforce
persistentVolumeClaimRetentionPolicy
Retain
policy on partition ingesters during migration to experimental ingest storage. #10395 - [ENHANCEMENT] Allow to not configure
topologySpreadConstraints
by setting the following configuration options to a negative value: #10540distributor_topology_spread_max_skew
query_frontend_topology_spread_max_skew
querier_topology_spread_max_skew
ruler_topology_spread_max_skew
ruler_querier_topology_spread_max_skew
- [ENHANCEMENT] Validate the
$._config.shuffle_sharding.ingester_partitions_shard_size
value when partition shuffle sharding is enabled in the ingest-storage mode. #10746 - [BUGFIX] Ports in container rollout-operator. #10273
- [BUGFIX] When downscaling is enabled, the components must annotate
prepare-downscale-http-port
with the value set in$._config.server_http_port
. #10367
Mimirtool
- [BUGFIX] Fix issue where
MIMIR_HTTP_PREFIX
environment variable was ignored and the value fromMIMIR_MIMIR_HTTP_PREFIX
was used instead. #10207 - [ENHANCEMENT] Unify mimirtool authentication options and add extra-headers support for commands that depend on MimirClient. #10178
- [ENHANCEMENT]
mimirtool grafana analyze
now supports custom panels. #10669 - [ENHANCEMENT]
mimirtool grafana analyze
now supports bar chart, pie chart, state timeline, status history,
histogram, candlestick, canvas, flame graph, geomap, node graph, trend, and XY chart panels. #10669
Mimir Continuous Test
Query-tee
- [ENHANCEMENT] Allow skipping comparisons when preferred backend fails. Disabled by default, enable with
-proxy.compare-skip-preferred-backend-failures=true
. #10612
Documentation
- [CHANGE] Add production tips related to cache size, heavy multi-tenancy and latency spikes. #9978
- [ENHANCEMENT] Update
MimirAutoscalerNotActive
andMimirAutoscalerKedaFailing
runbooks, with an instruction to check whether Prometheus has enough CPU allocated. #10257
Tools
- [CHANGE]
copyblocks
: Remove /pprof endpoint. #10329 - [CHANGE]
mark-blocks
: Replacemarkblocks
with added features including removing markers and reading block identifiers from a file. #10597
All changes in this release: mimir-2.15.1...mimir-2.16.0-rc.0