This release contains 166 PRs from 29 authors. Thank you!

Grafana Mimir version 2.4.0-rc.0 release notes

Grafana Labs is excited to announce version 2.4 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Note: If you are upgrading from Grafana Mimir 2.3, review the list of important changes that follow.

Features and enhancements

Query-scheduler ring-based service discovery: The query-scheduler is an optional, stateless component that retains a queue of queries to execute, and distributes the workload to available queriers. The use the query-scheduler, query-frontends and queriers are required to discover the addresses of the query-scheduler instances.
In addition to DNS-based service discovery, Mimir 2.4 introduces the ring-based service discovery for the query-scheduler. When enabled, the query-schedulers join their own hash ring (similar to other Mimir components), and the query-frontends and queriers discover query-scheduler instances via the ring.
Ring-based service discovery makes it easier to set up the query-scheduler in environments where you can’t easily define a DNS entry that resolves to the running query-scheduler instances. For more information, refer to query-scheduler configuration.
New API endpoint exposes per-tenant limits: Mimir 2.4 introduces a new API endpoint, which is available on all Mimir components that load the runtime configuration. The endpoint exposes the limits of the authenticated tenant. You can use this new API endpoint when developing custom integrations with Mimir that require looking up the actual limits that are applied on a given tenant. For more information, refer to Get tenant limits.
New TLS configuration options: Mimir 2.4 introduces new options to configure the accepted TLS cipher suites, and the minimum versions for the HTTP and gRPC clients that are used between Mimir components, or by Mimir to communicate to external services such as Consul or etcd.
You can use these new configuration options to override the default TLS settings and meet your security policy requirements. For more information, refer to Securing Grafana Mimir communications with TLS.
Maximum range query length limit: Mimir 2.4 introduces the new configuration option -query-frontend.max-total-query-length to limit the maximum range query length, which is computed as the query’s end minus start timestamp. This limit is enforced in the query-frontend and defaults to -store.max-query-length if unset.
The new configuration option allows you to set different limits between the received query maximum length (-query-frontend.max-total-query-length) and the maximum length of partial queries after splitting and sharding (-store.max-query-length).

Helm chart improvements

The mimir-distributed Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.4 release, we’re also releasing version 3.2 of the mimir-distributed Helm chart.

Notable enhancements follow. For the full list of changes, see the Helm chart changelog.

Added support for topologySpreadContraints.
Replaced the default anti-affinity rules with topologySpreadContraints for all components which puts less restrictions on where Kubernetes can run pods.

Important: if you are not using the sizing plans (small.yaml, large.yaml, capped-small.yaml, capped-large.yaml) in production, you must reintroduce pod affinity rules for the ingester and store-gateway. This also fixes a missing label selector for the ingester. Merge the following with your custom values file:

ingester:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: target
                operator: In
                values:
                  - ingester
          topologyKey: "kubernetes.io/hostname"
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - ingester
          topologyKey: "kubernetes.io/hostname"
store_gateway:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: target
                operator: In
                values:
                  - store-gateway
          topologyKey: "kubernetes.io/hostname"
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - store-gateway
          topologyKey: "kubernetes.io/hostname"

Updated the anti affinity rules in the sizing plans (small.yaml, large.yaml, capped-small.yaml, capped-large.yaml). The sizing plans now enforce that no two pods of the ingester, store-gateway, or alertmanager StatefulSets are scheduled on the same Node. Pods from different StaatefulSets can share a Node.
Support for Openshift Route resource for nginx has been added.

Important changes

In Grafana Mimir 2.4, the default values of the following configuration options have changed:

-distributor.remote-timeout has changed from 20s to 2s.
-distributor.forwarding.request-timeout has changed from 10s to 2s.
-blocks-storage.tsdb.head-compaction-concurrency has changed from 5 to 1.
The hash-ring heartbeat period for distributors, ingesters, rulers, and compactors has increased from 5s to 15s.

With Grafana Mimir 2.4, the anonymous usage statistics tracking is enabled by default. Mimir maintainers use this anonymous information to learn more about how the open source community runs Mimir and what the Mimir team should focus on when working on the next features and documentation improvements. If possible, we ask you to keep the usage reporting feature enabled. In case you want to opt-out from anonymous usage statistics reporting, refer to Disable the anonymous usage statistics reporting.

Bug fixes

PR 2979: Fix remote write HTTP response status code returned by Mimir when failing to write only to one ingester (the quorum is still honored when running Mimir with the default replication factor of 3) and some series are not ingested because of validation errors or some limits being reached.
PR 3005: Fix the querier to re-balance its workers connections when a query-frontend or query-scheduler instance is terminated.
PR 2963: Fix the remote read endpoint to correctly support the Accept-Encoding: snappy HTTP request header.

Changelog

2.4.0-rc.0

Grafana Mimir

[CHANGE] Distributor: change the default value of -distributor.remote-timeout to 2s from 20s and -distributor.forwarding.request-timeout to 2s from 10s to improve distributor resource usage when ingesters crash. #2728 #2912
[CHANGE] Anonymous usage statistics tracking: added the -ingester.ring.store value. #2981
[CHANGE] Series metadata HELP that is longer than -validation.max-metadata-length is now truncated silently, instead of being dropped with a 400 status code. #2993
[CHANGE] Ingester: changed default setting for -ingester.ring.readiness-check-ring-health from true to false. #2953
[CHANGE] Anonymous usage statistics tracking has been enabled by default, to help Mimir maintainers make better decisions to support the open source community. #2939 #3034
[CHANGE] Anonymous usage statistics tracking: added the minimum and maximum value of -ingester.out-of-order-time-window. #2940
[CHANGE] The default hash ring heartbeat period for distributors, ingesters, rulers and compactors has been increased from 5s to 15s. Now the default heartbeat period for all Mimir hash rings is 15s. #3033
[CHANGE] Reduce the default TSDB head compaction concurrency (-blocks-storage.tsdb.head-compaction-concurrency) from 5 to 1, in order to reduce CPU spikes. #3093
[CHANGE] Ruler: the ruler's remote evaluation mode (-ruler.query-frontend.address) is now stable. #3109
[CHANGE] Limits: removed the deprecated YAML configuration option active_series_custom_trackers_config. Please use active_series_custom_trackers instead. #3110
[CHANGE] Ingester: removed the deprecated configuration option -ingester.ring.join-after. #3111
[CHANGE] Querier: removed the deprecated configuration option -querier.shuffle-sharding-ingesters-lookback-period. The value of -querier.query-ingesters-within is now used internally for shuffle sharding lookback, while you can use -querier.shuffle-sharding-ingesters-enabled to enable or disable shuffle sharding on the read path. #3111
[CHANGE] Memberlist: cluster label verification feature (-memberlist.cluster-label and -memberlist.cluster-label-verification-disabled) is now marked as stable. #3108
[CHANGE] Distributor: only single per-tenant forwarding endpoint can be configured now. Support for per-rule endpoint has been removed. #3095
[CHANGE] Query-frontend: truncate queries based on the configured blocks retention period (-compactor.blocks-retention-period) to avoid querying past this period. #3134
[FEATURE] Query-scheduler: added an experimental ring-based service discovery support for the query-scheduler. Refer to query-scheduler configuration for more information. #2957
[FEATURE] Introduced the experimental endpoint /api/v1/user_limits exposed by all components that load runtime configuration. This endpoint exposes realtime limits for the authenticated tenant, in JSON format. #2864 #3017
[FEATURE] Query-scheduler: added the experimental configuration option -query-scheduler.max-used-instances to restrict the number of query-schedulers effectively used regardless how many replicas are running. This feature can be useful when using the experimental read-write deployment mode. #3005
[ENHANCEMENT] Go: updated to go 1.19.2. #2637 #3127 #3129
[ENHANCEMENT] Runtime config: don't unmarshal runtime configuration files if they haven't changed. This can save a bit of CPU and memory on every component using runtime config. #2954
[ENHANCEMENT] Query-frontend: Add cortex_frontend_query_result_cache_skipped_total and cortex_frontend_query_result_cache_attempted_total metrics to track the reason why query results are not cached. #2855
[ENHANCEMENT] Distributor: pool more connections per host when forwarding request. Mark requests as idempotent so they can be retried under some conditions. #2968
[ENHANCEMENT] Distributor: failure to send request to forwarding target now also increments cortex_distributor_forward_errors_total, with status_code="failed". #2968
[ENHANCEMENT] Distributor: added support forwarding push requests via gRPC, using httpgrpc messages from weaveworks/common library. #2996
[ENHANCEMENT] Query-frontend / Querier: increase internal backoff period used to retry connections to query-frontend / query-scheduler. #3011
[ENHANCEMENT] Querier: do not log "error processing requests from scheduler" when the query-scheduler is shutting down. #3012
[ENHANCEMENT] Query-frontend: query sharding process is now time-bounded and it is cancelled if the request is aborted. #3028
[ENHANCEMENT] Query-frontend: improved Prometheus response JSON encoding performance. #2450
[ENHANCEMENT] TLS: added configuration parameters to configure the client's TLS cipher suites and minimum version. The following new CLI flags have been added: #3070
- -alertmanager.alertmanager-client.tls-cipher-suites
- -alertmanager.alertmanager-client.tls-min-version
- -alertmanager.sharding-ring.etcd.tls-cipher-suites
- -alertmanager.sharding-ring.etcd.tls-min-version
- -compactor.ring.etcd.tls-cipher-suites
- -compactor.ring.etcd.tls-min-version
- -distributor.forwarding.grpc-client.tls-cipher-suites
- -distributor.forwarding.grpc-client.tls-min-version
- -distributor.ha-tracker.etcd.tls-cipher-suites
- -distributor.ha-tracker.etcd.tls-min-version
- -distributor.ring.etcd.tls-cipher-suites
- -distributor.ring.etcd.tls-min-version
- -ingester.client.tls-cipher-suites
- -ingester.client.tls-min-version
- -ingester.ring.etcd.tls-cipher-suites
- -ingester.ring.etcd.tls-min-version
- -memberlist.tls-cipher-suites
- -memberlist.tls-min-version
- -querier.frontend-client.tls-cipher-suites
- -querier.frontend-client.tls-min-version
- -querier.store-gateway-client.tls-cipher-suites
- -querier.store-gateway-client.tls-min-version
- -query-frontend.grpc-client-config.tls-cipher-suites
- -query-frontend.grpc-client-config.tls-min-version
- -query-scheduler.grpc-client-config.tls-cipher-suites
- -query-scheduler.grpc-client-config.tls-min-version
- -query-scheduler.ring.etcd.tls-cipher-suites
- -query-scheduler.ring.etcd.tls-min-version
- -ruler.alertmanager-client.tls-cipher-suites
- -ruler.alertmanager-client.tls-min-version
- -ruler.client.tls-cipher-suites
- -ruler.client.tls-min-version
- -ruler.query-frontend.grpc-client-config.tls-cipher-suites
- -ruler.query-frontend.grpc-client-config.tls-min-version
- -ruler.ring.etcd.tls-cipher-suites
- -ruler.ring.etcd.tls-min-version
- -store-gateway.sharding-ring.etcd.tls-cipher-suites
- -store-gateway.sharding-ring.etcd.tls-min-version
[ENHANCEMENT] Store-gateway: Add -blocks-storage.bucket-store.max-concurrent-reject-over-limit option to allow requests that exceed the max number of inflight object storage requests to be rejected. #2999
[ENHANCEMENT] Query-frontend: allow setting a separate limit on the total (before splitting/sharding) query length of range queries with the new experimental -query-frontend.max-total-query-length flag, which defaults to -store.max-query-length if unset or set to 0. #3058
[ENHANCEMENT] Query-frontend: Lower TTL for cache entries overlapping the out-of-order samples ingestion window (re-using -ingester.out-of-order-allowance from ingesters). #2935
[ENHANCEMENT] Ruler: added support to forcefully disable recording and/or alerting rules evaluation. The following new configuration options have been introduced, which can be overridden on a per-tenant basis in the runtime configuration: #3088
- -ruler.recording-rules-evaluation-enabled
- -ruler.alerting-rules-evaluation-enabled
[ENHANCEMENT] Distributor: Add age filter to forwarding functionality, to not forward samples which are older than defined duration. #3049
[ENHANCEMENT] Distributor: Improved error messages reported when the distributor fails to remote write to ingesters. #3055
[ENHANCEMENT] Improved tracing spans tracked by distributors, ingesters and store-gateways. #2879 #3099 #3089
[ENHANCEMENT] Ingester: improved the performance of label value cardinality endpoint. #3044
[ENHANCEMENT] Ruler: use backoff retry on remote evaluation #3098
[ENHANCEMENT] Query-frontend: Include multiple tenant IDs in query logs when present instead of dropping them. #3125
[ENHANCEMENT] Alertmanager: reduced memory utilization in Mimir clusters with a large number of tenants. #3143
[ENHANCEMENT] Store-gateway: added extra span logging to improve observability. #3131
[BUGFIX] Querier: Fix 400 response while handling streaming remote read. #2963
[BUGFIX] Fix a bug causing query-frontend, query-scheduler, and querier not failing if one of their internal components fail. #2978
[BUGFIX] Querier: re-balance the querier worker connections when a query-frontend or query-scheduler is terminated. #3005
[BUGFIX] Distributor: Now returns the quorum error from ingesters. For example, with replication_factor=3, two HTTP 400 errors and one HTTP 500 error, now the distributor will always return HTTP 400. Previously the behaviour was to return the error which the distributor first received. #2979
[BUGFIX] Ruler: fix panic when ruler.external_url is explicitly set to an empty string ("") in YAML. #2915
[BUGFIX] Alertmanager: Fix support for the Telegram API URL in the global settings. #3097
[BUGFIX] Alertmanager: Fix parsing of label matchers without label value in the API used to retrieve alerts. #3097
[BUGFIX] Ruler: Fix not restoring alert state for rule groups when other ruler replicas shut down. #3156
[BUGFIX] Updated golang.org/x/net dependency to fix CVE-2022-27664. #3124

Mixin

[CHANGE] Alerts: MimirQuerierAutoscalerNotActive is now critical and fires after 1h instead of 15m. #2958
[FEATURE] Dashboards: Added "Mimir / Overview" dashboards, providing an high level view over a Mimir cluster. #3122 #3147 #3155
[ENHANCEMENT] Dashboards: Updated the "Writes" and "Rollout progress" dashboards to account for samples ingested via the new OTLP ingestion endpoint. #2919 #2938
[ENHANCEMENT] Dashboards: Include per-tenant request rate in "Tenants" dashboard. #2874
[ENHANCEMENT] Dashboards: Include inflight object store requests in "Reads" dashboard. #2914
[ENHANCEMENT] Dashboards: Make queries used to find job, cluster and namespace for dropdown menus configurable. #2893
[ENHANCEMENT] Dashboards: Include rate of label and series queries in "Reads" dashboard. #3065 #3074
[ENHANCEMENT] Dashboards: Fix legend showing on per-pod panels. #2944
[ENHANCEMENT] Dashboards: Use the "req/s" unit on panels showing the requests rate. #3118
[ENHANCEMENT] Dashboards: Use a consistent color across dashboards for the error rate. #3154

Jsonnet

[FEATURE] Added support for query-scheduler ring-based service discovery. #3128
[ENHANCEMENT] Querier autoscaling is now slower on scale downs: scale down 10% every 1m instead of 100%. #2962
[BUGFIX] Memberlist: gossip_member_label is now set for ruler-queriers. #3141

Mimirtool

[ENHANCEMENT] mimirtool analyze: Store the query errors instead of exit during the analysis. #3052
[BUGFIX] mimir-tool remote-read: fix returns where some conditions return nil error even if there is error. #3053

Documentation

[ENHANCEMENT] Added documentation on how to configure storage retention. #2970
[ENHANCEMENT] Improved gRPC clients config documentation. #3020
[ENHANCEMENT] Added documentation on how to manage alerting and recording rules. #2983
[ENHANCEMENT] Improved MimirSchedulerQueriesStuck runbook. #3006
[ENHANCEMENT] Added "Cluster label verification" section to memberlist documentation. #3096
[ENHANCEMENT] Mention compression in multi-zone replication documentation. #3107
[BUGFIX] Fixed configuration option names in "Enabling zone-awareness via the Grafana Mimir Jsonnet". #3018
[BUGFIX] Fixed mimirtool analyze parameters documentation. #3094
[BUGFIX] Fixed YAML configuraton in the "Manage the configuration of Grafana Mimir with Helm" guide. #3042
[BUGFIX] Fixed Alertmanager capacity planning documentation. #3132

Tools

[BUGFIX] trafficdump: Fixed panic occurring when -success-only=true and the captured request failed. #2863

grafana/mimir mimir-2.4.0-rc.0 2.4.0-rc.0 on GitHub