2.2.0-rc.0

This release contains 214 contributions from 32 authors. Thank you!

Grafana Labs is excited to announce version 2.2 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

Highlights include the top features, enhancements, and bugfixes in this release. If you are upgrading from Grafana Mimir 2.1, there is migration-related information as well.
For the complete list of changes, see the Changelog.

Features and enhancements

Support for ingesting out-of-order samples: Grafana Mimir includes new, experimental support for ingesting out-of-order samples.
This support is configurable, with users able to set how far out-of-order Mimir will accept samples on a per-tenant basis.
Note that this feature still needs a heavy testing, and is not production-ready yet.
Error messages: The error messages that Mimir reports are more human readable, and the messages include error codes that are easily searchable.
Configurable prefix for object storage: Mimir can now store block data, rules, and alerts in one bucket, each under its own user-defined prefix, rather than requiring one bucket for each.
You can configure the storage prefix by using -<storage>.storage-prefix option for corresponding storage: ruler-storage, alertmanager-storage or blocks-storage.
Helm Chart update: TBD
Store-gateway can now optionally prepopulate the file system cache when memory-mapping index-header files.
This can help store-gateway to avoid looking stuck while loading index-headers.
Feature can be enabled with new experimental flag -blocks-storage.bucket-store.index-header.map-populate-enabled.
Faster ingester startup: Ingesters now replay Write-Ahead-Log by about 50% faster, and they also re-join the ring sooner under some conditions.

Upgrade considerations

We have updated default values and some parameters in Grafana Mimir 2.2 to give you a better out-of-the-box experience:

Message size limits for gRPC messages exchanged between internal Mimir components increased to 100 MiB from the previous 4 MiB.
This helps to avoid internal server errors when pushing or querying large data.
The -blocks-storage.bucket-store.ignore-blocks-within parameter changed from 0 to 10h.
The default value of -querier.query-store-after changed from 0 to 12h.
Both changes improve query performance for most-recent data by querying only the ingesters, rather than object storage.
The option -querier.shuffle-sharding-ingesters-lookback-period has been deprecated.
If you previously changed this option from its default of 0s, set -querier.shuffle-sharding-ingesters-enabled to true and specify the lookback period by setting the -querier.query-ingesters-within option.
The -memberlist.abort-if-join-fails parameter now defaults to false.
When Mimir is using memberlist as a backend store for hash ring, and it fails to join the memberlist cluster, Mimir no longer aborts startup by default.

Bug fixes

PR 1883: Fixed a bug that caused the query-frontend and querier to crash when they received a user query with a special regular expression label matcher.
PR 1933: Fixed a bug in the ingester ring page, which showed incorrect status of entries in the ring.
PR 2090: Ruler in remote rule evaluation mode now applies the timeout correctly. Previously the ruler could get stuck forever, which halted rule evaluation.
PR 2036: Fixed panic at startup when Mimir is running in monolithic mode and query sharding is enabled.

CHANGELOG

Grafana Mimir

[CHANGE] Increased default configuration for -server.grpc-max-recv-msg-size-bytes and -server.grpc-max-send-msg-size-bytes from 4MB to 100MB. #1884
[CHANGE] Default values have changed for the following settings. This improves query performance for recent data (within 12h) by only reading from ingesters: #1909 #1921
- -blocks-storage.bucket-store.ignore-blocks-within now defaults to 10h (previously 0)
- -querier.query-store-after now defaults to 12h (previously 0)
[CHANGE] Alertmanager: removed support for migrating local files from Cortex 1.8 or earlier. Related to original Cortex PR cortexproject/cortex#3910. #2253
[CHANGE] The following settings are now classified as advanced because the defaults should work for most users and tuning them requires in-depth knowledge of how the read path works: #1929
- -querier.query-ingesters-within
- -querier.query-store-after
[CHANGE] Config flag category overrides can be set dynamically at runtime. #1934
[CHANGE] Ingester: deprecated -ingester.ring.join-after. Mimir now behaves as this setting is always set to 0s. This configuration option will be removed in Mimir 2.4.0. #1965
[CHANGE] Blocks uploaded by ingester no longer contain __org_id__ label. Compactor now ignores this label and will compact blocks with and without this label together. mimirconvert tool will remove the label from blocks as "unknown" label. #1972
[CHANGE] Querier: deprecated -querier.shuffle-sharding-ingesters-lookback-period, instead adding -querier.shuffle-sharding-ingesters-enabled to enable or disable shuffle sharding on the read path. The value of -querier.query-ingesters-within is now used internally for shuffle sharding lookback. #2110
[CHANGE] Memberlist: -memberlist.abort-if-join-fails now defaults to false. Previously it defaulted to true. #2168
[CHANGE] Ruler: /api/v1/rules* and /prometheus/rules* configuration endpoints are removed. Use /prometheus/config/v1/rules*. #2182
[CHANGE] Ingester: -ingester.exemplars-update-period has been renamed to -ingester.tsdb-config-update-period. You can use it to update multiple, per-tenant TSDB configurations. #2187
[FEATURE] Ingester: (Experimental) Add the ability to ingest out-of-order samples up to an allowed limit. If you enable this feature, it requires additional memory and disk space. This feature also enables a write-behind log, which might lead to longer ingester-start replays. When this feature is disabled, there is no overhead on memory, disk space, or startup times. #2187
- -ingester.out-of-order-time-window, as duration string, allows you to set how back in time a sample can be. The default is 0s, where s is seconds.
- cortex_ingester_tsdb_out_of_order_samples_appended_total metric tracks the total number of out-of-order samples ingested by the ingester.
- cortex_discarded_samples_total has a new label reason="sample-too-old", when the -ingester.out-of-order-time-window flag is greater than zero. The label tracks the number of samples that were discarded for being too old; they were out of order, but beyond the time window allowed.
[ENHANCEMENT] Distributor: Added limit to prevent tenants from sending excessive number of requests: #1843
- The following CLI flags (and their respective YAML config options) have been added:
  - -distributor.request-rate-limit
  - -distributor.request-burst-limit
- The following metric is exposed to tell how many requests have been rejected:
  - cortex_discarded_requests_total
[ENHANCEMENT] Store-gateway: Add the experimental ability to run requests in a dedicated OS thread pool. This feature can be configured using -store-gateway.thread-pool-size and is disabled by default. Replaces the ability to run index header operations in a dedicated thread pool. #1660 #1812
[ENHANCEMENT] Improved error messages to make them easier to understand; each now have a unique, global identifier that you can use to look up in the runbooks for more information. #1907 #1919 #1888 #1939 #1984 #2009 #2056 #2066 #2104 #2150 #2234
[ENHANCEMENT] Memberlist KV: incoming messages are now processed on per-key goroutine. This may reduce loss of "maintanance" packets in busy memberlist installations, but use more CPU. New memberlist_client_received_broadcasts_dropped_total counter tracks number of dropped per-key messages. #1912
[ENHANCEMENT] Blocks Storage, Alertmanager, Ruler: add support a prefix to the bucket store (*_storage.storage_prefix). This enables using the same bucket for the three components. #1686 #1951
[ENHANCEMENT] Upgrade Docker base images to alpine:3.16.0. #2028
[ENHANCEMENT] Store-gateway: Add experimental configuration option for the store-gateway to attempt to pre-populate the file system cache when memory-mapping index-header files. Enabled with -blocks-storage.bucket-store.index-header.map-populate-enabled=true. Note this flag only has an effect when running on Linux. #2019 #2054
[ENHANCEMENT] Chunk Mapper: reduce memory usage of async chunk mapper. #2043
[ENHANCEMENT] Ingester: reduce sleep time when reading WAL. #2098
[ENHANCEMENT] Compactor: Run sanity check on blocks storage configuration at startup. #2144
[ENHANCEMENT] Compactor: Add HTTP API for uploading TSDB blocks. Enabled with -compactor.block-upload-enabled. #1694 #2126
[ENHANCEMENT] Ingester: Enable querying overlapping blocks by default. #2187
[ENHANCEMENT] Distributor: Auto-forget unhealthy distributors after ten failed ring heartbeats. #2154
[ENHANCEMENT] Distributor: Add new metric cortex_distributor_forward_errors_total for error codes resulting from forwarding requests. #2077
[ENHANCEMENT] /ready endpoint now returns and logs detailed services information. #2055
[ENHANCEMENT] Memcached client: Reduce number of connections required to fetch cached keys from memcached. #1920
[ENHANCEMENT] Improved error message returned when -querier.query-store-after validation fails. #1914
[BUGFIX] Fix regexp parsing panic for regexp label matchers with start/end quantifiers. #1883
[BUGFIX] Ingester: fixed deceiving error log "failed to update cached shipped blocks after shipper initialisation", occurring for each new tenant in the ingester. #1893
[BUGFIX] Ring: fix bug where instances may appear unhealthy in the hash ring web UI even though they are not. #1933
[BUGFIX] API: gzip is now enforced when identity encoding is explicitly rejected. #1864
[BUGFIX] Fix panic at startup when Mimir is running in monolithic mode and query sharding is enabled. #2036
[BUGFIX] Ruler: report cortex_ruler_queries_failed_total metric for any remote query error except 4xx when remote operational mode is enabled. #2053 #2143
[BUGFIX] Ingester: fix slow rollout when using -ingester.ring.unregister-on-shutdown=false with long -ingester.ring.heartbeat-period. #2085
[BUGFIX] Ruler: add timeout for remote rule evaluation queries to prevent rule group evaluations getting stuck indefinitely. The duration is configurable with -querier.timeout (default 2m). #2090 #2222
[BUGFIX] Limits: Active series custom tracker configuration has been named back from active_series_custom_trackers_config to active_series_custom_trackers. For backwards compatibility both version is going to be supported for until Mimir v2.4. When both fields are specified, active_series_custom_trackers_config takes precedence over active_series_custom_trackers. #2101
[BUGFIX] Ingester: fixed the order of labels applied when incrementing the cortex_discarded_metadata_total metric. #2096
[BUGFIX] Ingester: fixed bug where retrieving metadata for a metric with multiple metadata entries would return multiple copies of a single metadata entry rather than all available entries. #2096
[BUGFIX] Distributor: canceled requests are no longer accounted as internal errors. #2157
[BUGFIX] Memberlist: Fix typo in memberlist admin UI. #2202
[BUGFIX] Ruler: fixed typo in error message when ruler failed to decode a rule group. #2151
[BUGFIX] Active series custom tracker configuration is now displayed properly on /runtime_config page. #2065

Mixin

[CHANGE] Split mimir_queries rules group into mimir_queries and mimir_ingester_queries to keep number of rules per group within the default per-tenant limit. #1885
[CHANGE] Dashboards: Expose full image tag in "Mimir / Rollout progress" dashboard's "Pod per version panel." #1932
[CHANGE] Dashboards: Disabled gateway panels by default, because most users don't have a gateway exposing the metrics expected by Mimir dashboards. You can re-enable it setting gateway_enabled: true in the mixin config and recompiling the mixin running make build-mixin. #1955
[CHANGE] Alerts: adapt MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck to consider ruler query path components. #1949
[CHANGE] Alerts: Change MimirRulerTooManyFailedQueries severity to critical. #2165
[ENHANCEMENT] Dashboards: Add config option datasource_regex to customise the regular expression used to select valid datasources for Mimir dashboards. #1802
[ENHANCEMENT] Dashboards: Added "Mimir / Remote ruler reads" and "Mimir / Remote ruler reads resources" dashboards. #1911 #1937
[ENHANCEMENT] Dashboards: Make networking panels work for pods created by the mimir-distributed helm chart. #1927
[ENHANCEMENT] Alerts: Add MimirStoreGatewayNoSyncedTenants alert that fires when there is a store-gateway owning no tenants. #1882
[ENHANCEMENT] Rules: Make recording_rules_range_interval configurable for cases where Mimir metrics are scraped less often that every 30 seconds. #2118
[ENHANCEMENT] Added minimum Grafana version to mixin dashboards. #1943
[BUGFIX] Fix container_memory_usage_bytes:sum recording rule. #1865
[BUGFIX] Fix MimirGossipMembersMismatch alerts if Mimir alertmanager is activated. #1870
[BUGFIX] Fix MimirRulerMissedEvaluations to show % of missed alerts as a value between 0 and 100 instead of 0 and 1. #1895
[BUGFIX] Fix MimirCompactorHasNotUploadedBlocks alert false positive when Mimir is deployed in monolithic mode. #1902
[BUGFIX] Fix MimirGossipMembersMismatch to make it less sensitive during rollouts and fire one alert per installation, not per job. #1926
[BUGFIX] Do not trigger MimirAllocatingTooMuchMemory alerts if no container limits are supplied. #1905
[BUGFIX] Dashboards: Remove empty "Chunks per query" panel from Mimir / Queries dashboard. #1928
[BUGFIX] Dashboards: Use Grafana's $__rate_interval for rate queries in dashboards to support scrape intervals of >15s. #2011
[BUGFIX] Alerts: Make each version of MimirCompactorHasNotUploadedBlocks distinct to avoid rule evaluation failures due to duplicate series being generated. #2197
[BUGFIX] Fix MimirGossipMembersMismatch alert when using remote ruler evaluation. #2159

Jsonnet

[CHANGE] Remove use of -querier.query-store-after, -querier.shuffle-sharding-ingesters-lookback-period, -blocks-storage.bucket-store.ignore-blocks-within, and -blocks-storage.tsdb.close-idle-tsdb-timeout CLI flags since the values now match defaults. #1915 #1921
[CHANGE] Change default value for -blocks-storage.bucket-store.chunks-cache.memcached.timeout to 450ms to increase use of cached data. #2035
[CHANGE] The memberlist_ring_enabled configuration now applies to Alertmanager. #2102 #2103 #2107
[CHANGE] Default value for memberlist_ring_enabled is now true. It means that all hash rings use Memberlist as default KV store instead of Consul (previous default). #2161
[CHANGE] Configure -ingester.max-global-metadata-per-user to correspond to 20% of the configured max number of series per tenant. #2250
[CHANGE] Configure -ingester.max-global-metadata-per-metric to be 10. #2250
[CHANGE] Change _config.multi_zone_ingester_max_unavailable to 25. #2251
[FEATURE] Added querier autoscaling support. It requires KEDA installed in the Kubernetes cluster and query-scheduler enabled in the Mimir cluster. Querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2013 #2023
- autoscaling_querier_enabled: true to enable autoscaling.
- autoscaling_querier_min_replicas: minimum number of querier replicas.
- autoscaling_querier_max_replicas: maximum number of querier replicas.
- autoscaling_prometheus_url: Prometheus base URL from which to scrape Mimir metrics (e.g. http://prometheus.default:9090/prometheus).
[FEATURE] Jsonnet: Add support for ruler remote evaluation mode (ruler_remote_evaluation_enabled), which deploys and uses a dedicated query path for rule evaluation. This enables the benefits of the query-frontend for rule evaluation, such as query sharding. #2073
[ENHANCEMENT] Added compactor service, that can be used to route requests directly to compactor (e.g. admin UI). #2063
[ENHANCEMENT] Added a consul_enabled configuration option to provide the ability to disable consul. It is automatically set to false when memberlist_ring_enabled is true and multikv_migration_enabled (used for migration from Consul to memberlist) is not set. #2093 #2152
[BUGFIX] Querier: Fix disabling shuffle sharding on the read path whilst keeping it enabled on write path. #2164

Mimirtool

[CHANGE] mimirtool rules: --use-legacy-routes now toggles between using /prometheus/config/v1/rules (default) and /api/v1/rules (legacy) endpoints. #2182
[FEATURE] Added bearer token support for when Mimir is behind a gateway authenticating by bearer token. #2146
[BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors (#1840). #1973

Mimir Continuous Test

[ENHANCEMENT] Added the -tests.smoke-test flag to run the mimir-continuous-test suite once and immediately exit. #2047 #2094

Documentation

[ENHANCEMENT] Published Grafana Mimir runbooks as part of documentation. #1970
[ENHANCEMENT] Improved ruler's "remote operational mode" documentation. #1906
[ENHANCEMENT] Recommend fast disks for ingesters and store-gateways in production tips. #1903
[ENHANCEMENT] Explain the runtime override of active series matchers. #1868
[ENHANCEMENT] Clarify "Set rule group" API specification. #1869
[ENHANCEMENT] Published Mimir jsonnet documentation. #2024
[ENHANCEMENT] Documented required scrape interval for using alerting and recording rules from Mimir jsonnet. #2147
[ENHANCEMENT] Runbooks: Mention memberlist as possible source of problems for various alerts. #2158
[ENHANCEMENT] Added step-by-step article about migrating from Consul to Memberlist KV store using jsonnet without downtime. #2166
[ENHANCEMENT] Documented /memberlist admin page. #2166
[ENHANCEMENT] Documented how to configure Grafana Mimir's ruler with Jsonnet. #2127
[ENHANCEMENT] Documented how to configure queriers’ autoscaling with Jsonnet. #2128
[ENHANCEMENT] Updated mixin building instructions in "Installing Grafana Mimir dashboards and alerts" article. #2015 #2163
[ENHANCEMENT] Fix location of "Monitoring Grafana Mimir" article in the documentation hierarchy. #2130
[ENHANCEMENT] Runbook for MimirRequestLatency was expanded with more practical advice. #1967
[BUGFIX] Fixed ruler configuration used in the getting started guide. #2052
[BUGFIX] Fixed Mimir Alertmanager datasource in Grafana used by "Play with Grafana Mimir" tutorial. #2115
[BUGFIX] Fixed typos in "Scaling out Grafana Mimir" article. #2170
[BUGFIX] Added missing ring endpoint exposed by Ingesters. #1918

New Contributors

@pdf made their first contribution in #1865
@secustor made their first contribution in #1870
@zenador made their first contribution in #1930
@pr00se made their first contribution in #1934
@hjet made their first contribution in #1973
@williamzelesny made their first contribution in #2028
@javad-hajiani made their first contribution in #2146
@rojas-diego made their first contribution in #2147
@jhesketh made their first contribution in #2163
@gonzalez made their first contribution in #2112
@Eve832 made their first contribution in #2170

Full Changelog: mimir-2.1.0...mimir-2.2.0-rc.0

grafana/mimir mimir-2.2.0-rc.0 Mimir 2.2.0-rc.0 on GitHub