Grafana Labs is excited to announce version 2.2 of Grafana Mimir, the most scalable, most performant open source time series database in the world.
The highlights that follow include the top features, enhancements, and bugfixes in this release. If you are upgrading from Grafana Mimir 2.1, there is upgrade-related information as well.
For the complete list of changes, see the Changelog.
This release contains 214 contributions from 32 authors. Thank you!
Features and enhancements
-
Support for ingesting out-of-order samples: Grafana Mimir includes new, experimental support for ingesting out-of-order samples.
This support is configurable, and it allows you to set how far out-of-order Mimir accepts samples on a per-tenant basis.
This feature still needs additional testing; we do not recommend using it in a production environment.
For more information, see Configuring out-of-order samples ingestion -
Improved error messages: The error messages that Mimir reports are more human readable, and the messages include error codes that are easily searchable.
For error descriptions, see the Grafana Mimir runbooks’ Errors catalog. -
Configurable prefix for object storage: Mimir can now store block data, rules, and alerts in one bucket, with each under its own user-defined prefix, rather than requiring one bucket for each.
You can configure the storage prefix by using-<storage>.storage-prefix
option for corresponding storage:ruler-storage
,alertmanager-storage
orblocks-storage
. -
Store-gateway performance optimization
The store-gateway can now pre-populate the file system cache when memory-mapping index-header files.
This avoids the store-gateway from appearing to be stuck while loading index-headers.
This feature is experimental and disabled by default; enable it using the flag-blocks-storage.bucket-store.index-header.map-populate-enabled
. -
Faster ingester startup: Ingesters now replay their WALs (write ahead logs) about 50% faster, and they also re-join the ring sooner under some conditions.
-
Helm Chart improvements: The Mimir Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.2 release, we're also releasing version 3.0 of the Helm chart. Notable enhancements follow. For the full list of changes, see the Helm chart changelog.
- The Helm chart now supports OpenShift.
- The Helm chart can now easily deploy Grafana Agent in order to scrape metrics and logs from all Mimir pods, and ship them to a remote store, which makes it easier to monitor the health of your Mimir installation. For more information, see Collecting metrics and logs from Grafana Mimir.
- The Helm chart now enables multi-tenancy by default. This makes it easy for you to add tenants as you grow your cluster. You can take advantage of Mimir's per-tenant quality-of-service features, which improves stability and resilience at high scale. To learn more about how multi-tenancy in Mimir works, see Grafana Mimir authorization and authentication. This change is backwards-compatible. To read about how we implemented this, see #2117.
- We have significantly improved the configuration experience for the Helm chart, and here are a few of the most salient changes:
- We've added an
extraEnvFrom
capability to all Mimir services to enable you to inject secrets via environment variables. - We've made it possible to globally set environment variables and inject secrets across all pods in the chart using
global.extraEnv
andglobal.extraEnvFrom
. Note that the memcached and minio pods are not included. - We've switched the default storage of the Mimir configuration from a
Secret
to aConfigMap
, which makes it easier to quickly see the differences between your Mimir configurations between upgrades. We especially like the Helm diff plugin for this purpose. - We've added a
structuredConfig
option, which allows you to overwrite specific key-value pairs in themimir.config
template, which saves you from having to maintain the entiremimir.config
in your ownvalues.yaml
file. - We've added the ability to create global pod annotations. This unlocks the ability to trigger a restart of all services in response to a single event, such as the update of the secret containing Mimir's storage credentials.
- We've added an
- We've set the chart to disable
-ingester.ring.unregister-on-shutdown
and-distributor.extend-writes
, for a smoother upgrade experience. Rolling restarts of ingesters are now less likely to cause spikes in resource usage. - We've improved the documentation for the Helm chart by adding a Getting started with Mimir using the Helm chart.
- We've added a smoke test for your Mimir cluster to help catch errors immediately after you install or upgrade Mimir via the Helm chart.
Upgrade considerations
All deprecated API endpoints that are under /api/v1/rules*
and /prometheus/rules*
have now been removed from the ruler component in favor of identical endpoints that use the prefix /prometheus/config/v1/rules*
.
In Grafana Mimir 2.2, we have updated default values and some parameters to give you a better out-of-the-box experience:
-
Message size limits for gRPC messages that are exchanged between internal Mimir components have increased to 100 MiB from 4 MiB.
This helps to avoid internal server errors when pushing or querying large data. -
The
-blocks-storage.bucket-store.ignore-blocks-within
parameter changed from0
to10h
.
The default value of-querier.query-store-after
changed from0
to12h
.
For most-recent data, both changes improve query performance by querying only the ingesters, rather than object storage. -
The option
-querier.shuffle-sharding-ingesters-lookback-period
has been deprecated.
If you previously changed this option from its default of0s
, set-querier.shuffle-sharding-ingesters-enabled
totrue
and specify the lookback period by setting the-querier.query-ingesters-within
option. -
The
-memberlist.abort-if-join-fails
parameter now defaults tofalse
.
When Mimir is using memberlist as the backend store for its hash ring, and it fails to join the memberlist cluster, Mimir no longer aborts startup by default.
If you have used a previous version of the Mimir Helm chart, you must address some of the chart's breaking changes before upgrading to helm chart version 3.0. For a detailed information about how to do this, see Upgrade the Grafana Mimir Helm chart from version 2.1 to 3.0.
Bug fixes
- PR 1883: Fixed a bug that caused the query-frontend and querier to crash when they received a user query with a special regular expression label matcher.
- PR 1933: Fixed a bug in the ingester ring page, which showed incorrect status of entries in the ring.
- PR 2090: Ruler in remote rule evaluation mode now applies the timeout correctly. Previously the ruler could get stuck forever, which halted rule evaluation.
- PR 2036: Fixed panic at startup when Mimir is running in monolithic mode and query sharding is enabled.
Changelog
2.2.0
Grafana Mimir
- [CHANGE] Increased default configuration for
-server.grpc-max-recv-msg-size-bytes
and-server.grpc-max-send-msg-size-bytes
from 4MB to 100MB. #1884 - [CHANGE] Default values have changed for the following settings. This improves query performance for recent data (within 12h) by only reading from ingesters: #1909 #1921
-blocks-storage.bucket-store.ignore-blocks-within
now defaults to10h
(previously0
)-querier.query-store-after
now defaults to12h
(previously0
)
- [CHANGE] Alertmanager: removed support for migrating local files from Cortex 1.8 or earlier. Related to original Cortex PR cortexproject/cortex#3910. #2253
- [CHANGE] The following settings are now classified as advanced because the defaults should work for most users and tuning them requires in-depth knowledge of how the read path works: #1929
-querier.query-ingesters-within
-querier.query-store-after
- [CHANGE] Config flag category overrides can be set dynamically at runtime. #1934
- [CHANGE] Ingester: deprecated
-ingester.ring.join-after
. Mimir now behaves as this setting is always set to 0s. This configuration option will be removed in Mimir 2.4.0. #1965 - [CHANGE] Blocks uploaded by ingester no longer contain
__org_id__
label. Compactor now ignores this label and will compact blocks with and without this label together.mimirconvert
tool will remove the label from blocks as "unknown" label. #1972 - [CHANGE] Querier: deprecated
-querier.shuffle-sharding-ingesters-lookback-period
, instead adding-querier.shuffle-sharding-ingesters-enabled
to enable or disable shuffle sharding on the read path. The value of-querier.query-ingesters-within
is now used internally for shuffle sharding lookback. #2110 - [CHANGE] Memberlist:
-memberlist.abort-if-join-fails
now defaults to false. Previously it defaulted to true. #2168 - [CHANGE] Ruler:
/api/v1/rules*
and/prometheus/rules*
configuration endpoints are removed. Use/prometheus/config/v1/rules*
. #2182 - [CHANGE] Ingester:
-ingester.exemplars-update-period
has been renamed to-ingester.tsdb-config-update-period
. You can use it to update multiple, per-tenant TSDB configurations. #2187 - [FEATURE] Ingester: (Experimental) Add the ability to ingest out-of-order samples up to an allowed limit. If you enable this feature, it requires additional memory and disk space. This feature also enables a write-behind log, which might lead to longer ingester-start replays. When this feature is disabled, there is no overhead on memory, disk space, or startup times. #2187
-ingester.out-of-order-time-window
, as duration string, allows you to set how back in time a sample can be. The default is0s
, wheres
is seconds.cortex_ingester_tsdb_out_of_order_samples_appended_total
metric tracks the total number of out-of-order samples ingested by the ingester.cortex_discarded_samples_total
has a new labelreason="sample-too-old"
, when the-ingester.out-of-order-time-window
flag is greater than zero. The label tracks the number of samples that were discarded for being too old; they were out of order, but beyond the time window allowed. The labelsreason="sample-out-of-order"
andreason="sample-out-of-bounds"
are not used when out-of-order ingestion is enabled.
- [ENHANCEMENT] Distributor: Added limit to prevent tenants from sending excessive number of requests: #1843
- The following CLI flags (and their respective YAML config options) have been added:
-distributor.request-rate-limit
-distributor.request-burst-limit
- The following metric is exposed to tell how many requests have been rejected:
cortex_discarded_requests_total
- The following CLI flags (and their respective YAML config options) have been added:
- [ENHANCEMENT] Store-gateway: Add the experimental ability to run requests in a dedicated OS thread pool. This feature can be configured using
-store-gateway.thread-pool-size
and is disabled by default. Replaces the ability to run index header operations in a dedicated thread pool. #1660 #1812 - [ENHANCEMENT] Improved error messages to make them easier to understand; each now have a unique, global identifier that you can use to look up in the runbooks for more information. #1907 #1919 #1888 #1939 #1984 #2009 #2056 #2066 #2104 #2150 #2234
- [ENHANCEMENT] Memberlist KV: incoming messages are now processed on per-key goroutine. This may reduce loss of "maintanance" packets in busy memberlist installations, but use more CPU. New
memberlist_client_received_broadcasts_dropped_total
counter tracks number of dropped per-key messages. #1912 - [ENHANCEMENT] Blocks Storage, Alertmanager, Ruler: add support a prefix to the bucket store (
*_storage.storage_prefix
). This enables using the same bucket for the three components. #1686 #1951 - [ENHANCEMENT] Upgrade Docker base images to
alpine:3.16.0
. #2028 - [ENHANCEMENT] Store-gateway: Add experimental configuration option for the store-gateway to attempt to pre-populate the file system cache when memory-mapping index-header files. Enabled with
-blocks-storage.bucket-store.index-header.map-populate-enabled=true
. Note this flag only has an effect when running on Linux. #2019 #2054 - [ENHANCEMENT] Chunk Mapper: reduce memory usage of async chunk mapper. #2043
- [ENHANCEMENT] Ingester: reduce sleep time when reading WAL. #2098
- [ENHANCEMENT] Compactor: Run sanity check on blocks storage configuration at startup. #2144
- [ENHANCEMENT] Compactor: Add HTTP API for uploading TSDB blocks. Enabled with
-compactor.block-upload-enabled
. #1694 #2126 - [ENHANCEMENT] Ingester: Enable querying overlapping blocks by default. #2187
- [ENHANCEMENT] Distributor: Auto-forget unhealthy distributors after ten failed ring heartbeats. #2154
- [ENHANCEMENT] Distributor: Add new metric
cortex_distributor_forward_errors_total
for error codes resulting from forwarding requests. #2077 - [ENHANCEMENT]
/ready
endpoint now returns and logs detailed services information. #2055 - [ENHANCEMENT] Memcached client: Reduce number of connections required to fetch cached keys from memcached. #1920
- [ENHANCEMENT] Improved error message returned when
-querier.query-store-after
validation fails. #1914 - [BUGFIX] Fix regexp parsing panic for regexp label matchers with start/end quantifiers. #1883
- [BUGFIX] Ingester: fixed deceiving error log "failed to update cached shipped blocks after shipper initialisation", occurring for each new tenant in the ingester. #1893
- [BUGFIX] Ring: fix bug where instances may appear unhealthy in the hash ring web UI even though they are not. #1933
- [BUGFIX] API: gzip is now enforced when identity encoding is explicitly rejected. #1864
- [BUGFIX] Fix panic at startup when Mimir is running in monolithic mode and query sharding is enabled. #2036
- [BUGFIX] Ruler: report
cortex_ruler_queries_failed_total
metric for any remote query error except 4xx when remote operational mode is enabled. #2053 #2143 - [BUGFIX] Ingester: fix slow rollout when using
-ingester.ring.unregister-on-shutdown=false
with long-ingester.ring.heartbeat-period
. #2085 - [BUGFIX] Ruler: add timeout for remote rule evaluation queries to prevent rule group evaluations getting stuck indefinitely. The duration is configurable with
-querier.timeout
(default2m
). #2090 #2222 - [BUGFIX] Limits: Active series custom tracker configuration has been named back from
active_series_custom_trackers_config
toactive_series_custom_trackers
. For backwards compatibility both version is going to be supported for until Mimir v2.4. When both fields are specified,active_series_custom_trackers_config
takes precedence overactive_series_custom_trackers
. #2101 - [BUGFIX] Ingester: fixed the order of labels applied when incrementing the
cortex_discarded_metadata_total
metric. #2096 - [BUGFIX] Ingester: fixed bug where retrieving metadata for a metric with multiple metadata entries would return multiple copies of a single metadata entry rather than all available entries. #2096
- [BUGFIX] Distributor: canceled requests are no longer accounted as internal errors. #2157
- [BUGFIX] Memberlist: Fix typo in memberlist admin UI. #2202
- [BUGFIX] Ruler: fixed typo in error message when ruler failed to decode a rule group. #2151
- [BUGFIX] Active series custom tracker configuration is now displayed properly on
/runtime_config
page. #2065 - [BUGFIX] Query-frontend:
vector
andtime
functions were sharded, which made expressions likevector(1) > 0 and vector(1)
fail. #2355
Mixin
- [CHANGE] Split
mimir_queries
rules group intomimir_queries
andmimir_ingester_queries
to keep number of rules per group within the default per-tenant limit. #1885 - [CHANGE] Dashboards: Expose full image tag in "Mimir / Rollout progress" dashboard's "Pod per version panel." #1932
- [CHANGE] Dashboards: Disabled gateway panels by default, because most users don't have a gateway exposing the metrics expected by Mimir dashboards. You can re-enable it setting
gateway_enabled: true
in the mixin config and recompiling the mixin runningmake build-mixin
. #1955 - [CHANGE] Alerts: adapt
MimirFrontendQueriesStuck
andMimirSchedulerQueriesStuck
to consider ruler query path components. #1949 - [CHANGE] Alerts: Change
MimirRulerTooManyFailedQueries
severity tocritical
. #2165 - [ENHANCEMENT] Dashboards: Add config option
datasource_regex
to customise the regular expression used to select valid datasources for Mimir dashboards. #1802 - [ENHANCEMENT] Dashboards: Added "Mimir / Remote ruler reads" and "Mimir / Remote ruler reads resources" dashboards. #1911 #1937
- [ENHANCEMENT] Dashboards: Make networking panels work for pods created by the mimir-distributed helm chart. #1927
- [ENHANCEMENT] Alerts: Add
MimirStoreGatewayNoSyncedTenants
alert that fires when there is a store-gateway owning no tenants. #1882 - [ENHANCEMENT] Rules: Make
recording_rules_range_interval
configurable for cases where Mimir metrics are scraped less often that every 30 seconds. #2118 - [ENHANCEMENT] Added minimum Grafana version to mixin dashboards. #1943
- [BUGFIX] Fix
container_memory_usage_bytes:sum
recording rule. #1865 - [BUGFIX] Fix
MimirGossipMembersMismatch
alerts if Mimir alertmanager is activated. #1870 - [BUGFIX] Fix
MimirRulerMissedEvaluations
to show % of missed alerts as a value between 0 and 100 instead of 0 and 1. #1895 - [BUGFIX] Fix
MimirCompactorHasNotUploadedBlocks
alert false positive when Mimir is deployed in monolithic mode. #1902 - [BUGFIX] Fix
MimirGossipMembersMismatch
to make it less sensitive during rollouts and fire one alert per installation, not per job. #1926 - [BUGFIX] Do not trigger
MimirAllocatingTooMuchMemory
alerts if no container limits are supplied. #1905 - [BUGFIX] Dashboards: Remove empty "Chunks per query" panel from
Mimir / Queries
dashboard. #1928 - [BUGFIX] Dashboards: Use Grafana's
$__rate_interval
for rate queries in dashboards to support scrape intervals of >15s. #2011 - [BUGFIX] Alerts: Make each version of
MimirCompactorHasNotUploadedBlocks
distinct to avoid rule evaluation failures due to duplicate series being generated. #2197 - [BUGFIX] Fix
MimirGossipMembersMismatch
alert when using remote ruler evaluation. #2159
Jsonnet
- [CHANGE] Remove use of
-querier.query-store-after
,-querier.shuffle-sharding-ingesters-lookback-period
,-blocks-storage.bucket-store.ignore-blocks-within
, and-blocks-storage.tsdb.close-idle-tsdb-timeout
CLI flags since the values now match defaults. #1915 #1921 - [CHANGE] Change default value for
-blocks-storage.bucket-store.chunks-cache.memcached.timeout
to450ms
to increase use of cached data. #2035 - [CHANGE] The
memberlist_ring_enabled
configuration now applies to Alertmanager. #2102 #2103 #2107 - [CHANGE] Default value for
memberlist_ring_enabled
is now true. It means that all hash rings use Memberlist as default KV store instead of Consul (previous default). #2161 - [CHANGE] Configure
-ingester.max-global-metadata-per-user
to correspond to 20% of the configured max number of series per tenant. #2250 - [CHANGE] Configure
-ingester.max-global-metadata-per-metric
to be 10. #2250 - [CHANGE] Change
_config.multi_zone_ingester_max_unavailable
to 25. #2251 - [FEATURE] Added querier autoscaling support. It requires KEDA installed in the Kubernetes cluster and query-scheduler enabled in the Mimir cluster. Querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2013 #2023
autoscaling_querier_enabled
:true
to enable autoscaling.autoscaling_querier_min_replicas
: minimum number of querier replicas.autoscaling_querier_max_replicas
: maximum number of querier replicas.autoscaling_prometheus_url
: Prometheus base URL from which to scrape Mimir metrics (e.g.http://prometheus.default:9090/prometheus
).
- [FEATURE] Jsonnet: Add support for ruler remote evaluation mode (
ruler_remote_evaluation_enabled
), which deploys and uses a dedicated query path for rule evaluation. This enables the benefits of the query-frontend for rule evaluation, such as query sharding. #2073 - [ENHANCEMENT] Added
compactor
service, that can be used to route requests directly to compactor (e.g. admin UI). #2063 - [ENHANCEMENT] Added a
consul_enabled
configuration option to provide the ability to disable consul. It is automatically set to false whenmemberlist_ring_enabled
is true andmultikv_migration_enabled
(used for migration from Consul to memberlist) is not set. #2093 #2152 - [BUGFIX] Querier: Fix disabling shuffle sharding on the read path whilst keeping it enabled on write path. #2164
Mimirtool
- [CHANGE] mimirtool rules:
--use-legacy-routes
now toggles between using/prometheus/config/v1/rules
(default) and/api/v1/rules
(legacy) endpoints. #2182 - [FEATURE] Added bearer token support for when Mimir is behind a gateway authenticating by bearer token. #2146
- [BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors (#1840). #1973
- [BUGFIX] Make mimirtool build for Windows work again. #2273
Mimir Continuous Test
- [ENHANCEMENT] Added the
-tests.smoke-test
flag to run themimir-continuous-test
suite once and immediately exit. #2047 #2094
Documentation
- [ENHANCEMENT] Published Grafana Mimir runbooks as part of documentation. #1970
- [ENHANCEMENT] Improved ruler's "remote operational mode" documentation. #1906
- [ENHANCEMENT] Recommend fast disks for ingesters and store-gateways in production tips. #1903
- [ENHANCEMENT] Explain the runtime override of active series matchers. #1868
- [ENHANCEMENT] Clarify "Set rule group" API specification. #1869
- [ENHANCEMENT] Published Mimir jsonnet documentation. #2024
- [ENHANCEMENT] Documented required scrape interval for using alerting and recording rules from Mimir jsonnet. #2147
- [ENHANCEMENT] Runbooks: Mention memberlist as possible source of problems for various alerts. #2158
- [ENHANCEMENT] Added step-by-step article about migrating from Consul to Memberlist KV store using jsonnet without downtime. #2166
- [ENHANCEMENT] Documented
/memberlist
admin page. #2166 - [ENHANCEMENT] Documented how to configure Grafana Mimir's ruler with Jsonnet. #2127
- [ENHANCEMENT] Documented how to configure queriers’ autoscaling with Jsonnet. #2128
- [ENHANCEMENT] Updated mixin building instructions in "Installing Grafana Mimir dashboards and alerts" article. #2015 #2163
- [ENHANCEMENT] Fix location of "Monitoring Grafana Mimir" article in the documentation hierarchy. #2130
- [ENHANCEMENT] Runbook for
MimirRequestLatency
was expanded with more practical advice. #1967 - [BUGFIX] Fixed ruler configuration used in the getting started guide. #2052
- [BUGFIX] Fixed Mimir Alertmanager datasource in Grafana used by "Play with Grafana Mimir" tutorial. #2115
- [BUGFIX] Fixed typos in "Scaling out Grafana Mimir" article. #2170
- [BUGFIX] Added missing ring endpoint exposed by Ingesters. #1918
New Contributors
- @pdf made their first contribution in #1865
- @secustor made their first contribution in #1870
- @zenador made their first contribution in #1930
- @pr00se made their first contribution in #1934
- @hjet made their first contribution in #1973
- @williamzelesny made their first contribution in #2028
- @javad-hajiani made their first contribution in #2146
- @rojas-diego made their first contribution in #2147
- @jhesketh made their first contribution in #2163
- @gonzalez made their first contribution in #2112
- @Eve832 made their first contribution in #2170
Full Changelog: mimir-2.1.0...mimir-2.2.0