This release contains 260 PRs from 46 authors. Thank you!
Grafana Mimir version 2.9 release notes
Grafana Labs is excited to announce version 2.9 of Grafana Mimir.
The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.
Features and enhancements
- Reduced store-gateway memory utilization on fetching series from long-term storage For queries that include broad label matchers (e.g.
datacenter="dc1"
), Mimir 2.9 will fetch a reduced volume of index data, which leads to a significant reduction in memory allocations in the store-gateway. - Reduced CPU utilisation for some shuffle sharding scenarios Mimir queriers will now use significantly less CPU in cases where shuffle sharding is enabled for tenants with a shard size that's large but lower than the total number of ingesters.
- Reduced object storage API calls in compactors and rulers Mimir 2.9 comes with optimizations that will reduce the amount of times compactors and rulers need to access rules stored in object storage.
- This release adds experimental support for a ruler storage cache. This cache should reduce the number of "list objects" API calls issued to the object storage when there are 2+ ruler replicas running in a Mimir cluster. The cache can be configured by setting the
-ruler-storage.cache.*
CLI flags or their respective YAML config options. - We also introduced a new feature to trigger a synchronization of tenant's rule groups as soon as changes to the rule configuration are made via API. This synchronization is in addition of the periodic syncing done every -ruler.poll-interval and allows to increase the polling interval. The new behavior is enabled globally by default but can be disabled with
-ruler.sync-rules-on-changes-enabled=false
or tuned at a per-tenant level.
- This release adds experimental support for a ruler storage cache. This cache should reduce the number of "list objects" API calls issued to the object storage when there are 2+ ruler replicas running in a Mimir cluster. The cache can be configured by setting the
- Experimental support for streaming chunks from ingester to querier This is expected to greatly reduce querier memory consumption when evaluating queries that select a large number of series, because chunks streamed from the querier can now be read into memory as needed.
Helm chart improvements
The Grafana Mimir and Grafana Enterprise Metrics Helm chart is now released independently. See the Grafana Mimir Helm chart documentation.
Important changes
In Grafana Mimir 2.9 we have removed the following previously deprecated or experimental metrics:
cortex_bucket_store_chunk_pool_requested_bytes_total
cortex_bucket_store_chunk_pool_returned_bytes_total
The following configuration options are deprecated and will be removed in Grafana Mimir 2.11:
- The CLI flag
-querier.query-ingesters-within
. This configuration is moved to per-tenant overrides. - The CLI flag
-blocks-storage.bucket-store.bucket-index.enabled
. - The CLI flags
-blocks-storage.bucket-store.chunk-pool-min-bucket-size-bytes
,-blocks-storage.bucket-store.chunk-pool-max-bucket-size-bytes
and-blocks-storage.bucket-store.max-chunk-pool-bytes
. - The CLI flags
querier.iterators
and-query.batch-iterators
.
The following configuration options that were deprecated in 2.7 are removed:
- The CLI flag
-blocks-storage.bucket-store.chunks-cache.subrange-size
. A fixed value of 16000 is now always used. - The CLI flag
-blocks-storage.bucket-store.consistency-delay
. - The CLI flag
-compactor.consistency-delay
. - The CLI flag
-ingester.ring.readiness-check-ring-health
.
The following experimental options and features are now stable:
- The CLI flag
-query-frontend.query-sharding-max-regexp-size-bytes
. - The CLI flag
-query-scheduler.max-used-instances
. - The CLI flags
-(alertmanager|blocks|ruler)-storage.storage-prefix
. - The CLI flag
-compactor.first-level-compaction-wait-period
. - The CLI flags
-usage-stats.enabled
and-usage-stats.installation-mode
. - The CLI flag
-query-frontend.query-sharding-target-series-per-shard
.
The following configuration option defaults were changed:
- The default value for the CLI flag
-query-frontend.query-sharding-max-regexp-size-bytes
was changed from0
to4096
. As a result, queries with regex matchers exceeding this limit will not be sharded by default. - The default value for the CLI flag
-compactor.partial-block-deletion-delay
was changed from0s
to1d
. As a result, partial blocks resulting from a failed block upload or deletion will be cleaned up automatically. - The default value for the CLI flag
-ruler.poll-interval
was changed from1m
to10m
.
Bug fixes
- Store-gateway: Detect collisions in the postings cache. PR 4770
- Store-gateway: Fix panic caused by cached LabelValues responses with more than 655360 values. PR 5021
Changelog
2.9.0-rc.1
Grafana Mimir
- [CHANGE] Store-gateway: change expanded postings, postings, and label values index cache key format. These caches will be invalidated when rolling out the new Mimir version. #4770 #4978 #5037
- [CHANGE] Distributor: remove the "forwarding" feature as it isn't necessary anymore. #4876
- [CHANGE] Query-frontend: Change the default value of
-query-frontend.query-sharding-max-regexp-size-bytes
from0
to4096
. #4932 - [CHANGE] Querier:
-querier.query-ingesters-within
has been moved from a global flag to a per-tenant override. #4287 - [CHANGE] Querier: Use
-blocks-storage.tsdb.retention-period
instead of-querier.query-ingesters-within
for calculating the lookback period for shuffle sharded ingesters. Setting-querier.query-ingesters-within=0
no longer disables shuffle sharding on the read path. #4287 - [CHANGE] Block upload:
/api/v1/upload/block/{block}/files
endpoint now allows file uploads with noContent-Length
. #4956 - [CHANGE] Store-gateway: deprecate configuration parameters for chunk pooling, they will be removed in Mimir 2.11. The following options are now also ignored: #4996
-blocks-storage.bucket-store.max-chunk-pool-bytes
-blocks-storage.bucket-store.chunk-pool-min-bucket-size-bytes
-blocks-storage.bucket-store.chunk-pool-max-bucket-size-bytes
- [CHANGE] Store-gateway: remove metrics
cortex_bucket_store_chunk_pool_requested_bytes_total
andcortex_bucket_store_chunk_pool_returned_bytes_total
. #4996 - [CHANGE] Compactor: change default of
-compactor.partial-block-deletion-delay
to1d
. This will automatically clean up partial blocks that were a result of failed block upload or deletion. #5026 - [CHANGE] Compactor: the deprecated configuration parameter
-compactor.consistency-delay
has been removed. #5050 - [CHANGE] Store-gateway: the deprecated configuration parameter
-blocks-storage.bucket-store.consistency-delay
has been removed. #5050 - [CHANGE] The configuration parameter
-blocks-storage.bucket-store.bucket-index.enabled
has been deprecated and will be removed in Mimir 2.11. Mimir is running by default with the bucket index enabled since version 2.0, and starting from the version 2.11 it will not be possible to disable it. #5051 - [CHANGE] The configuration parameters
-querier.iterators
and-query.batch-iterators
have been deprecated and will be removed in Mimir 2.11. Mimir runs by default with-querier.batch-iterators=true
, and starting from version 2.11 it will not be possible to change this. #5114 - [CHANGE] Compactor: change default of
-compactor.first-level-compaction-wait-period
to 25m. #5128 - [CHANGE] Ruler: changed default of
-ruler.poll-interval
from1m
to10m
. Starting from this release, the configured rule groups will also be re-synced each time they're modified calling the ruler configuration API. #5170 - [FEATURE] Query-frontend: add
-query-frontend.log-query-request-headers
to enable logging of request headers in query logs. #5030 - [ENHANCEMENT] Add per-tenant limit
-validation.max-native-histogram-buckets
to be able to ignore native histogram samples that have too many buckets. #4765 - [ENHANCEMENT] Store-gateway: reduce memory usage in some LabelValues calls. #4789
- [ENHANCEMENT] Store-gateway: add a
stage
label to the metriccortex_bucket_store_series_data_touched
. This label now applies todata_type="chunks"
anddata_type="series"
. Thestage
label has 2 values:processed
- the number of series that parsed - andreturned
- the number of series selected from the processed bytes to satisfy the query. #4797 #4830 - [ENHANCEMENT] Distributor: make
__meta_tenant_id
label available in relabeling rules configured viametric_relabel_configs
. #4725 - [ENHANCEMENT] Compactor: added the configurable limit
compactor.block-upload-max-block-size-bytes
orcompactor_block_upload_max_block_size_bytes
to limit the byte size of uploaded or validated blocks. #4680 - [ENHANCEMENT] Querier: reduce CPU utilisation when shuffle sharding is enabled with large shard sizes. #4851
- [ENHANCEMENT] Packaging: facilitate configuration management by instructing systemd to start mimir with a configuration file. #4810
- [ENHANCEMENT] Store-gateway: reduce memory allocations when looking up postings from cache. #4861 #4869 #4962 #5047
- [ENHANCEMENT] Store-gateway: retain only necessary bytes when reading series from the bucket. #4926
- [ENHANCEMENT] Ingester, store-gateway: clear the shutdown marker after a successful shutdown to enable reusing their persistent volumes in case the ingester or store-gateway is restarted. #4985
- [ENHANCEMENT] Store-gateway, query-frontend: Reduced memory allocations when looking up cached entries from Memcached. #4862
- [ENHANCEMENT] Alertmanager: Add additional template function
queryFromGeneratorURL
returning query URL decoded query from theGeneratorURL
field of an alert. #4301 - [ENHANCEMENT] Ruler: added experimental ruler storage cache support. The cache should reduce the number of "list objects" API calls issued to the object storage when there are 2+ ruler replicas running in a Mimir cluster. The cache can be configured setting
-ruler-storage.cache.*
CLI flags or their respective YAML config options. #4950 #5054 - [ENHANCEMENT] Store-gateway: added HTTP
/store-gateway/prepare-shutdown
endpoint for gracefully scaling down of store-gateways. A gaugecortex_store_gateway_prepare_shutdown_requested
has been introduced for tracing this process. #4955 - [ENHANCEMENT] Updated Kuberesolver dependency (github.com/sercand/kuberesolver) from v2.4.0 to v4.0.0 and gRPC dependency (google.golang.org/grpc) from v1.47.0 to v1.53.0. #4922
- [ENHANCEMENT] Introduced new options for logging HTTP request headers:
-server.log-request-headers
enables logging HTTP request headers,-server.log-request-headers-exclude-list
lists headers which should not be logged. #4922 - [ENHANCEMENT] Block upload:
/api/v1/upload/block/{block}/files
endpoint now disables read and write HTTP timeout, overriding-server.http-read-timeout
and-server.http-write-timeout
values. This is done to allow large file uploads to succeed. #4956 - [ENHANCEMENT] Alertmanager: Introduce new metrics from upstream. #4918
cortex_alertmanager_notifications_failed_total
(addedreason
label)cortex_alertmanager_nflog_maintenance_total
cortex_alertmanager_nflog_maintenance_errors_total
cortex_alertmanager_silences_maintenance_total
cortex_alertmanager_silences_maintenance_errors_total
- [ENHANCEMENT] Add native histogram support for
cortex_request_duration_seconds
metric family. #4987 - [ENHANCEMENT] Ruler: do not list rule groups in the object storage for disabled tenants. #5004
- [ENHANCEMENT] Query-frontend and querier: add HTTP API endpoint
<prometheus-http-prefix>/api/v1/format_query
to format a PromQL query. #4373 - [ENHANCEMENT] Query-frontend: Add
cortex_query_frontend_regexp_matcher_count
andcortex_query_frontend_regexp_matcher_optimized_count
metrics to track optimization of regular expression label matchers. #4813 - [ENHANCEMENT] Alertmanager: Add configuration option to enable or disable the deletion of alertmanager state from object storage. This is useful when migrating alertmanager tenants from one cluster to another, because it avoids a condition where the state object is copied but then deleted before the configuration object is copied. #4989
- [ENHANCEMENT] Querier: only use the minimum set of chunks from ingesters when querying, and cancel unnecessary requests to ingesters sooner if we know their results won't be used. #5016
- [ENHANCEMENT] Add
-enable-go-runtime-metrics
flag to expose all go runtime metrics as Prometheus metrics. #5009 - [ENHANCEMENT] Ruler: trigger a synchronization of tenant's rule groups as soon as they change the rules configuration via API. This synchronization is in addition of the periodic syncing done every
-ruler.poll-interval
. The new behavior is enabled by default, but can be disabled with-ruler.sync-rules-on-changes-enabled=false
(configurable on a per-tenant basis too). If you disable the new behaviour, then you may want to revert-ruler.poll-interval
to1m
. #4975 #5053 #5115 #5170 - [ENHANCEMENT] Distributor: Improve invalid tenant shard size error message. #5024
- [ENHANCEMENT] Store-gateway: record index header loading time separately in
cortex_bucket_store_series_request_stage_duration_seconds{stage="load_index_header"}
. Now index header loading will be visible in the "Mimir / Queries" dashboard in the "Series request p99/average latency" panels. #5011 #5062 - [ENHANCEMENT] Querier and ingester: add experimental support for streaming chunks from ingesters to queriers while evaluating queries. This can be enabled with
-querier.prefer-streaming-chunks=true
. #4886 #5078 #5094 #5126 - [ENHANCEMENT] Update Docker base images from
alpine:3.17.3
toalpine:3.18.0
. #5065 - [ENHANCEMENT] Compactor: reduced the number of "object exists" API calls issued by the compactor to the object storage when syncing block's
meta.json
files. #5063 - [ENHANCEMENT] Distributor: Push request rate limits (
-distributor.request-rate-limit
and-distributor.request-burst-size
) and their associated YAML configuration are now stable. #5124 - [ENHANCEMENT] Go: updated to 1.20.5. #5185
- [BUGFIX] Metadata API: Mimir will now return an empty object when no metadata is available, matching Prometheus. #4782
- [BUGFIX] Store-gateway: add collision detection on expanded postings and individual postings cache keys. #4770
- [BUGFIX] Ruler: Support the
type=alert|record
query parameter for the API endpoint<prometheus-http-prefix>/api/v1/rules
. #4302 - [BUGFIX] Backend: Check that alertmanager's data-dir doesn't overlap with bucket-sync dir. #4921
- [BUGFIX] Alertmanager: Allow to rate-limit webex, telegram and discord notifications. #4979
- [BUGFIX] Store-gateway: panics when decoding LabelValues responses that contain more than 655360 values. These responses are no longer cached. #5021
- [BUGFIX] Querier: don't leak memory when processing query requests from query-frontends (ie. when the query-scheduler is disabled). #5199
Documentation
- [ENHANCEMENT] Improve
MimirIngesterReachingTenantsLimit
runbook. #4744 #4752 - [ENHANCEMENT] Add
symbol table size exceeds
case toMimirCompactorHasNotSuccessfullyRunCompaction
runbook. #4945 - [ENHANCEMENT] Clarify which APIs use query sharding. #4948
Mixin
- [CHANGE] Alerts: Remove
MimirQuerierHighRefetchRate
. #4980 - [CHANGE] Alerts: Remove
MimirTenantHasPartialBlocks
. This is obsoleted by the changed default of-compactor.partial-block-deletion-delay
to1d
, which will auto remediate this alert. #5026 - [ENHANCEMENT] Alertmanager dashboard: display active aggregation groups #4772
- [ENHANCEMENT] Alerts:
MimirIngesterTSDBWALCorrupted
now only fires when there are more than one corrupted WALs in single-zone deployments and when there are more than two zones affected in multi-zone deployments. #4920 - [ENHANCEMENT] Alerts: added labels to duplicated
MimirRolloutStuck
andMimirCompactorHasNotUploadedBlocks
rules in order to distinguish them. #5023 - [ENHANCEMENT] Dashboards: fix holes in graph for lightly loaded clusters #4915
- [ENHANCEMENT] Dashboards: allow configuring additional services for the Rollout Progress dashboard. #5007
- [ENHANCEMENT] Alerts: do not fire
MimirAllocatingTooMuchMemory
alert for any matching container outside of namespaces where Mimir is running. #5089 - [BUGFIX] Dashboards: show cancelled requests in a different color to successful requests in throughput panels on dashboards. #5039
- [BUGFIX] Dashboards: fix dashboard panels that showed percentages with axes from 0 to 10000%. #5084
Jsonnet
- [CHANGE] Ruler: changed ruler autoscaling policy, extended scale down period from 60s to 600s. #4786
- [CHANGE] Update to v0.5.0 rollout-operator. #4893
- [CHANGE] Backend: add
alertmanager_args
tomimir-backend
when running in read-write deployment mode. Remove hardcodedfilesystem
alertmanager storage. This moves alertmanager's data-dir to/data/alertmanager
by default. #4907 #4921 - [CHANGE] Remove
-pdb
suffix fromPodDisruptionBudget
names. This will create newPodDisruptionBudget
resources. Make sure to prune the old resources; otherwise, rollouts will be blocked. #5109 - [CHANGE] Query-frontend: enable query sharding for cardinality estimation via
-query-frontend.query-sharding-target-series-per-shard
by default if the results cache is enabled. #5128 - [ENHANCEMENT] Ingester: configure
-blocks-storage.tsdb.head-compaction-interval=15m
to spread TSDB head compaction over a wider time range. #4870 - [ENHANCEMENT] Ingester: configure
-blocks-storage.tsdb.wal-replay-concurrency
to CPU request minus 1. #4864 - [ENHANCEMENT] Compactor: configure
-compactor.first-level-compaction-wait-period
to TSDB head compaction interval plus 10 minutes. #4872 - [ENHANCEMENT] Store-gateway: set
GOMEMLIMIT
to the memory request value. This should reduce the likelihood the store-gateway may go out of memory, at the cost of an higher CPU utilization due to more frequent garbage collections when the memory utilization gets closer or above the configured requested memory. #4971 - [ENHANCEMENT] Store-gateway: dynamically set
GOMAXPROCS
based on the CPU request. This should reduce the likelihood a high load on the store-gateway will slow down the entire Kubernetes node. #5104 - [ENHANCEMENT] Store-gateway: add
store_gateway_lazy_loading_enabled
configuration option which combines disabled lazy-loading and reducing blocks sync concurrency. Reducing blocks sync concurrency improves startup times with disabled lazy loading on HDDs. #5025 - [ENHANCEMENT] Update
rollout-operator
image tov0.6.0
. #5155 - [BUGFIX] Backend: configure
-ruler.alertmanager-url
tomimir-backend
when running in read-write deployment mode. #4892
Mimirtool
- [CHANGE] check rules: will fail on duplicate rules when
--strict
is provided. #5035 - [FEATURE] sync/diff can now include/exclude namespaces based on a regular expression using
--namespaces-regex
and--ignore-namespaces-regex
. #5100 - [ENHANCEMENT] analyze prometheus: allow to specify
-prometheus-http-prefix
. #4966 - [ENHANCEMENT] analyze grafana: allow to specify
--folder-title
to limit dashboards analysis based on their exact folder title. #4973
Tools
- [CHANGE] copyblocks: copying between Azure Blob Storage buckets is now supported in addition to copying between Google Cloud Storage buckets. As a result, the
--service
flag is now required to be specified (accepted values aregcs
orabs
). #4756
New Contributors
- @sepich made their first contribution in #4725
- @srclosson made their first contribution in #4829
- @salvacorts made their first contribution in #4833
- @jhalterman made their first contribution in #4840
- @MattiasSegerdahl made their first contribution in #4855
- @willychrisza made their first contribution in #4827
- @blazarecki made their first contribution in #4234
- @alexweav made their first contribution in #4882
- @fionaliao made their first contribution in #4287
- @KristianGrafana made their first contribution in #5034
- @dhanusaputra made their first contribution in #4813
- @alex5517 made their first contribution in #5031
- @theSuess made their first contribution in #5100
Full Changelog: mimir-2.8.0...mimir-2.9.0-rc.1