This release contains 525 PRs from 60 authors, including new contributors Benoit Schipper, Derek Cadzow, Edwin, Itay Kalfon, Ivan Farré Vicente, Jan O. Rundshagen, Jorge Turrado Ferrero, Lukas Monkevicius, Mickaël Canévet, Rafael Sathler, Rajakavitha Kodhandapani, Tim Kotowski, Vladimir Varankin, Zach, Zach Day, Zirko, blut, github-actions[bot], ncharaf, zhehao-grafana. Thank you!
Grafana Mimir version 2.12.0-rc.0 release notes
Grafana Labs is excited to announce version 2.12 of Grafana Mimir.
The highlights that follow include the top features, enhancements, and bug fixes in this release.
For the complete list of changes, refer to the CHANGELOG.
Features and enhancements
-
Added support to only count series that are considered active through the Cardinality API endpoint
/api/v1/cardinality/label_names
by passing thecount_method
parameter.
If set toactive
it counts only series that are considered active according to the-ingester.active-series-metrics-idle-timeout
flag setting rather than counting all in-memory series. -
The "Store-gateway: bucket tenant blocks" admin page contains a new column "No Compact".
If block no compaction marker is set, it specifies the reason and the date the marker is added. -
The estimated number of compaction jobs based on the current bucket-index is now computed by the compactor.
The result is tracked by the newcortex_bucket_index_compaction_jobs
metric.
If this computation fails, thecortex_bucket_index_compaction_jobs_errors_total
metric is updated instead.
The estimated number of compaction jobs is also shown in Top tenants, Tenants, and Compactor dashboards. -
Added
mimir-distroless
container image built upon adistroless
image (gcr.io/distroless/static-debian12
).
This improvement minimizes attack surfaces and potential CVEs by trimming down the dependencies within the image.
After comprehensive testing, the Mimir maintainers plan to shift from the current image to the distroless version.
Additionally, the following previously experimental features are now considered stable:
-
The number of pre-allocated workers used to forward push requests to the ingesters, configurable via the
-distributor.reusable-ingester-push-workers
CLI flag on distributors.
It now defaults to2000
.
Note that this is a performance optimization, and not a limiting feature.
If not enough workers available, new goroutines will be spawned. -
The number of gRPC server workers used to serve the requests, configurable via the
-server.grpc.num-workers
CLI flag.
It now defaults to100
.
Note that this is the number of pre-allocated long-lived workers, and not a limiting feature.
If not enough workers are available, new goroutines will be spawned. -
The maximum number of concurrent index header loads across all tenants, configurable via the
-blocks-storage.bucket-store.index-header.lazy-loading-concurrency
CLI flag on store-gateways.
It defaults to4
. -
The maximum time to wait for the query-frontend to become ready before rejecting requests, configurable via the
-query-frontend.not-running-timeout
CLI flags on query-frontends.
It now defaults to2s
. -
Spread-minimizing token-related CLI flags:
-ingester.ring.token-generation-strategy
,-ingester.ring.spread-minimizing-zones
and-ingester.ring.spread-minimizing-join-ring-in-order
.
You can read more about this feature in our blog post.
Important changes
In Grafana Mimir 2.12 the following behavior has changed:
-
Store-gateway now persists a sparse version of the index-header to disk on construction and loads sparse index-headers from disk instead of the whole index-header.
This improves the speed at which index headers are lazy-loaded from disk by up to 90%. The added disk usage is in the order of 1-2%. -
Alertmanager deprecated the
v1
API. Allv1
API endpoints now respond with a JSON deprecation notice and a status code of410
.
All endpoints have av2
equivalent.
The list of endpoints is:<alertmanager-web.external-url>/api/v1/alerts
<alertmanager-web.external-url>/api/v1/receivers
<alertmanager-web.external-url>/api/v1/silence/{id}
<alertmanager-web.external-url>/api/v1/silences
<alertmanager-web.external-url>/api/v1/status
-
Exemplar's label
traceID
has been changed totrace_id
to be consistent with the OpenTelemetry standard. -
Errors returned by ingesters now contain only gRPC status codes.
Previously they contained both gRPC and HTTP status codes.
To guarantee backwards compatibility when migrating from a version prior to2.11
, it's necessary to first migrate to version2.11
, and then to version2.12
.
Otherwise, it might happen that during the migration, some ingester errors with HTTP status code4xx
won't be recognized, and the corresponding request will be repeated. -
Responses with gRPC status codes are now reported as
status_code
labels in thecortex_request_duration_seconds
andcortex_ingester_client_request_duration_seconds
metrics. -
Responses with HTTP 4xx status codes are now treated as errors and used in
status_code
label of request duration metric.
The default value of the following CLI flags have been changed:
-blocks-storage.tsdb.head-postings-for-matchers-cache-max-bytes
from10MB
to100MB
.-blocks-storage.tsdb.block-postings-for-matchers-cache-max-bytes
from10MB
to100MB
.-blocks-storage.bucket-store.tenant-sync-concurrency
from10
to1
.-query-frontend.max-cache-freshness
from1m
to10m
.-distributor.write-requests-buffer-pooling-enabled
fromfalse
totrue
.-locks-storage.bucket-store.block-sync-concurrency
from20
to4
.-memberlist.stream-timeout
from10s
to2s
.-server.report-grpc-codes-in-instrumentation-label-enabled
fromfalse
totrue
.
The following deprecated configuration options are removed in Grafana Mimir 2.12:
- The YAML setting
frontend.cache_unaligned_requests
.
The following configuration options are deprecated and will be removed in Grafana Mimir 2.14:
-
The CLI flag
-ingester.limit-inflight-requests-using-grpc-method-limiter
.
It now defaults totrue
. -
The CLI flag
-ingester.return-only-grpc-errors
.
It now defaults totrue
.
To guarantee backwards compatibility when migrating from a version prior to2.11
, it's necessary to first migrate to version2.11
, and then to version2.12
.
Otherwise, it might happen that during the migration, some ingester errors with HTTP status code4xx
won't be recognized, and the corresponding request will be repeated. -
The CLI flag
-ingester.client.report-grpc-codes-in-instrumentation-label-enabled
.
It now defaults totrue
. -
The CLI flag
-distributor.limit-inflight-requests-using-grpc-method-limiter
.
It now defaults totrue
. -
The CLI flag
-distributor.enable-otlp-metadata-storage
.
It now defaults totrue
. -
The CLI flag
-querier.max-query-into-future
.
The following metrics are removed or deprecated:
cortex_bucket_store_blocks_loaded_by_duration
has been removed.cortex_distributor_sample_delay_seconds
has been deprecated and will be removed in Mimir 2.14.
Experimental features
Grafana Mimir 2.12 includes new features that are considered experimental and disabled by default.
Use them with caution and report any issues you encounter:
-
The maximum number of tenant IDs that may be for a federated query can be configured via the
-tenant-federation.max-tenants
CLI flag on query-frontends.
By default, it's0
, meaning that the limit is disabled. -
Sharding of active series queries can be enabled via the
-query-frontend.shard-active-series-queries
CLI flag on query-frontends. -
Timely head compaction can be enabled via the
-blocks-storage.tsdb.timely-head-compaction-enabled
on ingesters.
If enabled, the head compaction happens when the min block range can no longer be appended, without requiring 1.5x the chunk range worth of data in the head. -
Streaming of responses from querier to query-frontend can be enabled via the
-querier.response-streaming-enabled
CLI flag on queriers.
This is currently supported only for responses from the/api/v1/cardinality/active_series
endpoint. -
The maximum response size for active series queries, in bytes, can be set via the
-querier.active-series-results-max-size-bytes
CLI flag on queriers. -
Metric relabeling on a per-tenant basis can be forcefully disabled via the
-distributor.metric-relabeling-enabled
CLI flag on rulers.
Metrics relabeling is enabled by default. -
Query Queue Load Balancing by Query Component. Tenant query queues in the query-scheduler can now be split into subqueues by which query component is expected to be utilized to complete the query: ingesters, store-gateways, both, or uncategorized.
Dequeuing queries for a given tenant will rotate through the query component subqueues via simple round-robin.
In the event that the one of the query components (ingesters or store-gateways) experience a slowdown, queries only utilizing the the other query component can continue to be serviced.
This feature is recommended to be enabled.
The following CLI flags must be set to true in order to be in effect:-query-frontend.additional-query-queue-dimensions-enabled
on the query-frontend.-query-scheduler.additional-query-queue-dimensions-enabled
on the query-scheduler.
-
Owned series tracking in ingesters can be enabled via the
-ingester.track-ingester-owned-series
CLI flag.
When enabled, ingesters will track the number of in-memory series that still map to the ingester based on the ring state.
These counts are more reactive to ring and shard changes than in-memory series, and can be used when enforcing tenant series limits by enabling the-ingester.use-ingester-owned-series-for-limits
CLI flag.
This feature requires zone-aware replication to be enabled, and the replication factor to be equal to the number of zones.
Bug fixes
- Distributor: fixed an issue where
-distributor.metric-relabeling-enabled
could cause distributors to panic. - Distributor: fix an issue where
-distributor.metric-relabeling-enabled
could cause distributors to write unsorted labels and corrupt blocks. - Ingester: errors encountered while iterating through chunks or samples in response to a query request aren't ignored anymore.
- Compactor: out-of-order blocks aren't allowed to prevent timely compaction anymore.
- Querier: requests to store-gateway when a query gets canceled aren't retried anymore.
- Querier: status code 499 is now returned instead of 500 when a request to remote read endpoint gets canceled.
- Querier: fixed an issue where
-querier.max-fetched-series-per-query
wasn't applied to/series
endpoint in case series loaded from ingesters. - Querier: fixed an issue with the remote-read requests HTTP status code translations.
Previously, remote-read had conflicting behaviours: when returning samples all internal errors were translated to HTTP400
, while when returning chunks all internal errors were translated to HTTP500
.
With this fix, all validation errors will be translated into HTTP400
errors, while all other errors will be translated into HTTP500
errors. - Query-frontend: the
cortex_query_frontend_queries_total
metric incorrectly reportedop="query"
for any request which wasn't a range query.
Now theop
label value can be one of the following:query
: instant queryquery_range
: range querycardinality
: cardinality querylabel_names_and_values
: label names / values queryactive_series
: active series queryother
: any other request
- Ruler: fixed an issue where "failed to remotely evaluate query expression, will retry" messages were logged without context such as the trace ID and didn't appear in trace events.
- Ruler: requests to remote querier when server's response exceeds its configured max payload size aren't retried anymore.
- Ruler: fixed a regression that caused client errors to be tracked in
cortex_ruler_write_requests_failed_total
metric. - Ruler: fixed an issue with recording rule result being corruption due to an usage of a bad native histogram pointer.
Helm chart improvements
The Grafana Mimir and Grafana Enterprise Metrics Helm charts are released independently.
Refer to the Grafana Mimir Helm chart documentation.
Changelog
2.12.0-rc.0
Grafana Mimir
- [CHANGE] Alertmanager: Deprecates the
v1
API. Allv1
API endpoints now respond with a JSON deprecation notice and a status code of410
. All endpoints have av2
equivalent. The list of endpoints is: #7103<alertmanager-web.external-url>/api/v1/alerts
<alertmanager-web.external-url>/api/v1/receivers
<alertmanager-web.external-url>/api/v1/silence/{id}
<alertmanager-web.external-url>/api/v1/silences
<alertmanager-web.external-url>/api/v1/status
- [CHANGE] Ingester: Increase default value of
-blocks-storage.tsdb.head-postings-for-matchers-cache-max-bytes
and-blocks-storage.tsdb.block-postings-for-matchers-cache-max-bytes
to 100 MiB (previous default value was 10 MiB). #6764 - [CHANGE] Validate tenant IDs according to documented behavior even when tenant federation is not enabled. Note that this will cause some previously accepted tenant IDs to be rejected such as those longer than 150 bytes or containing
|
characters. #6959 - [CHANGE] Ruler: don't use backoff retry on remote evaluation in case of
4xx
errors. #7004 - [CHANGE] Server: responses with HTTP 4xx status codes are now treated as errors and used in
status_code
label of request duration metric. #7045 - [CHANGE] Memberlist: change default for
-memberlist.stream-timeout
from10s
to2s
. #7076 - [CHANGE] Memcached: remove legacy
thanos_cache_memcached_*
andthanos_memcached_*
prefixed metrics. Instead, Memcached and Redis cache clients now emitthanos_cache_*
prefixed metrics with abackend
label. #7076 - [CHANGE] Ruler: the following metrics, exposed when the ruler is configured to discover Alertmanager instances via service discovery, have been renamed: #7057
prometheus_sd_failed_configs
renamed tocortex_prometheus_sd_failed_configs
prometheus_sd_discovered_targets
renamed tocortex_prometheus_sd_discovered_targets
prometheus_sd_received_updates_total
renamed tocortex_prometheus_sd_received_updates_total
prometheus_sd_updates_delayed_total
renamed tocortex_prometheus_sd_updates_delayed_total
prometheus_sd_updates_total
renamed tocortex_prometheus_sd_updates_total
prometheus_sd_refresh_failures_total
renamed tocortex_prometheus_sd_refresh_failures_total
prometheus_sd_refresh_duration_seconds
renamed tocortex_prometheus_sd_refresh_duration_seconds
- [CHANGE] Query-frontend: the default value for
-query-frontend.not-running-timeout
has been changed from 0 (disabled) to 2s. The configuration option has also been moved from "experimental" to "advanced". #7126 - [CHANGE] Store-gateway: to reduce disk contention on HDDs the default value for
blocks-storage.bucket-store.tenant-sync-concurrency
has been changed from10
to1
and the default value forblocks-storage.bucket-store.block-sync-concurrency
has been changed from20
to4
. #7136 - [CHANGE] Store-gateway: Remove deprecated CLI flags
-blocks-storage.bucket-store.index-header-lazy-loading-enabled
and-blocks-storage.bucket-store.index-header-lazy-loading-idle-timeout
and their corresponding YAML settings. Instead, use-blocks-storage.bucket-store.index-header.lazy-loading-enabled
and-blocks-storage.bucket-store.index-header.lazy-loading-idle-timeout
. #7521 - [CHANGE] Store-gateway: Mark experimental CLI flag
-blocks-storage.bucket-store.index-header.lazy-loading-concurrency
and its corresponding YAML settings as advanced. #7521 - [CHANGE] Store-gateway: Remove experimental CLI flag
-blocks-storage.bucket-store.index-header.sparse-persistence-enabled
since this is now the default behavior. #7535 - [CHANGE] All: set
-server.report-grpc-codes-in-instrumentation-label-enabled
totrue
by default, which enables reporting gRPC status codes asstatus_code
labels in thecortex_request_duration_seconds
metric. #7144 - [CHANGE] Distributor: report gRPC status codes as
status_code
labels in thecortex_ingester_client_request_duration_seconds
metric by default. #7144 - [CHANGE] Distributor: CLI flag
-ingester.client.report-grpc-codes-in-instrumentation-label-enabled
has been deprecated, and its default value is set totrue
. #7144 - [CHANGE] Ingester: CLI flag
-ingester.return-only-grpc-errors
has been deprecated, and its default value is set totrue
. To ensure backwards compatibility, during a migration from a version prior to 2.11.0 to 2.12 or later,-ingester.return-only-grpc-errors
should be set tofalse
. Once all the components are migrated, the flag can be removed. #7151 - [CHANGE] Ingester: the following CLI flags have been moved from "experimental" to "advanced": #7169
-ingester.ring.token-generation-strategy
-ingester.ring.spread-minimizing-zones
-ingester.ring.spread-minimizing-join-ring-in-order
- [CHANGE] Query-frontend: the default value of the CLI flag
-query-frontend.max-cache-freshness
(and its respective YAML configuration parameter) has been changed from1m
to10m
. #7161 - [CHANGE] Distributor: default the optimization
-distributor.write-requests-buffer-pooling-enabled
totrue
. #7165 - [CHANGE] Tracing: Move query information to span attributes instead of span logs. #7046
- [CHANGE] Distributor: the default value of circuit breaker's CLI flag
-ingester.client.circuit-breaker.cooldown-period
has been changed from1m
to10s
. #7310 - [CHANGE] Store-gateway: remove
cortex_bucket_store_blocks_loaded_by_duration
.cortex_bucket_store_series_blocks_queried
is better suited for detecting when compactors are not able to keep up with the number of blocks to compact. #7309 - [CHANGE] Ingester, Distributor: the support for rejecting push requests received via gRPC before reading them into memory, enabled via
-ingester.limit-inflight-requests-using-grpc-method-limiter
and-distributor.limit-inflight-requests-using-grpc-method-limiter
, is now stable and enabled by default. The configuration options have been deprecated and will be removed in Mimir 2.14. #7360 - [CHANGE] Distributor: Change
-distributor.enable-otlp-metadata-storage
flag's default to true, and deprecate it. The flag will be removed in Mimir 2.14. #7366 - [CHANGE] Store-gateway: Use a shorter TTL for cached items related to temporary blocks. #7407 #7534
- [CHANGE] Standardise exemplar label as "trace_id". #7475
- [CHANGE] The configuration option
-querier.max-query-into-future
has been deprecated and will be removed in Mimir 2.14. #7496 - [CHANGE] Distributor: the metric
cortex_distributor_sample_delay_seconds
has been deprecated and will be removed in Mimir 2.14. #7516 - [CHANGE] Query-frontend: The deprecated YAML setting
frontend.cache_unaligned_requests
has been moved tolimits.cache_unaligned_requests
. #7519 - [FEATURE] Introduce
-server.log-source-ips-full
option to log all IPs fromForwarded
,X-Real-IP
,X-Forwarded-For
headers. #7250 - [FEATURE] Introduce
-tenant-federation.max-tenants
option to limit the max number of tenants allowed for requests when federation is enabled. #6959 - [FEATURE] Cardinality API: added a new
count_method
parameter which enables counting active label values. #7085 - [FEATURE] Querier / query-frontend: added
-querier.promql-experimental-functions-enabled
CLI flag (and respective YAML config option) to enable experimental PromQL functions. The experimental functions introduced are:mad_over_time()
,sort_by_label()
andsort_by_label_desc()
. #7057 - [FEATURE] Alertmanager API: added
-alertmanager.grafana-alertmanager-compatibility-enabled
CLI flag (and respective YAML config option) to enable an experimental API endpoints that support the migration of the Grafana Alertmanager. #7057 - [FEATURE] Alertmanager: Added
-alertmanager.utf8-strict-mode-enabled
to control support for any UTF-8 character as part of Alertmanager configuration/API matchers and labels. It's default value is set tofalse
. #6898 - [FEATURE] Querier: added
histogram_avg()
function support to PromQL. #7293 - [FEATURE] Ingester: added
-blocks-storage.tsdb.timely-head-compaction
flag, which enables more timely head compaction, and defaults tofalse
. #7372 - [FEATURE] Compactor: Added
/compactor/tenants
and/compactor/tenant/{tenant}/planned_jobs
endpoints that provide functionality that was provided bytools/compaction-planner
-- listing of planned compaction jobs based on tenants' bucket index. #7381 - [FEATURE] Add experimental support for streaming response bodies from queriers to frontends via
-querier.response-streaming-enabled
. This is currently only supported for the/api/v1/cardinality/active_series
endpoint. #7173 - [FEATURE] Release: Added mimir distroless docker image. #7371
- [FEATURE] Add support for the new grammar of
{"metric_name", "l1"="val"}
to promql and some of the exposition formats. #7475 #7541 - [ENHANCEMENT] Distributor: Add a new metric
cortex_distributor_otlp_requests_total
to track the total number of OTLP requests. #7385 - [ENHANCEMENT] Vault: add lifecycle manager for token used to authenticate to Vault. This ensures the client token is always valid. Includes a gauge (
cortex_vault_token_lease_renewal_active
) to check whether token renewal is active, and the counterscortex_vault_token_lease_renewal_success_total
andcortex_vault_auth_success_total
to see the total number of successful lease renewals / authentications. #7337 - [ENHANCEMENT] Store-gateway: add no-compact details column on store-gateway tenants admin UI. #6848
- [ENHANCEMENT] PromQL: ignore small errors for bucketQuantile #6766
- [ENHANCEMENT] Distributor: improve efficiency of some errors #6785
- [ENHANCEMENT] Ruler: exclude vector queries from being tracked in
cortex_ruler_queries_zero_fetched_series_total
. #6544 - [ENHANCEMENT] Ruler: local storage backend now supports reading a rule group via
/config/api/v1/rules/{namespace}/{groupName}
configuration API endpoint. #6632 - [ENHANCEMENT] Query-Frontend and Query-Scheduler: split tenant query request queues by query component with
query-frontend.additional-query-queue-dimensions-enabled
andquery-scheduler.additional-query-queue-dimensions-enabled
. #6772 - [ENHANCEMENT] Distributor: support disabling metric relabel rules per-tenant via the flag
-distributor.metric-relabeling-enabled
or associated YAML. #6970 - [ENHANCEMENT] Distributor:
-distributor.remote-timeout
is now accounted from the first ingester push request being sent. #6972 - [ENHANCEMENT] Storage Provider:
-<prefix>.s3.sts-endpoint
sets a custom endpoint for AWS Security Token Service (AWS STS) in s3 storage provider. #6172 - [ENHANCEMENT] Querier: add
cortex_querier_queries_storage_type_total
metric that indicates how many queries have executed for a source, ingesters or store-gateways. Addcortex_querier_query_storegateway_chunks_total
metric to count the number of chunks fetched from a store gateway. #7099,#7145 - [ENHANCEMENT] Query-frontend: add experimental support for sharding active series queries via
-query-frontend.shard-active-series-queries
. #6784 - [ENHANCEMENT] Distributor: set
-distributor.reusable-ingester-push-workers=2000
by default and mark feature asadvanced
. #7128 - [ENHANCEMENT] All: set
-server.grpc.num-workers=100
by default and mark feature asadvanced
. #7131 - [ENHANCEMENT] Distributor: invalid metric name error message gets cleaned up to not include non-ascii strings. #7146
- [ENHANCEMENT] Store-gateway: add
source
,level
, andout_or_order
tocortex_bucket_store_series_blocks_queried
metric that indicates the number of blocks that were queried from store gateways by block metadata. #7112 #7262 #7267 - [ENHANCEMENT] Compactor: After updating bucket-index, compactor now also computes estimated number of compaction jobs based on current bucket-index, and reports the result in
cortex_bucket_index_estimated_compaction_jobs
metric. If computation of jobs fails,cortex_bucket_index_estimated_compaction_jobs_errors_total
is updated instead. #7299 - [ENHANCEMENT] Mimir: Integrate profiling into tracing instrumentation. #7363
- [ENHANCEMENT] Alertmanager: Adds metric
cortex_alertmanager_notifications_suppressed_total
that counts the total number of notifications suppressed for being silenced, inhibited, outside of active time intervals or within muted time intervals. #7384 - [ENHANCEMENT] Query-scheduler: added more buckets to
cortex_query_scheduler_queue_duration_seconds
histogram metric, in order to better track queries staying in the queue for longer than 10s. #7470 - [ENHANCEMENT] A
type
label is added toprometheus_tsdb_head_out_of_order_samples_appended_total
metric. #7475 - [ENHANCEMENT] Distributor: Optimize OTLP endpoint. #7475
- [ENHANCEMENT] API: Use github.com/klauspost/compress for faster gzip and deflate compression of API responses. #7475
- [ENHANCEMENT] Ingester: Limiting on owned series (
-ingester.use-ingester-owned-series-for-limits
) now prevents discards in cases where a tenant is sharded across all ingesters (or shuffle sharding is disabled) and the ingester count increases. #7411 - [ENHANCEMENT] Block upload: include converted timestamps in the error message if block is from the future. #7538
- [ENHANCEMENT] Query-frontend: Introduce
-query-frontend.active-series-write-timeout
to allow configuring the server-side write timeout for active series requests. #7553 #7569 - [BUGFIX] Ingester: don't ignore errors encountered while iterating through chunks or samples in response to a query request. #6451
- [BUGFIX] Fix issue where queries can fail or omit OOO samples if OOO head compaction occurs between creating a querier and reading chunks #6766
- [BUGFIX] Fix issue where concatenatingChunkIterator can obscure errors #6766
- [BUGFIX] Fix panic during tsdb Commit #6766
- [BUGFIX] tsdb/head: wlog exemplars after samples #6766
- [BUGFIX] Ruler: fix issue where "failed to remotely evaluate query expression, will retry" messages are logged without context such as the trace ID and do not appear in trace events. #6789
- [BUGFIX] Ruler: do not retry requests to remote querier when server's response exceeds its configured max payload size. #7216
- [BUGFIX] Querier: fix issue where spans in query request traces were not nested correctly. #6893
- [BUGFIX] Fix issue where all incoming HTTP requests have duplicate trace spans. #6920
- [BUGFIX] Querier: do not retry requests to store-gateway when a query gets canceled. #6934
- [BUGFIX] Querier: return 499 status code instead of 500 when a request to remote read endpoint gets canceled. #6934
- [BUGFIX] Querier: fix issue where
-querier.max-fetched-series-per-query
is not applied to/series
endpoint if the series are loaded from ingesters. #7055 - [BUGFIX] Distributor: fix issue where
-distributor.metric-relabeling-enabled
may cause distributors to panic #7176 - [BUGFIX] Distributor: fix issue where
-distributor.metric-relabeling-enabled
may cause distributors to write unsorted labels and corrupt blocks #7326 - [BUGFIX] Query-frontend: the
cortex_query_frontend_queries_total
report incorrectly reportedop="query"
for any request which wasn't a range query. Now theop
label value can be one of the following: #7207query
: instant queryquery_range
: range querycardinality
: cardinality querylabel_names_and_values
: label names / values queryactive_series
: active series queryother
: any other request
- [BUGFIX] Fix performance regression introduced in Mimir 2.11.0 when uploading blocks to AWS S3. #7240
- [BUGFIX] Query-frontend: fix race condition when sharding active series is enabled (see above) and response is compressed with snappy. #7290
- [BUGFIX] Query-frontend: "query stats" log unsuccessful replies from downstream as "failed". #7296
- [BUGFIX] Packaging: remove reload from systemd file as mimir does not take into account SIGHUP. #7345
- [BUGFIX] Compactor: do not allow out-of-order blocks to prevent timely compaction. #7342
- [BUGFIX] Update
google.golang.org/grpc
to resolve occasional issues with gRPC server closing its side of connection before it was drained by the client. #7380 - [BUGFIX] Query-frontend: abort response streaming for
active_series
requests when the request context is canceled. #7378 - [BUGFIX] Compactor: improve compaction of sporadic blocks. #7329
- [BUGFIX] Ruler: fix regression that caused client errors to be tracked in
cortex_ruler_write_requests_failed_total
metric. #7472 - [BUGFIX] promql: Fix Range selectors with an @ modifier are wrongly scoped in range queries. #7475
- [BUGFIX] Fix metadata API using wrong JSON field names. #7475
- [BUGFIX] Ruler: fix native histogram recording rule result corruption. #7552
Mixin
- [CHANGE] The
job
label matcher for distributor and gateway have been extended to include any deployment matchingdistributor.*
andcortex-gw.*
respectively. This change allows to match custom and multi-zone distributor and gateway deployments too. #6817 - [ENHANCEMENT] Dashboards: Add panels for alertmanager activity of a tenant #6826
- [ENHANCEMENT] Dashboards: Add graphs to "Slow Queries" dashboard. #6880
- [ENHANCEMENT] Dashboards: Update all deprecated "graph" panels to "timeseries" panels. #6864 #7413 #7457
- [ENHANCEMENT] Dashboards: Make most columns in "Slow Queries" sortable. #7000
- [ENHANCEMENT] Dashboards: Render graph panels at full resolution as opposed to at half resolution. #7027
- [ENHANCEMENT] Dashboards: show query-scheduler queue length on "Reads" and "Remote Ruler Reads" dashboards. #7088
- [ENHANCEMENT] Dashboards: Add estimated number of compaction jobs to "Compactor", "Tenants" and "Top tenants" dashboards. #7449 #7481
- [ENHANCEMENT] Recording rules: add native histogram recording rules to
cortex_request_duration_seconds
. #7528 - [ENHANCEMENT] Dashboards: Add total owned series, and per-ingester in-memory and owned series to "Tenants" dashboard. #7511
- [BUGFIX] Dashboards: drop
step
parameter from targets as it is not supported. #7157 - [BUGFIX] Recording rules: drop rules for metrics removed in 2.0:
cortex_memcache_request_duration_seconds
andcortex_cache_request_duration_seconds
. #7514
Jsonnet
- [CHANGE] Distributor: Increase
JAEGER_REPORTER_MAX_QUEUE_SIZE
from the default (100) to 1000, to avoid dropping tracing spans. #7259 - [CHANGE] Querier: Increase
JAEGER_REPORTER_MAX_QUEUE_SIZE
from 1000 to 5000, to avoid dropping tracing spans. #6764 - [CHANGE] rollout-operator: remove default CPU limit. #7066
- [CHANGE] Store-gateway: Increase
JAEGER_REPORTER_MAX_QUEUE_SIZE
from the default (100) to 1000, to avoid dropping tracing spans. #7068 - [CHANGE] Query-frontend, ingester, ruler, backend and write instances: Increase
JAEGER_REPORTER_MAX_QUEUE_SIZE
from the default (100), to avoid dropping tracing spans. #7086 - [CHANGE] Ring: relaxed the hash ring heartbeat period and timeout for distributor, ingester, store-gateway and compactor: #6860
-distributor.ring.heartbeat-period
set to1m
-distributor.ring.heartbeat-timeout
set to4m
-ingester.ring.heartbeat-period
set to2m
-store-gateway.sharding-ring.heartbeat-period
set to1m
-store-gateway.sharding-ring.heartbeat-timeout
set to4m
-compactor.ring.heartbeat-period
set to1m
-compactor.ring.heartbeat-timeout
set to4m
- [CHANGE] Ruler-querier: the topology spread constrain max skew is now configured through the configuration option
ruler_querier_topology_spread_max_skew
instead ofquerier_topology_spread_max_skew
. #7204 - [CHANGE] Distributor:
-server.grpc.keepalive.max-connection-age
lowered from2m
to60s
and configured-shutdown-delay=90s
and termination grace period to100
seconds in order to reduce the chances of failed gRPC write requests when distributors gracefully shutdown. #7361 - [FEATURE] Added support for the following root-level settings to configure the list of matchers to apply to node affinity: #6782 #6829
alertmanager_node_affinity_matchers
compactor_node_affinity_matchers
continuous_test_node_affinity_matchers
distributor_node_affinity_matchers
ingester_node_affinity_matchers
ingester_zone_a_node_affinity_matchers
ingester_zone_b_node_affinity_matchers
ingester_zone_c_node_affinity_matchers
mimir_backend_node_affinity_matchers
mimir_backend_zone_a_node_affinity_matchers
mimir_backend_zone_b_node_affinity_matchers
mimir_backend_zone_c_node_affinity_matchers
mimir_read_node_affinity_matchers
mimir_write_node_affinity_matchers
mimir_write_zone_a_node_affinity_matchers
mimir_write_zone_b_node_affinity_matchers
mimir_write_zone_c_node_affinity_matchers
overrides_exporter_node_affinity_matchers
querier_node_affinity_matchers
query_frontend_node_affinity_matchers
query_scheduler_node_affinity_matchers
rollout_operator_node_affinity_matchers
ruler_node_affinity_matchers
ruler_node_affinity_matchers
ruler_querier_node_affinity_matchers
ruler_query_frontend_node_affinity_matchers
ruler_query_scheduler_node_affinity_matchers
store_gateway_node_affinity_matchers
store_gateway_node_affinity_matchers
store_gateway_zone_a_node_affinity_matchers
store_gateway_zone_b_node_affinity_matchers
store_gateway_zone_c_node_affinity_matchers
- [FEATURE] Ingester: Allow automated zone-by-zone downscaling, that can be enabled via the
ingester_automated_downscale_enabled
flag. It is disabled by default. #6850 - [ENHANCEMENT] Alerts: Add
MimirStoreGatewayTooManyFailedOperations
warning alert that triggers when Mimir store-gateway report error when interacting with the object storage. #6831 - [ENHANCEMENT] Querier HPA: improved scaling metric and scaling policies, in order to scale up and down more gradually. #6971
- [ENHANCEMENT] Rollout-operator: upgraded to v0.13.0. #7469
- [ENHANCEMENT] Rollout-operator: add tracing configuration to rollout-operator container (when tracing is enabled and configured). #7469
- [ENHANCEMENT] Query-frontend: configured
-shutdown-delay
,-server.grpc.keepalive.max-connection-age
and termination grace period to reduce the likelihood of queries hitting terminated query-frontends. #7129 - [ENHANCEMENT] Autoscaling: add support for KEDA's
ignoreNullValues
option for Prometheus scaler. #7471 - [BUGFIX] Update memcached-exporter to 0.14.1 due to CVE-2023-39325. #6861
Mimirtool
- [FEATURE] Add command
migrate-utf8
to migrate Alertmanager configurations for Alertmanager versions 0.27.0 and later. #7383 - [ENHANCEMENT] Add template render command to render locally a template. #7325
- [ENHANCEMENT] Add
--extra-headers
option tomimirtool rules
command to add extra headers to requests for auth. #7141 - [ENHANCEMENT] Analyze Prometheus: set tenant header. #6737
- [ENHANCEMENT] Add argument
--output-dir
tomimirtool alertmanager get
where the config and templates will be written to and can be loaded viamimirtool alertmanager load
#6760 - [BUGFIX] Analyze rule-file: .metricsUsed field wasn't populated. #6953
Mimir Continuous Test
- [ENHANCEMENT] Include comparison of all expected and actual values when any float sample does not match. #6756
Query-tee
- [BUGFIX] Fix issue where
Host
HTTP header was not being correctly changed for the proxy targets. #7386 - [ENHANCEMENT] Allow using the value of X-Scope-OrgID for basic auth username in the forwarded request if URL username is set as
__REQUEST_HEADER_X_SCOPE_ORGID__
. #7452
Documentation
- [CHANGE] No longer mark OTLP distributor endpoint as experimental. #7348
- [ENHANCEMENT] Added runbook for
KubePersistentVolumeFillingUp
alert. #7297 - [ENHANCEMENT] Add Grafana Cloud recommendations to OTLP documentation. #7375
- [BUGFIX] Fixed typo on single zone->zone aware replication Helm page. #7327
Tools
- [CHANGE] copyblocks: The flags for copyblocks have been changed to align more closely with other tools. #6607
- [CHANGE] undelete-blocks: undelete-blocks-gcs has been removed and replaced with undelete-blocks, which supports recovering deleted blocks in versioned buckets from ABS, GCS, and S3-compatible object storage. #6607
- [FEATURE] copyprefix: Add tool to copy objects between prefixes. Supports ABS, GCS, and S3-compatible object storage. #6607
All changes in this release: mimir-2.11.0...mimir-2.12.0-rc.0