This release contains 333 PRs from 39 authors. Thank you!
Grafana Mimir version 2.3 release notes
Grafana Labs is excited to announce version 2.3 of Grafana Mimir, the most scalable, most performant open source time series database in the world.
The highlights that follow include the top features, enhancements, and bugfixes in this release. If you are upgrading from Grafana Mimir 2.2, there is upgrade-related information as well.
For the complete list of changes, see the Changelog.
Features and enhancements
-
Ingest metrics in OpenTelemetry format:
This release of Grafana Mimir introduces experimental support for ingesting metrics from the OpenTelemetry Collector'sotlphttp
exporter. This adds a second ingestion option for users of the OTel Collector; Mimir was already compatible with theprometheusremotewrite
exporter. For more information, please see Configure OTel Collector. -
Increased instant query performance:
Grafana Mimir now supports splitting instant queries by time. This allows it to better parallelize execution of instant queries and therefore return results faster. At present, splitting is only supported for a subset of instant queries, which means not all instant queries will see a speedup. This feature is being released as experimental and is disabled by default. It can be enabled by setting-query-frontend.split-instant-queries-by-interval
. -
Tenant federation for metadata queries:
Users with tenant federation enabled could previously issue instant queries, range queries, and exemplar queries to multiple tenants at once and receive a single aggregated result. With Grafana Mimir 2.3, we've added tenant federation support to the/api/v1/metadata
endpoint as well. -
Simpler object storage configuration:
Users can now configure block, alertmanager, and ruler storage all at once with thecommon
YAML config option key (or-common.storage.*
CLI flags). By centralizing your object storage configuration in one place, this enhancement makes configuration faster and less error prone. Users can still individually configure storage for each of these components if they desire. For more information, see the Common Configurations. -
DEB and RPM packages for Mimir:
Starting with version 2.3, we're publishing deb and rpm files for Grafana Mimir, which will make installing and running it on Debian or RedHat-based linux systems much easier. Thank you to community contributor wilfriedroset for your work to implement this! -
Import historic data to Grafana Mimir:
Users can now backfill time series data from their existing Prometheus or Cortex installation into Mimir usingmimirtool
, making it possible to migrate to Grafana Mimir without losing your existing metrics data. This support is still considered experimental and does not work for data stored in Thanos yet. To learn more about this feature, please see mimirtool backfill and Configure TSDB block upload -
New Helm chart minor release: The Mimir Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.3 release, we’re also releasing version 3.1 of the Mimir Helm chart. Notable enhancements follow. For the full list of changes, see the Helm chart changelog.
- We've upgraded the MinIO subchart dependency from a deprecated chart to the supported one. This creates a breaking change in how the administrator password is set. However, as the built-in MinIO is not a recommended object store for production use cases, this change did not warrant a new major version of the Mimir Helm chart.
- The backfill API endpoints for importing historic time series data are now exposed on the Nginx gateway.
- Nginx now sets the value of the
X-Scope-OrgID
header equal to the value of Mimir'sno_auth_tenant
parameter by default. The previous release had set the value ofX-Scope-OrgID
toanonymous
by default which complicated the process of migrating to Mimir. - Memberlist now uses DNS service-discovery by default, which should decrease startup time for large Mimir clusters.
Upgrade considerations
In Grafana Mimir 2.3 we have removed the following previously deprecated configuration options:
- The
extend_writes
parameter in the distributor YAML configuration and-distributor.extend-writes
CLI flag have been removed. - The
active_series_custom_trackers
parameter has been removed from the YAML configuration. It had already been moved to the runtime configuration. See #1188 for details.
With Grafana Mimir 2.3 we have also updated the default value for -distributor.ha-tracker.max-clusters
to 100
to provide Denial-of-Service protection. Previously -distributor.ha-tracker.max-clusters
was unlimited by default which could allow a tenant with HA Dedupe enabled to overload the HA tracker with __cluster__
label values that could cause the HA Dedupe database to fail.
Bug fixes
- PR 2447: Fix incorrect mapping of http status codes
429
to500
when the request queue is full in the query-frontend. This corrects behavior in the query-frontend where a429 "Too Many Outstanding Requests"
error (a retriable error) from a querier was incorrectly returned as a500
system error (an unretriable error). - PR 2505: The Memberlist key-value (KV) store now tries to "fast-join" the cluster to avoid serving an empty KV store. This fix addresses the confusing "empty ring" error response and the error log message "ring doesn't exist in KV store yet" emitted by services when there are other members present in the ring when a service starts. Those using other key-value store options (e.g., consul, etcd) are not impacted by this bug.
- PR 2289: The "List Prometheus rules" API endpoint of the Mimir Ruler component is no longer blocked while rules are being synced. This means users can now list rules while syncing larger rule sets.
Changelog since 2.2
2.3.0-rc.0
Grafana Mimir
- [CHANGE] Ingester: Added user label to ingester metric
cortex_ingester_tsdb_out_of_order_samples_appended_total
. On multitenant clusters this helps us find the rate of appended out-of-order samples for a specific tenant. #2493 - [CHANGE] Compactor: delete source and output blocks from local disk on compaction failed, to reduce likelihood that subsequent compactions fail because of no space left on disk. #2261
- [CHANGE] Ruler: Remove unused CLI flags
-ruler.search-pending-for
and-ruler.flush-period
(and their respective YAML config options). #2288 - [CHANGE] Successful gRPC requests are no longer logged (only affects internal API calls). #2309
- [CHANGE] Add new
-*.consul.cas-retry-delay
flags. They have a default value of1s
, while previously there was no delay between retries. #2309 - [CHANGE] Store-gateway: Remove the experimental ability to run requests in a dedicated OS thread pool and associated CLI flag
-store-gateway.thread-pool-size
. #2423 - [CHANGE] Memberlist: disabled TCP-based ping fallback, because Mimir already uses a custom transport based on TCP. #2456
- [CHANGE] Change default value for
-distributor.ha-tracker.max-clusters
to100
to provide a DoS protection. #2465 - [CHANGE] Experimental block upload API exposed by compactor has changed: Previous
/api/v1/upload/block/{block}
endpoint for starting block upload is now/api/v1/upload/block/{block}/start
, and previous endpoint/api/v1/upload/block/{block}?uploadComplete=true
for finishing block upload is now/api/v1/upload/block/{block}/finish
. New API endpoint has been added:/api/v1/upload/block/{block}/check
. #2486 #2548 - [CHANGE] Compactor: changed
-compactor.max-compaction-time
default from0s
(disabled) to1h
. When compacting blocks for a tenant, the compactor will move to compact blocks of another tenant or re-plan blocks to compact at least every 1h. #2514 - [CHANGE] Distributor: removed previously deprecated
extend_writes
(see #1856) YAML key and-distributor.extend-writes
CLI flag from the distributor config. #2551 - [CHANGE] Ingester: removed previously deprecated
active_series_custom_trackers
(see #1188) YAML key from the ingester config. #2552 - [CHANGE] The tenant ID
__mimir_cluster
is reserved by Mimir and not allowed to store metrics. #2643 - [CHANGE] Purger: removed the purger component and moved its API endpoints
/purger/delete_tenant
and/purger/delete_tenant_status
to the compactor at/compactor/delete_tenant
and/compactor/delete_tenant_status
. The new endpoints on the compactor are stable. #2644 - [CHANGE] Memberlist: Change the leave timeout duration (
-memberlist.leave-timeout duration
) from 5s to 20s and connection timeout (-memberlist.packet-dial-timeout
) from 5s to 2s. This makes leave timeout 10x the connection timeout, so that we can communicate the leave to at least 1 node, if the first 9 we try to contact times out. #2669 - [CHANGE] Alertmanager: return status code
412 Precondition Failed
and log info message when alertmanager isn't configured for a tenant. #2635 - [CHANGE] Distributor: if forwarding rules are used to forward samples, exemplars are now removed from the request. #2710
- [CHANGE] Limits: change the default value of
max_global_series_per_metric
limit to0
(disabled). Setting this limit by default does not provide much benefit because series are sharded by all labels. #2714 - [FEATURE] Compactor: Adds the ability to delete partial blocks after a configurable delay. This option can be configured per tenant. #2285
-compactor.partial-block-deletion-delay
, as a duration string, allows you to set the delay since a partial block has been modified before marking it for deletion. A value of0
, the default, disables this feature.- The metric
cortex_compactor_blocks_marked_for_deletion_total
has a new value for thereason
labelreason="partial"
, when a block deletion marker is triggered by the partial block deletion delay.
- [FEATURE] Querier: enabled support for queries with negative offsets, which are not cached in the query results cache. #2429
- [FEATURE] EXPERIMENTAL: OpenTelemetry Metrics ingestion path on
/otlp/v1/metrics
. #695 #2436 #2461 - [FEATURE] Querier: Added support for tenant federation to metric metadata endpoint. #2467
- [FEATURE] Query-frontend: introduced experimental support to split instant queries by time. The instant query splitting can be enabled setting
-query-frontend.split-instant-queries-by-interval
. #2469 #2564 #2565 #2570 #2571 #2572 #2573 #2574 #2575 #2576 #2581 #2582 #2601 #2632 #2633 #2634 #2641 #2642 #2766 - [ENHANCEMENT] Distributor: Decreased distributor tests execution time. #2562
- [ENHANCEMENT] Alertmanager: Allow the HTTP
proxy_url
configuration option in the receiver's configuration. #2317 - [ENHANCEMENT] ring: optimize shuffle-shard computation when lookback is used, and all instances have registered timestamp within the lookback window. In that case we can immediately return origial ring, because we would select all instances anyway. #2309
- [ENHANCEMENT] Memberlist: added experimental memberlist cluster label support via
-memberlist.cluster-label
and-memberlist.cluster-label-verification-disabled
CLI flags (and their respective YAML config options). #2354 - [ENHANCEMENT] Object storage can now be configured for all components using the
common
YAML config option key (or-common.storage.*
CLI flags). #2330 #2347 - [ENHANCEMENT] Go: updated to go 1.18.4. #2400
- [ENHANCEMENT] Store-gateway, listblocks: list of blocks now includes stats from
meta.json
file: number of series, samples and chunks. #2425 - [ENHANCEMENT] Added more buckets to
cortex_ingester_client_request_duration_seconds
histogram metric, to correctly track requests taking longer than 1s (up until 16s). #2445 - [ENHANCEMENT] Azure client: Improve memory usage for large object storage downloads. #2408
- [ENHANCEMENT] Distributor: Add
-distributor.instance-limits.max-inflight-push-requests-bytes
. This limit protects the distributor against multiple large requests that together may cause an OOM, but are only a few, so do not trigger themax-inflight-push-requests
limit. #2413 - [ENHANCEMENT] Distributor: Drop exemplars in distributor for tenants where exemplars are disabled. #2504
- [ENHANCEMENT] Runtime Config: Allow operator to specify multiple comma-separated yaml files in
-runtime-config.file
that will be merged in left to right order. #2583 - [ENHANCEMENT] Query sharding: shard binary operations only if it doesn't lead to non-shardable vector selectors in one of the operands. #2696
- [ENHANCEMENT] Add packaging for both debian based deb file and redhat based rpm file using FPM. #1803
- [BUGFIX] TSDB: Fixed a bug on the experimental out-of-order implementation that led to wrong query results. #2701
- [BUGFIX] Compactor: log the actual error on compaction failed. #2261
- [BUGFIX] Alertmanager: restore state from storage even when running a single replica. #2293
- [BUGFIX] Ruler: do not block "List Prometheus rules" API endpoint while syncing rules. #2289
- [BUGFIX] Ruler: return proper
*status.Status
error when running in remote operational mode. #2417 - [BUGFIX] Alertmanager: ensure the configured
-alertmanager.web.external-url
is either a path starting with/
, or a full URL including the scheme and hostname. #2381 #2542 - [BUGFIX] Memberlist: fix problem with loss of some packets, typically ring updates when instances were removed from the ring during shutdown. #2418
- [BUGFIX] Ingester: fix misfiring
MimirIngesterHasUnshippedBlocks
and stalecortex_ingester_oldest_unshipped_block_timestamp_seconds
when some block uploads fail. #2435 - [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 429 to 500 when request queue is full. #2447
- [BUGFIX] Memberlist: Fix problem with ring being empty right after startup. Memberlist KV store now tries to "fast-join" the cluster to avoid serving empty KV store. #2505
- [BUGFIX] Compactor: Fix bug when using
-compactor.partial-block-deletion-delay
: compactor didn't correctly check for modification time of all block files. #2559 - [BUGFIX] Query-frontend: fix wrong query sharding results for queries with boolean result like
1 < bool 0
. #2558 - [BUGFIX] Fixed error messages related to per-instance limits incorrectly reporting they can be set on a per-tenant basis. #2610
- [BUGFIX] Perform HA-deduplication before forwarding samples according to forwarding rules in the distributor. #2603 #2709
- [BUGFIX] Fix reporting of tracing spans from PromQL engine. #2707
- [BUGFIX] Apply relabel and drop_label rules before forwarding rules in the distributor. #2703
- [BUGFIX] Distributor: Register
cortex_discarded_requests_total
metric, which previously was not registered and therefore not exported. #2712
Mixin
- [CHANGE] Dashboards: "Slow Queries" dashboard no longer works with versions older than Grafana 9.0. #2223
- [CHANGE] Alerts: use RSS memory instead of working set memory in the
MimirAllocatingTooMuchMemory
alert for ingesters. #2480 - [ENHANCEMENT] Dashboards: added missed rule evaluations to the "Evaluations per second" panel in the "Mimir / Ruler" dashboard. #2314
- [ENHANCEMENT] Dashboards: add k8s resource requests to CPU and memory panels. #2346
- [ENHANCEMENT] Dashboards: add RSS memory utilization panel for ingesters, store-gateways and compactors. #2479
- [ENHANCEMENT] Dashboards: allow to configure graph tooltip. #2647
- [ENHANCEMENT] Alerts: MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck alerts are more reliable now as they consider all the intermediate samples in the minute prior to the evaluation. #2630
- [ENHANCEMENT] Alerts: added
RolloutOperatorNotReconciling
alert, firing if the optional rollout-operator is not successfully reconciling. #2700 - [BUGFIX] Dashboards: fixed unit of latency panels in the "Mimir / Ruler" dashboard. #2312
- [BUGFIX] Dashboards: fixed "Intervals per query" panel in the "Mimir / Queries" dashboard. #2308
- [BUGFIX] Dashboards: Make "Slow Queries" dashboard works with Grafana 9.0. #2223
- [BUGFIX] Dashboards: add missing API routes to Ruler dashboard. #2412
Jsonnet
- [CHANGE] query-scheduler is enabled by default. We advise to deploy the query-scheduler to improve the scalability of the query-frontend. #2431
- [CHANGE] Replaced anti-affinity rules with pod topology spread constraints for distributor, query-frontend, querier and ruler. #2517
- The following configuration options have been removed:
distributor_allow_multiple_replicas_on_same_node
query_frontend_allow_multiple_replicas_on_same_node
querier_allow_multiple_replicas_on_same_node
ruler_allow_multiple_replicas_on_same_node
- The following configuration options have been added:
distributor_topology_spread_max_skew
query_frontend_topology_spread_max_skew
querier_topology_spread_max_skew
ruler_topology_spread_max_skew
- The following configuration options have been removed:
- [CHANGE] Change
max_global_series_per_metric
to 0 in all plans, and as a default value. #2669 - [FEATURE] Memberlist: added support for experimental memberlist cluster label, through the jsonnet configuration options
memberlist_cluster_label
andmemberlist_cluster_label_verification_disabled
. #2349 - [FEATURE] Added ruler-querier autoscaling support. It requires KEDA installed in the Kubernetes cluster. Ruler-querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2545
autoscaling_ruler_querier_enabled
:true
to enable autoscaling.autoscaling_ruler_querier_min_replicas
: minimum number of ruler-querier replicas.autoscaling_ruler_querier_max_replicas
: maximum number of ruler-querier replicas.autoscaling_prometheus_url
: Prometheus base URL from which to scrape Mimir metrics (e.g.http://prometheus.default:9090/prometheus
).
- [ENHANCEMENT] Memberlist now uses DNS service-discovery by default. #2549
Mimirtool
- [ENHANCEMENT] Added
mimirtool backfill
command to upload Prometheus blocks using API available in the compactor. #1822 - [ENHANCEMENT] mimirtool bucket-validation: Verify existing objects can be overwritten by subsequent uploads. #2491
- [ENHANCEMENT] mimirtool config convert: Now supports migrating to the current version of Mimir. #2629
- [BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors by using custom parsing. #2386
Mimir Continuous Test
Documentation
- [ENHANCEMENT] Referenced
mimirtool
commands in the HTTP API documentation. #2516 - [ENHANCEMENT] Improved DNS service discovery documentation. #2513
Tools
- [ENHANCEMENT]
markblocks
now processes multiple blocks concurrently. #2677
New Contributors
- @ese made their first contribution in #2196
- @micborens made their first contribution in #2321
- @marctc made their first contribution in #2518
- @ravilushqa made their first contribution in #2562
- @BrandonDalton made their first contribution in #2515
- @nervo made their first contribution in #2569
- @lamida made their first contribution in #2427
- @LeviHarrison made their first contribution in #2644
- @sysedwinistrator made their first contribution in #2087
Full Changelog: mimir-2.2.0...mimir-2.3.0-rc0