This Cortex release features 125 contributions from 37 different authors. It's yet another great milestone we have reached thanks to the amazing support from our community ❤️ Thanks!
Highlights:
- The blocks storage is getting closer to production readiness. In this release we've done several fixes and improvements. In particular, you should be aware of:
- Some CLI flags and YAML config options have been renamed
- The store-gateway service is now mandatory when running the blocks storage
- Introduced support for a live cluster migration from chunks to blocks (and rollback)
- Introduced support to flush blocks on-demand from ingesters
- The ruler and alertmanager got several improvements, including but not limited to:
- The ruler now runs in the single binary when Cortex gets started with
-target=all
- Introduced new config options to fine-tune the ruler
- Introduced support to load locally stored rules (eg. loaded via Kubernetes config map)
- Multiple alertmanager URLs can now be specified in the ruler; each URL is treated as a separate alertmanager group
- Alertmanager configuration can be persisted to object storage via API
- The ruler now runs in the single binary when Cortex gets started with
- Other changes worth to note:
- Added optional
snappy
compression support to internal gRPC connections - Starting from this release we're going to publish
.rpm
and.deb
packages too
- Added optional
Please refer to the full changelog for full list of changes and improvements.
Changelog
- [CHANGE] Replace the metric
cortex_alertmanager_configs
withcortex_alertmanager_config_invalid
exposed by Alertmanager. #2960 - [CHANGE] Experimental Delete Series: Change target flag for purger from
data-purger
topurger
. #2777 - [CHANGE] Experimental blocks storage: The max concurrent queries against the long-term storage, configured via
-experimental.blocks-storage.bucket-store.max-concurrent
, is now a limit shared across all tenants and not a per-tenant limit anymore. The default value has changed from20
to100
and the following new metrics have been added: #2797cortex_bucket_stores_gate_queries_concurrent_max
cortex_bucket_stores_gate_queries_in_flight
cortex_bucket_stores_gate_duration_seconds
- [CHANGE] Metric
cortex_ingester_flush_reasons
has been renamed tocortex_ingester_flushing_enqueued_series_total
, and new metriccortex_ingester_flushing_dequeued_series_total
withoutcome
label (superset of reason) has been added. #2802 #2818 #2998 - [CHANGE] Experimental Delete Series: Metric
cortex_purger_oldest_pending_delete_request_age_seconds
would track age of delete requests since they are over their cancellation period instead of their creation time. #2806 - [CHANGE] Experimental blocks storage: the store-gateway service is required in a Cortex cluster running with the experimental blocks storage. Removed the
-experimental.tsdb.store-gateway-enabled
CLI flag andstore_gateway_enabled
YAML config option. The store-gateway is now always enabled when the storage engine isblocks
. #2822 - [CHANGE] Experimental blocks storage: removed support for
-experimental.blocks-storage.bucket-store.max-sample-count
flag because the implementation was flawed. To limit the number of samples/chunks processed by a single query you can set-store.query-chunk-limit
, which is now supported by the blocks storage too. #2852 - [CHANGE] Ingester: Chunks flushed via /flush stay in memory until retention period is reached. This affects
cortex_ingester_memory_chunks
metric. #2778 - [CHANGE] Querier: the error message returned when the query time range exceeds
-store.max-query-length
has changed frominvalid query, length > limit (X > Y)
tothe query time range exceeds the limit (query length: X, limit: Y)
. #2826 - [CHANGE] Add
component
label to metrics exposed by chunk, delete and index store clients. #2774 - [CHANGE] Querier: when
-querier.query-ingesters-within
is configured, the time range of the query sent to ingesters is now manipulated to ensure the query start time is not older than 'now - query-ingesters-within'. #2904 - [CHANGE] KV: The
role
label which was a label ofmulti
KV store client only has been added to metrics of every KV store client. If KV store client is notmulti
, then the value ofrole
label isprimary
. #2837 - [CHANGE] Added the
engine
label to the metrics exposed by the Prometheus query engine, to distinguish betweenruler
andquerier
metrics. #2854 - [CHANGE] Added ruler to the single binary when started with
-target=all
(default). #2854 - [CHANGE] Experimental blocks storage: compact head when opening TSDB. This should only affect ingester startup after it was unable to compact head in previous run. #2870
- [CHANGE] Metric
cortex_overrides_last_reload_successful
has been renamed tocortex_runtime_config_last_reload_successful
. #2874 - [CHANGE] HipChat support has been removed from the alertmanager (because removed from the Prometheus upstream too). #2902
- [CHANGE] Add constant label
name
to metriccortex_cache_request_duration_seconds
. #2903 - [CHANGE] Add
user
label to metriccortex_query_frontend_queue_length
. #2939 - [CHANGE] Experimental blocks storage: cleaned up the config and renamed "TSDB" to "blocks storage". #2937
- The storage engine setting value has been changed from
tsdb
toblocks
; this affects-store.engine
CLI flag and its respective YAML option. - The root level YAML config has changed from
tsdb
toblocks_storage
- The prefix of all CLI flags has changed from
-experimental.tsdb.
to-experimental.blocks-storage.
- The following settings have been grouped under
tsdb
property in the YAML config and their CLI flags changed:-experimental.tsdb.dir
changed to-experimental.blocks-storage.tsdb.dir
-experimental.tsdb.block-ranges-period
changed to-experimental.blocks-storage.tsdb.block-ranges-period
-experimental.tsdb.retention-period
changed to-experimental.blocks-storage.tsdb.retention-period
-experimental.tsdb.ship-interval
changed to-experimental.blocks-storage.tsdb.ship-interval
-experimental.tsdb.ship-concurrency
changed to-experimental.blocks-storage.tsdb.ship-concurrency
-experimental.tsdb.max-tsdb-opening-concurrency-on-startup
changed to-experimental.blocks-storage.tsdb.max-tsdb-opening-concurrency-on-startup
-experimental.tsdb.head-compaction-interval
changed to-experimental.blocks-storage.tsdb.head-compaction-interval
-experimental.tsdb.head-compaction-concurrency
changed to-experimental.blocks-storage.tsdb.head-compaction-concurrency
-experimental.tsdb.head-compaction-idle-timeout
changed to-experimental.blocks-storage.tsdb.head-compaction-idle-timeout
-experimental.tsdb.stripe-size
changed to-experimental.blocks-storage.tsdb.stripe-size
-experimental.tsdb.wal-compression-enabled
changed to-experimental.blocks-storage.tsdb.wal-compression-enabled
-experimental.tsdb.flush-blocks-on-shutdown
changed to-experimental.blocks-storage.tsdb.flush-blocks-on-shutdown
- The storage engine setting value has been changed from
- [CHANGE] Flags
-bigtable.grpc-use-gzip-compression
,-ingester.client.grpc-use-gzip-compression
,-querier.frontend-client.grpc-use-gzip-compression
are now deprecated. #2940 - [CHANGE] Limit errors reported by ingester during query-time now return HTTP status code 422. #2941
- [FEATURE] Introduced
ruler.for-outage-tolerance
, Max time to tolerate outage for restoring "for" state of alert. #2783 - [FEATURE] Introduced
ruler.for-grace-period
, Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. #2783 - [FEATURE] Introduced
ruler.resend-delay
, Minimum amount of time to wait before resending an alert to Alertmanager. #2783 - [FEATURE] Ruler: added
local
filesystem support to store rules (read-only). #2854 - [ENHANCEMENT] Upgraded Docker base images to
alpine:3.12
. #2862 - [ENHANCEMENT] Experimental: Querier can now optionally query secondary store. This is specified by using
-querier.second-store-engine
option, with valueschunks
orblocks
. Standard configuration options for this store are used. Additionally, this querying can be configured to happen only for queries that need data older than-querier.use-second-store-before-time
. Default value of zero will always query secondary store. #2747 - [ENHANCEMENT] Query-tee: increased the
cortex_querytee_request_duration_seconds
metric buckets granularity. #2799 - [ENHANCEMENT] Query-tee: fail to start if the configured
-backend.preferred
is unknown. #2799 - [ENHANCEMENT] Ruler: Added the following metrics: #2786
cortex_prometheus_notifications_latency_seconds
cortex_prometheus_notifications_errors_total
cortex_prometheus_notifications_sent_total
cortex_prometheus_notifications_dropped_total
cortex_prometheus_notifications_queue_length
cortex_prometheus_notifications_queue_capacity
cortex_prometheus_notifications_alertmanagers_discovered
- [ENHANCEMENT] The behavior of the
/ready
was changed for the query frontend to indicate when it was ready to accept queries. This is intended for use by a read path load balancer that would want to wait for the frontend to have attached queriers before including it in the backend. #2733 - [ENHANCEMENT] Experimental Delete Series: Add support for deletion of chunks for remaining stores. #2801
- [ENHANCEMENT] Add
-modules
command line flag to list possible values for-target
. Also, log warning if given target is internal component. #2752 - [ENHANCEMENT] Added
-ingester.flush-on-shutdown-with-wal-enabled
option to enable chunks flushing even when WAL is enabled. #2780 - [ENHANCEMENT] Query-tee: Support for custom API prefix by using
-server.path-prefix
option. #2814 - [ENHANCEMENT] Query-tee: Forward
X-Scope-OrgId
header to backend, if present in the request. #2815 - [ENHANCEMENT] Experimental blocks storage: Added
-experimental.blocks-storage.tsdb.head-compaction-idle-timeout
option to force compaction of data in memory into a block. #2803 - [ENHANCEMENT] Experimental blocks storage: Added support for flushing blocks via
/flush
,/shutdown
(previously these only worked for chunks storage) and by using-experimental.blocks-storage.tsdb.flush-blocks-on-shutdown
option. #2794 - [ENHANCEMENT] Experimental blocks storage: Added support to enforce max query time range length via
-store.max-query-length
. #2826 - [ENHANCEMENT] Experimental blocks storage: Added support to limit the max number of chunks that can be fetched from the long-term storage while executing a query. The limit is enforced both in the querier and store-gateway, and is configurable via
-store.query-chunk-limit
. #2852 #2922 - [ENHANCEMENT] Ingester: Added new metric
cortex_ingester_flush_series_in_progress
that reports number of ongoing flush-series operations. Useful when calling/flush
handler: ifcortex_ingester_flush_queue_length + cortex_ingester_flush_series_in_progress
is 0, all flushes are finished. #2778 - [ENHANCEMENT] Memberlist members can join cluster via SRV records. #2788
- [ENHANCEMENT] Added configuration options for chunks s3 client. #2831
s3.endpoint
s3.region
s3.access-key-id
s3.secret-access-key
s3.insecure
s3.sse-encryption
s3.http.idle-conn-timeout
s3.http.response-header-timeout
s3.http.insecure-skip-verify
- [ENHANCEMENT] Prometheus upgraded. #2798 #2849 #2867 #2902 #2918
- Optimized labels regex matchers for patterns containing literals (eg.
foo.*
,.*foo
,.*foo.*
)
- Optimized labels regex matchers for patterns containing literals (eg.
- [ENHANCEMENT] Add metric
cortex_ruler_config_update_failures_total
to Ruler to track failures of loading rules files. #2857 - [ENHANCEMENT] Experimental Alertmanager: Alertmanager configuration persisted to object storage using an experimental API that accepts and returns YAML-based Alertmanager configuration. #2768
- [ENHANCEMENT] Ruler:
-ruler.alertmanager-url
now supports multiple URLs. Each URL is treated as a separate Alertmanager group. Support for multiple Alertmanagers in a group can be achieved by using DNS service discovery. #2851 - [ENHANCEMENT] Experimental blocks storage: Cortex Flusher now works with blocks engine. Flusher needs to be provided with blocks-engine configuration, existing Flusher flags are not used (they are only relevant for chunks engine). Note that flush errors are only reported via log. #2877
- [ENHANCEMENT] Flusher: Added
-flusher.exit-after-flush
option (defaults to true) to control whether Cortex should stop completely after Flusher has finished its work. #2877 - [ENHANCEMENT] Added metrics
cortex_config_hash
andcortex_runtime_config_hash
to expose hash of the currently active config file. #2874 - [ENHANCEMENT] Logger: added JSON logging support, configured via the
-log.format=json
CLI flag or its respective YAML config option. #2386 - [ENHANCEMENT] Added new flags
-bigtable.grpc-compression
,-ingester.client.grpc-compression
,-querier.frontend-client.grpc-compression
to configure compression used by gRPC. Valid values aregzip
,snappy
, or empty string (no compression, default). #2940 - [ENHANCEMENT] Clarify limitations of the
/api/v1/series
,/api/v1/labels
and/api/v1/label/{name}/values
endpoints. #2953 - [ENHANCEMENT] Ingester: added
Dropped
outcome to metriccortex_ingester_flushing_dequeued_series_total
. #2998 - [BUGFIX] Fixed a bug with
api/v1/query_range
where no responses would return null values forresult
and empty values forresultType
. #2962 - [BUGFIX] Fixed a bug in the index intersect code causing storage to return more chunks/series than required. #2796
- [BUGFIX] Fixed the number of reported keys in the background cache queue. #2764
- [BUGFIX] Fix race in processing of headers in sharded queries. #2762
- [BUGFIX] Query Frontend: Do not re-split sharded requests around ingester boundaries. #2766
- [BUGFIX] Experimental Delete Series: Fixed a problem with cache generation numbers prefixed to cache keys. #2800
- [BUGFIX] Ingester: Flushing chunks via
/flush
endpoint could previously lead to panic, if chunks were already flushed before and then removed from memory during the flush caused by/flush
handler. Immediate flush now doesn't cause chunks to be flushed again. Samples received during flush triggered via/flush
handler are no longer discarded. #2778 - [BUGFIX] Prometheus upgraded. #2849
- Fixed unknown symbol error during head compaction
- [BUGFIX] Fix panic when using cassandra as store for both index and delete requests. #2774
- [BUGFIX] Experimental Delete Series: Fixed a data race in Purger. #2817
- [BUGFIX] KV: Fixed a bug that triggered a panic due to metrics being registered with the same name but different labels when using a
multi
configured KV client. #2837 - [BUGFIX] Query-frontend: Fix passing HTTP
Host
header if-frontend.downstream-url
is configured. #2880 - [BUGFIX] Ingester: Improve time-series distribution when
-experimental.distributor.user-subring-size
is enabled. #2887 - [BUGFIX] Set content type to
application/x-protobuf
for remote_read responses. #2915 - [BUGFIX] Fixed ruler and store-gateway instance registration in the ring (when sharding is enabled) when a new instance replaces abruptly terminated one, and the only difference between the two instances is the address. #2954
- [BUGFIX] Fixed
Missing chunks and index config causing silent failure
Absence of chunks and index from schema config is not validated. #2732 - [BUGFIX] Fix panic caused by KVs from boltdb being used beyond their life. #2971
- [BUGFIX] Experimental blocks storage:
/api/v1/series
,/api/v1/labels
and/api/v1/label/{name}/values
only query the TSDB head regardless of the configured-experimental.blocks-storage.tsdb.retention-period
. #2974 - [BUGFIX] Ingester: Avoid indefinite checkpointing in case of surge in number of series. #2955
- [BUGFIX] Querier: query /series from ingesters regardless the
-querier.query-ingesters-within
setting. #3035 - [BUGFIX] Ruler: fixed an unintentional breaking change introduced in the ruler's
alertmanager_url
YAML config option, which changed the value from a string to a list of strings. #2989