This release has a number of bug-fixes and enhancements, particularly:
- Memberlist KV client is no longer considered experimental. #2725
- 3rd-party index and chunk stores using gRPC client/server plugin mechanism (experimental) #2220
- Using an invalid flag no longer causes printing of all available flags. #2691 (my favourite change!)
Many thanks to all contributors.
Detailed list of changes:
- [CHANGE] Metric
cortex_kv_request_duration_seconds
now includesname
label to denote which client is being used as well as thebackend
label to denote the KV backend implementation in use. #2648 - [CHANGE] Experimental Ruler: Rule groups persisted to object storage using the experimental API have an updated object key encoding to better handle special characters. Rule groups previously-stored using object storage must be renamed to the new format. #2646
- [CHANGE] Query Frontend now uses Round Robin to choose a tenant queue to service next. #2553
- [CHANGE]
-promql.lookback-delta
is now deprecated and has been replaced by-querier.lookback-delta
along withlookback_delta
entry underquerier
in the config file.-promql.lookback-delta
will be removed in v1.4.0. #2604 - [CHANGE] Experimental TSDB: removed
-experimental.tsdb.bucket-store.binary-index-header-enabled
flag. Now the binary index-header is always enabled. - [CHANGE] Experimental TSDB: Renamed index-cache metrics to use original metric names from Thanos, as Cortex is not aggregating them in any way: #2627
cortex_<service>_blocks_index_cache_items_evicted_total
=>thanos_store_index_cache_items_evicted_total{name="index-cache"}
cortex_<service>_blocks_index_cache_items_added_total
=>thanos_store_index_cache_items_added_total{name="index-cache"}
cortex_<service>_blocks_index_cache_requests_total
=>thanos_store_index_cache_requests_total{name="index-cache"}
cortex_<service>_blocks_index_cache_items_overflowed_total
=>thanos_store_index_cache_items_overflowed_total{name="index-cache"}
cortex_<service>_blocks_index_cache_hits_total
=>thanos_store_index_cache_hits_total{name="index-cache"}
cortex_<service>_blocks_index_cache_items
=>thanos_store_index_cache_items{name="index-cache"}
cortex_<service>_blocks_index_cache_items_size_bytes
=>thanos_store_index_cache_items_size_bytes{name="index-cache"}
cortex_<service>_blocks_index_cache_total_size_bytes
=>thanos_store_index_cache_total_size_bytes{name="index-cache"}
cortex_<service>_blocks_index_cache_memcached_operations_total
=>thanos_memcached_operations_total{name="index-cache"}
cortex_<service>_blocks_index_cache_memcached_operation_failures_total
=>thanos_memcached_operation_failures_total{name="index-cache"}
cortex_<service>_blocks_index_cache_memcached_operation_duration_seconds
=>thanos_memcached_operation_duration_seconds{name="index-cache"}
cortex_<service>_blocks_index_cache_memcached_operation_skipped_total
=>thanos_memcached_operation_skipped_total{name="index-cache"}
- [CHANGE] Experimental TSDB: Renamed metrics in bucket stores: #2627
cortex_<service>_blocks_meta_syncs_total
=>cortex_blocks_meta_syncs_total{component="<service>"}
cortex_<service>_blocks_meta_sync_failures_total
=>cortex_blocks_meta_sync_failures_total{component="<service>"}
cortex_<service>_blocks_meta_sync_duration_seconds
=>cortex_blocks_meta_sync_duration_seconds{component="<service>"}
cortex_<service>_blocks_meta_sync_consistency_delay_seconds
=>cortex_blocks_meta_sync_consistency_delay_seconds{component="<service>"}
cortex_<service>_blocks_meta_synced
=>cortex_blocks_meta_synced{component="<service>"}
cortex_<service>_bucket_store_block_loads_total
=>cortex_bucket_store_block_loads_total{component="<service>"}
cortex_<service>_bucket_store_block_load_failures_total
=>cortex_bucket_store_block_load_failures_total{component="<service>"}
cortex_<service>_bucket_store_block_drops_total
=>cortex_bucket_store_block_drops_total{component="<service>"}
cortex_<service>_bucket_store_block_drop_failures_total
=>cortex_bucket_store_block_drop_failures_total{component="<service>"}
cortex_<service>_bucket_store_blocks_loaded
=>cortex_bucket_store_blocks_loaded{component="<service>"}
cortex_<service>_bucket_store_series_data_touched
=>cortex_bucket_store_series_data_touched{component="<service>"}
cortex_<service>_bucket_store_series_data_fetched
=>cortex_bucket_store_series_data_fetched{component="<service>"}
cortex_<service>_bucket_store_series_data_size_touched_bytes
=>cortex_bucket_store_series_data_size_touched_bytes{component="<service>"}
cortex_<service>_bucket_store_series_data_size_fetched_bytes
=>cortex_bucket_store_series_data_size_fetched_bytes{component="<service>"}
cortex_<service>_bucket_store_series_blocks_queried
=>cortex_bucket_store_series_blocks_queried{component="<service>"}
cortex_<service>_bucket_store_series_get_all_duration_seconds
=>cortex_bucket_store_series_get_all_duration_seconds{component="<service>"}
cortex_<service>_bucket_store_series_merge_duration_seconds
=>cortex_bucket_store_series_merge_duration_seconds{component="<service>"}
cortex_<service>_bucket_store_series_refetches_total
=>cortex_bucket_store_series_refetches_total{component="<service>"}
cortex_<service>_bucket_store_series_result_series
=>cortex_bucket_store_series_result_series{component="<service>"}
cortex_<service>_bucket_store_cached_postings_compressions_total
=>cortex_bucket_store_cached_postings_compressions_total{component="<service>"}
cortex_<service>_bucket_store_cached_postings_compression_errors_total
=>cortex_bucket_store_cached_postings_compression_errors_total{component="<service>"}
cortex_<service>_bucket_store_cached_postings_compression_time_seconds
=>cortex_bucket_store_cached_postings_compression_time_seconds{component="<service>"}
cortex_<service>_bucket_store_cached_postings_original_size_bytes_total
=>cortex_bucket_store_cached_postings_original_size_bytes_total{component="<service>"}
cortex_<service>_bucket_store_cached_postings_compressed_size_bytes_total
=>cortex_bucket_store_cached_postings_compressed_size_bytes_total{component="<service>"}
cortex_<service>_blocks_sync_seconds
=>cortex_bucket_stores_blocks_sync_seconds{component="<service>"}
cortex_<service>_blocks_last_successful_sync_timestamp_seconds
=>cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="<service>"}
- [CHANGE] Available command-line flags are printed to stdout, and only when requested via
-help
. Using invalid flag no longer causes printing of all available flags. #2691 - [CHANGE] Experimental Memberlist ring: randomize gossip node names to avoid conflicts when running multiple clients on the same host, or reusing host names (eg. pods in statefulset). Node name randomization can be disabled by using
-memberlist.randomize-node-name=false
. #2715 - [CHANGE] Memberlist KV client is no longer considered experimental. #2725
- [CHANGE] Change target flag for purger from
data-purger
topurger
and make delete request cancellation duration configurable. #2760 - [CHANGE] Removed
-store.fullsize-chunks
option which was undocumented and unused (it broke ingester hand-overs). #2656 - [CHANGE] Query with no metric name that has previously resulted in HTTP status code 500 now returns status code 422 instead. #2571
- [FEATURE] TLS config options added for GRPC clients in Querier (Query-frontend client & Ingester client), Ruler, Store Gateway, as well as HTTP client in Config store client. #2502
- [FEATURE] The flag
-frontend.max-cache-freshness
is now supported within the limits overrides, to specify per-tenant max cache freshness values. The corresponding YAML config parameter has been changed fromresults_cache.max_freshness
tolimits_config.max_cache_freshness
. The legacy YAML config parameter (results_cache.max_freshness
) will continue to be supported till Cortex releasev1.4.0
. #2609 - [FEATURE] Experimental gRPC Store: Added support to 3rd parties index and chunk stores using gRPC client/server plugin mechanism. #2220
- [ENHANCEMENT] Propagate GOPROXY value when building
build-image
. This is to help the builders building the code in a Network where default Go proxy is not accessible (e.g. when behind some corporate VPN). #2741 - [ENHANCEMENT] Querier: Added metric
cortex_querier_request_duration_seconds
for all requests to the querier. #2708 - [ENHANCEMENT] Cortex is now built with Go 1.14. #2480 #2749 #2753
- [ENHANCEMENT] Experimental TSDB: added the following metrics to the ingester: #2580 #2583 #2589 #2654
cortex_ingester_tsdb_appender_add_duration_seconds
cortex_ingester_tsdb_appender_commit_duration_seconds
cortex_ingester_tsdb_refcache_purge_duration_seconds
cortex_ingester_tsdb_compactions_total
cortex_ingester_tsdb_compaction_duration_seconds
cortex_ingester_tsdb_wal_fsync_duration_seconds
cortex_ingester_tsdb_wal_page_flushes_total
cortex_ingester_tsdb_wal_completed_pages_total
cortex_ingester_tsdb_wal_truncations_failed_total
cortex_ingester_tsdb_wal_truncations_total
cortex_ingester_tsdb_wal_writes_failed_total
cortex_ingester_tsdb_checkpoint_deletions_failed_total
cortex_ingester_tsdb_checkpoint_deletions_total
cortex_ingester_tsdb_checkpoint_creations_failed_total
cortex_ingester_tsdb_checkpoint_creations_total
cortex_ingester_tsdb_wal_truncate_duration_seconds
cortex_ingester_tsdb_head_active_appenders
cortex_ingester_tsdb_head_series_not_found_total
cortex_ingester_tsdb_head_chunks
cortex_ingester_tsdb_mmap_chunk_corruptions_total
cortex_ingester_tsdb_head_chunks_created_total
cortex_ingester_tsdb_head_chunks_removed_total
- [ENHANCEMENT] Experimental TSDB: added metrics useful to alert on critical conditions of the blocks storage: #2573
cortex_compactor_last_successful_run_timestamp_seconds
cortex_querier_blocks_last_successful_sync_timestamp_seconds
(when store-gateway is disabled)cortex_querier_blocks_last_successful_scan_timestamp_seconds
(when store-gateway is enabled)cortex_storegateway_blocks_last_successful_sync_timestamp_seconds
- [ENHANCEMENT] Experimental TSDB: added the flag
-experimental.tsdb.wal-compression-enabled
to allow to enable TSDB WAL compression. #2585 - [ENHANCEMENT] Experimental TSDB: Querier and store-gateway components can now use so-called "caching bucket", which can currently cache fetched chunks into shared memcached server. #2572
- [ENHANCEMENT] Ruler: Automatically remove unhealthy rulers from the ring. #2587
- [ENHANCEMENT] Query-tee: added support to
/metadata
,/alerts
, and/rules
endpoints #2600 - [ENHANCEMENT] Query-tee: added support to query results comparison between two different backends. The comparison is disabled by default and can be enabled via
-proxy.compare-responses=true
. #2611 - [ENHANCEMENT] Query-tee: improved the query-tee to not wait all backend responses before sending back the response to the client. The query-tee now sends back to the client first successful response, while honoring the
-backend.preferred
option. #2702 - [ENHANCEMENT] Thanos and Prometheus upgraded. #2602 #2604 #2634 #2659 #2686 #2756
- TSDB now holds less WAL files after Head Truncation.
- TSDB now does memory-mapping of Head chunks and reduces memory usage.
- [ENHANCEMENT] Experimental TSDB: decoupled blocks deletion from blocks compaction in the compactor, so that blocks deletion is not blocked by a busy compactor. The following metrics have been added: #2623
cortex_compactor_block_cleanup_started_total
cortex_compactor_block_cleanup_completed_total
cortex_compactor_block_cleanup_failed_total
cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds
- [ENHANCEMENT] Experimental TSDB: Use shared cache for metadata. This is especially useful when running multiple querier and store-gateway components to reduce number of object store API calls. #2626 #2640
- [ENHANCEMENT] Experimental TSDB: when
-querier.query-store-after
is configured and running the experimental blocks storage, the time range of the query sent to the store is now manipulated to ensure the query end time is not more recent than 'now - query-store-after'. #2642 - [ENHANCEMENT] Experimental TSDB: small performance improvement in concurrent usage of RefCache, used during samples ingestion. #2651
- [ENHANCEMENT] The following endpoints now respond appropriately to an
Accepts
header with the valueapplication/json
#2673/distributor/all_user_stats
/distributor/ha_tracker
/ingester/ring
/store-gateway/ring
/compactor/ring
/ruler/ring
/services
- [ENHANCEMENT] Experimental Cassandra backend: Add
-cassandra.num-connections
to allow increasing the number of TCP connections to each Cassandra server. #2666 - [ENHANCEMENT] Experimental Cassandra backend: Use separate Cassandra clients and connections for reads and writes. #2666
- [ENHANCEMENT] Experimental Cassandra backend: Add
-cassandra.reconnect-interval
to allow specifying the reconnect interval to a Cassandra server that has been markedDOWN
by the gocql driver. Also change the default value of the reconnect interval from60s
to1s
. #2687 - [ENHANCEMENT] Experimental Cassandra backend: Add option
-cassandra.convict-hosts-on-failure=false
to not convict host of being down when a request fails. #2684 - [ENHANCEMENT] Experimental TSDB: Applied a jitter to the period bucket scans in order to better distribute bucket operations over the time and increase the probability of hitting the shared cache (if configured). #2693
- [ENHANCEMENT] Experimental TSDB: Series limit per user and per metric now work in TSDB blocks. #2676
- [ENHANCEMENT] Experimental Memberlist: Added ability to periodically rejoin the memberlist cluster. #2724
- [ENHANCEMENT] Experimental Delete Series: Added the following metrics for monitoring processing of delete requests: #2730
cortex_purger_load_pending_requests_attempts_total
: Number of attempts that were made to load pending requests with status.cortex_purger_oldest_pending_delete_request_age_seconds
: Age of oldest pending delete request in seconds.cortex_purger_pending_delete_requests_count
: Count of requests which are in process or are ready to be processed.
- [ENHANCEMENT] Experimental TSDB: Improved compactor to hard-delete also partial blocks with an deletion mark (even if the deletion mark threshold has not been reached). #2751
- [ENHANCEMENT] Experimental TSDB: Introduced a consistency check done by the querier to ensure all expected blocks have been queried via the store-gateway. If a block is missing on a store-gateway, the querier retries fetching series from missing blocks up to 3 times. If the consistency check fails once all retries have been exhausted, the query execution fails. The following metrics have been added: #2593 #2630 #2689 #2695
cortex_querier_blocks_consistency_checks_total
cortex_querier_blocks_consistency_checks_failed_total
cortex_querier_storegateway_refetches_per_query
- [ENHANCEMENT] Delete requests can now be canceled #2555
- [ENHANCEMENT] Table manager can now provision tables for delete store #2546
- [BUGFIX] Ruler: Ensure temporary rule files with special characters are properly mapped and cleaned up. #2506
- [BUGFIX] Fixes #2411, Ensure requests are properly routed to the prometheus api embedded in the query if
-server.path-prefix
is set. #2372 - [BUGFIX] Experimental TSDB: fixed chunk data corruption when querying back series using the experimental blocks storage. #2400
- [BUGFIX] Fixed collection of tracing spans from Thanos components used internally. #2655
- [BUGFIX] Experimental TSDB: fixed memory leak in ingesters. #2586
- [BUGFIX] QueryFrontend: fixed a situation where HTTP error is ignored and an incorrect status code is set. #2590
- [BUGFIX] Ingester: Fix an ingester starting up in the JOINING state and staying there forever. #2565
- [BUGFIX] QueryFrontend: fixed a panic (
integer divide by zero
) in the query-frontend. The query-frontend now requires the-querier.default-evaluation-interval
config to be set to the same value of the querier. #2614 - [BUGFIX] Experimental TSDB: when the querier receives a
/series
request with a time range older than the data stored in the ingester, it now ignores the requested time range and returns known series anyway instead of returning an empty response. This aligns the behaviour with the chunks storage. #2617 - [BUGFIX] Cassandra: fixed an edge case leading to an invalid CQL query when querying the index on a Cassandra store. #2639
- [BUGFIX] Ingester: increment series per metric when recovering from WAL or transfer. #2674
- [BUGFIX] Fixed
wrong number of arguments for 'mget' command
Redis error when a query has no chunks to lookup from storage. #2700 - [BUGFIX] Ingester: Automatically remove old tmp checkpoints, fixing a potential disk space leak after an ingester crashes. #2726