Cortex 1.8.0 features 122 contributions by 35 authors. Thank you!
Highlights
- Automatic deletion of old blocks with configurable per-tenant retention
- Introduction of new storage options in Ruler and Alertmanager, using bucket client from Thanos. Previous storage options will be deprecated in next release.
- New
thanosconvert
tool to migrate Thanos or Prometheus block metadata to Cortex - Support for
@ <timestamp>
in PromQL (needs to be enabled by flag) - Configurable per-tenant server-side encryption for S3
- Work on sharding Alertmanager continues (not finished yet)
Changelog
- [CHANGE] Alertmanager: Don't expose cluster information to tenants via the
/alertmanager/api/v1/status
API endpoint when operating with clustering enabled. #3903 - [CHANGE] Ingester: don't update internal "last updated" timestamp of TSDB if tenant only sends invalid samples. This affects how "idle" time is computed. #3727
- [CHANGE] Require explicit flag
-<prefix>.tls-enabled
to enable TLS in GRPC clients. Previously it was enough to specify a TLS flag to enable TLS validation. #3156 - [CHANGE] Query-frontend: removed
-querier.split-queries-by-day
(deprecated in Cortex 0.4.0). Please use-querier.split-queries-by-interval
instead. #3813 - [CHANGE] Store-gateway: the chunks pool controlled by
-blocks-storage.bucket-store.max-chunk-pool-bytes
is now shared across all tenants. #3830 - [CHANGE] Ingester: return error code 400 instead of 429 when per-user/per-tenant series/metadata limits are reached. #3833
- [CHANGE] Compactor: add
reason
label tocortex_compactor_blocks_marked_for_deletion_total
metric. Source blocks marked for deletion by compactor are labelled ascompaction
, while blocks passing the retention period are labelled asretention
. #3879 - [CHANGE] Alertmanager: the
DELETE /api/v1/alerts
is now idempotent. No error is returned if the alertmanager config doesn't exist. #3888 - [FEATURE] Experimental Ruler Storage: Add a separate set of configuration options to configure the ruler storage backend under the
-ruler-storage.
flag prefix. All blocks storage bucket clients and the config service are currently supported. Clients using this implementation will only be enabled if the existing-ruler.storage
flags are left unset. #3805 #3864 - [FEATURE] Experimental Alertmanager Storage: Add a separate set of configuration options to configure the alertmanager storage backend under the
-alertmanager-storage.
flag prefix. All blocks storage bucket clients and the config service are currently supported. Clients using this implementation will only be enabled if the existing-alertmanager.storage
flags are left unset. #3888 - [FEATURE] Adds support to S3 server-side encryption using KMS. The S3 server-side encryption config can be overridden on a per-tenant basis for the blocks storage, ruler and alertmanager. Deprecated
-<prefix>.s3.sse-encryption
, please use the following CLI flags that have been added. #3651 #3810 #3811 #3870 #3886 #3906-<prefix>.s3.sse.type
-<prefix>.s3.sse.kms-key-id
-<prefix>.s3.sse.kms-encryption-context
- [FEATURE] Querier: Enable
@ <timestamp>
modifier in PromQL using the new-querier.at-modifier-enabled
flag. #3744 - [FEATURE] Overrides Exporter: Add
overrides-exporter
module for exposing per-tenant resource limit overrides as metrics. It is not included inall
target (single-binary mode), and must be explicitly enabled. #3785 - [FEATURE] Experimental thanosconvert: introduce an experimental tool
thanosconvert
to migrate Thanos block metadata to Cortex metadata. #3770 - [FEATURE] Alertmanager: It now shards the
/api/v1/alerts
API using the ring when sharding is enabled. #3671- Added
-alertmanager.max-recv-msg-size
(defaults to 16M) to limit the size of HTTP request body handled by the alertmanager. - New flags added for communication between alertmanagers:
-alertmanager.max-recv-msg-size
-alertmanager.alertmanager-client.remote-timeout
-alertmanager.alertmanager-client.tls-enabled
-alertmanager.alertmanager-client.tls-cert-path
-alertmanager.alertmanager-client.tls-key-path
-alertmanager.alertmanager-client.tls-ca-path
-alertmanager.alertmanager-client.tls-server-name
-alertmanager.alertmanager-client.tls-insecure-skip-verify
- Added
- [FEATURE] Compactor: added blocks storage per-tenant retention support. This is configured via
-compactor.retention-period
, and can be overridden on a per-tenant basis. #3879 - [ENHANCEMENT] Queries: Instrument queries that were discarded due to the configured
max_outstanding_requests_per_tenant
. #3894cortex_query_frontend_discarded_requests_total
cortex_query_scheduler_discarded_requests_total
- [ENHANCEMENT] Ruler: Add TLS and explicit basis authentication configuration options for the HTTP client the ruler uses to communicate with the alertmanager. #3752
-ruler.alertmanager-client.basic-auth-username
: Configure the basic authentication username used by the client. Takes precedent over a URL configured username.-ruler.alertmanager-client.basic-auth-password
: Configure the basic authentication password used by the client. Takes precedent over a URL configured password.-ruler.alertmanager-client.tls-ca-path
: File path to the CA file.-ruler.alertmanager-client.tls-cert-path
: File path to the TLS certificate.-ruler.alertmanager-client.tls-insecure-skip-verify
: Boolean to disable verifying the certificate.-ruler.alertmanager-client.tls-key-path
: File path to the TLS key certificate.-ruler.alertmanager-client.tls-server-name
: Expected name on the TLS certificate.
- [ENHANCEMENT] Ingester: exposed metric
cortex_ingester_oldest_unshipped_block_timestamp_seconds
, tracking the unix timestamp of the oldest TSDB block not shipped to the storage yet. #3705 - [ENHANCEMENT] Prometheus upgraded. #3739 #3806
- Avoid unnecessary
runtime.GC()
during compactions. - Prevent compaction loop in TSDB on data gap.
- Avoid unnecessary
- [ENHANCEMENT] Query-Frontend now returns server side performance metrics using
Server-Timing
header when query stats is enabled. #3685 - [ENHANCEMENT] Runtime Config: Add a
mode
query parameter for the runtime config endpoint./runtime_config?mode=diff
now shows the YAML runtime configuration with all values that differ from the defaults. #3700 - [ENHANCEMENT] Distributor: Enable downstream projects to wrap distributor push function and access the deserialized write requests berfore/after they are pushed. #3755
- [ENHANCEMENT] Add flag
-<prefix>.tls-server-name
to require a specific server name instead of the hostname on the certificate. #3156 - [ENHANCEMENT] Alertmanager: Remove a tenant's alertmanager instead of pausing it as we determine it is no longer needed. #3722
- [ENHANCEMENT] Blocks storage: added more configuration options to S3 client. #3775
-blocks-storage.s3.tls-handshake-timeout
: Maximum time to wait for a TLS handshake. 0 means no limit.-blocks-storage.s3.expect-continue-timeout
: The time to wait for a server's first response headers after fully writing the request headers if the request has an Expect header. 0 to send the request body immediately.-blocks-storage.s3.max-idle-connections
: Maximum number of idle (keep-alive) connections across all hosts. 0 means no limit.-blocks-storage.s3.max-idle-connections-per-host
: Maximum number of idle (keep-alive) connections to keep per-host. If 0, a built-in default value is used.-blocks-storage.s3.max-connections-per-host
: Maximum number of connections per host. 0 means no limit.
- [ENHANCEMENT] Ingester: when tenant's TSDB is closed, Ingester now removes pushed metrics-metadata from memory, and removes metadata (
cortex_ingester_memory_metadata
,cortex_ingester_memory_metadata_created_total
,cortex_ingester_memory_metadata_removed_total
) and validation metrics (cortex_discarded_samples_total
,cortex_discarded_metadata_total
). #3782 - [ENHANCEMENT] Distributor: cleanup metrics for inactive tenants. #3784
- [ENHANCEMENT] Ingester: Have ingester to re-emit following TSDB metrics. #3800
cortex_ingester_tsdb_blocks_loaded
cortex_ingester_tsdb_reloads_total
cortex_ingester_tsdb_reloads_failures_total
cortex_ingester_tsdb_symbol_table_size_bytes
cortex_ingester_tsdb_storage_blocks_bytes
cortex_ingester_tsdb_time_retentions_total
- [ENHANCEMENT] Querier: distribute workload across
-store-gateway.sharding-ring.replication-factor
store-gateway replicas when querying blocks and-store-gateway.sharding-enabled=true
. #3824 - [ENHANCEMENT] Distributor / HA Tracker: added cleanup of unused elected HA replicas from KV store. Added following metrics to monitor this process: #3809
cortex_ha_tracker_replicas_cleanup_started_total
cortex_ha_tracker_replicas_cleanup_marked_for_deletion_total
cortex_ha_tracker_replicas_cleanup_deleted_total
cortex_ha_tracker_replicas_cleanup_delete_failed_total
- [ENHANCEMENT] Ruler now has new API endpoint
/ruler/delete_tenant_config
that can be used to delete all ruler groups for tenant. It is intended to be used by administrators who wish to clean up state after removed user. Note that this endpoint is enabled regardless of-experimental.ruler.enable-api
. #3750 #3899 - [ENHANCEMENT] Query-frontend, query-scheduler: cleanup metrics for inactive tenants. #3826
- [ENHANCEMENT] Blocks storage: added
-blocks-storage.s3.region
support to S3 client configuration. #3811 - [ENHANCEMENT] Distributor: Remove cached subrings for inactive users when using shuffle sharding. #3849
- [ENHANCEMENT] Store-gateway: Reduced memory used to fetch chunks at query time. #3855
- [ENHANCEMENT] Ingester: attempt to prevent idle compaction from happening in concurrent ingesters by introducing a 25% jitter to the configured idle timeout (
-blocks-storage.tsdb.head-compaction-idle-timeout
). #3850 - [ENHANCEMENT] Compactor: cleanup local files for users that are no longer owned by compactor. #3851
- [ENHANCEMENT] Store-gateway: close empty bucket stores, and delete leftover local files for tenats that no longer belong to store-gateway. #3853
- [ENHANCEMENT] Store-gateway: added metrics to track partitioner behaviour. #3877
cortex_bucket_store_partitioner_requested_bytes_total
cortex_bucket_store_partitioner_requested_ranges_total
cortex_bucket_store_partitioner_expanded_bytes_total
cortex_bucket_store_partitioner_expanded_ranges_total
- [ENHANCEMENT] Store-gateway: added metrics to monitor chunk buffer pool behaviour. #3880
cortex_bucket_store_chunk_pool_requested_bytes_total
cortex_bucket_store_chunk_pool_returned_bytes_total
- [ENHANCEMENT] Alertmanager: load alertmanager configurations from object storage concurrently, and only load necessary configurations, speeding configuration synchronization process and executing fewer "GET object" operations to the storage when sharding is enabled. #3898
- [ENHANCEMENT] Ingester (blocks storage): Ingester can now stream entire chunks instead of individual samples to the querier. At the moment this feature must be explicitly enabled either by using
-ingester.stream-chunks-when-using-blocks
flag oringester_stream_chunks_when_using_blocks
(boolean) field in runtime config file, but these configuration options are temporary and will be removed when feature is stable. #3889 - [ENHANCEMENT] Alertmanager: New endpoint
/multitenant_alertmanager/delete_tenant_config
to delete configuration for tenant identified byX-Scope-OrgID
header. This is an internal endpoint, available even if Alertmanager API is not enabled by using-experimental.alertmanager.enable-api
. #3900 - [BUGFIX] Cortex: Fixed issue where fatal errors and various log messages where not logged. #3778
- [BUGFIX] HA Tracker: don't track as error in the
cortex_kv_request_duration_seconds
metric a CAS operation intentionally aborted. #3745 - [BUGFIX] Querier / ruler: do not log "error removing stale clients" if the ring is empty. #3761
- [BUGFIX] Store-gateway: fixed a panic caused by a race condition when the index-header lazy loading is enabled. #3775 #3789
- [BUGFIX] Compactor: fixed "could not guess file size" log when uploading blocks deletion marks to the global location. #3807
- [BUGFIX] Prevent panic at start if the http_prefix setting doesn't have a valid value. #3796
- [BUGFIX] Memberlist: fixed panic caused by race condition in
armon/go-metrics
used by memberlist client. #3725 - [BUGFIX] Querier: returning 422 (instead of 500) when query hits
max_chunks_per_query
limit with block storage. #3895 - [BUGFIX] Alertmanager: Ensure that experimental
/api/v1/alerts
endpoints work when-http.prefix
is empty. #3905 - [BUGFIX] Chunk store: fix panic in inverted index when deleted fingerprint is no longer in the index. #3543