github litmuschaos/litmus 1.10.0

latest releases: 3.6.1, 3.6.0, 3.5.0...
3 years ago

New Features & Enhancements

  • Introduces the alpha-2 version of Litmus Portal with:

    • Ability to configure custom chaos charts (experiment custom resources) source, a.k.a., “MyHub” to a project
    • Support for full CRUD operations on chaos (argo) workflows
    • Support for graceful removal of connected cluster targets
    • Optimizes the workflow for self-cluster connect, i.e., ability to add the cluster hosting the portal itself as a target.
    • Enhanced event handling for chaos workflows
    • Improves resiliency of the portal front-end
  • Adds support for resource filtering and chaos injection on pods managed by Argo Rollout resources, facilitating validation of blue-green & canary deployments

  • Promotes multiarch (amd64, arm64) docker images for all major litmus infra components: chaos-operator, chaos-runner, go-runner, chaos-exporter

  • Introduces a newer probe mode “OnChaos” for verification of steady-state only during the chaos injection period. This is specifically useful for “negative-test” scenarios where the result of steady-state checks are dependent/tied to the unavailability of certain services.

  • Extends the scope of the cmdProbe by supporting complex criteria against different output types: integer/float (equal to, less than/less than equal to, greater than/greater than equal to) and strings (substring, string match)

  • Paves way for increased application filtering and resource-specific status checks via propagation of application kind to the experiment job.

  • Supports definition of taint tolerations in the chaos-runner & experiment pods via ChaosEngine to enable scheduling of chaos resources on nodes specifically tainted for this purpose.

  • Supports the specification of NodeSelector in chaos-runner pods via ChaosEngine for guaranteed-schedule on dedicated nodes.

  • Includes experiments to induce chaos on platform resources (AWS) as part of the kube-aws experiment suite:

  • Terminates EC2 instances (cluster nodes) using a native litmus chaoslib that leverages the AWS Go SDK
    Induces disk loss via detachment of EBS volumes/disks attached to the specified instance

  • Introduces an SSH-based node restart experiment to the generic experiment suite (tech preview)

  • Lists use-cases for testing resiliency of Kubernetes system and add-on components (kube-proxy, kiam, calico, etc.,) based on pod-delete chaos under the kube-components suite

  • Provides an option to specify blast-radius (NODES_AFFECTED_PERCENTAGE) for node-level resource chaos experiments

  • Allows specification of a comma-separated list of target pods or nodes in cases where a known set of objects need to be targeted.

  • Adds specification of an optional VOLUME_MOUNT_PATH env variable to the pod-level IO stress experiment, thereby allowing capacity/stress chaos against both ephemeral and persistent storage volumes.

  • Enhances the pod-autoscaler experiment to:

    • Act on statefulsets, apart from deployments.
    • Abort experiment to result in an immediate rollback to initial replica count
    • Adds chaos-duration as the upper-limit for pod scale
  • Enhances the default pre-chaos criteria on the respective infra-level experiments to check infra components health (nodes, disk) apart from just the applications under test / auxiliary applications

  • Homogeneizes the environment variable naming patterns across experiments for pod and node details and improves probe logs to be more descriptive of the status and errors.

  • Adds more validation capability to the admission controller (presence of application namespace) along with increasing unit-test coverage

  • Improves the experiment e2e suite with tests for all the newly included enhancements with enhancements to add validation (chaos-execution checks) for network & resource chaos experiments

  • Provides a new helm chart for Litmus Portal with the ability to control mode of portal operation (namespaced v/s cluster scope) amongst other tunables

  • Enhances the litmus documentation with steps for helm based install, references to learning resources (tutorials, arch slides), docs for the newly added experiments & improved contributing guide.

  • Dockerizes the litmus-demo script to ease demo steps

  • The period of this release also saw the SIG-Orchestration being operationalized. Refer the meeting notes here

Major Bug Fixes

  • Prevents attempts to generate call-home metrics when the ANALYTICS environment variable is set to false on the chaos operator deployment. Multiple failed attempts to send the g.analytics events in air-gapped environments were seen to result in additional time taken to launch the experiment jobs (nearly 10-12s)

  • Reduces the time taken between successive events on the chaos-runner and also fixes the behavior of missed events

  • Optimizes the time taken to gauge successful experiment pod schedule and completion via reduced polling intervals

  • Fixes the behavior where the chaos events are overridden when more than one experiment is listed in the ChaosEngine

  • Fixes issues with the CI scripts in the chaos-charts repo that lead to repetition/duplication of experiments in the suite/category-wise concatenated experiments.yaml

  • Fixes incorrect schema in probe examples in the documentation

Major Known Issues & Limitations

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND. Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Issue:

Experiments requiring mount of the runtime socket file may fail on MicroK8s or K3s environments with error Falied to load config file: read /etc/crictl.yaml: is a directory.

Workaround/Fix:

This is being investigated

Issue

The pod-cpu-hog experiment using the pumba chaoslib can end ungracefully (after successfully injecting chaos for the specified duration) with this error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed, randomly, on some platforms like EKS. In this case, the experiment verdict can tend to show up as Fail due to the chaoslib pod entering a failed state, despite the chaos being injected.

Workaround/Fix:

This is being investigated

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.10.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

Don't miss a new litmus release

NewReleases is sending notifications on new releases.