New Features & Enhancements

Introduces the alpha-2 version of Litmus Portal with:
- Ability to configure custom chaos charts (experiment custom resources) source, a.k.a., “MyHub” to a project
- Support for full CRUD operations on chaos (argo) workflows
- Support for graceful removal of connected cluster targets
- Optimizes the workflow for self-cluster connect, i.e., ability to add the cluster hosting the portal itself as a target.
- Enhanced event handling for chaos workflows
- Improves resiliency of the portal front-end
Adds support for resource filtering and chaos injection on pods managed by Argo Rollout resources, facilitating validation of blue-green & canary deployments
Promotes multiarch (amd64, arm64) docker images for all major litmus infra components: chaos-operator, chaos-runner, go-runner, chaos-exporter
Introduces a newer probe mode “OnChaos” for verification of steady-state only during the chaos injection period. This is specifically useful for “negative-test” scenarios where the result of steady-state checks are dependent/tied to the unavailability of certain services.
Extends the scope of the cmdProbe by supporting complex criteria against different output types: integer/float (equal to, less than/less than equal to, greater than/greater than equal to) and strings (substring, string match)
Paves way for increased application filtering and resource-specific status checks via propagation of application kind to the experiment job.
Supports definition of taint tolerations in the chaos-runner & experiment pods via ChaosEngine to enable scheduling of chaos resources on nodes specifically tainted for this purpose.
Supports the specification of NodeSelector in chaos-runner pods via ChaosEngine for guaranteed-schedule on dedicated nodes.
Includes experiments to induce chaos on platform resources (AWS) as part of the kube-aws experiment suite:
Terminates EC2 instances (cluster nodes) using a native litmus chaoslib that leverages the AWS Go SDK
Induces disk loss via detachment of EBS volumes/disks attached to the specified instance
Introduces an SSH-based node restart experiment to the generic experiment suite (tech preview)
Lists use-cases for testing resiliency of Kubernetes system and add-on components (kube-proxy, kiam, calico, etc.,) based on pod-delete chaos under the kube-components suite
Provides an option to specify blast-radius (NODES_AFFECTED_PERCENTAGE) for node-level resource chaos experiments
Allows specification of a comma-separated list of target pods or nodes in cases where a known set of objects need to be targeted.
Adds specification of an optional VOLUME_MOUNT_PATH env variable to the pod-level IO stress experiment, thereby allowing capacity/stress chaos against both ephemeral and persistent storage volumes.
Enhances the pod-autoscaler experiment to:
- Act on statefulsets, apart from deployments.
- Abort experiment to result in an immediate rollback to initial replica count
- Adds chaos-duration as the upper-limit for pod scale
Enhances the default pre-chaos criteria on the respective infra-level experiments to check infra components health (nodes, disk) apart from just the applications under test / auxiliary applications
Homogeneizes the environment variable naming patterns across experiments for pod and node details and improves probe logs to be more descriptive of the status and errors.
Adds more validation capability to the admission controller (presence of application namespace) along with increasing unit-test coverage
Improves the experiment e2e suite with tests for all the newly included enhancements with enhancements to add validation (chaos-execution checks) for network & resource chaos experiments
Provides a new helm chart for Litmus Portal with the ability to control mode of portal operation (namespaced v/s cluster scope) amongst other tunables
Enhances the litmus documentation with steps for helm based install, references to learning resources (tutorials, arch slides), docs for the newly added experiments & improved contributing guide.
Dockerizes the litmus-demo script to ease demo steps
The period of this release also saw the SIG-Orchestration being operationalized. Refer the meeting notes here

Major Bug Fixes

Prevents attempts to generate call-home metrics when the ANALYTICS environment variable is set to false on the chaos operator deployment. Multiple failed attempts to send the g.analytics events in air-gapped environments were seen to result in additional time taken to launch the experiment jobs (nearly 10-12s)
Reduces the time taken between successive events on the chaos-runner and also fixes the behavior of missed events
Optimizes the time taken to gauge successful experiment pod schedule and completion via reduced polling intervals
Fixes the behavior where the chaos events are overridden when more than one experiment is listed in the ChaosEngine
Fixes issues with the CI scripts in the chaos-charts repo that lead to repetition/duplication of experiments in the suite/category-wise concatenated experiments.yaml
Fixes incorrect schema in probe examples in the documentation

Major Known Issues & Limitations

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND. Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Issue:

Experiments requiring mount of the runtime socket file may fail on MicroK8s or K3s environments with error Falied to load config file: read /etc/crictl.yaml: is a directory.

Workaround/Fix:

This is being investigated

Issue

The pod-cpu-hog experiment using the pumba chaoslib can end ungracefully (after successfully injecting chaos for the specified duration) with this error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed, randomly, on some platforms like EKS. In this case, the experiment verdict can tend to show up as Fail due to the chaoslib pod entering a failed state, despite the chaos being injected.

Workaround/Fix:

This is being investigated

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.10.0.yaml

Verify your installation

Verify if the chaos operator is running
kubectl get pods -n litmus
Verify if chaos CRDs are installed
kubectl get crds | grep chaos

For more details refer to the documentation at Docs

litmuschaos/litmus 1.10.0 on GitHub

New Features & Enhancements

Major Bug Fixes

Major Known Issues & Limitations

Issue:

Workaround:

Issue:

Workaround/Fix:

Issue

Workaround/Fix:

Installation

Verify your installation

litmuschaos/litmus 1.10.0
on GitHub