New Features & Enhancements
-
Introduces the alpha-2 version of Litmus Portal with:
- Ability to configure custom chaos charts (experiment custom resources) source, a.k.a., “MyHub” to a project
- Support for full CRUD operations on chaos (argo) workflows
- Support for graceful removal of connected cluster targets
- Optimizes the workflow for self-cluster connect, i.e., ability to add the cluster hosting the portal itself as a target.
- Enhanced event handling for chaos workflows
- Improves resiliency of the portal front-end
-
Adds support for resource filtering and chaos injection on pods managed by Argo Rollout resources, facilitating validation of blue-green & canary deployments
-
Promotes multiarch (amd64, arm64) docker images for all major litmus infra components: chaos-operator, chaos-runner, go-runner, chaos-exporter
-
Introduces a newer probe mode “OnChaos” for verification of steady-state only during the chaos injection period. This is specifically useful for “negative-test” scenarios where the result of steady-state checks are dependent/tied to the unavailability of certain services.
-
Extends the scope of the cmdProbe by supporting complex criteria against different output types: integer/float (equal to, less than/less than equal to, greater than/greater than equal to) and strings (substring, string match)
-
Paves way for increased application filtering and resource-specific status checks via propagation of application kind to the experiment job.
-
Supports definition of taint tolerations in the chaos-runner & experiment pods via ChaosEngine to enable scheduling of chaos resources on nodes specifically tainted for this purpose.
-
Supports the specification of NodeSelector in chaos-runner pods via ChaosEngine for guaranteed-schedule on dedicated nodes.
-
Includes experiments to induce chaos on platform resources (AWS) as part of the kube-aws experiment suite:
-
Terminates EC2 instances (cluster nodes) using a native litmus chaoslib that leverages the AWS Go SDK
Induces disk loss via detachment of EBS volumes/disks attached to the specified instance -
Introduces an SSH-based node restart experiment to the generic experiment suite (tech preview)
-
Lists use-cases for testing resiliency of Kubernetes system and add-on components (kube-proxy, kiam, calico, etc.,) based on pod-delete chaos under the kube-components suite
-
Provides an option to specify blast-radius (NODES_AFFECTED_PERCENTAGE) for node-level resource chaos experiments
-
Allows specification of a comma-separated list of target pods or nodes in cases where a known set of objects need to be targeted.
-
Adds specification of an optional VOLUME_MOUNT_PATH env variable to the pod-level IO stress experiment, thereby allowing capacity/stress chaos against both ephemeral and persistent storage volumes.
-
Enhances the pod-autoscaler experiment to:
- Act on statefulsets, apart from deployments.
- Abort experiment to result in an immediate rollback to initial replica count
- Adds chaos-duration as the upper-limit for pod scale
-
Enhances the default pre-chaos criteria on the respective infra-level experiments to check infra components health (nodes, disk) apart from just the applications under test / auxiliary applications
-
Homogeneizes the environment variable naming patterns across experiments for pod and node details and improves probe logs to be more descriptive of the status and errors.
-
Adds more validation capability to the admission controller (presence of application namespace) along with increasing unit-test coverage
-
Improves the experiment e2e suite with tests for all the newly included enhancements with enhancements to add validation (chaos-execution checks) for network & resource chaos experiments
-
Provides a new helm chart for Litmus Portal with the ability to control mode of portal operation (namespaced v/s cluster scope) amongst other tunables
-
Enhances the litmus documentation with steps for helm based install, references to learning resources (tutorials, arch slides), docs for the newly added experiments & improved contributing guide.
-
Dockerizes the litmus-demo script to ease demo steps
-
The period of this release also saw the SIG-Orchestration being operationalized. Refer the meeting notes here
Major Bug Fixes
-
Prevents attempts to generate call-home metrics when the ANALYTICS environment variable is set to false on the chaos operator deployment. Multiple failed attempts to send the g.analytics events in air-gapped environments were seen to result in additional time taken to launch the experiment jobs (nearly 10-12s)
-
Reduces the time taken between successive events on the chaos-runner and also fixes the behavior of missed events
-
Optimizes the time taken to gauge successful experiment pod schedule and completion via reduced polling intervals
-
Fixes the behavior where the chaos events are overridden when more than one experiment is listed in the ChaosEngine
-
Fixes issues with the CI scripts in the chaos-charts repo that lead to repetition/duplication of experiments in the suite/category-wise concatenated experiments.yaml
-
Fixes incorrect schema in probe examples in the documentation
Major Known Issues & Limitations
Issue:
The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.
Workaround:
Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND. Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.
Issue:
Experiments requiring mount of the runtime socket file may fail on MicroK8s or K3s environments with error Falied to load config file: read /etc/crictl.yaml: is a directory
.
Workaround/Fix:
This is being investigated
Issue
The pod-cpu-hog experiment using the pumba chaoslib can end ungracefully (after successfully injecting chaos for the specified duration) with this error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed
, randomly, on some platforms like EKS. In this case, the experiment verdict can tend to show up as Fail due to the chaoslib pod entering a failed state, despite the chaos being injected.
Workaround/Fix:
This is being investigated
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.10.0.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs