New Features & Enhancements
-
Moves the Litmus Portal to beta-0 phase with first-cut API documentation, view-only users, install/operation support in air-gapped environments, non-root/non-privileged containers etc.,
-
Introduces the Prometheus Probe to facilitate metrics based SLO validation during experiment runs
-
Enhances the litmus probes by adding regex support for output comparison, OpenAPI v3 based CRD validation for probe schema, error handling & probe logging improvements
-
Adds the node-restart & node power-off experiments for Kubevirt based Linux VMs
-
Support for adding ENV variable values from ConfigMaps and Secrets in the ChaosEngine. This is especially useful in the case of platform-specific (Ex: AWS) chaos experiments.
-
Allows chaos annotations for more than one application workload that shares the same labels (controls chaos for a set of apps)
-
Supports the definition of resource requests/limits for chaos-runner & helper pods
-
Extends the native litmus chaoslib for network chaos on docker runtime while continuing to support pumba lib. This is expected to help users that do not want additional images (defined by the TC_IMAGE env in the network chaos experiments) pulled during the course of the experiment.
-
Cleans up the failed/orphaned helper pods based on the jobCleanupPolicy specified in ChaosEngine.
-
Propagates the ImagePullPolicy of the experiment resource to the helper pods
-
Refactors the chaos-runner to avoid experiment unnecessary dependency checks (for configmaps, secrets) where applicable and alter the flow to fail faster in case of issues such as missing experiment CRs, etc.,
-
Removes dependency on (availability of)
crictl.yaml
on the Kubernetes nodes for the execution of experiments on containerd/crio runtime (esp useful for K3s, MicroK8S platforms) -
Reduces the image sizes for the chaos-operator & chaos-runner pods while significantly reducing vulnerabilities with a new base image
-
Adds non-root experiment (go-runner) images in the tech-preview stage for beta testing.
-
Introduces a recommended PodSecurityPolicy configuration for LitmusChaos experiments for use in restricted environments
-
Improves the experiment bootstrap experience with a simple scaffold CLI/SDK
-
Simplifies the ChaosEngine sample specs on the ChaosHub by removing redundant attributes, renaming the ENVs referring to remote services/hosts in the network chaos experiments, synchronizing runtime & socket-path vars, etc.,
-
Adds integration tests as a PR check (triggered on each commit unless skipped via tag) on a containerd based cluster (KIND) for the chaos-operator, chaos-runner, litmus-go & litmus-helm repos
-
Improves the litmus-e2e with dedicated pipelines on AWS cloud for pod level, infra (node) level experiment tests & control plane functionality tests with schedules setup for nightly builds on the
ci
tag. This aids in faster and easier on-demand execution. -
Adds a first-cut visualization of the e2e metrics based on a coverage tracker
-
Includes a helm chart (with an entry/release item on the helmfile) for the litmus-portal
-
Provides an option to execute the Litmus Demo from a container and adds EKS as a test platform.
Major Bug Fixes
-
Fixes issues in the chaos-runner & experiment logic which led to failed event generation when the experiment is restarted post an abort operation
-
Adds the
pods/exec
resource to the experiment RBAC to support the source mode of operation of cmdProbe wherein the probe command is executed from within a dedicated pod whose source image has been specified. Without this change, probe execution is unsuccessful. -
Fixes the behavior where the application pods configured with liveness probes enter CrashLoopBackOff state post network chaos injection in case of containerd runtime. This was caused due to an unsuccessful revert of chaos due to the change in container PID which was used by litmus to inject the netem rules. The fix involves injecting the rules on the corresponding sandbox container instead of the app container itself thereby facilitating successful chaos revert. With this, the app pods are expected to recover w/o manual intervention depending upon the existing backOff delay/no of restarts during the desired chaos duration.
-
Fixes the developer flow with Okteto based dev container/environments: executing the experiment code from within the
litmus-experiment
test deployment was seen to fail due to failed probe initialization (whereas the chaosengine is not defined at this stage at all). This has been fixed to ensure the probe initialization occurs only if the experiment is triggered by the chaosengine & probes are defined. -
Removes “auxiliaryAppInfo” as an attribute in non-infra experiments (w/o cluster-wide rolebinding). Providing this attribute in pod-level experiments caused failed entry/exit application status checks due to lack of permissions.
-
Cleans up the permissions on the chaos operator cluster role to avoid listing of unrelated resources under API groups
-
Fixes version comparison on the ChaosHub server to reflect the latest chaos-charts release on the website
-
Fixes the chaos-exporter deployment crash upon startup with appropriate entrypoint script
-
Propagates the docker socket file path to the pumba helper pod for network chaos experiments instead of the hardcoded /var/run/docker.sock
Major Known Issues & Limitations
Issue
Forced removal of the experiment helper pods (where applicable: notably network chaos experiments) either manually or due to Kubernetes eviction can render the chaos revert operation at the end of the chaos duration a failure/ a non-event. This will cause the application under test (AUT) to continue being subjected to chaos unless manually recovered.
-
Workaround
With experiment pod logs it can be deciphered that the helper operations have failed. In which case, the AUT pod(s) can be deleted so they can be rescheduled again (this is applicable only to those applications deployed as a higher-level controller such as deployment/statefulset/daemonset, etc.,) with a new network namespace.
-
Fix
This is being actively worked on (retry mechanism for chaos revert initiated in case of failed/missing helper pods) and should be available in a subsequent release.
Issue
The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.
-
Workaround
- Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
- Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.
-
Fix:
- This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.11.0.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs