New Features & Enhancements

Litmus Portal progress will be tracked in the fortnightly 2.0.0-beta releases henceforth as we build towards Litmus 2.0 GA. Check out the release notes for Litmus 2.0.0-Beta0 & Litmus 2.0.0-Beta1
Enhances the comparator logic for cmdProbe & promProbe to include more operations (OneOf, Range, etc.,) against a list of int, float & string values.
Streamlines the application health status check process by considering only the pods of annotated parent workloads (when annotationCheck is true). Also provides support to verify the readiness for a specific container (specified by ChaosExperiment/ChaosEngine env var TARGET_CONTAINER) within a pod.
Simplifies the disk-fill generic chaos experiment to work on applications that don’t have ephemeral storage limits defined explicitly, with env var EPHEMERAL_STORAGE_MEBIBYTES.
Improves the chaoslib & health-checks used in the AWS ec2-terminate experiment to work with managed node groups (ex: KOPS, EKS)
Extends support for termination signal type SIGKILL (as env var) in the container-kill experiment for Containerd, CRI-O runtimes.
Removes the need to change permissions of the container runtime socket-files on the node (mounted into the experiment helpers and permissions updated via init-containers) when using Litmus LIB for container-kill & network-chaos experiments
Extends OnChaos mode of probe operation to all native Litmus experiments
Improves the K8sProbe schema to be more meaningful, by replacing the “command” field with GVR (group-version-resource) fields in the probe inputs. Also adds K8s CRUD operation as a valid probe input.
Speeds up the abort routine (impact observed at scale) of network chaos experiments by shifting the chaosresult and event generation (kube-api calls) steps from helper pods to experiment pod.
Provides support for defining ResponseTimeouts for API calls in the httpProbe.
Adds flexibility in the definition of applabel in ChaosEngine (recommended kubectl patterns such as =, ==, !=, in, notin, exists) with annotationCheck enabled/disabled.
Introduces a new chaos category for Azure Cloud with the Azure VM instance kill experiment (available in tech preview mode with image: litmuschaos/go-runner: azure)
Adds additional labels (ChaosEngine name) in the Chaos Exporter metrics for improved tracking purposes in monitoring systems.
Adds new e2e tests (for validation of chaos execution in serial/parallel mode when pod_affected_percentage > 0) for PRs on chaos-operator
Increased unit-test coverage (+35%) on chaos-runner component
Includes the litmus-portal pipeline coverage details & execution results to the litmus-e2e dashboard.

Major Bug Fixes

Removes the force flag (terminationGracePeriodSeconds: 0 ) from the abort function to facilitate the experiment & helper pods to successfully execute the chaos rever/rollback and notification (via event) routine. The need is for the chaos rollback to occur instantaneously and in a guaranteed manner rather than immediate removal of the chaos pods/resources.
Ensures guaranteed rollback/revert of chaos process (tc rule) on target pods upon abort when the chaos injection occurs in parallel at scale (100-150 replicas). This is enabled via a change in the ordering of tasks in the abort routine and preventing further execution in the inject routine once SIGTERM is received.
Skips the AUT (Application-Under-Test) status checks in the chaos namespace when the .spec.appInfo is not specified within the ChaosEngine (with portal driven execution, the chaos namespace may contain completed experiment/argo pods that are not in “Running” status).
Handles invalid DESTINATION_HOSTS better in the network chaos experiments. The inability to resolve specified hostnames to valid IPs is logged, with only valid hosts injected with chaos instead of applying total egress chaos on the interface.
Fixes the OpenAPI schema validation error for httpProbe that resulted in a failed evaluation of the probe.
Fixes the result notification error for failure cases in the Kafka Broker Pod Failure experiment and makes the partition leader identification process more resilient.
Includes the missing step to propagate the ImagePullPolicy of experiment images to the helper pod (via ENV vars)
Includes the missing step to propagate the imagePullSecrets to the helper pods (via spec attribute)

Major Known Issues & Limitations

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.2.yaml

Verify your installation

Verify if the chaos operator is running
kubectl get pods -n litmus
Verify if chaos CRDs are installed
kubectl get crds | grep chaos

For more details refer to the documentation at Docs

litmuschaos/litmus 1.13.2 on GitHub