github litmuschaos/litmus 1.9.0

latest releases: 3.13.0, 3.12.0, 3.11.0...
4 years ago

New Features & Enhancements

  • Introduces the alpha-1 version of the Litmus Portal. Adds support for scheduled workflows, chaos workflows on external agents, namespaced mode of operation, workflow analytics comparison. Also includes additional pre-defined workflows, and enhanced UX around user management.

  • Enhances the K8s probe to support full CRUD operations against native/custom resources. This is especially useful during chaos on “control-plane” components where provisioning/de-provisioning abilities can be tested. Also adds more filters to the K8s probe (labelSelectors)

  • Supports ordered execution of probes with the ability to reuse probe (result) artifacts in “downstream” probes, thereby enabling the creation of complex exit checks in standard experiments. The probe artifacts are referenced via standard templates in the ChaosEngine schema.

  • Supports configmaps & secrets definition for the chaos-runner pod. One emerging use case that makes use of this feature is to achieve cross-cluster chaos, wherein the chaos-runner executes the experiment on a different cluster to the one where the chaos operator/runner (litmus control plane) resides.

  • Allows resource request/limits specification for chaos resources (chaos-runner, experiment pods) in the ChaosEngines. Aids operations in multi-tenant environments where the experiments are being executed simultaneously across several namespaces, leading to a large set of chaos pods.

  • Adds support for ImagePullSecrets for chaos resources in the ChaosEngine to enable operations in cases where private image registries are used.

  • Provides golang chaoslib for Kafka chaos with enhancements to dynamically retrieve "current" partition leaders for each iteration of the broker kill.

  • Supports network chaos between desired microservices (specified via service IP or hostname filters) on containerd & CRIO runtime

  • Introduces different modes of chaos execution - serial and parallel defined via a SEQUENCE env var for cases where the experiment blast radius is higher. This allows chaos to be executed sequentially or in parallel on the replicas of the application under test (AUT)

  • Supports abort operation for all node & pod-level chaos experiments (except kubelet/docker service kill), including those running chaos processes in the target container’s network/process namespace. Also handles probe status for abort scenarios.

  • Minimizes the permissions/scope of the clusterroles used in the chaos operator and admin-mode serviceaccount to better comply with standard security constraints.

  • Optimizes the code structure in the litmus-go repo to ensure a single experiment binary is built (which takes individual experiment names as args) instead of building binaries for each experiment, resulting in an experiment image with a much-reduced size footprint.

  • Releases a set of multi-arch (arm64, amd64) images with tag multiarch-1.9.0 for technical preview & feedback (built via docker buildx). Will be eventually assimilated into standard release images.

  • Improves build process via docker security checks, linting & formatting checks in missing components/repos.

  • Adds the recommended Kubernetes labels for all chaos resources to enable group-management by external tools.

  • Propagates the labels & identifiers of the chaos experiment pod (defined in the ChaosExperiment CRs) to the ChaosResults to allow segregation/management.

  • Improves error handling & logging (structured logs with logrus) in the chaos-runner & experiments.

  • Improves the scaffolding tool to bootstrap experiment artifacts with the latest schema enhancements (probe support, abort support, etc.,)

  • Improves the (validation webhook) admission-controller to verify availability of configmap & secret resources specified for a chaos experiment.

  • Introduces a helmfile for Litmus to package the infra (operator, CRDs) & the experiment helm charts as part of a single (litmus stack) installation.

  • Introduces on-demand e2e test (triggered via /run-e2e commands) for Pull Requests on litmus-go repository via github actions using KIND clusters

  • Improves the e2e coverage for chaos experiments (pod-io-stress, node-io-stress, pod-autoscaler, abort support, target specification) via new tests in the pipeline based on the new additions/enhancements. The existing tests are improved with increased validation to test the success of the chaos injection procedures.

  • Adds a new GitLab pipeline with an initial set of e2e tests for Litmus Portal functions

  • Enhances the litmus-demo scripts to set up the EKS environment & execute the generic chaos suite (KIND & GKE are the other supported platforms)

  • Introduces documentation standards (and consequent update/refactor) around naming conventions for resource names, attribute names, - contribution guidelines as part of the SIG-Documentation deliberations.

  • Adds new content to litmus-docs - chaos monitoring, chaos CR schema explanations, probe enhancements, troubleshooting faq additions, etc.,

Major Bug Fixes

  • Fixes the bug wherein applications configured with liveness probes are stuck in CrashLoopBackOff state upon being subjected to network chaos (docker runtime) with revert chaos being unsuccessful. The network chaoslib now uses the container ID of the Kubernetes pause container associated with the target pod to inject the tc rules in the network namespace instead of the target app containers themselves (as they are prone to restart via liveness probes).

  • Fixes the Failed to connect to bus: No data available error on kubelet-service-kill chaoslib pod

  • Fixes the regex patterns used in the CRD validation schema to support non-specification of .spec.appinfo in the ChaosEngine (either in case of node-level/infra experiments or for broader, randomized selection of pods in the pod-level experiments)

  • Adds logic to exclude the chaos-resource pods (operator, runner, experiment & helper pods) from the target list in cases where the .spec.appinfo is not specified.

  • Fixes the behavior where the chaos-runner runs forever without terminating the experiment, in cases where the experiment job is not successfully started (ImagePullBackOff, Pending etc.,). The chaos-runner is now configured to use StatusCheckTimeout defined in the ChaosEngine (defaults to 180s) to terminate the experiment.

  • Fixes the inability to inject network-chaos when the ChaosExperiment CR is created with a different name (other than the default names on the chaoshub). The logic to select the netem params based on the fixed experiment names has been altered with dedicated functions for each variant of network chaos (latency, loss, duplication, corruption).

  • Fixes improper entrypoint/command to the containerd/crio container-kill & node-io-stress chaoslib (helper) pods

  • Fixes inability to revert (downscale replicas) the pod-autoscaler chaos in cases where the application namespace and chaos namespace are different (as with admin mode execution).

Major Known Issues & Limitations

Issue:

  • The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

  • Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND. Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Note: Expected to be fixed in a subsequent patch/minor release

Issue:

  • The pod-cpu-hog experiment using the pumba chaoslib can end ungracefully (after successfully injecting chaos for the specified duration) with this error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed, randomly, on some platforms like EKS. In this case, the experiment verdict can tend to show up as Fail due to the chaoslib pod entering a failed state, despite the chaos being injected.

Workaround:

  • This is being investigated

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.9.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

Don't miss a new litmus release

NewReleases is sending notifications on new releases.