Chaos Engineering for Resilience Testing: A Practical Guide

Chaos Engineering is the discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions. By proactively injecting failures, teams discover weaknesses before they cause outages.

Chaos Engineering Principles

  • Build Hypothesis: Define expected system behavior
  • Vary Real-World Events: Simulate realistic failures
  • Run in Production: Test where it matters most
  • Automate: Run experiments continuously
  • Minimize Blast Radius: Start small, expand gradually

Litmus Chaos Installation

# Install Litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus \
  --namespace litmus \
  --create-namespace

# Install chaos experiments
kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/experiments.yaml

Pod Delete Experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: "30"
        - name: CHAOS_INTERVAL
          value: "10"
        - name: FORCE
          value: "false"
        - name: PODS_AFFECTED_PERC
          value: "50"

Network Chaos

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api
    appkind: deployment
  experiments:
  - name: pod-network-latency
    spec:
      components:
        env:
        - name: NETWORK_INTERFACE
          value: "eth0"
        - name: NETWORK_LATENCY
          value: "200"  # milliseconds
        - name: TOTAL_CHAOS_DURATION
          value: "60"
        - name: CONTAINER_RUNTIME
          value: "containerd"

AWS Fault Injection Simulator

# Terraform - FIS experiment template
resource "aws_fis_experiment_template" "ec2_terminate" {
  description = "Terminate random EC2 instances"
  role_arn    = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.high_error_rate.arn
  }

  action {
    name      = "terminate-instances"
    action_id = "aws:ec2:terminate-instances"
    
    target {
      key   = "Instances"
      value = "target-instances"
    }
  }

  target {
    name           = "target-instances"
    resource_type  = "aws:ec2:instance"
    selection_mode = "COUNT(1)"

    resource_tag {
      key   = "Environment"
      value = "production"
    }
  }
}

Chaos Monkey for Kubernetes

# kube-monkey configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-monkey-config
  namespace: kube-system
data:
  config.toml: |
    [kubemonkey]
    run_hour = 8
    start_hour = 10
    end_hour = 16
    blacklisted_namespaces = ["kube-system"]
    time_zone = "America/New_York"
    
    [debug]
    enabled = true
---
# Opt-in deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  labels:
    kube-monkey/enabled: "enabled"
    kube-monkey/mtbf: "2"  # Mean time between failures (hours)
    kube-monkey/kill-mode: "fixed"
    kube-monkey/kill-value: "1"

GameDay Checklist

  • Define success criteria and rollback procedures
  • Notify stakeholders and on-call teams
  • Ensure monitoring and alerting is active
  • Start with small blast radius
  • Document findings and remediation actions

Conclusion

Chaos Engineering builds confidence in system resilience by proactively discovering weaknesses. Start with simple experiments in non-production environments, then gradually expand to production with proper safeguards.