Chaos Engineering is the discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions. By proactively injecting failures, teams discover weaknesses before they cause outages.
Chaos Engineering Principles
- Build Hypothesis: Define expected system behavior
- Vary Real-World Events: Simulate realistic failures
- Run in Production: Test where it matters most
- Automate: Run experiments continuously
- Minimize Blast Radius: Start small, expand gradually
Litmus Chaos Installation
# Install Litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus \
--namespace litmus \
--create-namespace
# Install chaos experiments
kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/experiments.yamlPod Delete Experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "50"Network Chaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api
appkind: deployment
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_INTERFACE
value: "eth0"
- name: NETWORK_LATENCY
value: "200" # milliseconds
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CONTAINER_RUNTIME
value: "containerd"AWS Fault Injection Simulator
# Terraform - FIS experiment template
resource "aws_fis_experiment_template" "ec2_terminate" {
description = "Terminate random EC2 instances"
role_arn = aws_iam_role.fis.arn
stop_condition {
source = "aws:cloudwatch:alarm"
value = aws_cloudwatch_metric_alarm.high_error_rate.arn
}
action {
name = "terminate-instances"
action_id = "aws:ec2:terminate-instances"
target {
key = "Instances"
value = "target-instances"
}
}
target {
name = "target-instances"
resource_type = "aws:ec2:instance"
selection_mode = "COUNT(1)"
resource_tag {
key = "Environment"
value = "production"
}
}
}Chaos Monkey for Kubernetes
# kube-monkey configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-monkey-config
namespace: kube-system
data:
config.toml: |
[kubemonkey]
run_hour = 8
start_hour = 10
end_hour = 16
blacklisted_namespaces = ["kube-system"]
time_zone = "America/New_York"
[debug]
enabled = true
---
# Opt-in deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
labels:
kube-monkey/enabled: "enabled"
kube-monkey/mtbf: "2" # Mean time between failures (hours)
kube-monkey/kill-mode: "fixed"
kube-monkey/kill-value: "1"GameDay Checklist
- Define success criteria and rollback procedures
- Notify stakeholders and on-call teams
- Ensure monitoring and alerting is active
- Start with small blast radius
- Document findings and remediation actions
Conclusion
Chaos Engineering builds confidence in system resilience by proactively discovering weaknesses. Start with simple experiments in non-production environments, then gradually expand to production with proper safeguards.


