SRE: Chaos Engineering, Breaking Things on Purpose
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
In the previous articles we covered SLIs and SLOs, incident management, and observability. You have metrics, alerts, traces, runbooks, and postmortem processes. But how do you know any of it actually works before a real incident hits?
That is where chaos engineering comes in. The idea is simple: intentionally inject failures into your system to verify that your resilience mechanisms, monitoring, alerting, and incident response processes work as expected. It is like a fire drill, but for your infrastructure.
In this article we will cover the principles of chaos engineering, how to set up Litmus and Chaos Mesh in Kubernetes, how to plan and run game days, and how to build a culture where breaking things on purpose is not just accepted but encouraged.
Let’s get into it.
Why break things on purpose?
Complex systems fail in complex ways. You cannot predict every failure mode by reading code or architecture diagrams. The only way to truly understand how your system behaves under failure is to actually make it fail.
Chaos engineering helps you:
- Discover unknown failure modes before they bite you in production at 3am
- Validate your monitoring and alerting does your SLO alert actually fire when latency spikes?
- Test your runbooks can the on-call engineer actually follow them under pressure?
- Build confidence knowing your system can handle a pod crash or network partition makes you sleep better
- Reduce MTTR practicing incident response makes you faster when real incidents happen
The Netflix engineering team, who pioneered chaos engineering with Chaos Monkey, put it best: “The best way to avoid failure is to fail constantly.”
The chaos engineering process
Chaos engineering is not just randomly killing pods. It is a disciplined process:
- Define steady state: What does “normal” look like? Use your SLIs (from article 1) as the baseline.
- Hypothesize: “If we kill one pod, the remaining pods should handle the load and the SLO should not be violated.”
- Inject failure: Actually kill the pod (or whatever failure you are testing).
- Observe: Watch your metrics, traces, and logs. Did the system behave as expected?
- Learn: If it did not behave as expected, you found a weakness. Fix it before a real failure finds it for you.
Always start small. Kill one pod, not the whole deployment. Add 100ms of latency, not 30 seconds. The goal is controlled experiments, not uncontrolled chaos.
Chaos Mesh: chaos engineering for Kubernetes
Chaos Mesh is a CNCF project that provides a comprehensive set of chaos experiments for Kubernetes. It is easy to install and has a nice web UI for managing experiments.
Install it with Helm:
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock
Now let’s define some experiments. All experiments are Kubernetes custom resources, so they fit perfectly into a GitOps workflow with ArgoCD.
1. Pod failure: kill a random pod
# chaos/pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: tr-web-pod-kill
namespace: default
spec:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: tr-web
scheduler:
cron: "@every 2h" # Kill a pod every 2 hours
duration: "60s"
This kills one random tr-web pod every 2 hours. If your deployment has multiple replicas and a proper readiness probe, users should not notice anything. If they do, you found a problem.
2. Network latency: add artificial delay
# chaos/network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: tr-web-network-delay
namespace: default
spec:
action: delay
mode: all
selector:
namespaces:
- default
labelSelectors:
app: tr-web
delay:
latency: "200ms"
jitter: "50ms"
correlation: "25"
direction: to
target:
selector:
namespaces:
- default
labelSelectors:
app: postgresql
mode: all
duration: "5m"
This adds 200ms of latency (with 50ms jitter) between your web pods and the database for 5 minutes. This is incredibly useful for testing timeout configurations and retry logic.
3. Network partition: isolate a service
# chaos/network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: tr-web-partition
namespace: default
spec:
action: partition
mode: all
selector:
namespaces:
- default
labelSelectors:
app: tr-web
direction: both
target:
selector:
namespaces:
- default
labelSelectors:
app: postgresql
mode: all
duration: "2m"
This completely cuts network traffic between your web pods and the database. Does your app crash? Does it show a friendly error page? Does it recover when the network comes back? These are important questions.
4. CPU stress: simulate resource contention
# chaos/cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: tr-web-cpu-stress
namespace: default
spec:
mode: one
selector:
namespaces:
- default
labelSelectors:
app: tr-web
stressors:
cpu:
workers: 2
load: 80
duration: "5m"
This burns 80% CPU in one pod. With proper resource limits and HPA, your cluster should handle this gracefully.
5. DNS failure: break name resolution
# chaos/dns-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: tr-web-dns-failure
namespace: default
spec:
action: error
mode: all
selector:
namespaces:
- default
labelSelectors:
app: tr-web
patterns:
- "api.github.com"
duration: "5m"
This makes DNS resolution fail for api.github.com from your web pods. Remember how we fixed the GitHub
sponsors API issue with a dedicated Hackney pool? This experiment verifies that fix actually works, the
database connections should not be affected even when GitHub is unreachable.
Litmus: experiment workflows
Litmus is another CNCF chaos engineering project that focuses on experiment workflows. While Chaos Mesh is great for individual experiments, Litmus excels at orchestrating multi-step chaos scenarios.
Install Litmus:
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm
helm repo update
helm install litmus litmuschaos/litmus \
--namespace litmus \
--create-namespace
A Litmus workflow lets you chain multiple chaos experiments together with validation steps:
# litmus/workflow-resilience-test.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: tr-web-resilience-test
namespace: litmus
spec:
entrypoint: resilience-test
templates:
- name: resilience-test
steps:
# Step 1: Verify steady state
- - name: verify-baseline
template: check-slo
# Step 2: Kill a pod
- - name: pod-kill
template: pod-kill-experiment
# Step 3: Verify SLO is still met
- - name: verify-after-pod-kill
template: check-slo
# Step 4: Add network latency
- - name: network-delay
template: network-delay-experiment
# Step 5: Verify latency SLO
- - name: verify-after-delay
template: check-latency-slo
# Step 6: Clean up and final check
- - name: final-verification
template: check-slo
- name: check-slo
container:
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
# Query Prometheus for current SLI
AVAILABILITY=$(curl -s "http://prometheus:9090/api/v1/query?query=sli:availability:ratio_rate5m" \
| jq -r '.data.result[0].value[1]')
echo "Current availability SLI: $AVAILABILITY"
if (( $(echo "$AVAILABILITY < 0.999" | bc -l) )); then
echo "FAIL: Availability below SLO target"
exit 1
fi
echo "PASS: Availability within SLO target"
- name: check-latency-slo
container:
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=sli:latency:ratio_rate5m" \
| jq -r '.data.result[0].value[1]')
echo "Current latency SLI: $LATENCY"
# During chaos, we allow a slightly relaxed SLO
if (( $(echo "$LATENCY < 0.95" | bc -l) )); then
echo "FAIL: Latency severely degraded during chaos"
exit 1
fi
echo "PASS: Latency within acceptable range during chaos"
- name: pod-kill-experiment
container:
image: litmuschaos/litmus-checker:latest
# ... pod kill configuration
- name: network-delay-experiment
container:
image: litmuschaos/litmus-checker:latest
# ... network delay configuration
This workflow verifies that your service stays within SLO targets even while being subjected to chaos. If any verification step fails, you know you have a resilience gap to fix.
Game days: structured chaos
A game day is a scheduled event where the team intentionally injects failures and practices incident response. It is like a fire drill, but everyone knows it is happening (mostly).
Here is how to plan and run a game day:
Before the game day (1 week ahead)
- Choose a date and time (during business hours, never on a Friday)
- Define the scenarios you want to test (2-3 per game day, no more)
- Notify stakeholders that things might break
- Assign roles: facilitator, chaos operator, observers
- Prepare the experiments (have the YAML files ready)
- Review runbooks for the scenarios you will test
Game day checklist template:
# game-days/2026-02-25-checklist.md
# Game Day: February 25, 2026
## Pre-game
- [ ] All participants confirmed
- [ ] Stakeholders notified
- [ ] Monitoring dashboards open
- [ ] Runbooks accessible
- [ ] Rollback procedures ready
- [ ] Communication channel created (#gameday-2026-02-25)
## Scenario 1: Pod failure recovery
- **Hypothesis**: Killing 1 of 3 tr-web pods should not cause any user-visible errors
- **Experiment**: `chaos/pod-kill.yaml`
- **Success criteria**: Availability SLI stays above 99.9%
- **Duration**: 10 minutes
- **Results**: [ PASS / FAIL ]
- **Notes**: ___
## Scenario 2: Database latency spike
- **Hypothesis**: 200ms extra latency to DB should trigger the latency SLO alert but not the availability alert
- **Experiment**: `chaos/network-delay.yaml`
- **Success criteria**: Latency alert fires within 5 minutes, app remains functional
- **Duration**: 15 minutes
- **Results**: [ PASS / FAIL ]
- **Notes**: ___
## Scenario 3: External dependency failure
- **Hypothesis**: GitHub API being unreachable should not affect blog page load times
- **Experiment**: `chaos/dns-failure.yaml`
- **Success criteria**: Blog pages load normally, only sponsor section is empty
- **Duration**: 10 minutes
- **Results**: [ PASS / FAIL ]
- **Notes**: ___
## Post-game
- [ ] All experiments cleaned up
- [ ] Systems back to steady state
- [ ] Game day retro completed
- [ ] Action items created as GitHub issues
- [ ] Results shared with the team
During the game day
- The facilitator keeps time and coordinates
- The chaos operator applies experiments
- Observers watch dashboards and logs (using the observability stack from article 3)
- The on-call engineer responds as if it were a real incident
- Everyone takes notes
After the game day
Run a retro (just like a postmortem but for the exercise). What worked? What did not? What surprised you? Create action items for anything that needs fixing.
Steady state validation with automated chaos
Once you are comfortable with game days, you can start running automated chaos experiments in production. This is the advanced level of chaos engineering.
The key is to tie chaos experiments to your SLO monitoring. If an experiment causes an SLO violation, it stops automatically:
# chaos/continuous-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: continuous-pod-kill
namespace: default
spec:
schedule: "0 */4 * * *" # Every 4 hours
type: PodChaos
historyLimit: 5
concurrencyPolicy: Forbid
podChaos:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: tr-web
duration: "30s"
Combine this with an Alertmanager silence that suppresses the chaos-related page alert, but still tracks the SLO impact:
# Only silence the page alert, not the SLO recording
# This way you can see the SLO impact without getting paged
amtool silence add --alertmanager.url=http://alertmanager:9093 \
--author="chaos-bot" \
--comment="Scheduled chaos experiment" \
--duration="5m" \
alertname="TrWebPodKilled"
Chaos engineering for Elixir/BEAM applications
The BEAM VM has some unique characteristics that affect chaos engineering:
Supervision trees handle many failures automatically. When you kill an Elixir process, the supervisor restarts it. This is great for resilience but means you need to test harder failures (like network partitions or resource exhaustion) to find real issues.
Hot code reloading can mask deployment issues. If your app uses hot code reloading in production, you should also test cold restarts.
Distribution (Erlang clustering) is sensitive to network issues. If your nodes are clustered (like our
app with RELEASE_DISTRIBUTION=name), test what happens when nodes lose connectivity:
# chaos/cluster-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: beam-cluster-partition
namespace: default
spec:
action: partition
mode: all
selector:
namespaces:
- default
labelSelectors:
app: tr-web
fieldSelectors:
metadata.name: tr-web-0
direction: both
target:
selector:
namespaces:
- default
labelSelectors:
app: tr-web
fieldSelectors:
metadata.name: tr-web-1
mode: all
duration: "5m"
This partitions two nodes of your Erlang cluster. Does your app handle the netsplit gracefully? Does it recover when connectivity returns? These are important questions for clustered BEAM applications.
What to test first
If you are just starting with chaos engineering, here is a prioritized list:
- Single pod failure: Can your service handle losing one instance? (This is the minimum)
- Dependency timeout: What happens when an external service responds slowly?
- DNS failure: Can your app handle name resolution failures gracefully?
- Resource exhaustion: What happens when you hit CPU or memory limits?
- Network partition: Can your service handle being cut off from a dependency?
- Disk pressure: What happens when disk space runs low?
- Clock skew: What happens when time drifts between nodes?
Start with #1 and work your way down. Each experiment should be repeated regularly, not just once.
Safety guardrails
Chaos engineering can go wrong if you are not careful. Here are non-negotiable safety rules:
- Always have a kill switch. Every experiment must be stoppable immediately.
- Start in staging. Never run a new experiment in production for the first time.
- Blast radius control. Affect one pod, not all pods. One service, not all services.
- Time-bounded. Every experiment has a duration. No open-ended chaos.
- Monitor continuously. If SLOs are violated beyond acceptable thresholds, abort.
- Business hours only (for manual experiments). Do not do game days on Fridays at 5pm.
- Communicate. Everyone who needs to know should know that chaos is happening.
Putting it all together
Here is the chaos engineering maturity model:
- Level 0 - No chaos: You hope things work. (Most teams start here)
- Level 1 - Manual game days: Quarterly game days with pre-planned scenarios
- Level 2 - Automated chaos in staging: Regular chaos experiments run automatically in staging
- Level 3 - Automated chaos in production: Continuous chaos in production with SLO-based guardrails
- Level 4 - Chaos as CI: Chaos experiments run as part of your deployment pipeline
You do not need to reach Level 4 to get value. Even Level 1 (quarterly game days) will dramatically improve your team’s confidence and incident response speed.
Closing notes
Chaos engineering is not about breaking things for fun. It is about building confidence that your systems can handle the failures that will inevitably occur. Every experiment that passes tells you “this failure mode is handled.” Every experiment that fails tells you “fix this before a real failure finds it.”
The tools we covered, Chaos Mesh, Litmus, game day checklists, are all free and work great in Kubernetes. Start with a simple pod-kill experiment in staging and build from there. The hardest part is not the tooling, it is getting organizational buy-in to intentionally break things. But once you show the team the first bug you found through chaos, they will be convinced.
This wraps up our four-part SRE series. We went from measuring reliability (SLIs/SLOs) to responding to failures (incident management) to seeing what is happening (observability) to proactively finding weaknesses (chaos engineering). Together, these practices give you a solid foundation for running reliable systems.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.