Kubernetes Chaos Engineering with kubeqa: Breaking Clusters to Build Resilience
Learn how to use kubeqa's chaos engineering module to inject controlled failures into Kubernetes clusters - pod kills, network partitions, CPU stress, and node drains - with steady-state validation and blast-radius controls.
Your Kubernetes cluster works fine on a Tuesday morning. Pods are healthy, traffic flows, dashboards are green. But what happens when a node disappears? When a network partition isolates your database from your API layer? When a rogue deployment consumes all available CPU on a critical node?
If you have not tested for those scenarios, you do not know the answer. And the first time you find out should not be during a production incident at 2 a.m.
Kubernetes chaos engineering is the practice of injecting controlled failures into your cluster to discover weaknesses before they become outages. kubeqa integrates chaos testing directly alongside health scanning, compliance auditing, and deployment gates - so resilience testing is not an afterthought bolted onto your workflow. It is part of the same quality pipeline.
Why chaos engineering matters for Kubernetes
Kubernetes is designed for self-healing. Pods restart, deployments roll back, horizontal autoscalers add capacity. But those mechanisms only work when everything is configured correctly. In practice, most clusters have hidden fragility:
- Single-replica deployments that cause downtime when a pod restarts
- Missing pod disruption budgets that let node drains kill too many pods at once
- Network policies that fail open during partial failures
- Resource limits that are set too tight for spike scenarios
- Health probes that pass in steady state but fail under load
Traditional monitoring catches these problems after they cause an incident. Chaos engineering catches them before.
The core principle is simple: define your steady state, inject a failure, observe the impact, and verify recovery. The cluster either handles the failure gracefully or it does not. Either way, you learn something.
kubeqa chaos: how it works
kubeqa’s chaos module follows a structured workflow for every experiment:
- Steady-state validation - verify that the target workload is healthy before injecting failure
- Failure injection - apply the specified disruption with blast-radius controls
- Observation - monitor the impact on the workload and dependent services
- Recovery verification - confirm that the system returns to steady state within the expected window
- Scoring - assign a resilience score based on recovery time, error rate, and degradation
Every experiment runs with configurable safety controls. You define the blast radius, the maximum duration, and the rollback criteria. If the experiment exceeds your safety thresholds, kubeqa automatically rolls back the injection.
$ kubeqa chaos run pod-failure \
--namespace production \
--deployment api-gateway \
--count 1 \
--duration 60s \
--abort-on error-rate>50%
[1/5] Validating steady state...
✓ api-gateway: 3/3 pods ready, 0 restarts, p99 latency 42ms
[2/5] Injecting failure: killing 1 pod...
✓ Pod api-gateway-7b9f4 terminated
[3/5] Observing impact (60s window)...
- Error rate: 0.3% (within threshold)
- p99 latency: 89ms (degraded but acceptable)
- Dependent services: no cascading failures
[4/5] Verifying recovery...
✓ Replacement pod scheduled in 1.2s
✓ Pod ready in 6.8s
✓ Steady state restored in 8.0s
[5/5] Resilience score: 4/5
- Recovery time: 8.0s (good, <15s target)
- Error rate during failure: 0.3% (excellent)
- No cascading impact detected
Five chaos experiments every cluster should run
1. Pod failure recovery
The most basic experiment: kill a pod and measure how fast Kubernetes replaces it. This tests your replica count, readiness probes, and deployment configuration.
$ kubeqa chaos run pod-failure --namespace production --deployment payments --count 1
What you are looking for: recovery time under 15 seconds, no user-facing errors during the transition, and no cascading failures to dependent services.
Common findings: deployments running with replicas: 1 where the team assumed Kubernetes would handle it. A single-replica deployment means every pod restart is a service outage. kubeqa flags single-replica deployments in the health scan too, but chaos testing proves the impact.
2. Network partition
Simulate a network split between two services. This reveals whether your services handle connection timeouts gracefully or cascade into a broader failure.
$ kubeqa chaos run network-partition \
--source payments \
--target database \
--namespace production \
--duration 30s
What you are looking for: the source service should degrade gracefully with circuit breakers, retry logic, or cached responses rather than crashing or returning 500 errors to users.
Common findings: services that lack circuit breaker patterns and simply retry indefinitely, eventually exhausting connection pools and crashing. Most teams discover their timeout configurations are either missing or set far too high.
3. CPU stress
Inject CPU pressure on a specific pod to simulate resource contention. This tests your resource limits, horizontal pod autoscaler (HPA) configuration, and priority classes.
$ kubeqa chaos run cpu-stress \
--namespace production \
--deployment inference-api \
--cores 2 \
--duration 120s
What you are looking for: the HPA should detect increased CPU utilization and scale up additional pods. Existing pods should continue serving requests at higher latency rather than being OOMKilled or evicted.
Common findings: teams that set CPU limits equal to CPU requests, leaving zero headroom for spikes. Or HPAs configured with a scaleUpStabilizationWindowSeconds of 300 seconds, meaning it takes five minutes to respond to load.
4. Node drain
Simulate a node going down for maintenance. This tests your pod disruption budgets (PDBs), pod anti-affinity rules, and scheduling constraints.
$ kubeqa chaos run node-drain \
--node worker-3 \
--grace-period 30s
What you are looking for: pods should reschedule to other nodes without violating availability requirements. Pod disruption budgets should prevent draining more pods than your service can afford to lose.
Common findings: no PDBs defined at all, meaning a node drain can simultaneously evict every replica of a service. Or anti-affinity rules that are set to preferredDuringSchedulingIgnoredDuringExecution when they should be required, allowing all replicas to land on the same node.
5. DNS failure
Inject DNS resolution failures to test how services behave when they cannot resolve downstream dependencies.
$ kubeqa chaos run dns-failure \
--namespace production \
--target-service external-api.partner.com \
--duration 60s
What you are looking for: the service should fall back to cached responses, serve degraded functionality, or return a meaningful error - not hang indefinitely waiting for a DNS response.
Common findings: applications that treat DNS resolution as infallible. When CoreDNS experiences a brief disruption or an external DNS target becomes unreachable, the application hangs until its own health probes fail and Kubernetes restarts the pod - which then hangs again.
Building a chaos engineering practice
Running one-off chaos experiments is useful for finding bugs. Building a continuous chaos practice is what actually makes your cluster resilient over time.
Start with game days
Schedule a monthly “game day” where the team runs chaos experiments together and discusses the findings. This builds organizational muscle for incident response and creates a culture where breaking things in controlled environments is expected.
# Run a full chaos suite and generate a report
$ kubeqa chaos suite run --config chaos-gameday.yaml --report markdown
Running 8 experiments across 4 namespaces...
✓ pod-failure/api-gateway: 4/5 resilience
✓ pod-failure/payments: 5/5 resilience
✗ network-partition/db: 2/5 resilience (no circuit breaker)
✓ cpu-stress/inference: 4/5 resilience
✗ node-drain/worker-3: 1/5 resilience (no PDB)
...
Overall cluster resilience: 3.2/5
Report saved: chaos-gameday-2026-03-01.md
Integrate into CI/CD
Once your cluster can survive manual experiments, automate them. kubeqa’s chaos module integrates with your CI/CD pipeline to run resilience checks on every deployment to staging.
# .github/workflows/chaos.yaml
- name: Deploy to staging
run: kubectl apply -f manifests/ --namespace staging
- name: Run chaos experiments
uses: nomadx-ae/kubeqa-action@v1
with:
command: chaos suite run
config: .kubeqa/chaos-ci.yaml
fail-on: resilience-score<3
This creates a resilience gate - deployments that reduce the cluster’s resilience score below your threshold get blocked before they reach production.
Track resilience over time
The most valuable metric in chaos engineering is not a single experiment result - it is the trend. kubeqa tracks your resilience score over time, so you can see whether your cluster is getting more resilient or less.
$ kubeqa chaos history --last 30d
Date Score Experiments Failures
2026-02-01 2.8/5 12 5
2026-02-15 3.2/5 15 4
2026-03-01 3.8/5 18 2
Trend: ↑ improving (+1.0 over 30 days)
From chaos to confidence
Chaos engineering is not about breaking things for the sake of it. It is about building justified confidence that your Kubernetes cluster can handle the failures that will inevitably occur in production.
kubeqa brings chaos engineering into the same workflow as health scanning, compliance auditing, and deployment gates. You do not need a separate tool, a separate team, or a separate process. You run kubeqa chaos run, observe the results, fix the weaknesses, and track your progress.
The clusters that survive real incidents are the ones that practiced for them.
Ready to test your cluster’s resilience? Install kubeqa with brew install nomadx-ae/tap/kubeqa and run your first chaos experiment in under five minutes. Star the project on GitHub and join the kubeqa Discord to share your findings with the community.
Ship Kubernetes with Confidence
Free for open-source use. No credit card required. Install kubeqa and run your first cluster scan in under 5 minutes.
Get Started Free