What Chaos Engineers Actually Build
Chaos engineering spans from experiment design to resilience improvement.
Experiment Design
Systematically testing resilience:
- Hypothesis formation — What should happen during failure?
- Blast radius control — Limiting experiment impact
- Steady state definition — Normal system behavior metrics
- Variable injection — Controlled failure introduction
- Result analysis — Did the system behave as expected?
Fault Injection
Breaking things on purpose:
- Server failures — Instance termination
- Network issues — Latency, packet loss, partitions
- Resource exhaustion — CPU, memory, disk
- Dependency failures — Database, cache, API outages
- Clock skew — Time-related failures
Platform Development
Tools for chaos:
- Chaos platforms — Experiment orchestration
- Failure libraries — Reusable failure injections
- Automated experiments — Continuous chaos
- Integration — CI/CD chaos testing
- Reporting — Experiment results and trends
Chaos Engineering Tools
Platforms
| Tool | Use Case |
|---|---|
| Gremlin | Enterprise chaos platform |
| Chaos Monkey | Netflix original |
| LitmusChaos | Kubernetes native |
| Chaos Mesh | Cloud-native chaos |
| AWS FIS | AWS fault injection |
Observability
- Monitoring: Datadog, Prometheus
- Tracing: Jaeger, Zipkin
- Logging: ELK, Splunk
- Alerting: PagerDuty, Opsgenie
Skills by Experience Level
Junior Chaos Engineer (0-2 years)
Capabilities:
- Run existing chaos experiments
- Analyze experiment results
- Monitor during experiments
- Document findings
- Support incident response
Learning areas:
- Experiment design
- Failure mode analysis
- Platform development
- System architecture
Mid-Level Chaos Engineer (2-5 years)
Capabilities:
- Design chaos experiments
- Build fault injection tools
- Analyze system weaknesses
- Improve resilience
- Conduct gamedays
- Mentor juniors
Growing toward:
- Architecture influence
- Chaos strategy
- Technical leadership
Senior Chaos Engineer (5+ years)
Capabilities:
- Architect chaos programs
- Lead resilience strategy
- Design complex experiments
- Influence system design
- Drive reliability culture
- Mentor teams
Curiosity & fundamentals
Independence & ownership
Architecture & leadership
Strategy & org impact
Interview Focus Areas
Technical Fundamentals
- "What is chaos engineering and how is it different from testing?"
- "How do you control blast radius during experiments?"
- "What makes a good chaos experiment hypothesis?"
- "How do you know when it's safe to run a chaos experiment?"
System Design
- "Design a chaos engineering program for a microservices platform"
- "How would you test database failover?"
- "Design an automated chaos testing pipeline"
Experience
- "Tell me about a chaos experiment that found a real problem"
- "How do you convince teams to adopt chaos engineering?"
- "How do you handle an experiment that causes unexpected impact?"
Common Hiring Mistakes
Hiring Pure Testers
Chaos engineering requires deep distributed systems understanding. Testers without systems experience can't design meaningful experiments. Look for infrastructure or SRE background.
Ignoring Safety Focus
Chaos engineering done poorly causes outages. Engineers need to understand blast radius control, gradual rollout, and when to stop. Evaluate for safety mindset.
Underestimating Culture Work
Technical skills aren't enough. Chaos engineers must convince teams to participate, document findings, and drive remediation. Communication matters.
Missing Incident Experience
Understanding how incidents happen helps design better experiments. Look for on-call or incident response experience.
Where to Find Chaos Engineers
High-Signal Sources
Chaos engineers typically come from SRE teams at companies with mature reliability practices. Netflix, Amazon, Google, and Microsoft alumni who've worked on reliability have direct exposure. Also look at chaos engineering platform companies like Gremlin and LitmusChaos contributors.
Conference and Community
Chaos Conf (hosted by Gremlin) is specifically for chaos engineering practitioners. SRECon attracts reliability engineers who may have chaos experience. KubeCon has chaos engineering content for Kubernetes environments. O'Reilly Velocity conferences have covered chaos engineering extensively.
Company Backgrounds That Translate
- Cloud pioneers: Netflix, Amazon, Google, Microsoft—invented chaos practices
- Financial services: Banks with resilience testing requirements
- Chaos platforms: Gremlin, Steadybit, Harness Chaos Engineering
- Cloud providers: AWS, GCP, Azure—fault injection service teams
- High-availability companies: Stripe, Datadog, PagerDuty—reliability focus
- Large SaaS: Salesforce, Twilio—enterprise reliability requirements
Community Involvement
Chaos engineering has a strong community. Look for speakers at Chaos Conf, contributors to Chaos Monkey, LitmusChaos, or Chaos Mesh, and authors of chaos engineering content on engineering blogs.
Recruiter's Cheat Sheet
Resume Green Flags
- SRE or reliability background
- Distributed systems experience
- Chaos tool experience
- Incident response history
- Gameday facilitation
Resume Yellow Flags
- No reliability experience
- Only manual testing background
- Cannot discuss failure modes
- No distributed systems knowledge
Technical Terms to Know
| Term | What It Means |
|---|---|
| Chaos Monkey | Netflix's original chaos tool |
| Blast radius | Impact scope of experiment |
| Gameday | Planned failure exercise |
| Steady state | Normal system behavior |
| Fault injection | Deliberately causing failures |
| Resilience | Ability to handle failures |