What salary do chaos engineers expect?

US market 2026: Junior $100-140K, Mid $140-175K, Senior $160-210K. Chaos engineering combines reliability engineering with specialized experimentation skills. Financial services and high-scale tech companies pay at the top.

Chaos Engineer vs SRE—what's the difference?

Chaos engineers focus specifically on proactive failure testing and resilience. SREs have broader scope: reliability, automation, incident response, capacity planning. Chaos engineering is often part of SRE but can be a dedicated specialty.

Do we need dedicated chaos engineers?

Depends on scale and criticality. Small teams might have SREs run chaos experiments. Large organizations or high-stakes systems (financial, healthcare) benefit from dedicated chaos engineers who can build programs and tools.

Is chaos engineering risky?

Done properly, no—it prevents larger incidents. The key is controlled experiments: good observability, blast radius control, kill switches, and starting small. Poorly done chaos engineering can cause outages, which is why expertise matters.

Hiring Chaos Engineers: The Complete Guide

Tool	Use Case
Gremlin	Enterprise chaos platform
Chaos Monkey	Netflix original
LitmusChaos	Kubernetes native
Chaos Mesh	Cloud-native chaos
AWS FIS	AWS fault injection

Term	What It Means
Chaos Monkey	Netflix's original chaos tool
Blast radius	Impact scope of experiment
Gameday	Planned failure exercise
Steady state	Normal system behavior
Fault injection	Deliberately causing failures
Resilience	Ability to handle failures

Chaos Engineers

A developer-approved template. Customize the [PLACEHOLDERS].

Replace [PLACEHOLDERS] with your company's details

[Company]

# Chaos Engineer

Location: San Francisco, CA (Hybrid) · Employment Type: Full-time · Level: Mid-Senior

About [Company]

[Company] is a fintech platform processing $50B+ in annual transactions for 100,000+ businesses. Our systems must be reliable—every minute of downtime costs money, erodes trust, and impacts our customers' operations.

We've built a mature reliability program, but we know we can do better. Chaos engineering is how we proactively find weaknesses before they become outages.

Why join [Company]?

Build resilience for systems that really matter—$50B+ annual transactions
Proactive reliability work, not just fighting fires
Mature engineering culture that invests in reliability
Join a 5-person reliability team within a 200-person engineering org
Series D company, profitable, with strong enterprise customer base
Competitive compensation with meaningful equity

The Role

We're looking for a Chaos Engineer to help us proactively find and fix system weaknesses. You'll design chaos experiments, build fault injection infrastructure, conduct gamedays, and drive reliability improvements across our distributed systems.

This isn't just running chaos tools—it's building a reliability culture. You'll work with teams across engineering to understand their systems, design meaningful experiments, and help them build more resilient services.

The ideal candidate has deep distributed systems knowledge, understands failure modes, and can communicate effectively about reliability. You're excited about breaking things safely so we can fix them before they break in production.

What You'll Work On

### First 30 Days

Onboard to our architecture, reliability program, and incident history
Review our most critical systems and existing resilience mechanisms
Shadow existing chaos experiments and gamedays

### First 90 Days

Design and execute chaos experiments for critical systems
Contribute to our fault injection infrastructure
Conduct your first gameday exercise
Build relationships with teams across engineering

### First Year

Lead chaos engineering practice across the company
Build automated chaos testing in CI/CD
Develop chaos engineering best practices and training
Measurably improve system resilience metrics

Objectives of This Role

Identify and remediate system weaknesses before they cause outages
Reduce mean-time-to-recovery through improved resilience
Build chaos testing into development workflows
Conduct regular gamedays for critical systems
Improve reliability culture and practices across engineering

Responsibilities

Chaos Engineering (50%)

Design and execute chaos experiments for distributed systems
Identify meaningful failure scenarios based on system architecture
Run experiments safely with proper blast radius control
Analyze results and drive remediation of weaknesses
Build automated chaos testing for CI/CD integration

Gamedays & Exercises (25%)

Plan and facilitate gameday exercises
Design realistic failure scenarios for practice
Lead incident response exercises
Document learnings and drive improvements
Train teams on chaos engineering practices

Infrastructure & Tooling (25%)

Build and maintain fault injection infrastructure
Develop experiment orchestration and analysis tools
Improve observability for chaos experiments
Create dashboards for resilience metrics
Integrate chaos tools with existing infrastructure

Required Skills and Qualifications

Technical Skills

5+ years of software engineering experience
2+ years in SRE, reliability engineering, or chaos engineering
Strong distributed systems understanding (failure modes, consistency, availability)
Experience with observability tools (metrics, logs, traces)
Incident response and post-mortem experience
Proficiency in Python, Go, or similar languages

Chaos Knowledge

Understanding of chaos engineering principles and blast radius control
Experience with fault injection techniques
Familiarity with chaos platforms (Gremlin, LitmusChaos, or similar)

Soft Skills

Excellent communication for cross-team collaboration
Ability to explain risks and tradeoffs to non-reliability engineers
Judgment for safe experiment design
Leadership for driving reliability improvements

Preferred Skills and Qualifications

Financial services or regulated industry experience
Chaos platform experience (Gremlin, LitmusChaos, Chaos Mesh)
Gameday facilitation and training experience
Background in formal incident management
Experience building reliability programs from scratch
Public speaking or writing about reliability

Tech Stack

Infrastructure: Kubernetes (EKS), AWS services, Terraform

Observability: Datadog, PagerDuty, custom dashboards

Chaos Tools: Gremlin, custom fault injection, AWS Fault Injection Simulator

Languages: Python for tooling, Go for infrastructure

CI/CD: GitHub Actions, ArgoCD

Compensation and Benefits

Salary: $150,000 - $200,000 (based on experience)

Equity: 0.04% - 0.12% (4-year vest, 1-year cliff)

Benefits:

*Health & Wellness*

Medical, dental, and vision insurance (100% covered)
$100/month wellness stipend
Mental health support through Lyra

*Time Off*

Unlimited PTO with 15-day minimum encouraged
11 paid company holidays
16 weeks paid parental leave

*Professional Development*

$2,500 annual learning budget
Conference attendance (SRECon, Chaos Conf)
Internal reliability and chaos training

*Financial*

401(k) with 4% company match
Equity refresh grants annually

*Workspace*

$1,500 home office setup allowance
Hybrid work (2-3 days in SF office)
MacBook Pro and equipment of choice

Engineering Culture

How we work:

Proactive reliability—find problems before they find us
Blameless post-mortems focused on system improvements
Collaboration across teams on reliability improvements
Regular gamedays and failure exercises
Reliability metrics that matter to the business

What we value:

System resilience over heroic incident response
Learning from failures safely
Building reliability culture, not just running tools
Clear communication about risks and tradeoffs
Work-life balance (on-call is rotational and well-supported)

Interview Process

Our interview process typically takes 2 weeks.

Step 1: Recruiter Screen (30 min)

Background discussion and role overview.

Step 2: Technical Screen (60 min)

Distributed systems and reliability fundamentals.

Step 3: Chaos Deep Dive (90 min)

Technical discussion about chaos engineering, failure modes, and your experience.

Step 4: System Design (60 min)

Design a resilience improvement for a distributed system.

Step 5: Incident Discussion (45 min)

Walk through an incident you've handled and what you learned.

Step 6: Team Interviews (2 x 30 min)

Meet potential teammates and stakeholders.

---

*[Company] is an equal opportunity employer.*

# Chaos Engineer

**Location:** San Francisco, CA (Hybrid) · **Employment Type:** Full-time · **Level:** Mid-Senior

## About [Company]

[Company] is a fintech platform processing $50B+ in annual transactions for 100,000+ businesses. Our systems must be reliable—every minute of downtime costs money, erodes trust, and impacts our customers' operations.

We've built a mature reliability program, but we know we can do better. Chaos engineering is how we proactively find weaknesses before they become outages.

**Why join [Company]?**

- Build resilience for systems that really matter—$50B+ annual transactions
- Proactive reliability work, not just fighting fires
- Mature engineering culture that invests in reliability
- Join a 5-person reliability team within a 200-person engineering org
- Series D company, profitable, with strong enterprise customer base
- Competitive compensation with meaningful equity

## The Role

We're looking for a Chaos Engineer to help us proactively find and fix system weaknesses. You'll design chaos experiments, build fault injection infrastructure, conduct gamedays, and drive reliability improvements across our distributed systems.

This isn't just running chaos tools—it's building a reliability culture. You'll work with teams across engineering to understand their systems, design meaningful experiments, and help them build more resilient services.

The ideal candidate has deep distributed systems knowledge, understands failure modes, and can communicate effectively about reliability. You're excited about breaking things safely so we can fix them before they break in production.

## What You'll Work On

### First 30 Days
- Onboard to our architecture, reliability program, and incident history
- Review our most critical systems and existing resilience mechanisms
- Shadow existing chaos experiments and gamedays

### First 90 Days
- Design and execute chaos experiments for critical systems
- Contribute to our fault injection infrastructure
- Conduct your first gameday exercise
- Build relationships with teams across engineering

### First Year
- Lead chaos engineering practice across the company
- Build automated chaos testing in CI/CD
- Develop chaos engineering best practices and training
- Measurably improve system resilience metrics

## Objectives of This Role

- Identify and remediate system weaknesses before they cause outages
- Reduce mean-time-to-recovery through improved resilience
- Build chaos testing into development workflows
- Conduct regular gamedays for critical systems
- Improve reliability culture and practices across engineering

## Responsibilities

**Chaos Engineering (50%)**
- Design and execute chaos experiments for distributed systems
- Identify meaningful failure scenarios based on system architecture
- Run experiments safely with proper blast radius control
- Analyze results and drive remediation of weaknesses
- Build automated chaos testing for CI/CD integration

**Gamedays & Exercises (25%)**
- Plan and facilitate gameday exercises
- Design realistic failure scenarios for practice
- Lead incident response exercises
- Document learnings and drive improvements
- Train teams on chaos engineering practices

**Infrastructure & Tooling (25%)**
- Build and maintain fault injection infrastructure
- Develop experiment orchestration and analysis tools
- Improve observability for chaos experiments
- Create dashboards for resilience metrics
- Integrate chaos tools with existing infrastructure

## Required Skills and Qualifications

**Technical Skills**
- 5+ years of software engineering experience
- 2+ years in SRE, reliability engineering, or chaos engineering
- Strong distributed systems understanding (failure modes, consistency, availability)
- Experience with observability tools (metrics, logs, traces)
- Incident response and post-mortem experience
- Proficiency in Python, Go, or similar languages

**Chaos Knowledge**
- Understanding of chaos engineering principles and blast radius control
- Experience with fault injection techniques
- Familiarity with chaos platforms (Gremlin, LitmusChaos, or similar)

**Soft Skills**
- Excellent communication for cross-team collaboration
- Ability to explain risks and tradeoffs to non-reliability engineers
- Judgment for safe experiment design
- Leadership for driving reliability improvements

## Preferred Skills and Qualifications

- Financial services or regulated industry experience
- Chaos platform experience (Gremlin, LitmusChaos, Chaos Mesh)
- Gameday facilitation and training experience
- Background in formal incident management
- Experience building reliability programs from scratch
- Public speaking or writing about reliability

## Tech Stack

**Infrastructure:** Kubernetes (EKS), AWS services, Terraform

**Observability:** Datadog, PagerDuty, custom dashboards

**Chaos Tools:** Gremlin, custom fault injection, AWS Fault Injection Simulator

**Languages:** Python for tooling, Go for infrastructure

**CI/CD:** GitHub Actions, ArgoCD

## Compensation and Benefits

**Salary:** $150,000 - $200,000 (based on experience)

**Equity:** 0.04% - 0.12% (4-year vest, 1-year cliff)

**Benefits:**

*Health & Wellness*
- Medical, dental, and vision insurance (100% covered)
- $100/month wellness stipend
- Mental health support through Lyra

*Time Off*
- Unlimited PTO with 15-day minimum encouraged
- 11 paid company holidays
- 16 weeks paid parental leave

*Professional Development*
- $2,500 annual learning budget
- Conference attendance (SRECon, Chaos Conf)
- Internal reliability and chaos training

*Financial*
- 401(k) with 4% company match
- Equity refresh grants annually

*Workspace*
- $1,500 home office setup allowance
- Hybrid work (2-3 days in SF office)
- MacBook Pro and equipment of choice

## Engineering Culture

**How we work:**
- Proactive reliability—find problems before they find us
- Blameless post-mortems focused on system improvements
- Collaboration across teams on reliability improvements
- Regular gamedays and failure exercises
- Reliability metrics that matter to the business

**What we value:**
- System resilience over heroic incident response
- Learning from failures safely
- Building reliability culture, not just running tools
- Clear communication about risks and tradeoffs
- Work-life balance (on-call is rotational and well-supported)

## Interview Process

Our interview process typically takes 2 weeks.

- **Step 1: Recruiter Screen** (30 min)
  Background discussion and role overview.

- **Step 2: Technical Screen** (60 min)
  Distributed systems and reliability fundamentals.

- **Step 3: Chaos Deep Dive** (90 min)
  Technical discussion about chaos engineering, failure modes, and your experience.

- **Step 4: System Design** (60 min)
  Design a resilience improvement for a distributed system.

- **Step 5: Incident Discussion** (45 min)
  Walk through an incident you've handled and what you learned.

- **Step 6: Team Interviews** (2 x 30 min)
  Meet potential teammates and stakeholders.

---

*[Company] is an equal opportunity employer.*

JD Tips

Clear business context (fintech, high stakes)
Culture building mentioned
Communication skills emphasized
Don't treat chaos as just breaking things
Don't ignore organizational aspects

Chaos Engineers

daily.dev Hiring Academy • Recruiter's Cheat Sheet

Market Pulse

Senior Range

$160K-$210K

Current Demand

High

Avg Time to Hire

4-6 weeks

Market Trend

+25%

Critical Skills (Must Haves)

Distributed systems
Observability
Experiment design
Fault injection
Incident response

Nice-to-Have (Bonus)

Chaos platforms SRE background Culture building

Quick Context

Chaos engineers build systems that test resilience by deliberately introducing failures—from Netflix's Chaos Monkey to enterprise reliability programs. Senior chaos engineers command $160-200K+ in the US. Look for candidates with distributed systems experience, understanding of failure modes, and ability to design controlled experiments that improve reliability.

Common Mistakes

✗ Over-indexing on years vs. ability
✗ Testing trivia, not problem-solving
✗ Slow response times

Interview Tips

✓ Keep screens under 30 min
✓ Share structure upfront
✓ Allow 10+ min for Q&A
✓ Respond within 48 hours

Red Flags

"Just run Chaos Monkey." No support for fixing found issues. Blaming chaos engineers for experiment-caused incidents. Treating chaos as a checkbox, not a practice. Chaos engineers need real organizati...

Hiring Chaos Engineers: The Complete Guide

What Chaos Engineers Actually Build

Experiment Design

Fault Injection

Platform Development

Chaos Engineering Tools

Platforms

Observability

Skills by Experience Level

Junior Chaos Engineer (0-2 years)

Mid-Level Chaos Engineer (2-5 years)

Senior Chaos Engineer (5+ years)

Interview Focus Areas

Technical Fundamentals

System Design

Experience

Common Hiring Mistakes

Hiring Pure Testers

Ignoring Safety Focus

Underestimating Culture Work

Missing Incident Experience

Where to Find Chaos Engineers

High-Signal Sources

Conference and Community

Company Backgrounds That Translate

Community Involvement

Recruiter's Cheat Sheet

Resume Green Flags

Resume Yellow Flags

Technical Terms to Know

Frequently Asked Questions

Frequently Asked Questions

What salary do chaos engineers expect?

Chaos Engineer vs SRE—what's the difference?

Do we need dedicated chaos engineers?

Is chaos engineering risky?

Chaos Engineers

About [Company]

The Role

What You'll Work On

Objectives of This Role

Responsibilities

Required Skills and Qualifications

Preferred Skills and Qualifications

Tech Stack

Compensation and Benefits

Engineering Culture

Interview Process

Chaos Engineers

Chaos Engineers

Market Pulse

Critical Skills (Must Haves)

Nice-to-Have (Bonus)

Top 5 Interview Questions

Quick Context

Common Mistakes

Interview Tips

Red Flags

Keep Exploring

Related Outcomes

Related Stacks

Related Levels

Related Scenarios

Your next hire is already on daily.dev.