# Site Reliability Engineer
Location: Seattle, WA (Hybrid) · Employment Type: Full-time · Level: Senior
[Company] is a fintech platform that processes $2B+ in transactions annually for enterprise clients. Our systems power real-time payments, fraud detection, and financial reporting for over 500 businesses across North America.
We're a team of 180 people who believe reliability is a feature, not an afterthought. When banks and enterprises depend on your platform for their critical financial operations, every second of downtime has real consequences.
Why join [Company]?
- Work on systems where reliability directly impacts millions of dollars in transactions
- Join a 6-person SRE team supporting 60 engineers who take production seriously
- Series C funded ($85M) with strong unit economics and path to profitability
- Competitive compensation with meaningful equity and on-call pay
We're looking for a Site Reliability Engineer who approaches operations as a software engineering problem. This role is grounded in Google's SRE principles: you'll spend more time writing code to prevent incidents than responding to pages.
At [Company], SRE is an engineering discipline. Our SREs are software engineers who happen to focus on reliability—not sysadmins who learned to script. You'll write Go and Python daily, define SLOs that balance velocity with stability, and build automation that eliminates toil across our engineering organization.
Our SRE philosophy:
- 50% cap on operational work—the rest is engineering projects that improve reliability
- Error budgets as contracts—when services burn their budget, feature work pauses
- SLOs drive everything—we measure what matters to users, not vanity metrics
- Toil is the enemy—if a human does it more than twice, we automate it
- Shared production ownership—dev teams own their services; SRE provides expertise and tooling
- Improve our platform availability from 99.9% to 99.95% through systematic reliability engineering
- Reduce mean time to recovery (MTTR) by 40% through better automation and runbooks
- Eliminate at least 20 hours/month of toil across the engineering organization
- Establish SLO-based alerting that reduces alert noise by 50%
- Mentor product engineers on reliability practices and production readiness
- Design, implement, and maintain Service Level Objectives (SLOs) and error budgets across our platform
- Write code (primarily Go and Python) to automate operational tasks and eliminate toil
- Build and improve our observability stack—metrics, logging, distributed tracing, and alerting
- Participate in on-call rotation and lead incident response for production issues
- Conduct blameless postmortems and drive systemic improvements from incident learnings
- Perform production readiness reviews for new services and major changes
- Build self-service tooling that enables product teams to deploy and operate independently
- Implement reliability patterns: circuit breakers, retries with backoff, graceful degradation
- Design and run chaos engineering experiments to validate system resilience
- Collaborate with product engineering teams on architecture decisions that impact reliability
Software Engineering (This Comes First)
- 5+ years of software engineering experience writing production code
- Strong proficiency in Go and/or Python—you write code daily, not occasionally
- Solid CS fundamentals: data structures, algorithms, concurrency, distributed systems
- Experience with software engineering best practices: code review, testing, CI/CD
- You've built systems that other engineers maintain, debug, and extend
Reliability Engineering
- Experience defining and implementing SLIs, SLOs, and error budgets
- Understanding of distributed systems challenges: consistency, failure modes, CAP theorem
- Familiarity with reliability patterns: circuit breakers, bulkheads, graceful degradation
- Experience with incident response, postmortem practices, and systemic improvement
Infrastructure
- Production experience with AWS (EC2, EKS, RDS, Lambda, CloudWatch)
- Hands-on experience with Terraform for infrastructure as code
- Container orchestration with Kubernetes in production environments
- Experience with observability stacks (Prometheus, Grafana, or similar)
- Experience implementing SLO frameworks across an engineering organization
- Chaos engineering experience (Chaos Monkey, Litmus, Gremlin)
- Service mesh experience (Istio, Linkerd, or Envoy)
- Background in fintech, payments, or high-compliance environments
- Contributions to open-source infrastructure projects
- Public speaking or writing about SRE practices
- Experience at similar scale (thousands of RPS, sub-100ms latency requirements)
Infrastructure
- AWS (EKS, EC2, RDS Aurora, ElastiCache, SQS, Lambda)
- Kubernetes (EKS) with Helm charts
- Terraform for all infrastructure as code
- ArgoCD for GitOps deployments
- HashiCorp Vault for secrets management
Observability
- Prometheus + Thanos for metrics
- Grafana for dashboards and visualization
- Jaeger for distributed tracing
- Elasticsearch + Kibana for logging
- Custom SLO dashboards built on Grafana
Incident Management
- PagerDuty for on-call management
- Slack for incident communication
- Linear for incident tracking and action items
- Notion for postmortem documentation
Languages
- Go (primary for infrastructure tooling)
- Python (automation and scripting)
- TypeScript (internal tools and dashboards)
We believe in transparency. Here's how our systems actually perform:
Current SLOs
- Tier 1 (Payment Processing): 99.95% availability, <200ms p99 latency, 21.9 min/month error budget
- Tier 2 (Dashboard/Reporting): 99.9% availability, <500ms p99 latency, 43.8 min/month error budget
- Tier 3 (Internal Tools): 99.5% availability, <1s p99 latency, 3.6 hr/month error budget
Actual Performance (Trailing 12 Months)
- Tier 1 Availability: 99.93% (slightly under target—our top priority)
- MTTR (p50): 18 minutes (target: 15 minutes)
- MTTD (p50): 3 minutes (target: 5 minutes)
- Change Failure Rate: 4.2% (target: <5%)
- Deployment Frequency: 25+ deploys/day
Scale
- 15,000 requests per second at peak
- 45 microservices in production
- 8TB of data processed daily
- Multi-region (us-east-1 primary, us-west-2 DR)
- 1:10 SRE to engineer ratio
We believe in complete transparency about on-call. Here's exactly what to expect:
Rotation Structure
- 1 week on-call, 5 weeks off (6-person rotation)
- Primary/secondary coverage model
- 15-minute response SLA for SEV1, 30-minute for SEV2
- Clear escalation path to engineering managers and executives
Incident Volume (Trailing 6 Months Average)
- 8-12 pages per on-call week
- 2-3 after-hours pages per week
- 1-2 SEV1 incidents per quarter
- Average incident duration: 35 minutes
Common Incident Types
- Database connection pool exhaustion (being addressed with connection pooling improvements)
- Third-party payment processor timeouts (implementing circuit breakers)
- Kubernetes pod scheduling delays during deployments (improving rollout strategy)
How We Keep On-Call Sustainable
- Every page that doesn't require human intervention becomes an automation project
- Mandatory postmortems with tracked action items—we fix root causes, not symptoms
- Comp time: take a day off after any week with 3+ after-hours incidents
- We maintain 1:10 SRE:engineer ratio to keep on-call burden manageable
On-Call Compensation
- $500/week base on-call pay
- $150 per after-hours incident response
- Comp time for disruptive weeks
Salary: $180,000 - $220,000 (based on experience and location)
On-Call Compensation: ~$3,500-5,000/year additional (varies by incident volume)
Equity: 0.08% - 0.15% (4-year vest, 1-year cliff)
Benefits:
- Medical, dental, and vision insurance (100% employee, 75% dependents)
- Unlimited PTO with 15-day minimum encouraged
- $3,000 annual learning budget (conferences, courses, certifications)
- $2,000 home office setup allowance
- 401(k) with 4% company match
- 16 weeks paid parental leave
- Flexible hybrid work (2-3 days in office, your choice)
Our process assesses SRE-specific skills, not generic algorithm puzzles. We complete the process in 2-3 weeks.
- Step 1: Recruiter Screen (30 min) - We'll discuss your background, interests, and on-call expectations upfront.
- Step 2: Technical Screen (60 min) - SRE experience, reliability thinking, and past incidents you've handled.
- Step 3: System Design for Reliability (60 min) - Design a reliable system at scale with SLOs and failure modes.
- Step 4: Coding & Automation (60 min) - Write code to solve a real operations problem. Go or Python, not LeetCode.
- Step 5: Incident Simulation (45 min) - Walk through a realistic incident scenario with debugging and communication.
- Step 6: Team & Values (45 min) - Collaboration style, blameless culture fit, and reliability philosophy.
We don't do LeetCode-style puzzles, whiteboard coding, or gotcha questions about obscure Linux trivia.
[Company] is an equal opportunity employer. We believe diverse teams build more reliable systems—different perspectives help us anticipate failure modes and design for a wider range of users.
We do not discriminate based on race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic.
If you need accommodations during the interview process, let us know.
---
*Ready to build systems where reliability is a feature, not an afterthought? Apply now and help us define what it means to be truly reliable at scale.*
# Site Reliability Engineer
**Location:** Seattle, WA (Hybrid) · **Employment Type:** Full-time · **Level:** Senior
## About [Company]
[Company] is a fintech platform that processes $2B+ in transactions annually for enterprise clients. Our systems power real-time payments, fraud detection, and financial reporting for over 500 businesses across North America.
We're a team of 180 people who believe reliability is a feature, not an afterthought. When banks and enterprises depend on your platform for their critical financial operations, every second of downtime has real consequences.
**Why join [Company]?**
- Work on systems where reliability directly impacts millions of dollars in transactions
- Join a 6-person SRE team supporting 60 engineers who take production seriously
- Series C funded ($85M) with strong unit economics and path to profitability
- Competitive compensation with meaningful equity and on-call pay
## The Role
We're looking for a Site Reliability Engineer who approaches operations as a software engineering problem. This role is grounded in Google's SRE principles: you'll spend more time writing code to prevent incidents than responding to pages.
At [Company], SRE is an engineering discipline. Our SREs are software engineers who happen to focus on reliability—not sysadmins who learned to script. You'll write Go and Python daily, define SLOs that balance velocity with stability, and build automation that eliminates toil across our engineering organization.
**Our SRE philosophy:**
- 50% cap on operational work—the rest is engineering projects that improve reliability
- Error budgets as contracts—when services burn their budget, feature work pauses
- SLOs drive everything—we measure what matters to users, not vanity metrics
- Toil is the enemy—if a human does it more than twice, we automate it
- Shared production ownership—dev teams own their services; SRE provides expertise and tooling
## Objectives of This Role
- Improve our platform availability from 99.9% to 99.95% through systematic reliability engineering
- Reduce mean time to recovery (MTTR) by 40% through better automation and runbooks
- Eliminate at least 20 hours/month of toil across the engineering organization
- Establish SLO-based alerting that reduces alert noise by 50%
- Mentor product engineers on reliability practices and production readiness
## Responsibilities
- Design, implement, and maintain Service Level Objectives (SLOs) and error budgets across our platform
- Write code (primarily Go and Python) to automate operational tasks and eliminate toil
- Build and improve our observability stack—metrics, logging, distributed tracing, and alerting
- Participate in on-call rotation and lead incident response for production issues
- Conduct blameless postmortems and drive systemic improvements from incident learnings
- Perform production readiness reviews for new services and major changes
- Build self-service tooling that enables product teams to deploy and operate independently
- Implement reliability patterns: circuit breakers, retries with backoff, graceful degradation
- Design and run chaos engineering experiments to validate system resilience
- Collaborate with product engineering teams on architecture decisions that impact reliability
## Required Skills and Qualifications
**Software Engineering (This Comes First)**
- 5+ years of software engineering experience writing production code
- Strong proficiency in Go and/or Python—you write code daily, not occasionally
- Solid CS fundamentals: data structures, algorithms, concurrency, distributed systems
- Experience with software engineering best practices: code review, testing, CI/CD
- You've built systems that other engineers maintain, debug, and extend
**Reliability Engineering**
- Experience defining and implementing SLIs, SLOs, and error budgets
- Understanding of distributed systems challenges: consistency, failure modes, CAP theorem
- Familiarity with reliability patterns: circuit breakers, bulkheads, graceful degradation
- Experience with incident response, postmortem practices, and systemic improvement
**Infrastructure**
- Production experience with AWS (EC2, EKS, RDS, Lambda, CloudWatch)
- Hands-on experience with Terraform for infrastructure as code
- Container orchestration with Kubernetes in production environments
- Experience with observability stacks (Prometheus, Grafana, or similar)
## Preferred Skills and Qualifications
- Experience implementing SLO frameworks across an engineering organization
- Chaos engineering experience (Chaos Monkey, Litmus, Gremlin)
- Service mesh experience (Istio, Linkerd, or Envoy)
- Background in fintech, payments, or high-compliance environments
- Contributions to open-source infrastructure projects
- Public speaking or writing about SRE practices
- Experience at similar scale (thousands of RPS, sub-100ms latency requirements)
## Tech Stack
**Infrastructure**
- AWS (EKS, EC2, RDS Aurora, ElastiCache, SQS, Lambda)
- Kubernetes (EKS) with Helm charts
- Terraform for all infrastructure as code
- ArgoCD for GitOps deployments
- HashiCorp Vault for secrets management
**Observability**
- Prometheus + Thanos for metrics
- Grafana for dashboards and visualization
- Jaeger for distributed tracing
- Elasticsearch + Kibana for logging
- Custom SLO dashboards built on Grafana
**Incident Management**
- PagerDuty for on-call management
- Slack for incident communication
- Linear for incident tracking and action items
- Notion for postmortem documentation
**Languages**
- Go (primary for infrastructure tooling)
- Python (automation and scripting)
- TypeScript (internal tools and dashboards)
## Reliability Metrics
We believe in transparency. Here's how our systems actually perform:
**Current SLOs**
- Tier 1 (Payment Processing): 99.95% availability, <200ms p99 latency, 21.9 min/month error budget
- Tier 2 (Dashboard/Reporting): 99.9% availability, <500ms p99 latency, 43.8 min/month error budget
- Tier 3 (Internal Tools): 99.5% availability, <1s p99 latency, 3.6 hr/month error budget
**Actual Performance (Trailing 12 Months)**
- Tier 1 Availability: 99.93% (slightly under target—our top priority)
- MTTR (p50): 18 minutes (target: 15 minutes)
- MTTD (p50): 3 minutes (target: 5 minutes)
- Change Failure Rate: 4.2% (target: <5%)
- Deployment Frequency: 25+ deploys/day
**Scale**
- 15,000 requests per second at peak
- 45 microservices in production
- 8TB of data processed daily
- Multi-region (us-east-1 primary, us-west-2 DR)
- 1:10 SRE to engineer ratio
## On-Call Expectations
We believe in complete transparency about on-call. Here's exactly what to expect:
**Rotation Structure**
- 1 week on-call, 5 weeks off (6-person rotation)
- Primary/secondary coverage model
- 15-minute response SLA for SEV1, 30-minute for SEV2
- Clear escalation path to engineering managers and executives
**Incident Volume (Trailing 6 Months Average)**
- 8-12 pages per on-call week
- 2-3 after-hours pages per week
- 1-2 SEV1 incidents per quarter
- Average incident duration: 35 minutes
**Common Incident Types**
- Database connection pool exhaustion (being addressed with connection pooling improvements)
- Third-party payment processor timeouts (implementing circuit breakers)
- Kubernetes pod scheduling delays during deployments (improving rollout strategy)
**How We Keep On-Call Sustainable**
- Every page that doesn't require human intervention becomes an automation project
- Mandatory postmortems with tracked action items—we fix root causes, not symptoms
- Comp time: take a day off after any week with 3+ after-hours incidents
- We maintain 1:10 SRE:engineer ratio to keep on-call burden manageable
**On-Call Compensation**
- $500/week base on-call pay
- $150 per after-hours incident response
- Comp time for disruptive weeks
## Compensation and Benefits
**Salary:** $180,000 - $220,000 (based on experience and location)
**On-Call Compensation:** ~$3,500-5,000/year additional (varies by incident volume)
**Equity:** 0.08% - 0.15% (4-year vest, 1-year cliff)
**Benefits:**
- Medical, dental, and vision insurance (100% employee, 75% dependents)
- Unlimited PTO with 15-day minimum encouraged
- $3,000 annual learning budget (conferences, courses, certifications)
- $2,000 home office setup allowance
- 401(k) with 4% company match
- 16 weeks paid parental leave
- Flexible hybrid work (2-3 days in office, your choice)
## Interview Process
Our process assesses SRE-specific skills, not generic algorithm puzzles. We complete the process in 2-3 weeks.
- **Step 1: Recruiter Screen** (30 min) - We'll discuss your background, interests, and on-call expectations upfront.
- **Step 2: Technical Screen** (60 min) - SRE experience, reliability thinking, and past incidents you've handled.
- **Step 3: System Design for Reliability** (60 min) - Design a reliable system at scale with SLOs and failure modes.
- **Step 4: Coding & Automation** (60 min) - Write code to solve a real operations problem. Go or Python, not LeetCode.
- **Step 5: Incident Simulation** (45 min) - Walk through a realistic incident scenario with debugging and communication.
- **Step 6: Team & Values** (45 min) - Collaboration style, blameless culture fit, and reliability philosophy.
We don't do LeetCode-style puzzles, whiteboard coding, or gotcha questions about obscure Linux trivia.
## Equal Opportunity
[Company] is an equal opportunity employer. We believe diverse teams build more reliable systems—different perspectives help us anticipate failure modes and design for a wider range of users.
We do not discriminate based on race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic.
If you need accommodations during the interview process, let us know.
---
*Ready to build systems where reliability is a feature, not an afterthought? Apply now and help us define what it means to be truly reliable at scale.*