What SREs Actually Do
Site Reliability Engineering encompasses several core responsibilities that blend software engineering with operations expertise.
A Day in the Life
Reliability Engineering (30-40%)
- SLI/SLO/SLA management - Defining service level indicators (what to measure), objectives (targets), and agreements (commitments)
- Error budget management - Tracking reliability spend and making trade-offs between velocity and stability
- Incident response - On-call rotation, incident command, and leading production investigations
- Post-mortems - Blameless analysis of incidents to prevent recurrence
- Reliability improvements - Adding circuit breakers, retries, graceful degradation
Infrastructure & Platform (25-35%)
- Infrastructure as code - Terraform, CloudFormation, Kubernetes manifests, Pulumi
- Observability - Building and maintaining monitoring, logging, tracing systems (Prometheus, Grafana, Datadog, OpenTelemetry)
- CI/CD pipelines - Deployment automation, canary releases, rollback mechanisms
- Capacity planning - Forecasting growth, scaling strategies, cost optimization
Automation & Tooling (20-30%)
- Toil elimination - Automating repetitive operational tasks through code
- Self-service platforms - Building tools that let developers deploy and operate independently
- Runbooks and automation - Turning manual procedures into automated responses
- Developer experience - Making it easier for engineers to build reliable systems
Software Engineering (10-20%)
- Production code contributions - Some SREs contribute directly to product codebases
- Infrastructure code - Writing maintainable, tested infrastructure automation
- Code reviews - Reviewing both infrastructure and application changes for reliability concerns
- Architecture guidance - Influencing system design for scalability and fault tolerance
Understanding Google's SRE Model
Google pioneered Site Reliability Engineering in 2003, and their approach has become the industry standard. Understanding these concepts is essential for evaluating SRE candidates.
Error Budgets: The Core Innovation
The error budget is the defining concept of SRE. Here's how it works:
If your SLO is 99.9% availability, your error budget is 0.1% - that's 43.8 minutes of downtime per month you can "spend."
This transforms the reliability conversation. Instead of ops teams demanding "more stability" while product teams push for "more features," both teams share a quantitative budget. When you're within budget, ship features. When you're burning budget too fast, pause and fix reliability.
Error budgets create alignment because:
- Development teams can ship faster when reliability is good
- SRE teams have data to justify reliability work
- Leadership can see trade-offs quantitatively
Service Level Objectives (SLOs)
SLOs define "good enough" reliability. This is counterintuitive—why not aim for 100%? Because:
- 100% is impossible - Beyond certain nines, you're fighting physics
- 100% is wasteful - Going from 99.9% to 99.99% costs 10x but users barely notice
- 100% slows you down - Over-engineering reliability prevents shipping features
SLOs force explicit decisions: "Our users need 99.95% availability for payments, but 99% is fine for reporting." This clarity helps prioritize engineering effort.
SLIs: Measuring What Matters
Service Level Indicators (SLIs) are the metrics that feed into SLOs. Good SLIs measure user experience, not server metrics:
- Availability: Successful requests / total requests
- Latency: Percentage of requests faster than threshold (e.g., p99 < 200ms)
- Quality: Percentage of responses that are correct/complete
- Freshness: How recent is the data users see?
Bad SLIs (don't measure these for SLOs):
- CPU usage, memory utilization (server metrics, not user experience)
- Uptime (binary—doesn't capture partial degradation)
- Incident count (measures process, not user impact)
SRE vs DevOps: Know the Difference
This distinction matters for hiring. DevOps and SRE overlap but have different emphases.
| Aspect | DevOps Engineer | Site Reliability Engineer |
|---|---|---|
| Primary Focus | CI/CD, deployment automation | Reliability, error budgets, SLOs |
| Background | Often ops → coding | Often software engineering → ops |
| Toil Attitude | Automate deployments | Automate everything (50% cap on toil) |
| Success Metric | Deployment frequency | Error budget remaining |
| Incident Response | Deploys fixes | Leads incident command, drives root cause |
| Code Output | Pipeline code, scripts | Production code + infrastructure code |
When to hire DevOps vs SRE:
- DevOps: You need better deployment pipelines, CI/CD automation, developer tooling
- SRE: You need reliability engineering—SLOs, error budgets, production resilience
Many companies use the titles interchangeably, but the skill sets differ. SREs typically have stronger software engineering backgrounds and think systematically about reliability as a feature.
Career Progression
Curiosity & fundamentals
Independence & ownership
Architecture & leadership
Strategy & org impact
SRE Archetypes: Know What You Need
Software-Focused SRE
- Strong software engineering background, often ex-backend engineers
- Writes production code and infrastructure code fluently
- Common at companies where SREs embed in product teams
- Risk: May lack deep operations or incident response experience
Infrastructure-Focused SRE
- Deep operations and infrastructure expertise
- Focuses on platform, tooling, and automation
- Common at companies with centralized SRE teams
- Risk: May lack software engineering depth for complex automation
Incident Response Specialist
- Expert at debugging production issues under pressure
- Strong on-call leadership and incident command skills
- Common at companies with complex, distributed systems
- Risk: May focus on reactive work over proactive improvements
Platform Builder
- Focuses on building self-service reliability tooling
- Enables product teams to operate independently
- Common at larger companies with mature SRE practices
- Risk: May lose touch with production firefighting
Be explicit about which archetype you need. A platform builder won't thrive in a role that's 80% on-call firefighting.
Where to Find SREs
Software Engineers Interested in Production
Backend engineers who've been on-call, debugged production incidents, or built observability tooling often make excellent SREs. They understand how applications work, which makes them better at making applications reliable.
Why they work: Software engineering foundation, understand developer pain
Watch out for: May lack deep infrastructure or networking knowledge
DevOps Engineers Ready to Level Up
Experienced DevOps engineers who want more software engineering responsibility and are frustrated by purely operational work.
Why they work: Infrastructure expertise, production experience
Watch out for: May struggle with software engineering rigor
Infrastructure Engineers at Scale
Engineers from companies running at significant scale (Netflix, Google, Facebook alumni) who've seen sophisticated reliability engineering firsthand.
Why they work: Exposure to world-class practices, proven at scale
Watch out for: May have unrealistic expectations about tooling maturity
Open Source Contributors
Contributors to projects like Kubernetes, Prometheus, Envoy, or observability tools demonstrate relevant skills publicly.
Why they work: Proven expertise, self-directed, community engaged
Watch out for: May prefer open source to company-specific work
Common Hiring Mistakes
1. Treating SREs as "On-Call Robots"
SREs should spend most time on proactive reliability improvements, not just responding to incidents. Google's guideline: no more than 50% time on toil. If your SREs are constantly firefighting, you need more SREs or better systems.
2. Ignoring Software Engineering Skills
SREs need to write code—infrastructure code, automation, and often production code. Don't hire sysadmins who can't code. The "SR" in SRE stands for Site Reliability, but the "E" stands for Engineer.
3. Not Testing Reliability Thinking
"Tell me about monitoring" tests tool knowledge. "How would you decide if this system needs 99.9% vs 99.99% reliability?" tests SRE thinking. Focus on error budgets, SLOs, and trade-offs—not just tool familiarity.
4. Expecting Only Operations Experience
The best SREs often come from software engineering backgrounds. They understand how applications work, which makes them better at making applications reliable. Don't filter out strong software engineers because they lack "ops experience."
5. Unclear On-Call Expectations
Be transparent about on-call burden. How often will they be paged? What's the rotation? Is there secondary coverage? SRE candidates will ask—and vague answers signal a dysfunctional team.
Red Flags in SRE Candidates
- Only talks about incidents - Great SREs focus on preventing incidents, not just responding
- Can't write code - SRE is an engineering discipline; coding is non-negotiable
- No SLI/SLO experience - Missing core SRE concepts signals ops background without SRE training
- Blames developers for incidents - Good SREs partner with developers, not blame them
- Hasn't automated anything - Toil elimination is fundamental to SRE
- Doesn't ask about on-call - Shows lack of understanding of SRE reality
- Only knows one cloud - Modern SREs need cloud and infrastructure breadth
- Can't explain error budgets - Core SRE concept; absence suggests rebranded ops
Interview Focus Areas
Reliability Engineering
- How they define and measure reliability (SLIs, SLOs)
- Error budget management and trade-off decisions
- Post-mortem practices and systemic improvement
- Making systems more resilient (circuit breakers, retries, graceful degradation)
Infrastructure & Operations
- Infrastructure as code practices (Terraform, Kubernetes)
- Observability design (metrics, logs, traces, alerts)
- Capacity planning and scaling strategies
- Cloud platform expertise (AWS, GCP, Azure)
Software Engineering
- Can they write production-quality code?
- Code review practices and standards
- System design for reliability
- Distributed systems understanding
Problem-Solving & Debugging
- Debugging complex production issues
- Systematic troubleshooting approaches
- Balancing speed vs. thoroughness during incidents
- Learning from incidents and preventing recurrence
Developer Expectations
| Aspect | ✓ What They Expect | ✗ What Breaks Trust |
|---|---|---|
| On-Call Balance | →Reasonable rotation (1 week on, 4+ weeks off) with secondary coverage and clear escalation | ⚠Constant on-call, no rotation, or being solely responsible for production 24/7 |
| Toil Management | →Less than 50% time on operational toil; dedicated time for automation and improvement projects | ⚠Endless firefighting with no investment in reducing operational burden |
| Error Budget Authority | →Real error budgets with authority to pause feature work when budget is exhausted | ⚠SLOs exist on paper but product teams ignore reliability concerns |
| Engineering Identity | →Treated as software engineers who focus on reliability, not as ops/support tier | ⚠SRE as "the team that gets paged" without engineering projects or career growth |
| Blameless Culture | →Post-mortems focus on systemic improvements, not individual blame | ⚠Finger-pointing after incidents; engineers punished for production issues |