Should SREs write production code or just infrastructure code?

It depends on your model. Some companies have SREs write production code (embedded model where SREs join product teams). Others have SREs focus on infrastructure and platform code (centralized model). Both work-be clear about expectations. What's non-negotiable: SREs must write code. Infrastructure as code, automation, tooling-these are all software engineering. An SRE who can't code is just an ops engineer with a new title.

How much on-call should SREs do?

Google's guideline: no more than 50% of SRE time should go to operational work (including on-call). Typical healthy rotation: 1 week on, 4-5 weeks off, with secondary coverage. If your SREs are constantly firefighting the same issues, you need more SREs or better systems. Good SREs reduce on-call burden over time through automation and reliability improvements-tracking this metric is a sign of mature SRE practice.

What salary should I expect to pay an SRE?

US salaries: Mid-level $140-180K, Senior $180-220K, Staff $220-280K+. SREs typically earn similar to or slightly more than software engineers at the same level due to on-call requirements and the specialized skill set. Total comp at FAANG can reach $350K+ for senior SREs. On-call compensation adds $3-10K/year at most companies. Remote positions and LATAM/Eastern Europe are typically 40-60% lower.

Hiring Site Reliability Engineers: The Complete Guide

Q: What's the difference between SRE and DevOps Engineer?

DevOps Engineers focus on deployment automation, CI/CD pipelines, and developer tooling. SREs focus on reliability, performance, and scalability through code and systematic approaches-specifically using error budgets, SLOs, and the 50% toil cap. SREs typically have stronger software engineering backgrounds and think systematically about reliability as a feature. DevOps asks "how do we deploy faster?" while SRE asks "how do we stay reliable while deploying faster?"

Q: What is an error budget and why does it matter?

An error budget is the amount of unreliability your SLO allows. If your SLO is 99.9% availability, you have a 0.1% error budget-about 43 minutes of downtime per month. When you're within budget, ship features. When you're burning budget too fast, pause and fix reliability. Error budgets transform the reliability vs. velocity debate from subjective arguments into data-driven decisions. They create alignment between product and reliability teams.

Site Reliability Engineer (SRE)

Definition

A Site Reliability Engineer (SRE) is a technical professional who designs, builds, and maintains software systems using programming languages and development frameworks. This specialized role requires deep technical expertise, continuous learning, and collaboration with cross-functional teams to deliver high-quality software products that meet business needs.

Site Reliability Engineer (SRE) is a fundamental concept in tech recruiting and talent acquisition. In the context of hiring developers and technical professionals, site reliability engineer (sre) plays a crucial role in connecting organizations with the right talent. Whether you're a recruiter, hiring manager, or candidate, understanding site reliability engineer (sre) helps navigate the complex landscape of modern tech hiring. This concept is particularly important for developer-focused recruiting where technical expertise and cultural fit must be carefully balanced.

Read full definition

What SREs Actually Do

Site Reliability Engineering encompasses several core responsibilities that blend software engineering with operations expertise.

A Day in the Life

Reliability Engineering (30-40%)

SLI/SLO/SLA management - Defining service level indicators (what to measure), objectives (targets), and agreements (commitments)
Error budget management - Tracking reliability spend and making trade-offs between velocity and stability
Incident response - On-call rotation, incident command, and leading production investigations
Post-mortems - Blameless analysis of incidents to prevent recurrence
Reliability improvements - Adding circuit breakers, retries, graceful degradation

Infrastructure & Platform (25-35%)

Infrastructure as code - Terraform, CloudFormation, Kubernetes manifests, Pulumi
Observability - Building and maintaining monitoring, logging, tracing systems (Prometheus, Grafana, Datadog, OpenTelemetry)
CI/CD pipelines - Deployment automation, canary releases, rollback mechanisms
Capacity planning - Forecasting growth, scaling strategies, cost optimization

Automation & Tooling (20-30%)

Toil elimination - Automating repetitive operational tasks through code
Self-service platforms - Building tools that let developers deploy and operate independently
Runbooks and automation - Turning manual procedures into automated responses
Developer experience - Making it easier for engineers to build reliable systems

Software Engineering (10-20%)

Production code contributions - Some SREs contribute directly to product codebases
Infrastructure code - Writing maintainable, tested infrastructure automation
Code reviews - Reviewing both infrastructure and application changes for reliability concerns
Architecture guidance - Influencing system design for scalability and fault tolerance

Understanding Google's SRE Model

Google pioneered Site Reliability Engineering in 2003, and their approach has become the industry standard. Understanding these concepts is essential for evaluating SRE candidates.

Error Budgets: The Core Innovation

The error budget is the defining concept of SRE. Here's how it works:

If your SLO is 99.9% availability, your error budget is 0.1% - that's 43.8 minutes of downtime per month you can "spend."

This transforms the reliability conversation. Instead of ops teams demanding "more stability" while product teams push for "more features," both teams share a quantitative budget. When you're within budget, ship features. When you're burning budget too fast, pause and fix reliability.

Error budgets create alignment because:

Development teams can ship faster when reliability is good
SRE teams have data to justify reliability work
Leadership can see trade-offs quantitatively

Service Level Objectives (SLOs)

SLOs define "good enough" reliability. This is counterintuitive-why not aim for 100%? Because:

100% is impossible - Beyond certain nines, you're fighting physics
100% is wasteful - Going from 99.9% to 99.99% costs 10x but users barely notice
100% slows you down - Over-engineering reliability prevents shipping features

SLOs force explicit decisions: "Our users need 99.95% availability for payments, but 99% is fine for reporting." This clarity helps prioritize engineering effort.

SLIs: Measuring What Matters

Service Level Indicators (SLIs) are the metrics that feed into SLOs. Good SLIs measure user experience, not server metrics:

Availability: Successful requests / total requests
Latency: Percentage of requests faster than threshold (e.g., p99 < 200ms)
Quality: Percentage of responses that are correct/complete
Freshness: How recent is the data users see?

Bad SLIs (don't measure these for SLOs):

CPU usage, memory utilization (server metrics, not user experience)
Uptime (binary-doesn't capture partial degradation)
Incident count (measures process, not user impact)

SRE vs DevOps: Know the Difference

This distinction matters for hiring. DevOps and SRE overlap but have different emphases.

Aspect	DevOps Engineer	Site Reliability Engineer
Primary Focus	CI/CD, deployment automation	Reliability, error budgets, SLOs
Background	Often ops → coding	Often software engineering → ops
Toil Attitude	Automate deployments	Automate everything (50% cap on toil)
Success Metric	Deployment frequency	Error budget remaining
Incident Response	Deploys fixes	Leads incident command, drives root cause
Code Output	Pipeline code, scripts	Production code + infrastructure code

When to hire DevOps vs SRE:

DevOps: You need better deployment pipelines, CI/CD automation, developer tooling
SRE: You need reliability engineering-SLOs, error budgets, production resilience

Many companies use the titles interchangeably, but the skill sets differ. SREs typically have stronger software engineering backgrounds and think systematically about reliability as a feature.

Career Progression

Junior0-2 yrs

Curiosity & fundamentals

Asks good questions

Learning mindset

Clean code

Mid-Level2-5 yrs

Independence & ownership

Ships end-to-end

Writes tests

Mentors juniors

Senior5+ yrs

Architecture & leadership

Designs systems

Tech decisions

Unblocks others

Staff+8+ yrs

Strategy & org impact

Cross-team work

Solves ambiguity

Multiplies output

SRE Archetypes: Know What You Need

Software-Focused SRE

Strong software engineering background, often ex-backend engineers
Writes production code and infrastructure code fluently
Common at companies where SREs embed in product teams
Risk: May lack deep operations or incident response experience

Infrastructure-Focused SRE

Deep operations and infrastructure expertise
Focuses on platform, tooling, and automation
Common at companies with centralized SRE teams
Risk: May lack software engineering depth for complex automation

Incident Response Specialist

Expert at debugging production issues under pressure
Strong on-call leadership and incident command skills
Common at companies with complex, distributed systems
Risk: May focus on reactive work over proactive improvements

Platform Builder

Focuses on building self-service reliability tooling
Enables product teams to operate independently
Common at larger companies with mature SRE practices
Risk: May lose touch with production firefighting

Be explicit about which archetype you need. A platform builder won't thrive in a role that's 80% on-call firefighting.

Where to Find SREs

Software Engineers Interested in Production

Backend engineers who've been on-call, debugged production incidents, or built observability tooling often make excellent SREs. They understand how applications work, which makes them better at making applications reliable.

Why they work: Software engineering foundation, understand developer pain
Watch out for: May lack deep infrastructure or networking knowledge

DevOps Engineers Ready to Level Up

Experienced DevOps engineers who want more software engineering responsibility and are frustrated by purely operational work.

Why they work: Infrastructure expertise, production experience
Watch out for: May struggle with software engineering rigor

Infrastructure Engineers at Scale

Engineers from companies running at significant scale (Netflix, Google, Facebook alumni) who've seen sophisticated reliability engineering firsthand.

Why they work: Exposure to world-class practices, proven at scale
Watch out for: May have unrealistic expectations about tooling maturity

Open Source Contributors

Contributors to projects like Kubernetes, Prometheus, Envoy, or observability tools demonstrate relevant skills publicly.

Why they work: Proven expertise, self-directed, community engaged
Watch out for: May prefer open source to company-specific work

Common Hiring Mistakes

1. Treating SREs as "On-Call Robots"

SREs should spend most time on proactive reliability improvements, not just responding to incidents. Google's guideline: no more than 50% time on toil. If your SREs are constantly firefighting, you need more SREs or better systems.

2. Ignoring Software Engineering Skills

SREs need to write code-infrastructure code, automation, and often production code. Don't hire sysadmins who can't code. The "SR" in SRE stands for Site Reliability, but the "E" stands for Engineer.

3. Not Testing Reliability Thinking

"Tell me about monitoring" tests tool knowledge. "How would you decide if this system needs 99.9% vs 99.99% reliability?" tests SRE thinking. Focus on error budgets, SLOs, and trade-offs-not just tool familiarity.

4. Expecting Only Operations Experience

The best SREs often come from software engineering backgrounds. They understand how applications work, which makes them better at making applications reliable. Don't filter out strong software engineers because they lack "ops experience."

5. Unclear On-Call Expectations

Be transparent about on-call burden. How often will they be paged? What's the rotation? Is there secondary coverage? SRE candidates will ask-and vague answers signal a dysfunctional team.

Red Flags in SRE Candidates

Only talks about incidents - Great SREs focus on preventing incidents, not just responding
Can't write code - SRE is an engineering discipline; coding is non-negotiable
No SLI/SLO experience - Missing core SRE concepts signals ops background without SRE training
Blames developers for incidents - Good SREs partner with developers, not blame them
Hasn't automated anything - Toil elimination is fundamental to SRE
Doesn't ask about on-call - Shows lack of understanding of SRE reality
Only knows one cloud - Modern SREs need cloud and infrastructure breadth
Can't explain error budgets - Core SRE concept; absence suggests rebranded ops

Interview Focus Areas

Reliability Engineering

How they define and measure reliability (SLIs, SLOs)
Error budget management and trade-off decisions
Post-mortem practices and systemic improvement
Making systems more resilient (circuit breakers, retries, graceful degradation)

Infrastructure & Operations

Infrastructure as code practices (Terraform, Kubernetes)
Observability design (metrics, logs, traces, alerts)
Capacity planning and scaling strategies
Cloud platform expertise (AWS, GCP, Azure)

Software Engineering

Can they write production-quality code?
Code review practices and standards
System design for reliability
Distributed systems understanding

Problem-Solving & Debugging

Debugging complex production issues
Systematic troubleshooting approaches
Balancing speed vs. thoroughness during incidents
Learning from incidents and preventing recurrence

Developer Expectations

Aspect	✓ What They Expect	✗ What Breaks Trust
On-Call Balance	→Reasonable rotation (1 week on, 4+ weeks off) with secondary coverage and clear escalation	⚠Constant on-call, no rotation, or being solely responsible for production 24/7
Toil Management	→Less than 50% time on operational toil; dedicated time for automation and improvement projects	⚠Endless firefighting with no investment in reducing operational burden
Error Budget Authority	→Real error budgets with authority to pause feature work when budget is exhausted	⚠SLOs exist on paper but product teams ignore reliability concerns
Engineering Identity	→Treated as software engineers who focus on reliability, not as ops/support tier	⚠SRE as "the team that gets paged" without engineering projects or career growth
Blameless Culture	→Post-mortems focus on systemic improvements, not individual blame	⚠Finger-pointing after incidents; engineers punished for production issues

Frequently Asked Questions

DevOps Engineers focus on deployment automation, CI/CD pipelines, and developer tooling. SREs focus on reliability, performance, and scalability through code and systematic approaches-specifically using error budgets, SLOs, and the 50% toil cap. SREs typically have stronger software engineering backgrounds and think systematically about reliability as a feature. DevOps asks "how do we deploy faster?" while SRE asks "how do we stay reliable while deploying faster?"

Hiring Site Reliability Engineers: The Complete Guide

Site Reliability Engineer (SRE)

What SREs Actually Do

A Day in the Life

Reliability Engineering (30-40%)

Infrastructure & Platform (25-35%)

Automation & Tooling (20-30%)

Software Engineering (10-20%)

Understanding Google's SRE Model

Error Budgets: The Core Innovation

Service Level Objectives (SLOs)

SLIs: Measuring What Matters

SRE vs DevOps: Know the Difference

Career Progression

SRE Archetypes: Know What You Need

Software-Focused SRE

Infrastructure-Focused SRE

Incident Response Specialist

Platform Builder

Where to Find SREs

Software Engineers Interested in Production

DevOps Engineers Ready to Level Up

Infrastructure Engineers at Scale

Open Source Contributors

Common Hiring Mistakes

1. Treating SREs as "On-Call Robots"

2. Ignoring Software Engineering Skills

3. Not Testing Reliability Thinking

4. Expecting Only Operations Experience

5. Unclear On-Call Expectations

Red Flags in SRE Candidates

Interview Focus Areas

Reliability Engineering

Infrastructure & Operations

Software Engineering

Problem-Solving & Debugging

Developer Expectations

Frequently Asked Questions

Frequently Asked Questions

What's the difference between SRE and DevOps Engineer?

What is an error budget and why does it matter?

Should SREs write production code or just infrastructure code?

How much on-call should SREs do?

What salary should I expect to pay an SRE?

Site Reliability Engineers

About [Company]

The Role

Objectives of This Role

Responsibilities

Required Skills and Qualifications

Preferred Skills and Qualifications

Tech Stack

Reliability Metrics

On-Call Expectations

Compensation and Benefits

Interview Process

Equal Opportunity

Site Reliability Engineers

Site Reliability Engineers

Market Pulse

Critical Skills (Must Haves)

Nice-to-Have (Bonus)

Top 5 Interview Questions

Quick Context

Common Mistakes

Interview Tips

Keep Exploring

Related Outcomes

Related Stacks

Related Levels

Related Scenarios

Your next hire is already on daily.dev.