What Observability Engineers Actually Do
Observability Engineering encompasses several core responsibilities that blend deep infrastructure knowledge with developer experience focus.
A Day in the Life
Instrumentation & Data Collection (30-40%)
- Instrumentation standards - Defining how services emit metrics, logs, and traces consistently
- OpenTelemetry adoption - Rolling out OTel SDKs, collectors, and semantic conventions
- Auto-instrumentation - Making instrumentation effortless for application teams
- Custom instrumentation - Building specialized instrumentation for business-critical flows
- Data quality - Ensuring telemetry is complete, accurate, and actionable
Pipeline & Infrastructure (25-35%)
- Telemetry pipelines - Building systems to collect, process, route, and store observability data at scale
- Data backends - Operating metrics stores (Prometheus, Thanos, Mimir), log aggregators (Elasticsearch, Loki), and trace backends (Jaeger, Tempo)
- Cost optimization - Managing observability costs through sampling, aggregation, and retention policies
- High availability - Ensuring observability systems remain available during the incidents they're meant to debug
Alerting & Incident Enablement (20-30%)
- Alert quality - Reducing alert fatigue through SLO-based alerting and actionable alerts
- Runbook integration - Connecting alerts to debugging workflows and automated remediation
- On-call tooling - Building systems that help responders diagnose issues faster
- Correlation engines - Connecting metrics, logs, and traces for unified debugging
Developer Experience (15-25%)
- Dashboards and visualization - Creating Grafana dashboards, exploratory tools, and service health views
- Self-service tooling - Enabling developers to create dashboards, alerts, and queries without observability expertise
- Documentation and training - Teaching teams how to use observability effectively
- Debug workflows - Building guided paths from alert to root cause
Observability vs. Monitoring: Understanding the Difference
This distinction matters for hiring. Many candidates use the terms interchangeably, but they represent different philosophies.
Monitoring: Predefined Questions
Traditional monitoring answers questions you knew to ask:
- Is the server up? (availability)
- Is disk usage above 80%? (threshold)
- Are we getting 5xx errors? (known failure mode)
Monitoring requires you to anticipate failures and create dashboards/alerts ahead of time. When novel problems occur, you're blind.
Observability: Arbitrary Questions
Observability enables questions you didn't know to ask:
- Why are requests from iPhone users in Germany slow?
- What changed between this deployment and last that caused latency to spike?
- Which customer's traffic triggered this database hotspot?
Observability requires high-cardinality data that lets you slice and dice by arbitrary dimensions (user_id, request_id, feature_flag, etc.).
Why It Matters for Hiring
Candidates who only talk about dashboards and thresholds have a monitoring mindset. Strong Observability Engineers think about:
- Cardinality - Can we query by any dimension without pre-aggregation?
- Correlation - Can we follow a request across services?
- Exploration - Can engineers debug without knowing what to look for?
The Three Pillars of Observability
Every Observability Engineer must understand the three pillars and their trade-offs.
Metrics
What they are: Numeric measurements aggregated over time (request count, latency percentiles, error rate)
Strengths:
- Cheap to store and query (pre-aggregated)
- Great for alerting and trends
- Low cardinality queries are fast
Weaknesses:
- High cardinality is expensive
- Can't trace individual requests
- Lose detail through aggregation
Key concepts to assess:
- Prometheus data model (labels, time series)
- Histogram vs. summary trade-offs
- Cardinality explosion risks
Logs
What they are: Timestamped records of discrete events (request completed, error occurred, user action)
Strengths:
- Rich context and detail
- Can search for specific events
- Good for debugging individual requests
Weaknesses:
- Expensive to store at scale
- Searching is slower than metrics
- Unstructured logs are hard to analyze
Key concepts to assess:
- Structured logging patterns
- Log levels and when to use each
- Sampling strategies at scale
Traces
What they are: Records of requests flowing through distributed systems, showing timing and relationships
Strengths:
- Show the full request path
- Identify where time is spent
- Enable service dependency mapping
Weaknesses:
- Complex to implement correctly
- High storage costs for full traces
- Require propagation context everywhere
Key concepts to assess:
- Distributed tracing concepts (spans, trace context)
- Sampling strategies (head-based, tail-based)
- OpenTelemetry trace model
Career Progression
Curiosity & fundamentals
Independence & ownership
Architecture & leadership
Strategy & org impact
When to Hire Dedicated Observability Engineers
Observability work happens in every engineering organization, but dedicated roles are context-dependent.
Strong Signal for Dedicated Role
- Scale - Processing billions of metrics/logs daily requires specialization
- Multi-team organizations - 50+ engineers need consistent observability standards
- Observability as platform - You're building internal observability tooling
- Vendor complexity - Managing Datadog/New Relic/custom stack requires expertise
- Cost management - Observability spend exceeds $100K/month
When SRE/Platform Covers It
- Smaller organizations - Under 30 engineers, SREs handle observability as part of reliability work
- Simple architectures - Monoliths or few microservices need less specialized tooling
- Vendor-managed solutions - Fully managed Datadog/New Relic reduces operational burden
- Mature practices - When observability is already well-established, maintenance is part-time
The Hybrid Reality
Most companies don't have dedicated "Observability Engineers"—the work is distributed across:
- SREs - Own alerting, incident tooling, and reliability metrics
- Platform Engineers - Own telemetry pipelines and developer tooling
- Application teams - Own service-specific instrumentation
When hiring, be clear: Are you building a dedicated observability team, or adding observability expertise to an existing SRE/Platform function?
Where to Find Observability Engineers
Backend Engineers at Observability Vendors
Engineers from Datadog, New Relic, Honeycomb, Grafana Labs, or Splunk have deep observability expertise. They've built the tools others use.
Why they work: Deep expertise, understand scale challenges
Watch out for: May have narrow tool expertise; compensation expectations may be high
SREs with Observability Focus
Site Reliability Engineers who've specialized in monitoring, alerting, and debugging tooling often make excellent Observability Engineers. They understand the production context.
Why they work: Production experience, understand on-call pain points
Watch out for: May focus on ops over developer experience
Platform Engineers Building Internal Tools
Engineers who've built internal observability platforms understand both the technical challenges and the developer experience requirements.
Why they work: Full-stack observability experience, user empathy
Watch out for: May lack depth in specific areas (tracing, metrics stores)
Open Source Contributors
Contributors to OpenTelemetry, Prometheus, Jaeger, Grafana, or similar projects demonstrate expertise publicly. These communities are excellent talent pools.
Why they work: Proven expertise, community engagement
Watch out for: May prefer open source work to company-specific challenges
Common Hiring Mistakes
1. Confusing Monitoring with Observability
Candidates who only talk about Nagios, uptime checks, and threshold alerts have a monitoring mindset. True Observability Engineers think about high-cardinality data, distributed tracing, and enabling exploration.
2. Over-Indexing on Tool Knowledge
Tools change rapidly. A candidate who knows Datadog deeply but lacks fundamentals will struggle when you migrate to Grafana. Focus on concepts: what makes good instrumentation? How do you design telemetry pipelines? What makes alerts actionable?
3. Ignoring Developer Experience
Observability is only valuable if developers use it. Candidates who focus solely on infrastructure without considering how application teams will instrument, query, and debug miss half the role.
4. Unclear Scope
Is this role about building telemetry pipelines, improving alert quality, or enabling self-service? Be specific. "Own observability" is too vague—define the actual problems you need solved.
5. Expecting Instant Results
Observability improvements take time. Migrating to OpenTelemetry, reducing alert noise by 50%, or building reliable trace correlation takes quarters, not weeks. Set realistic expectations.
Red Flags in Observability Candidates
- Only knows one tool - Can't discuss trade-offs or alternatives to their preferred stack
- No developer empathy - Builds tools for themselves, not for application teams
- Dashboard-centric thinking - Only talks about visualization, not data quality or instrumentation
- Can't explain the three pillars - Missing foundational knowledge
- No cost awareness - Doesn't understand observability economics at scale
- Alert-happy - Believes more alerts mean better observability
- Ignores correlation - Treats metrics, logs, and traces as separate problems
Interview Focus Areas
Observability Fundamentals
- Understanding of metrics, logs, and traces trade-offs
- OpenTelemetry knowledge and semantic conventions
- Instrumentation patterns for different languages/frameworks
System Design for Observability
- Telemetry pipeline architecture at scale
- Sampling strategies and their trade-offs
- Cost management and data retention decisions
Alerting Philosophy
- SLO-based alerting vs. threshold alerting
- Reducing alert fatigue and improving signal-to-noise
- On-call experience and incident debugging
Developer Experience
- Self-service tooling approaches
- Documentation and training strategies
- Balancing flexibility with consistency
Developer Expectations
| Aspect | ✓ What They Expect | ✗ What Breaks Trust |
|---|---|---|
| Tool Investment | →Modern observability stack with OpenTelemetry adoption path and reliable tooling | ⚠Legacy monitoring tools, no investment in improvements, or vendor lock-in with no migration plan |
| Alert Quality | →SLO-based alerting with low noise, actionable alerts, and continuous improvement | ⚠Alert fatigue accepted as normal, no effort to reduce noise, or hundreds of ignored alerts |
| Engineering Time | →Majority of time on engineering projects—building systems, improving tooling, reducing toil | ⚠Constant firefighting, manual data exports, or being the "person who runs queries" for everyone |
| Organizational Support | →Authority to set instrumentation standards and drive adoption across teams | ⚠No enforcement ability, being ignored by application teams, or observability as an afterthought |
| Learning & Growth | →Exposure to scale challenges, new technologies (OTel, eBPF), and industry best practices | ⚠Maintaining legacy systems with no modernization, or solving the same problems repeatedly |