Skip to main content
OpenTelemetry Engineers icon

Hiring OpenTelemetry Engineers: The Complete Guide

Market Snapshot
Senior Salary (US) 🔥 Hot
$170k – $220k
Hiring Difficulty Very Hard
Easy Hard
Avg. Time to Hire 5-7 weeks

Site Reliability Engineer (SRE)

Definition

A Site Reliability Engineer (SRE) is a technical professional who designs, builds, and maintains software systems using programming languages and development frameworks. This specialized role requires deep technical expertise, continuous learning, and collaboration with cross-functional teams to deliver high-quality software products that meet business needs.

Site Reliability Engineer (SRE) is a fundamental concept in tech recruiting and talent acquisition. In the context of hiring developers and technical professionals, site reliability engineer (sre) plays a crucial role in connecting organizations with the right talent. Whether you're a recruiter, hiring manager, or candidate, understanding site reliability engineer (sre) helps navigate the complex landscape of modern tech hiring. This concept is particularly important for developer-focused recruiting where technical expertise and cultural fit must be carefully balanced.

Shopify E-Commerce

Commerce Platform Observability

Instrumenting thousands of services to handle Black Friday traffic peaks. OTel-based tracing enables debugging of complex order flows across payment, inventory, and fulfillment services with sub-second resolution.

High-Scale Tracing Tail Sampling Custom Instrumentation Event Correlation
eBay E-Commerce

Global Marketplace Telemetry Migration

Migrated from proprietary instrumentation to OpenTelemetry across their global marketplace. Reduced vendor lock-in while improving trace coverage from 40% to 95% of requests.

Migration Strategy Collector Gateway Cross-Region Vendor Neutral
Skyscanner Travel

Travel Search Distributed Tracing

End-to-end tracing from mobile app search queries through dozens of microservices including pricing, availability, and booking. Context propagation across multiple languages (Kotlin, Python, Go).

Multi-Language Mobile-to-Backend Context Propagation Business Attributes
Zalando E-Commerce

Fashion E-Commerce Observability Platform

Built internal observability platform on OpenTelemetry serving 2,000+ engineers. Self-service instrumentation with standardized span schemas and automated sampling policies.

Platform Engineering Self-Service Schema Design Sampling Strategy

What OpenTelemetry Expertise Actually Means

Before assessing OpenTelemetry skills, understand the different levels of expertise and what they mean for your hiring needs.

Level 1: Instrumented Application User (Most Engineers)

Every engineer working with modern observability can:

  • Read distributed traces in Jaeger/Zipkin/vendor UIs
  • Navigate span hierarchies to debug slow requests
  • Understand basic context propagation concepts
  • Add simple spans to code using auto-instrumentation

This is table stakes. Don't test for it in interviews—assume any decent backend engineer develops this within weeks of working with observability.

Level 2: Instrumentation Implementer (Backend/Platform)

Engineers with hands-on OTel experience can:

  • Add custom spans, attributes, and events to critical code paths
  • Configure OTel SDK initialization and exporters
  • Set up the OTel Collector with processors and pipelines
  • Implement proper context propagation across async boundaries
  • Create meaningful span names and attribute schemas

This is your target for backend engineers who'll instrument services. It develops naturally with 6-12 months of production observability work.

Level 3: Observability Platform Owner (Specialized)

A smaller subset of engineers can:

  • Deploy and operate OTel Collectors at scale
  • Design telemetry pipelines with sampling, filtering, and routing
  • Build custom instrumentation libraries for internal frameworks
  • Optimize cost by managing cardinality and sampling strategies
  • Integrate OTel with CI/CD for deployment correlation

This is rare and valuable for platform teams building internal observability infrastructure. It requires dedicated focus, not just incidental OTel usage.


The Three Pillars of OpenTelemetry

Understanding how OTel handles each signal helps you assess candidate depth.

Traces (Distributed Tracing)

The most mature OTel signal. Traces capture request flow across service boundaries, with spans representing individual operations. Each span includes:

  • Span name: What operation occurred (HTTP GET, database query, queue processing)
  • Attributes: Key-value pairs describing the operation (http.method, db.system, user.id)
  • Events: Timestamped annotations within a span (exception thrown, cache miss)
  • Status: Success, error, or unset

Interview signal: Candidates who discuss trace context propagation, span attribute conventions, and sampling strategies understand distributed systems debugging deeply.

Metrics (Measurements and Aggregations)

OTel Metrics provide standardized instruments for measuring application behavior:

  • Counters: Monotonically increasing values (requests served, errors occurred)
  • Gauges: Point-in-time values (current queue depth, memory usage)
  • Histograms: Distribution of values (request latency percentiles)

Interview signal: Understanding when to use counters vs. histograms, and awareness of cardinality concerns, indicates operational maturity.

Logs (Structured Logging)

The newest OTel signal, still gaining adoption. OTel Logs aim to correlate log entries with traces:

  • Automatic injection of trace/span IDs into log records
  • Structured logging with semantic conventions
  • Unified collection through the OTel Collector

Interview signal: Candidates who understand log-trace correlation have modern observability perspectives beyond basic logging.


The OpenTelemetry Collector

The Collector is central to production OTel deployments. It receives, processes, and exports telemetry data.

Architecture Patterns

Agent Pattern: Collector runs as a sidecar or daemon, receiving data from local applications and forwarding to backends.

Gateway Pattern: Centralized Collector clusters receive data from many applications, enabling unified processing, sampling, and routing.

Combined Pattern: Agents forward to gateways, enabling both local buffering and centralized processing.

Key Collector Capabilities

Component Purpose Example
Receivers Ingest data in various formats OTLP, Jaeger, Prometheus, Zipkin
Processors Transform, filter, batch data Tail sampling, attribute modification, batching
Exporters Send to backends Jaeger, Zipkin, Datadog, Honeycomb, OTLP
Connectors Route between pipelines Span-to-metrics conversion

Interview question: "How would you deploy OTel Collectors for a 500-service platform?"


When OpenTelemetry Expertise Matters (And When It Doesn't)

High Value: Observability Platform Teams

If you're hiring someone to:

  • Build and maintain your company's telemetry pipeline
  • Design instrumentation standards for development teams
  • Migrate from proprietary instrumentation to OTel
  • Optimize observability costs through sampling and filtering

Then deep OTel experience matters. Look for candidates who've operated Collectors at scale, designed span schemas, and built instrumentation libraries.

Medium Value: Backend/Microservices Engineers

For engineers who will:

  • Instrument services they build and maintain
  • Debug production issues using distributed tracing
  • Add custom spans for business-critical code paths

Instrumentation skills matter, but specific OTel API knowledge is secondary. Prioritize candidates who understand what to instrument and can explain their tracing philosophy.

Low Value: Most Application Developers

For engineers who will:

  • Read traces to debug requests
  • Rely on auto-instrumentation
  • Focus on business logic over infrastructure

OTel familiarity is nice but irrelevant for hiring decisions. Auto-instrumentation handles most needs; reading traces requires minutes to learn.


Real-World OpenTelemetry Usage Patterns

Pattern 1: Service Mesh Integration

Challenge: Existing service mesh (Istio, Linkerd) provides some observability, but lacks application-level context like user IDs, feature flags, or business metrics.

Solution: Combine mesh-level telemetry with application instrumentation. OTel propagates trace context through mesh proxies while application code adds business-relevant attributes.

Interview question: "How would you correlate service mesh metrics with application-level traces?"

Pattern 2: Sampling at Scale

Challenge: Full trace collection at 10,000 requests per second generates terabytes of data daily—expensive to store and process.

Solution: Head-based sampling (decide at trace start) for baseline coverage, tail-based sampling (decide after trace completes) to capture errors and slow requests. The OTel Collector supports sophisticated sampling strategies.

Interview question: "You have 50,000 RPS across your platform. How do you approach trace sampling?"

Pattern 3: Migration from Proprietary SDKs

Challenge: You're using Datadog APM or New Relic agents but want to reduce vendor lock-in without losing observability during migration.

Solution: OTel's vendor-neutral design allows gradual migration. Start by deploying the Collector to receive data from existing agents, then incrementally replace proprietary instrumentation with OTel SDKs.

Interview question: "Walk me through migrating from Datadog APM to OpenTelemetry without losing visibility."

Pattern 4: Cross-Language Context Propagation

Challenge: A request flows from JavaScript frontend to Go API to Python ML service to Java payment processor. How do you maintain trace continuity?

Solution: OTel's W3C Trace Context and Baggage propagation standards work across all languages. Configure each service's OTel SDK to inject and extract context from HTTP headers.

Interview question: "How do you ensure trace context propagates across services in different languages?"


Recruiter's Cheat Sheet: Spotting Real Expertise

Resume Screening Signals

Conversation Starters That Reveal Skill Level

Question Surface-Level Answer Deep Understanding
"How do you decide what to instrument?" "I add spans to slow endpoints" "I instrument at service boundaries, async operations, and business-critical paths. I focus on actionable data—what will help me debug at 3 AM?"
"How do you handle high cardinality?" "I don't add too many attributes" "I distinguish between indexed attributes for querying versus span events for context. I use bounded cardinality for service/endpoint names, sampling for high-cardinality debugging data"
"What's your sampling strategy?" "We sample 10% of traces" "We use head-based sampling for baseline coverage plus tail-based sampling in the Collector to capture 100% of errors and high-latency traces"

Resume Signals That Matter

Look for:

  • Scale context ("Instrumented 200+ services with OTel, processing 1M spans/sec")
  • Collector experience ("Deployed OTel Collector gateway handling cross-region traffic")
  • Migration work ("Led migration from X-Ray to OpenTelemetry across platform")
  • Sampling design ("Implemented tail-based sampling reducing storage costs 60%")
  • Schema ownership ("Defined span naming conventions and attribute schemas")

🚫 Be skeptical of:

  • "OpenTelemetry expert" without scale or complexity context
  • Only consuming traces, never instrumenting code
  • No mention of Collector operations or sampling
  • Listing OTel alongside 20 tools without depth on any
  • Can't explain the difference between traces, metrics, and logs

GitHub/Portfolio Indicators

  • OTel instrumentation libraries or contrib modules
  • Collector configuration with custom processors
  • Span attribute schema documentation
  • Blog posts about instrumentation patterns or sampling strategies

Common Hiring Mistakes

1. Requiring "3+ Years OpenTelemetry Experience"

OpenTelemetry only became production-ready around 2021-2022. Requiring years of experience excludes candidates with deep observability expertise from OpenTracing, Jaeger, or Zipkin—whose skills transfer directly. OTel is learnable in weeks; distributed systems thinking takes years.

2. Conflating OpenTelemetry with Observability Backends

OTel is an instrumentation standard, not an observability platform. Candidates might have deep OTel experience but use Jaeger, or extensive Datadog experience without OTel. Understanding where OTel fits in the observability stack helps you ask the right questions.

3. Testing OTel API Knowledge in Interviews

Asking "How do you create a span in Go?" wastes interview time. APIs differ by language and are in documentation. Instead, ask about instrumentation strategy: "What would you instrument in a payment processing service, and why?"

4. Ignoring the Collector

Engineers who've only added spans to application code have limited OTel experience. Production deployments require Collector expertise: pipelines, processors, sampling, batching, reliability. Ask about operational experience, not just SDK usage.

5. Treating OTel and Vendor Experience as Mutually Exclusive

The best OTel engineers often have deep experience with Datadog, New Relic, or Honeycomb. Vendor expertise provides context for why vendor-neutral matters. Don't dismiss candidates because they learned observability through commercial tools.


OpenTelemetry in the Broader Observability Landscape

Understanding where OTel fits helps you assess candidates accurately.

OTel vs. Proprietary APM (Datadog, New Relic, Dynatrace)

Proprietary APMs provide integrated instrumentation, storage, and visualization—turnkey observability with vendor lock-in. OTel provides vendor-neutral instrumentation that exports to any backend.

Implication: OTel expertise indicates comfort with building from components and valuing portability. Proprietary APM expertise indicates working within integrated platforms. Different skills, both valuable.

OTel vs. OpenTracing/OpenCensus

OpenTelemetry supersedes both projects. OpenTracing focused on distributed tracing; OpenCensus added metrics. OTel unifies both and adds logs.

Implication: Candidates with OpenTracing or OpenCensus experience have directly transferable skills. Don't treat these as separate requirements.

The Vendor Landscape

All major observability vendors support OTel: Datadog, New Relic, Honeycomb, Lightstep, Grafana, Splunk. OTLP (OpenTelemetry Protocol) is becoming the standard export format.

Implication: OTel skills are increasingly portable. Engineers can switch between vendors or run multi-vendor setups without re-instrumenting applications.

The Future: OTel Everywhere

OTel is becoming the default instrumentation layer. Cloud providers (AWS, GCP, Azure) are adding native OTel support. Frameworks and libraries are shipping with built-in OTel instrumentation. Candidates who understand OTel deeply are positioned for the observability future.

Frequently Asked Questions

Frequently Asked Questions

For most roles, general observability experience is more valuable than OTel specifically. OpenTelemetry concepts—distributed tracing, context propagation, instrumentation patterns—transfer directly from OpenTracing, Jaeger, Zipkin, or vendor APMs like Datadog. An engineer with strong distributed tracing experience from any background can become productive with OTel in 2-3 weeks. Exception: if you're hiring someone to own OTel Collector infrastructure at scale, specific Collector experience saves significant ramp-up time.

Join the movement

The best teams don't wait.
They're already here.

Today, it's your turn.