SaaS Infrastructure Monitoring
One of the largest Prometheus deployments in the industry, monitoring GitLab.com's entire infrastructure. Engineers build federation architectures, custom exporters, and SLO-based alerting.
Edge Network Observability
Monitoring 200+ data centers with sub-second alerting for DDoS detection and performance anomalies. High-throughput metrics collection at massive scale.
Cloud Platform Monitoring
Customer-facing SLIs and internal platform health monitoring. Custom exporters for hypervisor metrics and automated incident response.
Kubernetes Cluster Monitoring
Observability for thousands of microservices. Resource utilization tracking for cost optimization and application performance metrics.
What Prometheus Engineers Actually Build
Before writing your job description, understand what Prometheus expertise means in practice. Here are real examples from industry leaders:
Infrastructure Observability
GitLab runs one of the largest Prometheus deployments, monitoring their entire SaaS infrastructure. Their observability engineers handle:
- Metrics collection from thousands of services and nodes
- PromQL dashboards for capacity planning and performance analysis
- Alerting pipelines that reduce MTTR (Mean Time to Recovery)
- Federation and remote storage for long-term data retention
DigitalOcean uses Prometheus to monitor their cloud platform. Their engineers build:
- Custom exporters for proprietary hypervisor metrics
- Service-level indicators (SLIs) and SLOs for customer-facing APIs
- Automated alerting that pages on-call engineers with context
Cloud-Native Applications
Cloudflare monitors their edge network with Prometheus. Engineers work on:
- High-cardinality metrics from 200+ data centers
- Sub-second latency dashboards for DDoS detection
- Integration with Grafana for real-time visualization
Spotify uses Prometheus in their Kubernetes clusters to track:
- Microservice health and inter-service latencies
- Resource utilization for cost optimization
- Custom application metrics for playlist recommendations
Metrics vs. Logs vs. Traces: Why It Matters
common recruiter confusion: conflating different observability tools. Understanding the distinction helps you assess candidates accurately.Metrics (Prometheus Territory)
What they are: Numeric measurements over time—request counts, latencies, CPU usage, error rates.
Why they matter: Metrics answer "what's happening?" They're cheap to store, fast to query, and ideal for alerting. Prometheus excels here.
Example: "Our API served 1.2M requests in the last hour with 99.2% success rate and p95 latency of 45ms."
Logs (ELK Stack, Loki)
What they are: Detailed text records of events—error messages, transaction details, debug output.
Why they matter: Logs answer "why did it happen?" They're essential for debugging but expensive at scale.
Example: "Error at 14:32:01 - User 12345 failed authentication: invalid token format."
Traces (Jaeger, Zipkin)
What they are: Request paths through distributed systems—showing how a single request flows across services.
Why they matter: Traces answer "where did time go?" Critical for debugging microservice bottlenecks.
Example: "Request X spent 200ms in auth-service, 50ms in user-service, 400ms in database."
The Hiring Implication
An engineer who only knows Prometheus doesn't understand the full observability picture. The best candidates can explain when to use each pillar and how they complement each other. This is where you separate "I've used Prometheus" from "I've built observability systems."
PromQL: The Language That Separates Beginners from Experts
(Prometheus Query Language) is where you identify real expertise. It's deceptively simple syntax hiding significant complexity.Basic Queries (Any Engineer)
http_requests_total{job="api", status="500"}
Returns current value of 500-error requests. Anyone can write this after a tutorial.
Intermediate Queries (Working Knowledge)
rate(http_requests_total{status="500"}[5m]) / rate(http_requests_total[5m]) * 100
Calculates error percentage over 5 minutes. Requires understanding rate(), time windows, and vector math.
Advanced Queries (Real Expertise)
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Computes p99 latency from histogram buckets, grouped by service. This requires understanding histogram internals, aggregation, and the le label convention.
What to Ask in Interviews
Instead of "Do you know PromQL?", try: "Walk me through how you'd create an alert for when error rate exceeds 5% of traffic, but only when traffic is above 100 requests per minute." This reveals understanding of:
- Rate calculations over time windows
- Combining multiple metrics in expressions
- Avoiding noisy alerts during low-traffic periods
Recruiter's Cheat Sheet: Spotting Great Candidates
Conversation Starters That Reveal Skill Level
| Question | Junior Answer | Senior Answer |
|---|---|---|
| "How do you handle high-cardinality metrics?" | "What's cardinality?" | "We limit label values, use recording rules, and sometimes pre-aggregate at the application level" |
| "Tell me about an alerting system you built" | "I set up alerts in Grafana" | "I designed SLO-based alerts with multi-window burn rates and escalation policies" |
| "What's your approach to metric naming?" | Generic or no opinion | References conventions like OpenMetrics, explains tradeoffs of different approaches |
Resume Signals That Matter
✅ Look for:
- Specific scale indicators ("monitored 500+ services", "handled 10M samples/second")
- Alerting design experience (not just consuming alerts)
- Mentions of Thanos, Cortex, or Victoria Metrics (indicates scale challenges)
- Federation or remote storage implementation
- SLO/SLI experience with specific targets
🚫 Be skeptical of:
- "Prometheus expert" without scale context
- Only Grafana dashboard experience (consumer, not builder)
- No mention of alerting or on-call experience
- Listing every observability tool without depth
GitHub Portfolio Indicators
- Custom exporters they've written
- Recording rules and alerting configurations
- Helm charts or operators for Prometheus deployment
- Contributions to Prometheus ecosystem projects
The Modern Prometheus Engineer (2024-2026)
Prometheus has evolved from a standalone tool to the center of cloud-native observability. Modern expertise looks different from five years ago.
Kubernetes-Native Monitoring
The Prometheus Operator and ServiceMonitor CRDs are now standard. Engineers should understand:
- Automatic service discovery in Kubernetes
- PodMonitor and ServiceMonitor resources
- Prometheus Operator configuration patterns
Long-Term Storage Solutions
Prometheus wasn't designed for long-term storage. Production deployments need:
- Thanos: Multi-cluster queries and object storage backend
- Cortex: Horizontally scalable Prometheus-as-a-service
- Victoria Metrics: High-performance alternative storage
OpenTelemetry Integration
The observability landscape is converging on OpenTelemetry. Strong candidates understand how Prometheus fits into the OTel ecosystem and can discuss migration paths.
Common Hiring Mistakes
1. Treating Prometheus Like a Generic Skill
Monitoring tools ARE learnable, but depth matters. An engineer who's designed alerting systems understands nuances that take years to develop: cardinality explosions, alert fatigue, meaningful vs. noise alerts. Don't dismiss Prometheus experience as "just another tool."
2. Conflating Grafana with Prometheus
Grafana is a visualization layer. Prometheus is the data backend. Many engineers have made beautiful Grafana dashboards without understanding how metrics collection, storage, or alerting works. Ask about the full pipeline, not just dashboards.
3. Ignoring Operational Experience
The difference between "I set up Prometheus once" and "I ran Prometheus in production for three years" is massive. Production experience means handling federation, storage scaling, upgrade migrations, and incident debugging. Prioritize operational depth.
4. Requiring Specific Years of Experience
Prometheus graduated from CNCF in 2018, but the project started in 2012. Someone with three years of deep Kubernetes experience might understand cloud-native monitoring better than someone who used Prometheus for five years in a static environment.