Global Ride Infrastructure Monitoring
Visualizing metrics from thousands of microservices across global infrastructure. Real-time dashboards monitoring millions of rides daily with sub-minute alerting for service degradation.
Marketplace Observability Platform
Enterprise Grafana deployment enabling self-service dashboards for hundreds of engineering teams. Dashboard provisioning from code with automated governance and access control.
Financial Trading Systems Monitoring
Real-time visualization of trading system performance and market data feeds. Low-latency dashboards for identifying anomalies in high-frequency data streams.
Wikipedia Infrastructure Observability
Open-source observability stack monitoring Wikipedia's global infrastructure. Public Grafana dashboards showcasing real-world large-scale deployment patterns.
What Grafana Expertise Actually Means
Before assessing Grafana skills, understand the different levels of expertise and what they mean for your hiring needs.
Level 1: Dashboard Consumer (Most Engineers)
Every engineer who's worked with observability can:
- Navigate existing Grafana dashboards
- Read metrics and identify anomalies
- Set up basic alerts from existing panels
- Use time range selectors and filters
This is table stakes. Don't test for it in interviews—assume any decent engineer can do this within a day of joining.
Level 2: Dashboard Creator (DevOps/SRE)
Engineers with hands-on observability experience can:
- Build new dashboards from scratch
- Write PromQL/InfluxQL queries for panels
- Configure variables for dynamic filtering
- Set up notification channels and alert rules
- Organize dashboards with folders and tags
This is your target for most DevOps, SRE, and platform roles. It develops naturally with 6-12 months of production observability work.
Level 3: Grafana Platform Owner (Specialized)
A smaller subset of engineers can:
- Deploy and manage Grafana at enterprise scale
- Develop custom plugins and data sources
- Configure SSO, RBAC, and team provisioning
- Optimize performance for high-cardinality dashboards
- Integrate Grafana with CI/CD and GitOps workflows
This is rare and valuable for platform teams building internal observability platforms. It requires dedicated focus, not just incidental Grafana usage.
The Grafana Stack (LGTM)
Grafana Labs has expanded beyond dashboards into a full observability stack. Understanding this ecosystem helps you assess candidate depth.
Grafana (Visualization)
The core product: dashboards, panels, alerting, and annotations. This is what most people mean when they say "Grafana experience."
Loki (Logs)
A horizontally-scalable log aggregation system designed to be cost-effective and easy to operate. It indexes metadata (labels) rather than full text, making it cheaper than Elasticsearch for many use cases.
Interview signal: Candidates who mention Loki alongside Grafana understand modern observability stacks beyond metrics.
Tempo (Traces)
A high-volume distributed tracing backend that integrates with Grafana for trace visualization. It's designed to be cost-effective by not requiring indexing.
Interview signal: Trace expertise indicates understanding of distributed systems debugging, not just metric monitoring.
Mimir (Metrics)
Grafana's horizontally-scalable Prometheus-compatible metrics backend, replacing Cortex as their recommended long-term storage solution.
Interview signal: Mimir experience suggests large-scale metrics infrastructure work, likely at companies with significant observability maturity.
Pyroscope (Profiling)
Continuous profiling for application performance analysis. Recently acquired by Grafana Labs and integrated into their stack.
Interview signal: Profiling knowledge indicates performance engineering depth beyond basic monitoring.
When Grafana Expertise Matters (And When It Doesn't)
High Value: Observability Platform Teams
If you're hiring someone to:
- Build and maintain your company's observability infrastructure
- Design monitoring standards and best practices
- Create self-service tooling for development teams
Then Grafana platform experience matters. Look for candidates who've owned Grafana deployments, managed dashboard sprawl, and built monitoring that engineers actually use.
Medium Value: SRE and DevOps Roles
For engineers who will:
- Create dashboards for services they support
- Set up alerting and on-call integrations
- Debug production issues using observability tools
Dashboard creation skills are important but learnable. Prioritize candidates who understand what to monitor and can explain their alerting philosophy—the Grafana mechanics are secondary.
Low Value: Application Developers
For backend or frontend engineers who will:
- Read dashboards occasionally
- Instrument their code with metrics
- Respond to alerts about their services
Grafana familiarity is nice but irrelevant for hiring decisions. Any competent developer learns to read dashboards in their first week.
Real-World Grafana Usage Patterns
Pattern 1: Multi-Team Dashboard Organization
Challenge: 50 engineering teams each creating dashboards leads to chaos—hundreds of unorganized dashboards, naming collisions, abandoned panels nobody maintains.
Solution: Folder hierarchies by team/service, naming conventions, dashboard tagging, and provisioning from code. Some companies use Grafonnet or Terraform to manage dashboards as code.
Interview question: "How would you organize dashboards for a 200-person engineering organization?"
Pattern 2: Alerting That Doesn't Suck
Challenge: Default alerting leads to noise—too many false positives, alerts without context, notification fatigue.
Solution: Alert rules with proper thresholds, pending periods to avoid flapping, notification policies that route to the right teams, links to runbooks in alert descriptions.
Interview question: "Walk me through how you'd design alerting for a payment processing service."
Pattern 3: Variable-Driven Dashboards
Challenge: Creating separate dashboards for each service/environment doesn't scale—you end up with hundreds of near-identical dashboards.
Solution: Template variables that let users filter by service, environment, region, etc. A single dashboard template serves multiple use cases.
Interview question: "You have 50 microservices. How do you approach dashboard creation?"
Pattern 4: High-Cardinality Visualization
Challenge: Visualizing metrics with thousands of unique values (user IDs, container IDs, request IDs) overwhelms Grafana and produces unreadable charts.
Solution: Aggregation at the query level, Top-N queries, careful label selection, pre-aggregated recording rules.
Interview question: "A developer wants a dashboard showing latency per user. How do you respond?"
Recruiter's Cheat Sheet: Spotting Real Expertise
Conversation Starters That Reveal Skill Level
| Question | Surface-Level Answer | Deep Understanding |
|---|---|---|
| "How do you manage dashboard sprawl?" | "We organize by team" | "We provision dashboards from code using Grafonnet, with PR review for changes and automated cleanup of unused dashboards" |
| "How do you approach alerting?" | "I set thresholds based on past incidents" | "I design alerts around SLOs—error budget burn rates, not arbitrary thresholds. Alerts link to runbooks and include context for the on-call engineer" |
| "What makes a good dashboard?" | "One that shows all the metrics" | "One that answers specific questions. I design for the operator: what do they need to see at 3 AM to understand if there's a problem?" |
Resume Signals That Matter
✅ Look for:
- Scale context ("Managed Grafana for 500+ dashboards across 40 teams")
- Platform ownership ("Designed self-service observability platform")
- Alerting design experience ("Reduced alert noise by 70% through SLO-based alerting")
- Dashboard-as-code experience (Grafonnet, Terraform provider, Jsonnet)
- LGTM stack familiarity (Loki, Tempo, Mimir)
🚫 Be skeptical of:
- "Grafana expert" without context of what they built
- Only viewing dashboards, never creating them
- No mention of alerting or operational context
- Listing Grafana alongside 20 other tools without depth
- Can't explain what data sources they connected
GitHub/Portfolio Indicators
- Grafonnet or Jsonnet dashboard definitions
- Custom Grafana plugins or data sources
- Terraform configurations for Grafana provisioning
- Documentation of monitoring strategies
Common Hiring Mistakes
1. Testing Grafana UI Knowledge in Technical Interviews
Asking "How do you create a panel in Grafana?" wastes interview time. Anyone can learn the UI in an afternoon. Instead, ask about monitoring strategy: "What would you include in a dashboard for a checkout service, and why?"
2. Conflating Grafana with Prometheus
Grafana is a visualization layer; Prometheus is a metrics backend. Many "Grafana experts" have limited understanding of how their data is collected, stored, and queried. If you need someone to design monitoring systems, assess the full stack—not just the dashboard layer.
3. Requiring Grafana Experience for Application Developer Roles
Adding "Grafana experience required" to backend engineer job descriptions is noise. Application developers need to understand what to instrument, not how to build dashboards. Remove it from requirements and focus on observability concepts instead.
4. Ignoring the Human Side of Dashboards
The best dashboards are designed for human operators, not metric completeness. Engineers who talk about "dashboards for debugging at 3 AM" or "reducing cognitive load during incidents" understand what visualization is actually for.
5. Treating All Grafana Experience Equally
An engineer who's managed Grafana Enterprise for 2,000 users has different skills than someone who created dashboards for their team project. Ask about scope, responsibility, and impact—not just duration.
Grafana in the Broader Observability Landscape
Understanding where Grafana fits helps you assess candidates more accurately.
Grafana vs. Datadog/New Relic
Datadog and New Relic are integrated observability platforms—metrics, logs, traces, and APM in one commercial product. Grafana is an open-source visualization layer that connects to various backends.
Implication: Grafana expertise indicates comfort with building observability stacks from components. Datadog expertise indicates working within an integrated platform. Different skills, both valuable.
Grafana vs. Kibana
Kibana visualizes Elasticsearch data, primarily for log analysis. Grafana is data-source agnostic and optimized for time-series metrics. They're increasingly overlapping as both expand their capabilities.
Implication: Candidates often have experience with both. Don't treat them as mutually exclusive requirements.
The OpenTelemetry Future
The observability ecosystem is converging on OpenTelemetry for instrumentation. Grafana positions itself as the visualization layer for OTel data. Candidates who understand this trajectory are thinking about observability strategically, not just tactically.