Overview
Data pipeline development means building automated systems that extract data from sources, transform it for analysis, and load it into destinations where it drives business decisions. Modern data pipelines power dashboards, ML models, and operational systems.
The data engineering landscape has matured significantly. The "modern data stack" centers on cloud warehouses (Snowflake, BigQuery), transformation tools (dbt), and orchestrators (Airflow, Dagster, Prefect). This shift moved complexity from ETL developers writing custom code to declarative frameworks that enable smaller teams to manage larger data volumes.
When hiring, evaluate SQL depth, data modeling intuition, and systems thinking over specific tool certifications. The best data engineers understand that pipelines are products—they need reliability engineering, not just scripting skills. Tools change; fundamentals don't.
What Success Looks Like
Building data pipelines successfully means creating infrastructure where data flows reliably, stakeholders trust the numbers, and engineers aren't firefighting broken jobs daily. Here's what distinguishes good data infrastructure from problematic setups:
Reliable Data Flow
- Pipelines run on schedule without manual intervention
- Data arrives when expected (and alerts fire when it doesn't)
- Failures are self-documenting with clear error messages and lineage
- Recovery is automated through idempotent jobs and backfill capabilities
- Schema changes don't break downstream consumers silently
Trusted Data
- Stakeholders use dashboards confidently without "let me verify that number first"
- Data quality tests catch issues before they reach production reports
- Lineage is traceable from source to final metric
- Documentation exists explaining what each table contains and how it's calculated
- Discrepancies between systems are understood and documented
Sustainable Team Operations
- On-call is manageable (not every morning debugging failed jobs)
- New pipelines are straightforward to build using established patterns
- Knowledge isn't siloed in one person's head
- Technical debt is actively managed, not ignored until crisis
Roles You'll Need
Data infrastructure requires different skills than typical software engineering. Here's who you need and when:
Data Engineer
Focus: Building and maintaining pipelines, infrastructure, and reliability
Key skills: Python, SQL, orchestration (Airflow/Dagster), cloud data warehouses
When to hire: First data hire after your initial analyst outgrows spreadsheets
Salary range: $120-165K mid, $165-210K senior
Data engineers are the core of your pipeline team. They build the infrastructure that moves data from sources to destinations, ensure reliability, and enable downstream consumers. At early stages, one strong data engineer handles everything from ingestion to transformation. As you scale, they specialize into pipeline engineers (ingestion focus) and analytics engineers (transformation focus).
Analytics Engineer
Focus: Transforming raw data into clean, business-ready models using dbt
Key skills: Advanced SQL, dbt, data modeling, stakeholder communication
When to hire: After you have reliable data flowing but need better organization
Salary range: $110-145K mid, $145-185K senior
Analytics engineers bridge the gap between raw data and business metrics. They own the transformation layer—turning event logs into user journeys, transactions into revenue metrics, and raw tables into dimensional models analysts can query. This role emerged with dbt's popularity and is distinct from traditional data engineering. Analytics engineers need excellent SQL and business context; they're not primarily infrastructure-focused.
Data Platform Engineer
Focus: Building internal tools and platforms for data producers and consumers
Key skills: Software engineering, infrastructure, developer experience
When to hire: When your data team reaches 5+ and needs better tooling
Salary range: $140-180K mid, $180-230K senior
Platform engineers build the internal developer experience for data work—catalog systems, data discovery tools, access management, and self-service infrastructure. This is a senior role for teams at scale where the data infrastructure itself becomes a product with internal customers.
ML Engineer (for Feature Pipelines)
Focus: Building pipelines that feed ML models with real-time and batch features
Key skills: Python, feature stores, ML frameworks, streaming systems
When to hire: When ML models need production features beyond batch data
Salary range: $150-190K mid, $190-250K senior
If your data pipelines feed ML models, you'll eventually need engineers who understand both data engineering and ML requirements. Feature pipelines have different latency, freshness, and consistency needs than analytics pipelines.
Tech Stack Decisions
Batch vs. Streaming: The Most Important Choice
Start with batch processing. 90% of companies don't need real-time pipelines, and streaming adds significant complexity. Batch pipelines are easier to debug, easier to backfill, and easier to hire for.
When to consider streaming:
- User-facing latency requirements (fraud detection, personalization, live dashboards)
- Event-driven architectures where you need to react to events immediately
- Data volumes too large for batch windows to complete
Streaming warning signs to probe in interviews:
- "We need real-time analytics" (usually means "hourly would be fine")
- "Our competitors use Kafka" (not a technical requirement)
- "Management wants a real-time dashboard" (often hourly refreshes suffice)
The Modern Data Stack
| Layer | Recommended Tools | Why |
|---|---|---|
| Ingestion | Fivetran, Airbyte, Stitch | Managed connectors reduce maintenance burden |
| Warehouse | Snowflake, BigQuery, Redshift | Scalable, SQL-native, managed |
| Transformation | dbt | Industry standard, testable, version-controlled |
| Orchestration | Airflow, Dagster, Prefect | Dependency management, monitoring, alerting |
| Quality | dbt tests, Great Expectations, Monte Carlo | Catch issues before stakeholders do |
For Streaming (when actually needed)
| Layer | Recommended Tools | Why |
|---|---|---|
| Message Queue | Kafka, AWS Kinesis, Google Pub/Sub | Battle-tested, scalable |
| Processing | Flink, Spark Streaming, Kafka Streams | Depends on latency requirements |
| State Store | Redis, DynamoDB | Real-time feature serving |
Hiring Sequence
Phase 1: First Data Engineer (0 → 1)
Your first data engineer should be a generalist who can:
- Set up a data warehouse and initial pipelines
- Work directly with stakeholders to understand data needs
- Make pragmatic decisions without over-engineering
- Build foundations others can extend
Interview focus: "Tell me about a time you built data infrastructure from scratch. What tradeoffs did you make?"
What to look for:
- Strong SQL and Python fundamentals
- Experience with at least one modern orchestrator
- Product mindset (understands why data matters to the business)
- Self-directed with minimal supervision
Red flag: Candidates who want to implement Kafka, Spark, and a lakehouse from day one. Start simple.
Phase 2: Growing the Team (2-4 engineers)
Once your first engineer has established patterns, add specialists:
Second hire: Another generalist or analytics engineer
- Handles dbt transformations and data modeling
- Partners with analysts to understand needs
- Frees up the first engineer for infrastructure work
Third hire: Based on your bottleneck
- More ingestion complexity → Pipeline engineer
- More modeling needs → Analytics engineer
- Reliability issues → Data platform engineer
Phase 3: Scale (5+ engineers)
At this stage, formalize roles:
- Pipeline team (ingestion, orchestration, reliability)
- Analytics engineering team (dbt, modeling, data quality)
- Platform team (tooling, self-service, governance)
You'll need technical leadership to coordinate across teams and establish standards.
Common Pitfalls
1. Hiring Before You Know What You Need
The mistake: "Let's hire a data team" without clear requirements
The result: Expensive hires sitting idle or building the wrong things
Better approach: Start with an analyst or data-savvy engineer who can identify the actual needs. Then hire data engineers to solve specific problems.
2. Over-Engineering from Day One
The mistake: Building a data lakehouse with Spark, Kafka, and custom orchestration
The result: Months of infrastructure work before delivering business value
Better approach: Use managed services aggressively. Fivetran/Airbyte for ingestion, a cloud warehouse, dbt for transformations. Add complexity only when managed services can't meet requirements.
3. Requiring Specific Tool Experience
The mistake: "Must have 3+ years of Airflow experience"
The result: Filtered out excellent candidates who used Prefect or Dagster
Better approach: Test fundamentals—SQL, data modeling, systems thinking. Someone who understands orchestration concepts learns Airflow in weeks.
4. Ignoring Data Quality Until Crisis
The mistake: Building pipelines without tests or monitoring
The result: Stakeholder trust erodes when numbers don't match
Better approach: Data quality is a first-class concern from day one. Build tests alongside pipelines, not after.
5. Treating Data Engineering as Backend Engineering
The mistake: Hiring backend engineers expecting them to build pipelines
The result: Well-architected code that doesn't understand data semantics
Better approach: Data engineering requires specific skills—SQL depth, modeling intuition, understanding of idempotency and late-arriving data. These aren't automatic from backend experience.
6. Underestimating Maintenance Burden
The mistake: Planning for pipeline building, not pipeline operating
The result: Engineers spend 80% of time firefighting, not building
Better approach: Budget engineering time for maintenance: schema changes, backfills, debugging, optimization. A realistic ratio is 50% building, 50% maintaining for mature teams.
Interview Strategy
Technical Assessment
Test SQL depth beyond basic queries:
- Window functions, CTEs, complex joins
- Query optimization and explain plans
- Data modeling scenarios (slowly changing dimensions, handling late data)
Give a practical take-home or live coding exercise:
- "Here's messy data, transform it into this schema"
- "Debug this failing pipeline and explain what went wrong"
- "Design a pipeline architecture for this use case"
Questions to Ask
For generalist data engineers:
- "Walk me through a pipeline you built end-to-end. What would you do differently?"
- "How do you handle late-arriving data?"
- "Tell me about a data quality issue you discovered and fixed."
For analytics engineers:
- "How do you structure dbt models? Walk me through your marts and staging conventions."
- "How do you document your transformations for business stakeholders?"
- "Tell me about a time a stakeholder questioned your numbers. How did you handle it?"
For senior hires:
- "How would you set up data infrastructure for a company at our stage?"
- "What's your approach to balancing new feature development with reliability work?"
- "How do you think about build vs. buy decisions for data tools?"
Building Your Data Culture
Great data infrastructure isn't just pipelines—it's culture. Hire for these traits:
- Ownership mentality — Engineers who treat pipelines as products, not just jobs to run
- Stakeholder empathy — Understanding that data serves business decisions, not technical elegance
- Reliability focus — Valuing uptime and data quality over clever solutions
- Documentation habits — Writing things down so knowledge isn't siloed
The best data teams feel responsible for the decisions made from their data. That ownership separates excellent data engineering from adequate data engineering.