Do we need Spark and Kafka, or is that over-engineering?

For most companies, Spark and Kafka are over-engineering. Here's a reality check: (1) If your data fits in a cloud warehouse query (even at TB scale), you probably don't need Spark. Modern warehouses like Snowflake and BigQuery handle workloads that previously required Spark. (2) If hourly or daily latency is acceptable, you probably don't need Kafka. Most "real-time" requirements are actually "fresh enough" requirements. (3) Spark and Kafka add significant operational complexity and require specialized skills. Start with the simplest stack that meets requirements: managed ingestion tools, a cloud warehouse, dbt for transformations. Add Spark/Kafka only when you hit concrete limits—not because they seem impressive.

What's the difference between a data engineer and an analytics engineer?

Data engineers build infrastructure—ingestion pipelines, orchestration, data platform tooling. They focus on getting data from sources to the warehouse reliably. Analytics engineers transform data once it's in the warehouse—building dbt models, defining metrics, ensuring data quality in the transformation layer. They partner closely with analysts and business stakeholders. Think of it as infrastructure vs. business logic. Data engineers ensure data arrives; analytics engineers ensure data is useful. In smaller teams, one person does both. At scale, they're distinct roles with different hiring profiles. Analytics engineers need stronger SQL and business context; data engineers need stronger Python and infrastructure skills.

How do I evaluate data engineering candidates without being a data engineer myself?

Focus on outcomes and communication: (1) Ask for specific examples with numbers—"Reduced pipeline failures from 15% to 2%" is better than "improved reliability." (2) Have them explain technical concepts simply—great data engineers translate complexity into clear explanations. (3) Probe for ownership—did they build and operate, or just build and hand off? (4) Ask about stakeholder relationships—data engineers serve business consumers, not just technical elegance. (5) Give a take-home SQL assessment or have a technical team member evaluate. Look for candidates who discuss tradeoffs, not just solutions. Data engineering is full of "it depends" decisions; black-and-white thinkers struggle.

What should we expect for time-to-hire and compensation for data engineers?

Data engineers typically take 5-7 weeks to hire—slightly longer than general software engineers due to specialized skills. Compensation in the US: mid-level ($120-165K), senior ($165-210K), staff ($200-260K). Add 10-15% for candidates with strong streaming experience or ML pipeline expertise. Remote positions may find talent at 10-20% below SF/NYC rates, but senior talent commands premium rates regardless of location. Don't lowball—data engineering has strong demand and candidates have options. Companies with modern tooling (dbt, cloud warehouses) are more attractive than those with legacy infrastructure, which can partially offset salary competition.

Hiring to Build Data Pipelines: The Complete Guide

Data Engineer

Definition

A Data Engineer is a technical professional who designs, builds, and maintains software systems using programming languages and development frameworks. This specialized role requires deep technical expertise, continuous learning, and collaboration with cross-functional teams to deliver high-quality software products that meet business needs.

Data Engineer is a fundamental concept in tech recruiting and talent acquisition. In the context of hiring developers and technical professionals, data engineer plays a crucial role in connecting organizations with the right talent. Whether you're a recruiter, hiring manager, or candidate, understanding data engineer helps navigate the complex landscape of modern tech hiring. This concept is particularly important for developer-focused recruiting where technical expertise and cultural fit must be carefully balanced.

Read full definition

Overview

Data pipeline development means building automated systems that extract data from sources, transform it for analysis, and load it into destinations where it drives business decisions. Modern data pipelines power dashboards, ML models, and operational systems.

The data engineering landscape has matured significantly. The "modern data stack" centers on cloud warehouses (Snowflake, BigQuery), transformation tools (dbt), and orchestrators (Airflow, Dagster, Prefect). This shift moved complexity from ETL developers writing custom code to declarative frameworks that enable smaller teams to manage larger data volumes.

When hiring, evaluate SQL depth, data modeling intuition, and systems thinking over specific tool certifications. The best data engineers understand that pipelines are products—they need reliability engineering, not just scripting skills. Tools change; fundamentals don't.

What Success Looks Like

Building data pipelines successfully means creating infrastructure where data flows reliably, stakeholders trust the numbers, and engineers aren't firefighting broken jobs daily. Here's what distinguishes good data infrastructure from problematic setups:

Reliable Data Flow

Pipelines run on schedule without manual intervention
Data arrives when expected (and alerts fire when it doesn't)
Failures are self-documenting with clear error messages and lineage
Recovery is automated through idempotent jobs and backfill capabilities
Schema changes don't break downstream consumers silently

Trusted Data

Stakeholders use dashboards confidently without "let me verify that number first"
Data quality tests catch issues before they reach production reports
Lineage is traceable from source to final metric
Documentation exists explaining what each table contains and how it's calculated
Discrepancies between systems are understood and documented

Sustainable Team Operations

On-call is manageable (not every morning debugging failed jobs)
New pipelines are straightforward to build using established patterns
Knowledge isn't siloed in one person's head
Technical debt is actively managed, not ignored until crisis

Roles You'll Need

Data infrastructure requires different skills than typical software engineering. Here's who you need and when:

Data Engineer

Focus: Building and maintaining pipelines, infrastructure, and reliability
Key skills: Python, SQL, orchestration (Airflow/Dagster), cloud data warehouses
When to hire: First data hire after your initial analyst outgrows spreadsheets
Salary range: $120-165K mid, $165-210K senior

Data engineers are the core of your pipeline team. They build the infrastructure that moves data from sources to destinations, ensure reliability, and enable downstream consumers. At early stages, one strong data engineer handles everything from ingestion to transformation. As you scale, they specialize into pipeline engineers (ingestion focus) and analytics engineers (transformation focus).

Analytics Engineer

Focus: Transforming raw data into clean, business-ready models using dbt
Key skills: Advanced SQL, dbt, data modeling, stakeholder communication
When to hire: After you have reliable data flowing but need better organization
Salary range: $110-145K mid, $145-185K senior

Analytics engineers bridge the gap between raw data and business metrics. They own the transformation layer—turning event logs into user journeys, transactions into revenue metrics, and raw tables into dimensional models analysts can query. This role emerged with dbt's popularity and is distinct from traditional data engineering. Analytics engineers need excellent SQL and business context; they're not primarily infrastructure-focused.

Data Platform Engineer

Focus: Building internal tools and platforms for data producers and consumers
Key skills: Software engineering, infrastructure, developer experience
When to hire: When your data team reaches 5+ and needs better tooling
Salary range: $140-180K mid, $180-230K senior

Platform engineers build the internal developer experience for data work—catalog systems, data discovery tools, access management, and self-service infrastructure. This is a senior role for teams at scale where the data infrastructure itself becomes a product with internal customers.

ML Engineer (for Feature Pipelines)

Focus: Building pipelines that feed ML models with real-time and batch features
Key skills: Python, feature stores, ML frameworks, streaming systems
When to hire: When ML models need production features beyond batch data
Salary range: $150-190K mid, $190-250K senior

If your data pipelines feed ML models, you'll eventually need engineers who understand both data engineering and ML requirements. Feature pipelines have different latency, freshness, and consistency needs than analytics pipelines.

Tech Stack Decisions

Batch vs. Streaming: The Most Important Choice

Start with batch processing. 90% of companies don't need real-time pipelines, and streaming adds significant complexity. Batch pipelines are easier to debug, easier to backfill, and easier to hire for.

When to consider streaming:

User-facing latency requirements (fraud detection, personalization, live dashboards)
Event-driven architectures where you need to react to events immediately
Data volumes too large for batch windows to complete

Streaming warning signs to probe in interviews:

"We need real-time analytics" (usually means "hourly would be fine")
"Our competitors use Kafka" (not a technical requirement)
"Management wants a real-time dashboard" (often hourly refreshes suffice)

The Modern Data Stack

Layer	Recommended Tools	Why
Ingestion	Fivetran, Airbyte, Stitch	Managed connectors reduce maintenance burden
Warehouse	Snowflake, BigQuery, Redshift	Scalable, SQL-native, managed
Transformation	dbt	Industry standard, testable, version-controlled
Orchestration	Airflow, Dagster, Prefect	Dependency management, monitoring, alerting
Quality	dbt tests, Great Expectations, Monte Carlo	Catch issues before stakeholders do

For Streaming (when actually needed)

Layer	Recommended Tools	Why
Message Queue	Kafka, AWS Kinesis, Google Pub/Sub	Battle-tested, scalable
Processing	Flink, Spark Streaming, Kafka Streams	Depends on latency requirements
State Store	Redis, DynamoDB	Real-time feature serving

Hiring Sequence

Phase 1: First Data Engineer (0 → 1)

Your first data engineer should be a generalist who can:

Set up a data warehouse and initial pipelines
Work directly with stakeholders to understand data needs
Make pragmatic decisions without over-engineering
Build foundations others can extend

Interview focus: "Tell me about a time you built data infrastructure from scratch. What tradeoffs did you make?"

What to look for:

Strong SQL and Python fundamentals
Experience with at least one modern orchestrator
Product mindset (understands why data matters to the business)
Self-directed with minimal supervision

Red flag: Candidates who want to implement Kafka, Spark, and a lakehouse from day one. Start simple.

Phase 2: Growing the Team (2-4 engineers)

Once your first engineer has established patterns, add specialists:

Second hire: Another generalist or analytics engineer

Handles dbt transformations and data modeling
Partners with analysts to understand needs
Frees up the first engineer for infrastructure work

Third hire: Based on your bottleneck

More ingestion complexity → Pipeline engineer
More modeling needs → Analytics engineer
Reliability issues → Data platform engineer

Phase 3: Scale (5+ engineers)

At this stage, formalize roles:

Pipeline team (ingestion, orchestration, reliability)
Analytics engineering team (dbt, modeling, data quality)
Platform team (tooling, self-service, governance)

You'll need technical leadership to coordinate across teams and establish standards.

Common Pitfalls

1. Hiring Before You Know What You Need

The mistake: "Let's hire a data team" without clear requirements
The result: Expensive hires sitting idle or building the wrong things

Better approach: Start with an analyst or data-savvy engineer who can identify the actual needs. Then hire data engineers to solve specific problems.

2. Over-Engineering from Day One

The mistake: Building a data lakehouse with Spark, Kafka, and custom orchestration
The result: Months of infrastructure work before delivering business value

Better approach: Use managed services aggressively. Fivetran/Airbyte for ingestion, a cloud warehouse, dbt for transformations. Add complexity only when managed services can't meet requirements.

3. Requiring Specific Tool Experience

The mistake: "Must have 3+ years of Airflow experience"
The result: Filtered out excellent candidates who used Prefect or Dagster

Better approach: Test fundamentals—SQL, data modeling, systems thinking. Someone who understands orchestration concepts learns Airflow in weeks.

4. Ignoring Data Quality Until Crisis

The mistake: Building pipelines without tests or monitoring
The result: Stakeholder trust erodes when numbers don't match

Better approach: Data quality is a first-class concern from day one. Build tests alongside pipelines, not after.

5. Treating Data Engineering as Backend Engineering

The mistake: Hiring backend engineers expecting them to build pipelines
The result: Well-architected code that doesn't understand data semantics

Better approach: Data engineering requires specific skills—SQL depth, modeling intuition, understanding of idempotency and late-arriving data. These aren't automatic from backend experience.

6. Underestimating Maintenance Burden

The mistake: Planning for pipeline building, not pipeline operating
The result: Engineers spend 80% of time firefighting, not building

Better approach: Budget engineering time for maintenance: schema changes, backfills, debugging, optimization. A realistic ratio is 50% building, 50% maintaining for mature teams.

Interview Strategy

Technical Assessment

Test SQL depth beyond basic queries:

Window functions, CTEs, complex joins
Query optimization and explain plans
Data modeling scenarios (slowly changing dimensions, handling late data)

Give a practical take-home or live coding exercise:

"Here's messy data, transform it into this schema"
"Debug this failing pipeline and explain what went wrong"
"Design a pipeline architecture for this use case"

Questions to Ask

For generalist data engineers:

"Walk me through a pipeline you built end-to-end. What would you do differently?"
"How do you handle late-arriving data?"
"Tell me about a data quality issue you discovered and fixed."

For analytics engineers:

"How do you structure dbt models? Walk me through your marts and staging conventions."
"How do you document your transformations for business stakeholders?"
"Tell me about a time a stakeholder questioned your numbers. How did you handle it?"

For senior hires:

"How would you set up data infrastructure for a company at our stage?"
"What's your approach to balancing new feature development with reliability work?"
"How do you think about build vs. buy decisions for data tools?"

Building Your Data Culture

Great data infrastructure isn't just pipelines—it's culture. Hire for these traits:

Ownership mentality — Engineers who treat pipelines as products, not just jobs to run
Stakeholder empathy — Understanding that data serves business decisions, not technical elegance
Reliability focus — Valuing uptime and data quality over clever solutions
Documentation habits — Writing things down so knowledge isn't siloed

The best data teams feel responsible for the decisions made from their data. That ownership separates excellent data engineering from adequate data engineering.

The Trust Lens

Industry Reality

Frequently Asked Questions

Hire a dedicated data engineer when data becomes critical to business decisions and your backend engineers are spending significant time on data work. Signs you're ready: (1) Analysts are blocked waiting for data. (2) Data quality issues are affecting business decisions. (3) You need more than a few ad-hoc scripts to move data around. (4) Backend engineers are building pipelines reluctantly and poorly. Start with a senior generalist who can establish patterns. If you only need occasional data exports, a data-savvy backend engineer suffices—but once data becomes a product, you need dedicated expertise.

Hiring to Build Data Pipelines: The Complete Guide

Data Engineer

Overview

What Success Looks Like

Reliable Data Flow

Trusted Data

Sustainable Team Operations

Roles You'll Need

Data Engineer

Analytics Engineer

Data Platform Engineer

ML Engineer (for Feature Pipelines)

Tech Stack Decisions

Batch vs. Streaming: The Most Important Choice

The Modern Data Stack

For Streaming (when actually needed)

Hiring Sequence

Phase 1: First Data Engineer (0 → 1)

Phase 2: Growing the Team (2-4 engineers)

Phase 3: Scale (5+ engineers)

Common Pitfalls

1. Hiring Before You Know What You Need

2. Over-Engineering from Day One

3. Requiring Specific Tool Experience

4. Ignoring Data Quality Until Crisis

5. Treating Data Engineering as Backend Engineering

6. Underestimating Maintenance Burden

Interview Strategy

Technical Assessment

Questions to Ask

Building Your Data Culture

The Trust Lens

Frequently Asked Questions

Frequently Asked Questions

When should I hire my first data engineer vs. using a data-savvy backend engineer?

Do we need Spark and Kafka, or is that over-engineering?

What's the difference between a data engineer and an analytics engineer?

How do I evaluate data engineering candidates without being a data engineer myself?

What should we expect for time-to-hire and compensation for data engineers?

to Build Data Pipelines

About [Company]

The Role

What You'll Do

Required Skills and Qualifications

Preferred Skills and Qualifications

Tech Stack

Data Complexity Context

Compensation and Benefits

Interview Process

How to Apply

to Build Data Pipelines

to Build Data Pipelines Strategy

Define Your Requirements

Craft Your Message

Source Candidates

Screen Effectively

Close Strong

to Build Data Pipelines

Market Pulse

Critical Skills (Must Haves)

Nice-to-Have (Bonus)

Top 5 Interview Questions

Quick Context

Common Mistakes

Interview Tips

Red Flags

Keep Exploring

Related Roles

Related Stacks

Related Levels

Related Scenarios

The best teams don't wait.They're already here.

The best teams don't wait.
They're already here.