Back to Blog
3 min read

Data Pipelines: What Production Actually Teaches You

I’ve built data pipelines for 8 years. Every production deployment humbled me. Here’s what the tutorials don’t teach.

Lesson 1: Data Is Never Clean

Every data source lies to you.

“These fields are always populated.” They’re not.

“Dates are in ISO format.” Some are. Some are US format. One is a Unix timestamp. Another is “last Tuesday.”

“IDs are unique.” Until they’re not.

# What you expect
{"id": 123, "name": "John", "date": "2026-02-05"}

# What you get at 3 AM
{"id": None, "name": "", "date": "02/05/26", "extra_field": "surprise"}

Build for the data you’ll get, not the data you’re told you’ll get.

Lesson 2: Idempotency Is Everything

Your pipeline will fail. It will be rerun. If it can’t handle reruns, you’ll have duplicates.

# Bad: Append always
def process_batch(data):
    insert_into_table(data)

# Good: Upsert or merge
def process_batch(data):
    merge_into_table(
        data,
        match_on=["id"],
        when_matched="update",
        when_not_matched="insert"
    )

Every pipeline should be safe to rerun. No exceptions.

Lesson 3: Monitoring Beats Prevention

You can’t prevent all failures. But you can detect them fast.

What to monitor:

  • Row counts (sudden drops or spikes)
  • Schema changes (new columns, type changes)
  • Freshness (when did data last arrive?)
  • Null rates (unexpected nulls in required fields)
  • Value distributions (is 90% of a column suddenly the same value?)
def check_data_quality(df):
    alerts = []

    if df.count() < expected_min_rows:
        alerts.append("Row count below threshold")

    null_rate = df.filter(col("key_field").isNull()).count() / df.count()
    if null_rate > 0.01:
        alerts.append(f"Null rate {null_rate:.2%} exceeds 1%")

    return alerts

Lesson 4: Start with Microsoft Fabric Data Factory

For Azure shops, Fabric’s Data Factory handles 80% of pipeline needs without custom code.

Where it shines:

  • Copy activities between sources
  • Simple transformations
  • Scheduling and orchestration
  • Built-in monitoring

Where you need code:

  • Complex business logic
  • Multi-step ML pipelines
  • Custom data quality checks

Don’t write code when configuration will do.

Lesson 5: Late Data Is Normal

Data arrives late. Sometimes hours late. Sometimes days late.

Design for this:

  • Use event time, not processing time
  • Build late-arrival windows
  • Re-process when late data arrives

Lesson 6: Schema Evolution Will Happen

Sources change schemas without telling you. New fields appear. Types change. Fields disappear.

Handle it gracefully or wake up at 3 AM.

The Reality

Production data pipelines are 20% building and 80% maintaining.

The exciting part is building them. The valuable part is keeping them running.

Design for maintenance from day one. Your future self will thank you.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.