Back to Blog
3 min read

Data Engineering Trends That Defined 2021

Data engineering has evolved dramatically in 2021. The role has expanded beyond ETL pipelines to encompass data quality, governance, and platform engineering. Let’s explore the trends that shaped the field this year.

The Rise of the Modern Data Stack

The modern data stack became mainstream, characterized by:

  • Cloud-native data warehouses
  • ELT over ETL
  • Declarative transformations
  • Automated data quality
# dbt model for incremental processing - a 2021 staple
# models/orders_daily.sql
"""
{{
    config(
        materialized='incremental',
        unique_key='order_date',
        partition_by={'field': 'order_date', 'data_type': 'date'}
    )
}}

SELECT
    DATE(order_timestamp) as order_date,
    COUNT(*) as order_count,
    SUM(order_amount) as total_revenue,
    AVG(order_amount) as avg_order_value
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE order_timestamp > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
GROUP BY DATE(order_timestamp)
"""

Delta Lake and Lakehouse Architecture

The lakehouse pattern gained serious traction, combining data lake flexibility with warehouse reliability:

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaLakehouse") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# MERGE operation - ACID transactions on your data lake
delta_table = DeltaTable.forPath(spark, "/mnt/delta/customers")

delta_table.alias("target").merge(
    updates_df.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdate(
    set={
        "email": "source.email",
        "last_updated": "current_timestamp()"
    }
).whenNotMatchedInsert(
    values={
        "customer_id": "source.customer_id",
        "email": "source.email",
        "created_at": "current_timestamp()",
        "last_updated": "current_timestamp()"
    }
).execute()

Data Contracts Emerge

Data contracts became a hot topic for managing producer-consumer relationships:

# data_contract.yaml
apiVersion: datacontract/v1
kind: DataContract
metadata:
  name: customer-orders
  version: 1.2.0
  owner: data-platform-team
spec:
  schema:
    type: object
    properties:
      order_id:
        type: string
        format: uuid
        description: Unique order identifier
      customer_id:
        type: string
        required: true
      order_date:
        type: string
        format: date-time
      items:
        type: array
        items:
          type: object
          properties:
            sku: { type: string }
            quantity: { type: integer, minimum: 1 }
  quality:
    completeness:
      order_id: 100%
      customer_id: 100%
    freshness: 1 hour
  sla:
    availability: 99.9%

Streaming Becomes Standard

Real-time data processing moved from specialized to expected:

from azure.eventhub import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblob import BlobCheckpointStore
import json

checkpoint_store = BlobCheckpointStore.from_connection_string(
    storage_conn_str,
    container_name="checkpoints"
)

def process_events(partition_context, events):
    for event in events:
        data = json.loads(event.body_as_str())

        # Real-time transformation
        enriched_data = {
            **data,
            "processed_at": datetime.utcnow().isoformat(),
            "partition_id": partition_context.partition_id
        }

        # Write to downstream systems
        write_to_delta_lake(enriched_data)

    partition_context.update_checkpoint()

client = EventHubConsumerClient.from_connection_string(
    conn_str=eventhub_conn_str,
    consumer_group="$Default",
    eventhub_name="events",
    checkpoint_store=checkpoint_store
)

with client:
    client.receive(on_event=process_events, starting_position="-1")

Infrastructure as Code for Data

Data infrastructure became code-first:

# Terraform for Azure Synapse workspace
resource "azurerm_synapse_workspace" "synapse" {
  name                                 = "synapse-analytics-prod"
  resource_group_name                  = azurerm_resource_group.data.name
  location                             = azurerm_resource_group.data.location
  storage_data_lake_gen2_filesystem_id = azurerm_storage_data_lake_gen2_filesystem.datalake.id
  sql_administrator_login              = "sqladminuser"
  sql_administrator_login_password     = var.sql_admin_password

  identity {
    type = "SystemAssigned"
  }

  tags = {
    Environment = "Production"
    ManagedBy   = "Terraform"
  }
}

resource "azurerm_synapse_spark_pool" "spark" {
  name                 = "sparkpool"
  synapse_workspace_id = azurerm_synapse_workspace.synapse.id
  node_size_family     = "MemoryOptimized"
  node_size            = "Medium"
  node_count           = 3

  auto_pause {
    delay_in_minutes = 15
  }

  library_requirement {
    content  = file("requirements.txt")
    filename = "requirements.txt"
  }
}

Key Takeaways from 2021

  1. Data Quality is Non-Negotiable: Tools like Great Expectations and dbt tests became standard
  2. Governance Built-In: Not an afterthought but a core requirement
  3. Self-Service with Guardrails: Enable teams while maintaining standards
  4. Cost Awareness: FinOps for data platforms became essential

What’s Coming in 2022

  • Data Mesh adoption accelerating
  • More sophisticated data observability
  • Increased focus on data products
  • ML feature stores going mainstream

Data engineering in 2021 matured from a support function to a strategic capability. The tools improved, patterns solidified, and the role gained the recognition it deserves.

Resources

Michael John Pena

Michael John Pena

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.