1 min read
Data Engineering Trends That Defined 2021
I wrote “Data Engineering Trends That Defined 2021” to share practical, production-minded guidance on this topic.
The Rise of the Modern Data Stack
The modern data stack became mainstream, characterized by:
- Cloud-native data warehouses
- ELT over ETL
- Declarative transformations
- Automated data quality
# dbt model for incremental processing - a 2021 staple
# models/orders_daily.sql
"""
{{
config(
materialized='incremental',
unique_key='order_date',
partition_by={'field': 'order_date', 'data_type': 'date'}
)
}}
SELECT
DATE(order_timestamp) as order_date,
COUNT(*) as order_count,
SUM(order_amount) as total_revenue,
AVG(order_amount) as avg_order_value
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE order_timestamp > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
GROUP BY DATE(order_timestamp)
"""
Delta Lake and Lakehouse Architecture
The lakehouse pattern gained serious traction, combining data lake flexibility with warehouse reliability:
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DeltaLakehouse") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.getOrCreate()
# MERGE operation - ACID transactions on your data lake
delta_table = DeltaTable.forPath(spark, "/mnt/delta/customers")
delta_table.alias("target").merge(
updates_df.alias("source"),
"target.customer_id = source.customer_id"
).whenMatchedUpdate(
set={
"email": "source.email",
"last_updated": "current_timestamp()"
}
).whenNotMatchedInsert(
values={
"customer_id": "source.customer_id",
"email": "source.email",
"created_at": "current_timestamp()",
"last_updated": "current_timestamp()"
}
).execute()
Data Contracts Emerge
Data contracts became a hot topic for managing producer-consumer relationships:
# data_contract.yaml
apiVersion: datacontract/v1
kind: DataContract
metadata:
name: customer-orders
version: 1.2.0
owner: data-platform-team
spec:
schema:
type: object
properties:
order_id:
type: string
format: uuid
description: Unique order identifier
customer_id:
type: string
required: true
order_date:
type: string
format: date-time
items:
type: array
items:
type: object
properties:
sku: { type: string }
quantity: { type: integer, minimum: 1 }
quality:
completeness:
order_id: 100%
customer_id: 100%
freshness: 1 hour
sla:
availability: 99.9%
Streaming Becomes Standard
Real-time data processing moved from specialized to expected:
from azure.eventhub import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblob import BlobCheckpointStore
import json
checkpoint_store = BlobCheckpointStore.from_connection_string(
storage_conn_str,
container_name="checkpoints"
)
def process_events(partition_context, events):
for event in events:
data = json.loads(event.body_as_str())
# Real-time transformation
enriched_data = {
**data,
"processed_at": datetime.utcnow().isoformat(),
"partition_id": partition_context.partition_id
}
# Write to downstream systems
write_to_delta_lake(enriched_data)
partition_context.update_checkpoint()
client = EventHubConsumerClient.from_connection_string(
conn_str=eventhub_conn_str,
consumer_group="$Default",
eventhub_name="events",
checkpoint_store=checkpoint_store
)
with client:
client.receive(on_event=process_events, starting_position="-1")
Infrastructure as Code for Data
Data infrastructure became code-first:
# Terraform for Azure Synapse workspace
resource "azurerm_synapse_workspace" "synapse" {
name = "synapse-analytics-prod"
resource_group_name = azurerm_resource_group.data.name
location = azurerm_resource_group.data.location
storage_data_lake_gen2_filesystem_id = azurerm_storage_data_lake_gen2_filesystem.datalake.id
sql_administrator_login = "sqladminuser"
sql_administrator_login_password = var.sql_admin_password
identity {
type = "SystemAssigned"
}
tags = {
Environment = "Production"
ManagedBy = "Terraform"
}
}
resource "azurerm_synapse_spark_pool" "spark" {
name = "sparkpool"
synapse_workspace_id = azurerm_synapse_workspace.synapse.id
node_size_family = "MemoryOptimized"
node_size = "Medium"
node_count = 3
auto_pause {
delay_in_minutes = 15
}
library_requirement {
content = file("requirements.txt")
filename = "requirements.txt"
}
}
Key Takeaways from 2021
- Data Quality is Non-Negotiable: Tools like Great Expectations and dbt tests became standard
- Governance Built-In: Not an afterthought but a core requirement
- Self-Service with Guardrails: Enable teams while maintaining standards
- Cost Awareness: FinOps for data platforms became essential
What’s Coming in 2022
- Data Mesh adoption accelerating
- More sophisticated data observability
- Increased focus on data products
- ML feature stores going mainstream
Data engineering in 2021 matured from a support function to a strategic capability. The tools improved, patterns solidified, and the role gained the recognition it deserves.