3 min read
Data Engineering Trends That Defined 2021
Data engineering has evolved dramatically in 2021. The role has expanded beyond ETL pipelines to encompass data quality, governance, and platform engineering. Let’s explore the trends that shaped the field this year.
The Rise of the Modern Data Stack
The modern data stack became mainstream, characterized by:
- Cloud-native data warehouses
- ELT over ETL
- Declarative transformations
- Automated data quality
# dbt model for incremental processing - a 2021 staple
# models/orders_daily.sql
"""
{{
config(
materialized='incremental',
unique_key='order_date',
partition_by={'field': 'order_date', 'data_type': 'date'}
)
}}
SELECT
DATE(order_timestamp) as order_date,
COUNT(*) as order_count,
SUM(order_amount) as total_revenue,
AVG(order_amount) as avg_order_value
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE order_timestamp > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
GROUP BY DATE(order_timestamp)
"""
Delta Lake and Lakehouse Architecture
The lakehouse pattern gained serious traction, combining data lake flexibility with warehouse reliability:
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DeltaLakehouse") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.getOrCreate()
# MERGE operation - ACID transactions on your data lake
delta_table = DeltaTable.forPath(spark, "/mnt/delta/customers")
delta_table.alias("target").merge(
updates_df.alias("source"),
"target.customer_id = source.customer_id"
).whenMatchedUpdate(
set={
"email": "source.email",
"last_updated": "current_timestamp()"
}
).whenNotMatchedInsert(
values={
"customer_id": "source.customer_id",
"email": "source.email",
"created_at": "current_timestamp()",
"last_updated": "current_timestamp()"
}
).execute()
Data Contracts Emerge
Data contracts became a hot topic for managing producer-consumer relationships:
# data_contract.yaml
apiVersion: datacontract/v1
kind: DataContract
metadata:
name: customer-orders
version: 1.2.0
owner: data-platform-team
spec:
schema:
type: object
properties:
order_id:
type: string
format: uuid
description: Unique order identifier
customer_id:
type: string
required: true
order_date:
type: string
format: date-time
items:
type: array
items:
type: object
properties:
sku: { type: string }
quantity: { type: integer, minimum: 1 }
quality:
completeness:
order_id: 100%
customer_id: 100%
freshness: 1 hour
sla:
availability: 99.9%
Streaming Becomes Standard
Real-time data processing moved from specialized to expected:
from azure.eventhub import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblob import BlobCheckpointStore
import json
checkpoint_store = BlobCheckpointStore.from_connection_string(
storage_conn_str,
container_name="checkpoints"
)
def process_events(partition_context, events):
for event in events:
data = json.loads(event.body_as_str())
# Real-time transformation
enriched_data = {
**data,
"processed_at": datetime.utcnow().isoformat(),
"partition_id": partition_context.partition_id
}
# Write to downstream systems
write_to_delta_lake(enriched_data)
partition_context.update_checkpoint()
client = EventHubConsumerClient.from_connection_string(
conn_str=eventhub_conn_str,
consumer_group="$Default",
eventhub_name="events",
checkpoint_store=checkpoint_store
)
with client:
client.receive(on_event=process_events, starting_position="-1")
Infrastructure as Code for Data
Data infrastructure became code-first:
# Terraform for Azure Synapse workspace
resource "azurerm_synapse_workspace" "synapse" {
name = "synapse-analytics-prod"
resource_group_name = azurerm_resource_group.data.name
location = azurerm_resource_group.data.location
storage_data_lake_gen2_filesystem_id = azurerm_storage_data_lake_gen2_filesystem.datalake.id
sql_administrator_login = "sqladminuser"
sql_administrator_login_password = var.sql_admin_password
identity {
type = "SystemAssigned"
}
tags = {
Environment = "Production"
ManagedBy = "Terraform"
}
}
resource "azurerm_synapse_spark_pool" "spark" {
name = "sparkpool"
synapse_workspace_id = azurerm_synapse_workspace.synapse.id
node_size_family = "MemoryOptimized"
node_size = "Medium"
node_count = 3
auto_pause {
delay_in_minutes = 15
}
library_requirement {
content = file("requirements.txt")
filename = "requirements.txt"
}
}
Key Takeaways from 2021
- Data Quality is Non-Negotiable: Tools like Great Expectations and dbt tests became standard
- Governance Built-In: Not an afterthought but a core requirement
- Self-Service with Guardrails: Enable teams while maintaining standards
- Cost Awareness: FinOps for data platforms became essential
What’s Coming in 2022
- Data Mesh adoption accelerating
- More sophisticated data observability
- Increased focus on data products
- ML feature stores going mainstream
Data engineering in 2021 matured from a support function to a strategic capability. The tools improved, patterns solidified, and the role gained the recognition it deserves.