Back to Blog
3 min read

Getting Started with Delta Lake on Azure Databricks

Delta Lake brings reliability to data lakes. Since being open-sourced in 2019, it has matured significantly and is now a core component of modern data architectures on Azure Databricks.

What is Delta Lake?

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. Key capabilities include:

  • ACID Transactions: Serializable isolation levels
  • Schema Enforcement: Prevents bad data from corrupting your tables
  • Time Travel: Query previous versions of your data
  • Unified Batch and Streaming: Same table for both workloads

Creating Delta Tables

# From DataFrame
df.write.format("delta").save("/delta/events")

# Managed table
df.write.format("delta").saveAsTable("events")

# With partitioning
df.write \
    .format("delta") \
    .partitionBy("year", "month") \
    .save("/delta/events")

Reading Delta Tables

# Read latest version
df = spark.read.format("delta").load("/delta/events")

# SQL syntax
spark.sql("SELECT * FROM delta.`/delta/events`")

# Registered table
spark.sql("SELECT * FROM events")

MERGE (Upsert) Operations

One of Delta Lake’s most powerful features is the MERGE command for upserts:

from delta.tables import DeltaTable

# Load target table
target = DeltaTable.forPath(spark, "/delta/customers")

# Merge source updates
target.alias("t").merge(
    updates_df.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdate(set={
    "name": "s.name",
    "email": "s.email",
    "updated_at": "current_timestamp()"
}).whenNotMatchedInsert(values={
    "customer_id": "s.customer_id",
    "name": "s.name",
    "email": "s.email",
    "created_at": "current_timestamp()",
    "updated_at": "current_timestamp()"
}).execute()

Time Travel

Query historical versions of your data:

# Read specific version
df = spark.read.format("delta").option("versionAsOf", 5).load("/delta/events")

# Read at timestamp
df = spark.read.format("delta").option("timestampAsOf", "2020-11-16 10:00:00").load("/delta/events")

# View history
deltaTable = DeltaTable.forPath(spark, "/delta/events")
deltaTable.history().show()
-- SQL syntax
SELECT * FROM events VERSION AS OF 5;
SELECT * FROM events TIMESTAMP AS OF '2020-11-16';
DESCRIBE HISTORY events;

Schema Evolution

# Enable auto-merge schema
df.write \
    .format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("/delta/events")
-- Explicit schema changes
ALTER TABLE events ADD COLUMNS (
    new_column STRING AFTER existing_column
);

Optimize Performance

Compaction

Small files hurt query performance. Use OPTIMIZE to compact them:

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/delta/events")
deltaTable.optimize().executeCompaction()
OPTIMIZE events;

Z-Ordering

Co-locate related data for faster queries:

OPTIMIZE events ZORDER BY (date, customer_id);

Vacuum Old Files

Remove files no longer referenced by the Delta table:

# Delete files older than 7 days (default retention)
deltaTable.vacuum()

# Specify retention period in hours
deltaTable.vacuum(168)  # 7 days
VACUUM events RETAIN 168 HOURS;

Streaming with Delta Lake

Delta Lake works seamlessly with Spark Structured Streaming:

# Stream reads
stream_df = spark.readStream \
    .format("delta") \
    .load("/delta/events")

# Stream writes
stream_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/events") \
    .start("/delta/processed_events")

Best Practices

AspectRecommendation
Partitions1GB+ per partition, avoid over-partitioning
File sizeTarget 1GB files using OPTIMIZE
Z-OrderColumns frequently in WHERE clauses
VacuumRun weekly, keep 7+ days history
StatisticsDelta collects automatically, no manual action needed

Integration with Azure

Delta Lake integrates well with the Azure ecosystem:

  • Azure Data Lake Storage Gen2: Primary storage layer
  • Azure Data Factory: Orchestrate Delta Lake pipelines
  • Power BI: Direct query support via Databricks connector
  • Azure Synapse (preview): Query Delta tables from Synapse

Getting Started

# Create your first Delta table
data = [
    (1, "Alice", "2020-01-01"),
    (2, "Bob", "2020-01-02"),
    (3, "Charlie", "2020-01-03")
]

df = spark.createDataFrame(data, ["id", "name", "date"])

df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("/delta/my_first_table")

# Query it
spark.read.format("delta").load("/delta/my_first_table").show()

Delta Lake brings data warehouse reliability to your data lake, enabling the modern lakehouse architecture.

Resources

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.