Back to Blog
2 min read

Delta Lake Optimization: VACUUM, OPTIMIZE, and Z-ORDER Explained

Delta Lake provides powerful table maintenance operations that keep your data performant and storage costs under control. Understanding VACUUM, OPTIMIZE, and Z-ORDER is essential for production data engineering.

The VACUUM Operation

VACUUM removes data files no longer referenced by the Delta log. Without regular vacuuming, your storage costs grow unbounded as old file versions accumulate.

from delta.tables import DeltaTable

# Get reference to your Delta table
delta_table = DeltaTable.forPath(spark, "/lakehouse/Tables/sales_data")

# Remove files older than 7 days (default retention)
delta_table.vacuum(retentionHours=168)

# For aggressive cleanup (use with caution)
# This requires disabling the safety check
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
delta_table.vacuum(retentionHours=0)

The OPTIMIZE Operation

OPTIMIZE compacts small files into larger ones, improving query performance by reducing file scan overhead.

# Basic optimize - compacts small files
delta_table.optimize().executeCompaction()

# Optimize with predicate - only compact specific partitions
delta_table.optimize() \
    .where("date >= '2025-08-01'") \
    .executeCompaction()

# Using SQL
spark.sql("""
    OPTIMIZE sales_data
    WHERE region = 'APAC'
""")

Z-ORDER for Query Performance

Z-ORDER clusters data by specified columns, dramatically improving query performance when filtering on those columns.

# Optimize with Z-ORDER
delta_table.optimize() \
    .where("date >= '2025-08-01'") \
    .executeZOrderBy("customer_id", "product_category")

# SQL equivalent
spark.sql("""
    OPTIMIZE sales_data
    ZORDER BY (customer_id, product_category)
""")

Best Practices

Schedule OPTIMIZE jobs during off-peak hours as they can be resource-intensive. Choose Z-ORDER columns based on your most common filter patterns. Run VACUUM after OPTIMIZE to clean up the compacted small files. Aim to keep file sizes between 256MB and 1GB for optimal query performance.

Regular maintenance is the difference between a sluggish data lake and a performant Lakehouse.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.