Skip to content
Back to Blog
1 min read

Azure Synapse Spark Pools (Preview): Big Data Processing

Spark on Synapse is the same managed Spark story most cloud providers offer now, but it’s done with the parts of Azure I already use — same workspace as the SQL pools and pipelines, same security model, same ADLS Gen2 storage. For teams that want notebooks-and-Spark without standing up Databricks separately, the Synapse path keeps everything under one bill and one IAM model.

Note: Synapse is in public preview. Features and APIs may change before GA.

Creating a Spark Pool

resource "azurerm_synapse_spark_pool" "main" {
  name                 = "sparkpool"
  synapse_workspace_id = azurerm_synapse_workspace.main.id
  node_size_family     = "MemoryOptimized"
  node_size            = "Medium"

  auto_scale {
    min_node_count = 3
    max_node_count = 10
  }

  auto_pause {
    delay_in_minutes = 15
  }

  library_requirement {
    content  = file("requirements.txt")
    filename = "requirements.txt"
  }
}

Notebook Development

# Read from Data Lake
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("abfss://raw@mydatalake.dfs.core.windows.net/sales/*.csv")

# Transform
from pyspark.sql.functions import col, year, month, sum as spark_sum

monthly_sales = df \
    .withColumn("year", year(col("sale_date"))) \
    .withColumn("month", month(col("sale_date"))) \
    .groupBy("year", "month", "region") \
    .agg(spark_sum("amount").alias("total_sales"))

# Write as Delta
monthly_sales.write \
    .format("delta") \
    .mode("overwrite") \
    .save("abfss://curated@mydatalake.dfs.core.windows.net/monthly_sales/")

Shared Metadata

Tables created in Spark are visible in SQL:

# In Spark notebook
monthly_sales.write.saveAsTable("curated.monthly_sales")
-- In SQL pool or serverless
SELECT * FROM curated.monthly_sales WHERE year = 2020

Optimizations

# Partition for query performance
df.write \
    .partitionBy("year", "month") \
    .format("delta") \
    .save("/path/to/table")

# Broadcast small tables
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")

# Cache frequently used DataFrames
df.cache()

Cost Control

  • Auto-pause: Clusters pause when idle
  • Auto-scale: Scale down during low usage
  • Spot instances: Coming soon for cost savings

Synapse Spark brings enterprise Spark without the ops burden.\n\n## Takeaways\n\nAdd a concise, personal takeaway and recommended next steps here.\n

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.