October 1, 2020 1 min read

Azure Synapse Spark Pools (Preview): Big Data Processing

Azure Synapse Analytics is currently in preview, with GA expected later this year. The Spark pools feature brings managed Apache Spark into the unified analytics platform. No cluster management, just code.

Note: Synapse is in public preview. Features and APIs may change before GA.

Creating a Spark Pool

resource "azurerm_synapse_spark_pool" "main" {
  name                 = "sparkpool"
  synapse_workspace_id = azurerm_synapse_workspace.main.id
  node_size_family     = "MemoryOptimized"
  node_size            = "Medium"

  auto_scale {
    min_node_count = 3
    max_node_count = 10
  }

  auto_pause {
    delay_in_minutes = 15
  }

  library_requirement {
    content  = file("requirements.txt")
    filename = "requirements.txt"
  }
}

Notebook Development

# Read from Data Lake
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("abfss://raw@mydatalake.dfs.core.windows.net/sales/*.csv")

# Transform
from pyspark.sql.functions import col, year, month, sum as spark_sum

monthly_sales = df \
    .withColumn("year", year(col("sale_date"))) \
    .withColumn("month", month(col("sale_date"))) \
    .groupBy("year", "month", "region") \
    .agg(spark_sum("amount").alias("total_sales"))

# Write as Delta
monthly_sales.write \
    .format("delta") \
    .mode("overwrite") \
    .save("abfss://curated@mydatalake.dfs.core.windows.net/monthly_sales/")

Shared Metadata

Tables created in Spark are visible in SQL:

# In Spark notebook
monthly_sales.write.saveAsTable("curated.monthly_sales")

-- In SQL pool or serverless
SELECT * FROM curated.monthly_sales WHERE year = 2020

Optimizations

# Partition for query performance
df.write \
    .partitionBy("year", "month") \
    .format("delta") \
    .save("/path/to/table")

# Broadcast small tables
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")

# Cache frequently used DataFrames
df.cache()

Cost Control

Auto-pause: Clusters pause when idle
Auto-scale: Scale down during low usage
Spot instances: Coming soon for cost savings

Synapse Spark brings enterprise Spark without the ops burden.