5 min read
Fabric Notebooks: Features and Productivity Tips
Fabric notebooks are more than just a Spark interface - they come with powerful features for data engineering productivity. Today we’ll explore these features in depth.
Notebook Basics
# Fabric notebooks support multiple languages
# Default: PySpark
# Switch language per cell:
# %%sql - SparkSQL
# %%scala - Scala
# %%r - R (SparkR)
# %%md - Markdown
Multi-Language Support
PySpark (Default)
# PySpark cell (default)
from pyspark.sql.functions import *
df = spark.read.table("sales")
result = df.groupBy("category").agg(sum("amount").alias("total"))
result.show()
SQL Magic
%%sql
-- SQL cell
SELECT
category,
SUM(amount) as total,
COUNT(*) as count
FROM sales
GROUP BY category
ORDER BY total DESC
Scala
%%scala
// Scala cell
val df = spark.read.table("sales")
val result = df.groupBy("category")
.agg(sum("amount").alias("total"))
result.show()
R
%%r
# R cell
library(SparkR)
df <- sql("SELECT * FROM sales")
summary <- summarize(groupBy(df, "category"),
total = sum(df$amount))
head(summary)
Lakehouse Integration
# When attached to a Lakehouse, you get direct access:
# Access Tables
df = spark.read.table("customers")
# Access Files
df = spark.read.csv("Files/raw/data.csv", header=True)
# Write to Tables
df.write.format("delta").saveAsTable("processed_customers")
# Write to Files
df.write.parquet("Files/output/data.parquet")
# The Lakehouse appears in the left panel for easy browsing
Built-in Visualizations
# Fabric notebooks include built-in visualizations
# After displaying a DataFrame, click the chart icon
df = spark.sql("""
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as total
FROM sales
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month
""")
# Display with visualization options
display(df)
# Or use displayHTML for custom visualizations
displayHTML("""
<div style="font-size: 24px; color: green;">
Custom HTML content here
</div>
""")
Parameterization
# Create parameters for notebook orchestration
# Parameters cell should be tagged with "parameters"
# Define default parameter values
start_date = "2023-01-01"
end_date = "2023-12-31"
input_table = "sales"
output_table = "sales_processed"
# Use parameters in your code
df = spark.sql(f"""
SELECT * FROM {input_table}
WHERE order_date BETWEEN '{start_date}' AND '{end_date}'
""")
df.write.format("delta").mode("overwrite").saveAsTable(output_table)
# When calling from a pipeline, parameters are passed:
# Notebook activity > Base parameters > Add parameters
Notebook Scheduling
# Schedule notebooks directly from the UI
# 1. Click "Schedule" in the toolbar
# 2. Configure schedule:
schedule_config = {
"enabled": True,
"frequency": "Daily", # or Hourly, Weekly, etc.
"time": "06:00",
"timezone": "UTC",
"start_date": "2023-07-01",
"end_date": None, # None for no end
"retry": {
"count": 3,
"interval_minutes": 5
}
}
# 3. Monitor runs in the "Run history" tab
Running Child Notebooks
# Call another notebook
result = dbutils.notebook.run(
path="./child_notebook",
timeout_seconds=300,
arguments={"param1": "value1", "param2": "value2"}
)
print(f"Child notebook returned: {result}")
# In the child notebook, retrieve parameters:
param1 = dbutils.widgets.get("param1")
param2 = dbutils.widgets.get("param2")
# Return a value from child notebook
dbutils.notebook.exit("Success: processed 1000 rows")
Notebook Utilities
# MSSparkUtils - Fabric's utility library
# File system operations
mssparkutils.fs.ls("Files/")
mssparkutils.fs.cp("source", "destination")
mssparkutils.fs.rm("path", recurse=True)
# Credentials
token = mssparkutils.credentials.getToken("https://storage.azure.com/")
# Environment info
mssparkutils.env.getWorkspaceName()
mssparkutils.env.getLakehouseId()
# Notebook utilities
mssparkutils.notebook.run("./other_notebook", 300)
mssparkutils.notebook.exit("value")
Session Management
# Session configuration
# Fabric uses starter pools for fast startup
# View session info
print(f"Spark Version: {spark.version}")
print(f"Application ID: {spark.sparkContext.applicationId}")
# Session variables persist across cells
my_global_var = "This persists across cells"
# Stop session explicitly (if needed)
# spark.stop() # Use carefully!
Collaboration Features
# 1. Comments
# Add comments to cells using the comment icon
# Team members can reply and resolve
# 2. Version history
# View all changes: File > Version history
# Restore previous versions if needed
# 3. Export options
# Export as: .ipynb, .py, HTML
# 4. Git integration
# Connect workspace to Git repo
# Sync notebooks with source control
Performance Optimization
# 1. Use display() for large DataFrames
# It automatically limits output
display(large_df) # Shows 1000 rows max
# 2. Avoid collecting large DataFrames to driver
# Bad:
# data = df.collect() # Brings all data to driver
# Good:
# df.limit(100).collect() # Collect small samples
# 3. Cache intermediate results
df = spark.read.table("large_table")
df.cache()
df.count() # Triggers caching
# Use cached DataFrame multiple times
summary1 = df.groupBy("col1").count()
summary2 = df.groupBy("col2").count()
df.unpersist() # Release cache when done
# 4. Monitor execution
# Click on the Spark jobs link to see execution details
Error Handling Patterns
# Wrap cells in try-except for production notebooks
from pyspark.sql.utils import AnalysisException
try:
# Main processing logic
df = spark.read.table("input_table")
processed = df.transform(my_transform_function)
processed.write.saveAsTable("output_table")
dbutils.notebook.exit("SUCCESS")
except AnalysisException as e:
print(f"Data error: {e}")
dbutils.notebook.exit(f"FAILED: {str(e)}")
except Exception as e:
print(f"Unexpected error: {e}")
raise # Re-raise for pipeline to catch
Notebook Best Practices
# 1. Structure your notebook
"""
# Notebook Template
## 1. Configuration (tagged as "parameters")
## 2. Imports and Setup
## 3. Data Loading
## 4. Transformations
## 5. Validation
## 6. Output
## 7. Cleanup
"""
# 2. Document with markdown cells
"""
## Data Loading
This section loads data from the bronze layer
and applies initial filtering.
**Input**: bronze_sales table
**Output**: Filtered DataFrame with valid records
"""
# 3. Modular code with functions
def clean_data(df):
"""Remove nulls and duplicates"""
return df.dropna().dropDuplicates()
def enrich_data(df):
"""Add computed columns"""
return df.withColumn("processed_at", current_timestamp())
# 4. Use assertions for data quality
assert df.count() > 0, "Input DataFrame is empty"
assert "required_column" in df.columns, "Missing required column"
Tomorrow we’ll explore data wrangling techniques in Fabric notebooks.