Back to Blog
5 min read

Fabric Notebooks: Features and Productivity Tips

Fabric notebooks are more than just a Spark interface - they come with powerful features for data engineering productivity. Today we’ll explore these features in depth.

Notebook Basics

# Fabric notebooks support multiple languages
# Default: PySpark

# Switch language per cell:
# %%sql    - SparkSQL
# %%scala  - Scala
# %%r      - R (SparkR)
# %%md     - Markdown

Multi-Language Support

PySpark (Default)

# PySpark cell (default)
from pyspark.sql.functions import *

df = spark.read.table("sales")
result = df.groupBy("category").agg(sum("amount").alias("total"))
result.show()

SQL Magic

%%sql
-- SQL cell
SELECT
    category,
    SUM(amount) as total,
    COUNT(*) as count
FROM sales
GROUP BY category
ORDER BY total DESC

Scala

%%scala
// Scala cell
val df = spark.read.table("sales")
val result = df.groupBy("category")
    .agg(sum("amount").alias("total"))
result.show()

R

%%r
# R cell
library(SparkR)

df <- sql("SELECT * FROM sales")
summary <- summarize(groupBy(df, "category"),
                     total = sum(df$amount))
head(summary)

Lakehouse Integration

# When attached to a Lakehouse, you get direct access:

# Access Tables
df = spark.read.table("customers")

# Access Files
df = spark.read.csv("Files/raw/data.csv", header=True)

# Write to Tables
df.write.format("delta").saveAsTable("processed_customers")

# Write to Files
df.write.parquet("Files/output/data.parquet")

# The Lakehouse appears in the left panel for easy browsing

Built-in Visualizations

# Fabric notebooks include built-in visualizations
# After displaying a DataFrame, click the chart icon

df = spark.sql("""
    SELECT
        DATE_TRUNC('month', order_date) as month,
        SUM(amount) as total
    FROM sales
    GROUP BY DATE_TRUNC('month', order_date)
    ORDER BY month
""")

# Display with visualization options
display(df)

# Or use displayHTML for custom visualizations
displayHTML("""
<div style="font-size: 24px; color: green;">
    Custom HTML content here
</div>
""")

Parameterization

# Create parameters for notebook orchestration
# Parameters cell should be tagged with "parameters"

# Define default parameter values
start_date = "2023-01-01"
end_date = "2023-12-31"
input_table = "sales"
output_table = "sales_processed"

# Use parameters in your code
df = spark.sql(f"""
    SELECT * FROM {input_table}
    WHERE order_date BETWEEN '{start_date}' AND '{end_date}'
""")

df.write.format("delta").mode("overwrite").saveAsTable(output_table)

# When calling from a pipeline, parameters are passed:
# Notebook activity > Base parameters > Add parameters

Notebook Scheduling

# Schedule notebooks directly from the UI
# 1. Click "Schedule" in the toolbar
# 2. Configure schedule:

schedule_config = {
    "enabled": True,
    "frequency": "Daily",  # or Hourly, Weekly, etc.
    "time": "06:00",
    "timezone": "UTC",
    "start_date": "2023-07-01",
    "end_date": None,  # None for no end
    "retry": {
        "count": 3,
        "interval_minutes": 5
    }
}

# 3. Monitor runs in the "Run history" tab

Running Child Notebooks

# Call another notebook
result = dbutils.notebook.run(
    path="./child_notebook",
    timeout_seconds=300,
    arguments={"param1": "value1", "param2": "value2"}
)

print(f"Child notebook returned: {result}")

# In the child notebook, retrieve parameters:
param1 = dbutils.widgets.get("param1")
param2 = dbutils.widgets.get("param2")

# Return a value from child notebook
dbutils.notebook.exit("Success: processed 1000 rows")

Notebook Utilities

# MSSparkUtils - Fabric's utility library

# File system operations
mssparkutils.fs.ls("Files/")
mssparkutils.fs.cp("source", "destination")
mssparkutils.fs.rm("path", recurse=True)

# Credentials
token = mssparkutils.credentials.getToken("https://storage.azure.com/")

# Environment info
mssparkutils.env.getWorkspaceName()
mssparkutils.env.getLakehouseId()

# Notebook utilities
mssparkutils.notebook.run("./other_notebook", 300)
mssparkutils.notebook.exit("value")

Session Management

# Session configuration
# Fabric uses starter pools for fast startup

# View session info
print(f"Spark Version: {spark.version}")
print(f"Application ID: {spark.sparkContext.applicationId}")

# Session variables persist across cells
my_global_var = "This persists across cells"

# Stop session explicitly (if needed)
# spark.stop()  # Use carefully!

Collaboration Features

# 1. Comments
# Add comments to cells using the comment icon
# Team members can reply and resolve

# 2. Version history
# View all changes: File > Version history
# Restore previous versions if needed

# 3. Export options
# Export as: .ipynb, .py, HTML

# 4. Git integration
# Connect workspace to Git repo
# Sync notebooks with source control

Performance Optimization

# 1. Use display() for large DataFrames
# It automatically limits output

display(large_df)  # Shows 1000 rows max

# 2. Avoid collecting large DataFrames to driver
# Bad:
# data = df.collect()  # Brings all data to driver

# Good:
# df.limit(100).collect()  # Collect small samples

# 3. Cache intermediate results
df = spark.read.table("large_table")
df.cache()
df.count()  # Triggers caching

# Use cached DataFrame multiple times
summary1 = df.groupBy("col1").count()
summary2 = df.groupBy("col2").count()

df.unpersist()  # Release cache when done

# 4. Monitor execution
# Click on the Spark jobs link to see execution details

Error Handling Patterns

# Wrap cells in try-except for production notebooks
from pyspark.sql.utils import AnalysisException

try:
    # Main processing logic
    df = spark.read.table("input_table")
    processed = df.transform(my_transform_function)
    processed.write.saveAsTable("output_table")

    dbutils.notebook.exit("SUCCESS")

except AnalysisException as e:
    print(f"Data error: {e}")
    dbutils.notebook.exit(f"FAILED: {str(e)}")

except Exception as e:
    print(f"Unexpected error: {e}")
    raise  # Re-raise for pipeline to catch

Notebook Best Practices

# 1. Structure your notebook
"""
# Notebook Template

## 1. Configuration (tagged as "parameters")
## 2. Imports and Setup
## 3. Data Loading
## 4. Transformations
## 5. Validation
## 6. Output
## 7. Cleanup
"""

# 2. Document with markdown cells
"""
## Data Loading

This section loads data from the bronze layer
and applies initial filtering.

**Input**: bronze_sales table
**Output**: Filtered DataFrame with valid records
"""

# 3. Modular code with functions
def clean_data(df):
    """Remove nulls and duplicates"""
    return df.dropna().dropDuplicates()

def enrich_data(df):
    """Add computed columns"""
    return df.withColumn("processed_at", current_timestamp())

# 4. Use assertions for data quality
assert df.count() > 0, "Input DataFrame is empty"
assert "required_column" in df.columns, "Missing required column"

Tomorrow we’ll explore data wrangling techniques in Fabric notebooks.

Resources

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.