5 min read
Creating Your First Fabric Lakehouse: Step-by-Step Guide
The Lakehouse is the core data storage artifact in Microsoft Fabric. Today we’ll create a Lakehouse from scratch and understand its structure, capabilities, and best practices.
What is a Fabric Lakehouse?
A Lakehouse combines the best of data lakes and data warehouses:
lakehouse_characteristics = {
"storage": {
"format": "Delta Lake (Parquet + transaction log)",
"location": "OneLake",
"structure": "Files (unstructured) + Tables (structured)"
},
"compute": {
"spark": "For data engineering",
"sql": "Via SQL endpoint (auto-generated)"
},
"integration": {
"power_bi": "DirectLake mode",
"dataflows": "Direct output target",
"pipelines": "Native activities"
}
}
Creating a Lakehouse
Step 1: Navigate to Your Workspace
1. Go to app.fabric.microsoft.com
2. Select your Fabric-enabled workspace
3. Click "+ New"
4. Select "Lakehouse"
Step 2: Configure the Lakehouse
# Lakehouse settings
lakehouse_config = {
"name": "sales_lakehouse", # Use lowercase, underscores
"description": "Sales data for analytics team"
}
# Naming conventions I recommend:
# - {domain}_lakehouse (e.g., sales_lakehouse)
# - {domain}_{layer}_lh (e.g., sales_bronze_lh)
# - Keep names short but descriptive
Step 3: Understand the Structure
Once created, your Lakehouse has two main sections:
sales_lakehouse/
├── Files/ # Unstructured storage
│ ├── raw/ # Landing zone for raw files
│ ├── processed/ # Processed files
│ └── archive/ # Historical files
│
└── Tables/ # Managed Delta tables
├── customers/ # Delta table
├── orders/ # Delta table
└── products/ # Delta table
Loading Data into Your Lakehouse
Method 1: Direct Upload
# For small files (< 100MB)
# 1. Click "..." on Files folder
# 2. Select "Upload" > "Upload files"
# 3. Select files from your computer
# Good for:
# - Quick prototyping
# - Sample data
# - Configuration files
Method 2: Notebook Ingestion
# Create a notebook attached to your Lakehouse
# This gives you full Spark capabilities
# Read from external source
import requests
import json
# Example: Load from API
url = "https://api.example.com/data"
response = requests.get(url)
data = response.json()
# Convert to DataFrame
df = spark.createDataFrame(data)
# Save to Files (as raw JSON)
df.write.mode("overwrite").json("Files/raw/api_data")
# Or save directly to Tables (as Delta)
df.write.format("delta").mode("overwrite").saveAsTable("api_data")
Method 3: Data Pipeline
# Use Copy Activity for bulk data movement
# 1. Create a new Data Pipeline
# 2. Add Copy Data activity
# 3. Configure source (e.g., Azure SQL)
# 4. Configure destination (your Lakehouse)
# Pipeline configuration example:
copy_activity_config = {
"source": {
"type": "AzureSqlDatabase",
"query": "SELECT * FROM Sales.Orders WHERE ModifiedDate > @{pipeline().parameters.watermark}"
},
"sink": {
"type": "Lakehouse",
"tableOption": "Tables/orders",
"format": "Delta"
}
}
Working with the Files Section
The Files section is for unstructured or semi-structured data:
# Create folder structure
dbutils.fs.mkdirs("Files/raw/sales/2023/07")
dbutils.fs.mkdirs("Files/raw/inventory")
dbutils.fs.mkdirs("Files/processed")
dbutils.fs.mkdirs("Files/archive")
# Upload files from notebook
# Copy local file to Lakehouse
dbutils.fs.cp("file:/tmp/data.csv", "Files/raw/sales/data.csv")
# List files
files = dbutils.fs.ls("Files/raw/sales/")
for f in files:
print(f"{f.name}: {f.size} bytes")
Working with the Tables Section
Tables are managed Delta Lake tables:
# Method 1: Create table from DataFrame
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
schema = StructType([
StructField("customer_id", IntegerType(), False),
StructField("name", StringType(), True),
StructField("email", StringType(), True),
StructField("segment", StringType(), True)
])
data = [
(1, "Alice Smith", "alice@example.com", "Enterprise"),
(2, "Bob Jones", "bob@example.com", "SMB"),
(3, "Carol White", "carol@example.com", "Enterprise")
]
df = spark.createDataFrame(data, schema)
df.write.format("delta").mode("overwrite").saveAsTable("customers")
# Method 2: Create table with SQL
spark.sql("""
CREATE TABLE IF NOT EXISTS orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount DOUBLE
)
USING DELTA
""")
# Method 3: Load from Files to Tables
raw_df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("Files/raw/sales/*.csv")
raw_df.write.format("delta").mode("overwrite").saveAsTable("sales_raw")
The SQL Endpoint
Every Lakehouse automatically gets a SQL endpoint:
-- Access via the SQL endpoint view
-- Tables appear as database objects
-- Query your Delta tables
SELECT
c.name,
c.segment,
COUNT(o.order_id) as order_count,
SUM(o.total_amount) as total_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name, c.segment
ORDER BY total_spend DESC;
-- Create views
CREATE VIEW vw_customer_summary AS
SELECT
customer_id,
name,
segment,
created_date
FROM customers
WHERE segment = 'Enterprise';
Lakehouse Best Practices
# 1. Organize Files with purpose
recommended_structure = """
Files/
├── landing/ # Raw data arrives here
├── raw/ # Preserved raw data
├── staging/ # Intermediate processing
├── processed/ # Final file outputs
├── archive/ # Historical data
└── reference/ # Reference/lookup files
"""
# 2. Use consistent table naming
table_naming = {
"raw_*": "Raw ingested data",
"stg_*": "Staging/intermediate tables",
"dim_*": "Dimension tables",
"fact_*": "Fact tables",
"rpt_*": "Reporting/aggregated tables"
}
# 3. Partition large tables
df.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("year", "month") \
.saveAsTable("sales_partitioned")
# 4. Optimize table storage
spark.sql("OPTIMIZE sales_partitioned")
spark.sql("VACUUM sales_partitioned RETAIN 168 HOURS")
Connecting Power BI
# Power BI connects via DirectLake for best performance
# 1. In Power BI Desktop or Service
# 2. Get Data > Fabric Lakehouse
# 3. Select your workspace and Lakehouse
# 4. Choose tables to load
# DirectLake benefits:
# - No data import (reads directly from Delta)
# - Near real-time refresh
# - Reduced storage costs
Monitoring Your Lakehouse
# Check table details
spark.sql("DESCRIBE DETAIL customers").show()
# View table history
spark.sql("DESCRIBE HISTORY customers").show()
# Check storage usage
files_info = dbutils.fs.ls("Tables/customers/")
total_size = sum(f.size for f in files_info if not f.name.startswith("_"))
print(f"Table size: {total_size / 1024 / 1024:.2f} MB")
Tomorrow we’ll dive deeper into Delta tables in Fabric and explore their unique capabilities.