Back to Blog
5 min read

Creating Your First Fabric Lakehouse: Step-by-Step Guide

The Lakehouse is the core data storage artifact in Microsoft Fabric. Today we’ll create a Lakehouse from scratch and understand its structure, capabilities, and best practices.

What is a Fabric Lakehouse?

A Lakehouse combines the best of data lakes and data warehouses:

lakehouse_characteristics = {
    "storage": {
        "format": "Delta Lake (Parquet + transaction log)",
        "location": "OneLake",
        "structure": "Files (unstructured) + Tables (structured)"
    },
    "compute": {
        "spark": "For data engineering",
        "sql": "Via SQL endpoint (auto-generated)"
    },
    "integration": {
        "power_bi": "DirectLake mode",
        "dataflows": "Direct output target",
        "pipelines": "Native activities"
    }
}

Creating a Lakehouse

Step 1: Navigate to Your Workspace

1. Go to app.fabric.microsoft.com
2. Select your Fabric-enabled workspace
3. Click "+ New"
4. Select "Lakehouse"

Step 2: Configure the Lakehouse

# Lakehouse settings
lakehouse_config = {
    "name": "sales_lakehouse",  # Use lowercase, underscores
    "description": "Sales data for analytics team"
}

# Naming conventions I recommend:
# - {domain}_lakehouse (e.g., sales_lakehouse)
# - {domain}_{layer}_lh (e.g., sales_bronze_lh)
# - Keep names short but descriptive

Step 3: Understand the Structure

Once created, your Lakehouse has two main sections:

sales_lakehouse/
├── Files/          # Unstructured storage
│   ├── raw/        # Landing zone for raw files
│   ├── processed/  # Processed files
│   └── archive/    # Historical files

└── Tables/         # Managed Delta tables
    ├── customers/  # Delta table
    ├── orders/     # Delta table
    └── products/   # Delta table

Loading Data into Your Lakehouse

Method 1: Direct Upload

# For small files (< 100MB)
# 1. Click "..." on Files folder
# 2. Select "Upload" > "Upload files"
# 3. Select files from your computer

# Good for:
# - Quick prototyping
# - Sample data
# - Configuration files

Method 2: Notebook Ingestion

# Create a notebook attached to your Lakehouse
# This gives you full Spark capabilities

# Read from external source
import requests
import json

# Example: Load from API
url = "https://api.example.com/data"
response = requests.get(url)
data = response.json()

# Convert to DataFrame
df = spark.createDataFrame(data)

# Save to Files (as raw JSON)
df.write.mode("overwrite").json("Files/raw/api_data")

# Or save directly to Tables (as Delta)
df.write.format("delta").mode("overwrite").saveAsTable("api_data")

Method 3: Data Pipeline

# Use Copy Activity for bulk data movement
# 1. Create a new Data Pipeline
# 2. Add Copy Data activity
# 3. Configure source (e.g., Azure SQL)
# 4. Configure destination (your Lakehouse)

# Pipeline configuration example:
copy_activity_config = {
    "source": {
        "type": "AzureSqlDatabase",
        "query": "SELECT * FROM Sales.Orders WHERE ModifiedDate > @{pipeline().parameters.watermark}"
    },
    "sink": {
        "type": "Lakehouse",
        "tableOption": "Tables/orders",
        "format": "Delta"
    }
}

Working with the Files Section

The Files section is for unstructured or semi-structured data:

# Create folder structure
dbutils.fs.mkdirs("Files/raw/sales/2023/07")
dbutils.fs.mkdirs("Files/raw/inventory")
dbutils.fs.mkdirs("Files/processed")
dbutils.fs.mkdirs("Files/archive")

# Upload files from notebook
# Copy local file to Lakehouse
dbutils.fs.cp("file:/tmp/data.csv", "Files/raw/sales/data.csv")

# List files
files = dbutils.fs.ls("Files/raw/sales/")
for f in files:
    print(f"{f.name}: {f.size} bytes")

Working with the Tables Section

Tables are managed Delta Lake tables:

# Method 1: Create table from DataFrame
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("segment", StringType(), True)
])

data = [
    (1, "Alice Smith", "alice@example.com", "Enterprise"),
    (2, "Bob Jones", "bob@example.com", "SMB"),
    (3, "Carol White", "carol@example.com", "Enterprise")
]

df = spark.createDataFrame(data, schema)
df.write.format("delta").mode("overwrite").saveAsTable("customers")

# Method 2: Create table with SQL
spark.sql("""
    CREATE TABLE IF NOT EXISTS orders (
        order_id INT,
        customer_id INT,
        order_date DATE,
        total_amount DOUBLE
    )
    USING DELTA
""")

# Method 3: Load from Files to Tables
raw_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("Files/raw/sales/*.csv")

raw_df.write.format("delta").mode("overwrite").saveAsTable("sales_raw")

The SQL Endpoint

Every Lakehouse automatically gets a SQL endpoint:

-- Access via the SQL endpoint view
-- Tables appear as database objects

-- Query your Delta tables
SELECT
    c.name,
    c.segment,
    COUNT(o.order_id) as order_count,
    SUM(o.total_amount) as total_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name, c.segment
ORDER BY total_spend DESC;

-- Create views
CREATE VIEW vw_customer_summary AS
SELECT
    customer_id,
    name,
    segment,
    created_date
FROM customers
WHERE segment = 'Enterprise';

Lakehouse Best Practices

# 1. Organize Files with purpose
recommended_structure = """
Files/
├── landing/       # Raw data arrives here
├── raw/           # Preserved raw data
├── staging/       # Intermediate processing
├── processed/     # Final file outputs
├── archive/       # Historical data
└── reference/     # Reference/lookup files
"""

# 2. Use consistent table naming
table_naming = {
    "raw_*": "Raw ingested data",
    "stg_*": "Staging/intermediate tables",
    "dim_*": "Dimension tables",
    "fact_*": "Fact tables",
    "rpt_*": "Reporting/aggregated tables"
}

# 3. Partition large tables
df.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .saveAsTable("sales_partitioned")

# 4. Optimize table storage
spark.sql("OPTIMIZE sales_partitioned")
spark.sql("VACUUM sales_partitioned RETAIN 168 HOURS")

Connecting Power BI

# Power BI connects via DirectLake for best performance
# 1. In Power BI Desktop or Service
# 2. Get Data > Fabric Lakehouse
# 3. Select your workspace and Lakehouse
# 4. Choose tables to load

# DirectLake benefits:
# - No data import (reads directly from Delta)
# - Near real-time refresh
# - Reduced storage costs

Monitoring Your Lakehouse

# Check table details
spark.sql("DESCRIBE DETAIL customers").show()

# View table history
spark.sql("DESCRIBE HISTORY customers").show()

# Check storage usage
files_info = dbutils.fs.ls("Tables/customers/")
total_size = sum(f.size for f in files_info if not f.name.startswith("_"))
print(f"Table size: {total_size / 1024 / 1024:.2f} MB")

Tomorrow we’ll dive deeper into Delta tables in Fabric and explore their unique capabilities.

Resources

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.