Back to Blog
6 min read

Azure Databricks Unity Catalog: Unified Data Governance at Scale

Unity Catalog brings unified governance to Azure Databricks, providing a single place to manage data access, discover assets, and track lineage across all your workspaces.

Why Unity Catalog Matters

Before Unity Catalog, Databricks governance was workspace-centric:

  • Each workspace had its own metastore
  • Sharing data between workspaces required manual copying
  • Access control was fragmented
  • No centralized audit trail

Unity Catalog solves these problems with a unified approach.

Core Concepts

Three-Level Namespace

Unity Catalog introduces a hierarchical namespace:

catalog.schema.table

Example: production.sales.transactions

This enables:

  • Catalogs: Logical groupings (production, development, sandbox)
  • Schemas: Database-like containers within catalogs
  • Tables/Views/Functions: The actual data objects

Metastore

The metastore is the top-level container that holds all metadata:

# Creating a Unity Catalog metastore (admin operation)
# Typically done via UI or Terraform

# Terraform example
resource "databricks_metastore" "this" {
  name          = "primary"
  storage_root  = "abfss://metastore@storageaccount.dfs.core.windows.net/"
  owner         = "data-platform-team"
  region        = "eastus"

  # Delta Sharing
  delta_sharing_scope                               = "INTERNAL_AND_EXTERNAL"
  delta_sharing_recipient_token_lifetime_in_seconds = 3600
}

Setting Up Unity Catalog

Enable Unity Catalog for a Workspace

# Assign workspace to metastore
resource "databricks_metastore_assignment" "this" {
  metastore_id = databricks_metastore.this.id
  workspace_id = var.workspace_id
}

# Create a catalog
resource "databricks_catalog" "production" {
  metastore_id = databricks_metastore.this.id
  name         = "production"
  comment      = "Production data catalog"
}

# Create a schema
resource "databricks_schema" "sales" {
  catalog_name = databricks_catalog.production.name
  name         = "sales"
  comment      = "Sales domain data"
}

Creating Tables

-- Create a managed table
CREATE TABLE production.sales.transactions (
    transaction_id STRING,
    customer_id STRING,
    product_id STRING,
    amount DECIMAL(10, 2),
    transaction_date DATE,
    region STRING
)
USING DELTA
COMMENT 'Sales transactions data'
TBLPROPERTIES ('quality' = 'gold');

-- Create from existing data
CREATE TABLE production.sales.customers
USING DELTA
LOCATION 'abfss://data@storage.dfs.core.windows.net/customers'
COMMENT 'Customer master data';

-- Create external table
CREATE TABLE production.sales.external_orders (
    order_id STRING,
    order_date DATE,
    total DECIMAL(10, 2)
)
USING DELTA
LOCATION 'abfss://external@storage.dfs.core.windows.net/orders';

Access Control

Unity Catalog provides fine-grained access control:

-- Grant catalog access
GRANT USAGE ON CATALOG production TO `data-analysts@company.com`;

-- Grant schema access
GRANT USAGE ON SCHEMA production.sales TO `data-analysts@company.com`;

-- Grant table-level read access
GRANT SELECT ON TABLE production.sales.transactions TO `data-analysts@company.com`;

-- Grant write access
GRANT SELECT, MODIFY ON TABLE production.sales.transactions TO `data-engineers@company.com`;

-- Grant all privileges on schema
GRANT ALL PRIVILEGES ON SCHEMA production.sales TO `data-platform@company.com`;

-- Create a row filter for sensitive data
CREATE FUNCTION production.sales.region_filter(region STRING)
RETURNS BOOLEAN
RETURN IF(
    is_account_group_member('global-access'),
    TRUE,
    region = current_user_region()
);

ALTER TABLE production.sales.transactions
SET ROW FILTER production.sales.region_filter ON (region);

Column-Level Security

Protect sensitive columns:

-- Create a column mask function
CREATE FUNCTION production.sales.mask_email(email STRING)
RETURNS STRING
RETURN IF(
    is_account_group_member('pii-access'),
    email,
    CONCAT(LEFT(email, 2), '***@', SPLIT(email, '@')[1])
);

-- Apply mask to column
ALTER TABLE production.sales.customers
ALTER COLUMN email SET MASK production.sales.mask_email;

-- Query results will show masked data for non-privileged users
SELECT customer_id, email FROM production.sales.customers;
-- Returns: C001, jo***@company.com (for users without pii-access)

Data Lineage

Track data dependencies:

# Lineage is automatically captured for all Unity Catalog operations

# Query lineage via REST API
import requests

def get_table_lineage(catalog, schema, table):
    url = f"{workspace_url}/api/2.0/lineage-tracking/table-lineage"

    headers = {"Authorization": f"Bearer {token}"}

    params = {
        "table_name": f"{catalog}.{schema}.{table}",
        "include_entity_lineage": True
    }

    response = requests.get(url, headers=headers, params=params)
    return response.json()

# Get upstream dependencies
lineage = get_table_lineage("production", "sales", "daily_summary")
print("Upstream tables:", lineage.get("upstream_tables", []))
print("Upstream notebooks:", lineage.get("upstream_notebooks", []))

Managed vs External Tables

Understanding the difference:

-- Managed table: Unity Catalog manages both metadata and data
CREATE TABLE production.sales.managed_orders (
    order_id STRING,
    amount DECIMAL(10, 2)
) USING DELTA;

-- When dropped, data is deleted
DROP TABLE production.sales.managed_orders;

-- External table: Unity Catalog manages metadata, you manage data
CREATE TABLE production.sales.external_orders
USING DELTA
LOCATION 'abfss://data@storage.dfs.core.windows.net/orders';

-- When dropped, only metadata is removed, data remains
DROP TABLE production.sales.external_orders;

Storage Credentials

Securely access external storage:

# Create storage credential
resource "databricks_storage_credential" "external" {
  name = "external-storage-cred"

  azure_managed_identity {
    access_connector_id = azurerm_databricks_access_connector.this.id
  }

  comment = "Managed identity for external storage access"
}

# Create external location
resource "databricks_external_location" "data_lake" {
  name            = "data-lake"
  url             = "abfss://data@storageaccount.dfs.core.windows.net/"
  credential_name = databricks_storage_credential.external.name
  comment         = "Primary data lake location"
}
-- Grant access to external location
GRANT READ FILES ON EXTERNAL LOCATION `data-lake` TO `data-engineers@company.com`;
GRANT CREATE TABLE ON EXTERNAL LOCATION `data-lake` TO `data-engineers@company.com`;

Audit Logging

Track all access:

# Query audit logs from Azure Monitor
query = """
DatabricksAudit
| where Category == "unityCatalog"
| where ActionName in ("getTable", "createTable", "deleteTable", "grantPermission")
| project
    TimeGenerated,
    ActionName,
    Identity,
    RequestParams,
    Response
| order by TimeGenerated desc
"""

# Or from system tables
spark.sql("""
    SELECT
        event_time,
        action_name,
        user_identity.email as user,
        request_params.full_name_arg as object_name
    FROM system.access.audit
    WHERE action_name LIKE '%Table%'
    ORDER BY event_time DESC
    LIMIT 100
""")

Best Practices

Catalog Organization

# Recommended structure
production/
├── sales/
│   ├── transactions
│   ├── customers
│   └── products
├── marketing/
│   ├── campaigns
│   └── leads
└── finance/
    ├── invoices
    └── payments

development/
├── sales/
│   └── (mirrors production structure)
└── experiments/
    └── (ad-hoc analysis)

sandbox/
└── (per-user schemas for exploration)

Naming Conventions

-- Use consistent naming
-- Tables: snake_case, plural for collections
CREATE TABLE production.sales.order_line_items ...;

-- Views: prefix with v_ or suffix with _view
CREATE VIEW production.sales.v_daily_sales ...;

-- Functions: verb_noun format
CREATE FUNCTION production.sales.calculate_discount ...;

Migration Strategy

# Migrate existing Hive tables to Unity Catalog

# 1. List existing tables
legacy_tables = spark.catalog.listTables("legacy_database")

# 2. Create Unity Catalog equivalents
for table in legacy_tables:
    source = f"legacy_database.{table.name}"
    target = f"production.sales.{table.name}"

    # For managed tables, use CTAS
    spark.sql(f"""
        CREATE TABLE {target}
        USING DELTA
        AS SELECT * FROM {source}
    """)

    # Preserve properties
    spark.sql(f"""
        ALTER TABLE {target}
        SET TBLPROPERTIES (
            'migrated_from' = '{source}',
            'migration_date' = current_date()
        )
    """)

Integration with Azure Services

Azure Purview Integration

# Unity Catalog metadata can be synced to Azure Purview
# for enterprise-wide data catalog

# Configure in Unity Catalog settings
purview_config = {
    "purview_account": "mycompany-purview",
    "collection": "databricks-assets",
    "sync_frequency": "hourly"
}

Azure AD Integration

# Groups from Azure AD are automatically available
# Grant access using Azure AD group names
spark.sql("""
    GRANT SELECT ON TABLE production.sales.transactions
    TO `Sales-Team@mycompany.onmicrosoft.com`
""")

Conclusion

Unity Catalog represents a significant step forward for data governance in Azure Databricks. By providing unified access control, data lineage, and audit capabilities across all workspaces, it enables organizations to implement true data mesh architectures while maintaining centralized governance.

Key benefits:

  • Single source of truth for data access policies
  • Cross-workspace data sharing without copying
  • Automatic lineage tracking
  • Fine-grained security down to row and column level
  • Enterprise-ready audit trails

Resources

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.