March 14, 2021 2 min read

Azure Databricks Workspace Management and Data Governance Best Practices

Azure Databricks Data Governance Data Lakehouse Security

As organizations scale their Azure Databricks deployments, workspace management and data governance become critical. In this post, I’ll explore best practices for organizing workspaces, managing access control, and implementing data governance patterns that prepare your environment for enterprise-scale data operations.

Workspace Organization

A well-organized Databricks workspace is essential for team productivity and security.

Folder Structure Best Practices

/Repos/              # Git-integrated notebooks and code
  /team-a/
  /team-b/
  /shared/

/Users/              # Personal user folders
  /user@company.com/

/Shared/             # Shared resources
  /Libraries/        # Common libraries and utilities
  /Data/             # Shared datasets and references
  /Templates/        # Notebook templates

/Production/         # Production workflows (restricted access)
  /ETL/
  /ML/
  /Reports/

Creating Workspace Structure

# Using Databricks CLI
%sh
# Create folder structure
databricks workspace mkdirs /Shared/Libraries
databricks workspace mkdirs /Shared/Data
databricks workspace mkdirs /Shared/Templates
databricks workspace mkdirs /Production/ETL
databricks workspace mkdirs /Production/ML

Access Control with Table ACLs

Enable table access control for fine-grained permissions:

-- Enable table access control on a cluster
-- Set spark.databricks.acl.dfAclsEnabled to true in cluster config

-- Grant database access
GRANT USAGE ON DATABASE analytics TO `data_analysts@company.com`;

-- Grant table read access
GRANT SELECT ON TABLE analytics.sales TO `data_analysts@company.com`;

-- Grant table write access
GRANT SELECT, MODIFY ON TABLE analytics.sales_staging TO `data_engineers@company.com`;

-- Grant all privileges on database
GRANT ALL PRIVILEGES ON DATABASE analytics TO `data_platform_admins@company.com`;

-- View grants
SHOW GRANTS ON TABLE analytics.sales;

Hive Metastore Organization

Structure your databases for clarity and access control:

-- Create databases for different zones
CREATE DATABASE IF NOT EXISTS raw
COMMENT 'Raw data ingested from source systems'
LOCATION 'abfss://raw@storage.dfs.core.windows.net/';

CREATE DATABASE IF NOT EXISTS curated
COMMENT 'Cleaned and standardized data'
LOCATION 'abfss://curated@storage.dfs.core.windows.net/';

CREATE DATABASE IF NOT EXISTS analytics
COMMENT 'Analytics-ready datasets and aggregations'
LOCATION 'abfss://analytics@storage.dfs.core.windows.net/';

CREATE DATABASE IF NOT EXISTS sandbox
COMMENT 'Sandbox for exploration and development'
LOCATION 'abfss://sandbox@storage.dfs.core.windows.net/';

Cluster Policies

Define cluster policies to control costs and standardize configurations:

{
  "cluster_type": {
    "type": "fixed",
    "value": "all-purpose"
  },
  "autotermination_minutes": {
    "type": "range",
    "minValue": 10,
    "maxValue": 120,
    "defaultValue": 60
  },
  "spark_version": {
    "type": "allowlist",
    "values": [
      "7.3.x-scala2.12",
      "8.1.x-scala2.12",
      "8.2.x-scala2.12"
    ],
    "defaultValue": "8.2.x-scala2.12"
  },
  "node_type_id": {
    "type": "allowlist",
    "values": [
      "Standard_DS3_v2",
      "Standard_DS4_v2",
      "Standard_DS5_v2"
    ]
  },
  "num_workers": {
    "type": "range",
    "minValue": 1,
    "maxValue": 10,
    "defaultValue": 2
  },
  "custom_tags.team": {
    "type": "fixed",
    "value": "data-engineering"
  },
  "custom_tags.environment": {
    "type": "allowlist",
    "values": ["dev", "test", "prod"]
  }
}

Secret Management

Use Azure Key Vault for secure secret management:

# Create secret scope backed by Azure Key Vault
# Via Databricks CLI
%sh
databricks secrets create-scope --scope my-secrets \
  --scope-backend-type AZURE_KEYVAULT \
  --resource-id /subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.KeyVault/vaults/{vault-name} \
  --dns-name https://{vault-name}.vault.azure.net/

# Access secrets in notebooks
storage_key = dbutils.secrets.get(scope="my-secrets", key="storage-account-key")
sql_password = dbutils.secrets.get(scope="my-secrets", key="sql-password")

# Use in Spark configuration
spark.conf.set(
    "fs.azure.account.key.mystorageaccount.dfs.core.windows.net",
    dbutils.secrets.get(scope="my-secrets", key="storage-account-key")
)

Data Lineage Tracking

Implement custom lineage tracking for your pipelines:

from datetime import datetime
import json

class LineageTracker:
    def __init__(self, spark, lineage_table="governance.data_lineage"):
        self.spark = spark
        self.lineage_table = lineage_table

    def track_transformation(self, source_tables, target_table,
                            transformation_name, notebook_path):
        """Record data lineage for a transformation."""
        lineage_record = {
            "transformation_id": str(uuid.uuid4()),
            "transformation_name": transformation_name,
            "source_tables": json.dumps(source_tables),
            "target_table": target_table,
            "notebook_path": notebook_path,
            "executed_by": spark.sql("SELECT current_user()").first()[0],
            "executed_at": datetime.utcnow().isoformat(),
            "cluster_id": spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
        }

        lineage_df = spark.createDataFrame([lineage_record])
        lineage_df.write.format("delta").mode("append").saveAsTable(self.lineage_table)

        return lineage_record["transformation_id"]

# Usage
tracker = LineageTracker(spark)

# Before your transformation
transformation_id = tracker.track_transformation(
    source_tables=["raw.events", "raw.users"],
    target_table="curated.user_events",
    transformation_name="Join events with users",
    notebook_path=dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
)

# Your transformation logic here...

Data Quality Framework

Implement data quality checks:

from pyspark.sql.functions import *

class DataQualityChecker:
    def __init__(self, spark):
        self.spark = spark
        self.checks = []

    def check_not_null(self, df, columns):
        """Check that specified columns have no null values."""
        for col_name in columns:
            null_count = df.filter(col(col_name).isNull()).count()
            self.checks.append({
                "check": f"not_null_{col_name}",
                "passed": null_count == 0,
                "details": f"Found {null_count} null values"
            })
        return self

    def check_unique(self, df, columns):
        """Check that specified columns form a unique key."""
        total_count = df.count()
        distinct_count = df.select(columns).distinct().count()
        self.checks.append({
            "check": f"unique_{','.join(columns)}",
            "passed": total_count == distinct_count,
            "details": f"Total: {total_count}, Distinct: {distinct_count}"
        })
        return self

    def check_range(self, df, column, min_val, max_val):
        """Check that values are within expected range."""
        out_of_range = df.filter(
            (col(column) < min_val) | (col(column) > max_val)
        ).count()
        self.checks.append({
            "check": f"range_{column}",
            "passed": out_of_range == 0,
            "details": f"Found {out_of_range} out of range values"
        })
        return self

    def check_referential_integrity(self, df, column, reference_df, reference_column):
        """Check referential integrity between tables."""
        orphan_count = df.join(
            reference_df,
            df[column] == reference_df[reference_column],
            "left_anti"
        ).count()
        self.checks.append({
            "check": f"referential_{column}",
            "passed": orphan_count == 0,
            "details": f"Found {orphan_count} orphan records"
        })
        return self

    def get_results(self):
        """Return all check results."""
        return self.checks

    def assert_all_passed(self):
        """Raise exception if any check failed."""
        failed = [c for c in self.checks if not c["passed"]]
        if failed:
            raise Exception(f"Data quality checks failed: {failed}")
        print("All data quality checks passed!")

# Usage
df = spark.table("curated.transactions")

checker = DataQualityChecker(spark)
checker \
    .check_not_null(df, ["transaction_id", "customer_id", "amount"]) \
    .check_unique(df, ["transaction_id"]) \
    .check_range(df, "amount", 0, 1000000) \
    .assert_all_passed()

Job Monitoring and Alerting

Set up monitoring for your data pipelines:

import requests
from datetime import datetime

def send_alert(webhook_url, title, message, severity="warning"):
    """Send alert to Teams/Slack webhook."""
    payload = {
        "title": title,
        "text": message,
        "severity": severity,
        "timestamp": datetime.utcnow().isoformat()
    }
    response = requests.post(webhook_url, json=payload)
    return response.status_code == 200

def monitor_job_health(job_name, expected_rows_min, actual_rows):
    """Monitor job output and alert if anomalies detected."""
    webhook_url = dbutils.secrets.get("alerts", "teams-webhook")

    if actual_rows < expected_rows_min:
        send_alert(
            webhook_url,
            f"Low Row Count Alert: {job_name}",
            f"Expected minimum {expected_rows_min} rows, got {actual_rows}",
            "critical"
        )
        return False

    return True

# Usage in your pipeline
rows_processed = df.count()
monitor_job_health("daily_sales_etl", expected_rows_min=1000, actual_rows=rows_processed)

Environment Separation

Manage multiple environments (dev, test, prod):

# Configuration management
class EnvironmentConfig:
    def __init__(self, env_name):
        self.env = env_name
        self.configs = {
            "dev": {
                "storage_account": "devstorageaccount",
                "database_prefix": "dev_",
                "cluster_size": "small"
            },
            "test": {
                "storage_account": "teststorageaccount",
                "database_prefix": "test_",
                "cluster_size": "medium"
            },
            "prod": {
                "storage_account": "prodstorageaccount",
                "database_prefix": "",
                "cluster_size": "large"
            }
        }

    def get(self, key):
        return self.configs[self.env][key]

    def get_database_name(self, base_name):
        return f"{self.configs[self.env]['database_prefix']}{base_name}"

# Usage
env = EnvironmentConfig(dbutils.widgets.get("environment"))
database = env.get_database_name("analytics")
storage = env.get("storage_account")

Best Practices Summary

Organize workspaces - Use consistent folder structures across teams
Implement RBAC - Use table ACLs and cluster policies for access control
Secure secrets - Always use Azure Key Vault or Databricks secrets
Track lineage - Implement custom lineage for compliance and debugging
Monitor quality - Build data quality checks into every pipeline
Separate environments - Use configuration-driven environment management

Conclusion

Effective workspace management and data governance are essential for scaling Azure Databricks in the enterprise. By implementing these patterns, you create a foundation for secure, reliable, and auditable data operations.