Azure Data Lake Storage Gen2 - Building Modern Data Lakes
Azure Data Lake Storage Gen2 (ADLS Gen2) combines the power of a Hadoop-compatible file system with the scalability of Azure Blob Storage. It’s the foundation for modern data lakes, supporting analytics workloads from Azure Synapse, Databricks, and HDInsight. Let me walk you through the fundamentals and best practices.
What is ADLS Gen2?
ADLS Gen2 adds hierarchical namespace (HNS) to Azure Blob Storage, enabling:
- True directory operations - Atomic renames and deletes
- POSIX-compliant ACLs - Fine-grained access control
- Hadoop compatibility - ABFS driver for big data tools
- Blob storage features - Tiering, lifecycle management, encryption
Creating a Data Lake
Using Azure CLI
# Create storage account with hierarchical namespace
az storage account create \
--name mydatalake \
--resource-group myResourceGroup \
--location eastus \
--sku Standard_LRS \
--kind StorageV2 \
--enable-hierarchical-namespace true \
--allow-blob-public-access false
# Create file systems (containers)
az storage fs create \
--name raw \
--account-name mydatalake \
--auth-mode login
az storage fs create \
--name curated \
--account-name mydatalake \
--auth-mode login
Using Terraform
resource "azurerm_storage_account" "datalake" {
name = "mydatalake"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
account_tier = "Standard"
account_replication_type = "LRS"
account_kind = "StorageV2"
is_hns_enabled = true
blob_properties {
versioning_enabled = true
delete_retention_policy {
days = 7
}
}
network_rules {
default_action = "Deny"
bypass = ["AzureServices"]
virtual_network_subnet_ids = [
azurerm_subnet.data.id
]
}
tags = {
Environment = "Production"
Purpose = "DataLake"
}
}
resource "azurerm_storage_data_lake_gen2_filesystem" "raw" {
name = "raw"
storage_account_id = azurerm_storage_account.datalake.id
}
resource "azurerm_storage_data_lake_gen2_filesystem" "curated" {
name = "curated"
storage_account_id = azurerm_storage_account.datalake.id
}
Data Lake Architecture Patterns
Medallion Architecture (Bronze/Silver/Gold)
raw/ # Bronze - Raw ingested data
├── sales/
│ ├── 2021/04/13/
│ │ ├── orders_001.json
│ │ └── orders_002.json
│ └── _checkpoints/
├── inventory/
└── customers/
curated/ # Silver - Cleansed and validated
├── sales/
│ ├── orders/
│ │ └── year=2021/month=04/
│ └── order_items/
├── inventory/
└── customers/
analytics/ # Gold - Business-level aggregates
├── sales_summary/
├── customer_360/
└── inventory_metrics/
Zone-Based Architecture
landing/ # Temporary landing zone
├── batch/
└── streaming/
raw/ # Immutable raw data
├── internal/
│ ├── erp/
│ └── crm/
└── external/
├── vendors/
└── market_data/
enriched/ # Processed and enriched
├── master_data/
└── reference_data/
curated/ # Business-ready datasets
├── sales/
├── finance/
└── operations/
sandbox/ # Exploration and development
├── data_science/
└── experiments/
Working with ADLS Gen2
Python with azure-storage-file-datalake
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
import pandas as pd
from io import BytesIO
# Initialize client
credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(
account_url="https://mydatalake.dfs.core.windows.net",
credential=credential
)
# Get file system client
file_system_client = service_client.get_file_system_client("raw")
# Create directory
directory_client = file_system_client.create_directory("sales/2021/04/13")
# Upload file
file_client = directory_client.create_file("orders.json")
with open("local_orders.json", "rb") as f:
file_client.upload_data(f.read(), overwrite=True)
# Upload with append (for large files)
file_client = directory_client.create_file("large_file.csv")
file_client.append_data(data=chunk1, offset=0, length=len(chunk1))
file_client.append_data(data=chunk2, offset=len(chunk1), length=len(chunk2))
file_client.flush_data(len(chunk1) + len(chunk2))
# Read file into DataFrame
file_client = file_system_client.get_file_client("curated/sales/orders.parquet")
download = file_client.download_file()
df = pd.read_parquet(BytesIO(download.readall()))
# List directory contents
paths = file_system_client.get_paths(path="sales/2021")
for path in paths:
print(f"{path.name} - {'Directory' if path.is_directory else 'File'} - {path.last_modified}")
PySpark Integration
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DataLakeDemo") \
.config("spark.hadoop.fs.azure.account.auth.type.mydatalake.dfs.core.windows.net", "OAuth") \
.config("spark.hadoop.fs.azure.account.oauth.provider.type.mydatalake.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") \
.config("spark.hadoop.fs.azure.account.oauth2.client.id.mydatalake.dfs.core.windows.net", client_id) \
.config("spark.hadoop.fs.azure.account.oauth2.client.secret.mydatalake.dfs.core.windows.net", client_secret) \
.config("spark.hadoop.fs.azure.account.oauth2.client.endpoint.mydatalake.dfs.core.windows.net",
f"https://login.microsoftonline.com/{tenant_id}/oauth2/token") \
.getOrCreate()
# Read from Data Lake
df = spark.read.parquet("abfss://curated@mydatalake.dfs.core.windows.net/sales/orders/")
# Process data
result = df.groupBy("customer_id") \
.agg(
sum("amount").alias("total_amount"),
count("order_id").alias("order_count")
)
# Write back to Data Lake
result.write \
.mode("overwrite") \
.partitionBy("year", "month") \
.parquet("abfss://analytics@mydatalake.dfs.core.windows.net/customer_summary/")
Azure Data Factory Integration
{
"name": "ADLS_LinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://mydatalake.dfs.core.windows.net",
"accountKey": {
"type": "SecureString",
"value": "account-key"
}
}
}
}
Or with Managed Identity:
{
"name": "ADLS_MI_LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://mydatalake.dfs.core.windows.net"
},
"connectVia": {
"referenceName": "AutoResolveIntegrationRuntime",
"type": "IntegrationRuntimeReference"
}
}
}
Storage Tiers and Lifecycle Management
Configure Lifecycle Policy
{
"rules": [
{
"name": "archiveOldData",
"enabled": true,
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["raw/"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 30
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 90
},
"delete": {
"daysAfterModificationGreaterThan": 365
}
}
}
}
},
{
"name": "deleteOldSnapshots",
"enabled": true,
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"]
},
"actions": {
"snapshot": {
"delete": {
"daysAfterCreationGreaterThan": 90
}
}
}
}
}
]
}
Programmatic Tier Management
from azure.storage.filedatalake import DataLakeServiceClient
def set_blob_tier(file_system, path, tier):
"""Set access tier for a blob"""
file_client = file_system.get_file_client(path)
# Note: Tier changes are eventually consistent
file_client.set_access_tier(tier) # Hot, Cool, Archive
# Archive old files
from datetime import datetime, timedelta
paths = file_system.get_paths(path="raw/sales/2020")
for path in paths:
if not path.is_directory:
age = datetime.utcnow() - path.last_modified.replace(tzinfo=None)
if age > timedelta(days=90):
set_blob_tier(file_system, path.name, "Archive")
print(f"Archived: {path.name}")
Security Best Practices
Network Security
# Enable service endpoint
az network vnet subnet update \
--name data-subnet \
--vnet-name my-vnet \
--resource-group myResourceGroup \
--service-endpoints Microsoft.Storage
# Configure firewall rules
az storage account network-rule add \
--account-name mydatalake \
--vnet-name my-vnet \
--subnet data-subnet
# Enable private endpoint
az network private-endpoint create \
--name mydatalake-pe \
--resource-group myResourceGroup \
--vnet-name my-vnet \
--subnet private-endpoints \
--private-connection-resource-id $(az storage account show -n mydatalake -g myResourceGroup --query id -o tsv) \
--group-id dfs \
--connection-name mydatalake-connection
Encryption
# Enable infrastructure encryption
az storage account create \
--name mydatalake \
--resource-group myResourceGroup \
--require-infrastructure-encryption true
# Use customer-managed keys
az storage account update \
--name mydatalake \
--resource-group myResourceGroup \
--encryption-key-source Microsoft.Keyvault \
--encryption-key-vault https://mykeyvault.vault.azure.net \
--encryption-key-name storage-encryption-key
Monitoring and Diagnostics
Enable Diagnostic Settings
az monitor diagnostic-settings create \
--name datalake-diagnostics \
--resource $(az storage account show -n mydatalake -g myResourceGroup --query id -o tsv) \
--logs '[{"category": "StorageRead", "enabled": true}, {"category": "StorageWrite", "enabled": true}, {"category": "StorageDelete", "enabled": true}]' \
--metrics '[{"category": "Transaction", "enabled": true}]' \
--workspace $(az monitor log-analytics workspace show -n myworkspace -g myResourceGroup --query id -o tsv)
Query Logs
// Storage operations by path
StorageBlobLogs
| where TimeGenerated > ago(24h)
| where Uri contains "raw/sales"
| summarize count() by OperationName, bin(TimeGenerated, 1h)
| render timechart
// Failed operations
StorageBlobLogs
| where TimeGenerated > ago(24h)
| where StatusCode >= 400
| summarize count() by StatusCode, StatusText, CallerIpAddress
| order by count_ desc
// Data egress by user
StorageBlobLogs
| where TimeGenerated > ago(7d)
| where OperationName == "GetBlob"
| summarize TotalBytes = sum(ResponseBodySize) by UserAgentHeader
| order by TotalBytes desc
Best Practices Summary
- Enable hierarchical namespace - Essential for data lake workloads
- Design folder structure carefully - Supports both analytics and governance
- Use partitioning - Organize by date or key attributes
- Implement lifecycle management - Optimize costs with tiering
- Use managed identity - Avoid key-based authentication
- Enable soft delete - Protect against accidental deletion
- Monitor access patterns - Understand usage and optimize
Conclusion
ADLS Gen2 provides the foundation for modern data architectures, combining the scale of object storage with the semantics needed for analytics workloads. By understanding the architecture patterns, access methods, and security options, you can build data lakes that serve your organization’s analytics needs while maintaining proper governance and cost efficiency.