2 min read
Azure Data Lake Storage Gen2: Foundation of Modern Data Platforms
ADLS Gen2 combines the scalability of Blob Storage with the hierarchical namespace needed for big data analytics.
Why Gen2?
Gen1 vs Gen2:
- Gen1: Hadoop-first, separate service
- Gen2: Blob Storage + hierarchical namespace, unified
Gen2 is the standard for new data lakes on Azure.
Enabling Hierarchical Namespace
resource "azurerm_storage_account" "datalake" {
name = "mydatalake"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
account_tier = "Standard"
account_replication_type = "GRS"
account_kind = "StorageV2"
is_hns_enabled = true # This enables Gen2 features
}
Organizing Your Lake
datalake/
├── raw/ # Ingested data, immutable
│ ├── sales/
│ │ ├── 2020/09/30/
│ │ └── 2020/10/01/
│ └── customers/
├── curated/ # Cleansed, transformed
│ ├── dimensions/
│ └── facts/
├── sandbox/ # Exploration, temporary
└── archive/ # Historical, rarely accessed
Access Control
POSIX-style permissions + Azure RBAC:
# Set ACL on directory
az storage fs access set \
--acl "user::rwx,group::r-x,other::---,user:data-engineers:rwx" \
--path "curated/facts" \
--file-system "datalake" \
--account-name "mydatalake"
Spark Integration
# Direct access via abfss://
df = spark.read.parquet("abfss://datalake@mydatalake.dfs.core.windows.net/curated/facts/sales/")
# With service principal
spark.conf.set("fs.azure.account.auth.type.mydatalake.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.mydatalake.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.mydatalake.dfs.core.windows.net", "<app-id>")
Best Practices
- Enable soft delete for accidental deletion recovery
- Use lifecycle policies for cost optimization
- Implement folder structure that matches your data domains
- Apply ACLs at directory level, not individual files
ADLS Gen2 is the foundation. Build your data platform on solid ground.