2 min read
Azure Synapse Serverless SQL (Preview): Query Your Data Lake
Azure Synapse Analytics is currently in preview. One of its most exciting features is Serverless SQL (also called SQL on-demand), which lets you query your data lake files with T-SQL. No provisioning, pay per query.
Note: Synapse is in public preview with GA expected in late 2020. Features may change.
Basic Query
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://mydatalake.dfs.core.windows.net/raw/sales/*.parquet',
FORMAT = 'PARQUET'
) AS sales
CSV Files
SELECT *
FROM OPENROWSET(
BULK 'https://mydatalake.dfs.core.windows.net/raw/customers/*.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
) WITH (
customer_id INT,
name VARCHAR(100),
email VARCHAR(256),
created_date DATE
) AS customers
Creating Views
CREATE VIEW curated.vw_monthly_sales
AS
SELECT
year(sale_date) AS year,
month(sale_date) AS month,
region,
SUM(amount) AS total_sales
FROM OPENROWSET(
BULK 'https://mydatalake.dfs.core.windows.net/raw/sales/**',
FORMAT = 'PARQUET'
) AS sales
GROUP BY year(sale_date), month(sale_date), region
External Tables
CREATE EXTERNAL DATA SOURCE DataLake
WITH (LOCATION = 'https://mydatalake.dfs.core.windows.net/curated/');
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (FORMAT_TYPE = PARQUET);
CREATE EXTERNAL TABLE curated.sales (
sale_id BIGINT,
customer_id INT,
amount DECIMAL(18,2),
sale_date DATE
)
WITH (
LOCATION = 'sales/',
DATA_SOURCE = DataLake,
FILE_FORMAT = ParquetFormat
);
Cost Management
Pricing: ~$5 per TB scanned
Optimize by:
- Using Parquet (columnar, compressed)
- Partitioning by date or other filter columns
- Selecting only needed columns
- Filtering early in queries
-- Good: Selective query
SELECT sale_date, amount
FROM curated.sales
WHERE sale_date >= '2020-01-01'
-- Bad: Full table scan
SELECT * FROM curated.sales
Serverless SQL democratizes data lake access for SQL users.