5 min read
High-Performance Data Loading with COPY Command in Synapse
High-Performance Data Loading with COPY Command in Synapse
The COPY command is the recommended method for loading data into Azure Synapse dedicated SQL pools. It offers superior performance, simpler syntax, and better error handling compared to alternatives.
Why Use COPY?
The COPY command provides:
- Faster loading - Optimized for Synapse architecture
- Simpler syntax - No external objects required
- Better error handling - Detailed error messages and options
- Identity column support - Auto-generate identity values
- Flexible authentication - Multiple credential options
Basic COPY Syntax
-- Simple COPY from Parquet files
COPY INTO [dbo].[FactSales]
FROM 'https://storageaccount.blob.core.windows.net/container/sales/*.parquet'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Managed Identity')
);
Authentication Options
Managed Identity
COPY INTO [dbo].[FactSales]
FROM 'https://storageaccount.dfs.core.windows.net/container/data/'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Managed Identity')
);
Shared Access Signature
COPY INTO [dbo].[FactSales]
FROM 'https://storageaccount.blob.core.windows.net/container/data/'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Shared Access Signature',
SECRET = 'sv=2020-08-04&ss=b&srt=sco&sp=rl...')
);
Storage Account Key
COPY INTO [dbo].[FactSales]
FROM 'https://storageaccount.blob.core.windows.net/container/data/'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Storage Account Key',
SECRET = 'your-storage-account-key')
);
File Format Options
Parquet Files
COPY INTO [dbo].[FactSales]
FROM 'https://storage.blob.core.windows.net/container/*.parquet'
WITH (
FILE_TYPE = 'PARQUET'
);
CSV Files
COPY INTO [dbo].[FactSales]
(
[SalesOrderId],
[OrderDate],
[CustomerId],
[Amount]
)
FROM 'https://storage.blob.core.windows.net/container/*.csv'
WITH (
FILE_TYPE = 'CSV',
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2,
ENCODING = 'UTF8',
FIELDQUOTE = '"'
);
Compressed CSV
COPY INTO [dbo].[FactSales]
FROM 'https://storage.blob.core.windows.net/container/*.csv.gz'
WITH (
FILE_TYPE = 'CSV',
COMPRESSION = 'Gzip',
FIELDTERMINATOR = ','
);
Column Mapping
Explicit Column Mapping
COPY INTO [dbo].[FactSales]
(
[SalesOrderId] 1, -- First column in file
[OrderDate] 2, -- Second column
[Amount] 4 -- Fourth column (skip third)
)
FROM 'https://storage.blob.core.windows.net/container/*.csv'
WITH (
FILE_TYPE = 'CSV',
FIELDTERMINATOR = ','
);
Column Mapping with Default Values
COPY INTO [dbo].[FactSales]
(
[SalesOrderId] 1,
[OrderDate] 2,
[Amount] 3,
[LoadDate] DEFAULT GETDATE(),
[BatchId] DEFAULT 'Batch-001'
)
FROM 'https://storage.blob.core.windows.net/container/*.csv'
WITH (
FILE_TYPE = 'CSV'
);
Error Handling
Reject Options
COPY INTO [dbo].[FactSales]
FROM 'https://storage.blob.core.windows.net/container/*.csv'
WITH (
FILE_TYPE = 'CSV',
MAXERRORS = 100, -- Allow up to 100 errors
ERRORFILE = 'https://storage.blob.core.windows.net/errors/',
ERRORFILE_CREDENTIAL = (IDENTITY = 'Managed Identity')
);
Viewing Rejected Rows
-- After COPY with ERRORFILE, check rejected rows
SELECT *
FROM OPENROWSET(
BULK 'https://storage.blob.core.windows.net/errors/FactSales_rejected.csv',
FORMAT = 'CSV'
) AS errors;
Identity Column Handling
Auto-Generate Identity Values
-- Table with identity column
CREATE TABLE [dbo].[FactSales]
(
[SalesKey] BIGINT IDENTITY(1,1) NOT NULL,
[SalesOrderId] BIGINT NOT NULL,
[OrderDate] DATE NOT NULL,
[Amount] DECIMAL(18,2) NOT NULL
)
WITH (DISTRIBUTION = HASH([SalesOrderId]));
-- COPY generates identity values automatically
COPY INTO [dbo].[FactSales]
(
[SalesOrderId],
[OrderDate],
[Amount]
)
FROM 'https://storage.blob.core.windows.net/container/*.parquet'
WITH (
FILE_TYPE = 'PARQUET',
AUTO_CREATE_TABLE = 'OFF'
);
Advanced Scenarios
Date Format Handling
COPY INTO [dbo].[FactSales]
FROM 'https://storage.blob.core.windows.net/container/*.csv'
WITH (
FILE_TYPE = 'CSV',
DATEFORMAT = 'ymd' -- YYYY-MM-DD format
);
Loading Multiple File Patterns
-- Load from multiple paths
COPY INTO [dbo].[FactSales]
FROM
'https://storage.blob.core.windows.net/container/2021/10/*.parquet',
'https://storage.blob.core.windows.net/container/2021/11/*.parquet'
WITH (
FILE_TYPE = 'PARQUET'
);
Wildcard Patterns
-- Use wildcards for flexible file selection
COPY INTO [dbo].[FactSales]
FROM 'https://storage.blob.core.windows.net/container/sales/year=2021/month=*/day=*/*.parquet'
WITH (
FILE_TYPE = 'PARQUET'
);
Performance Optimization
File Sizing Guidelines
Optimal file sizes for COPY:
- Minimum: 256 MB per file
- Optimal: 256 MB to 1 GB per file
- Avoid: Many small files or few huge files
Parallel Loading
-- COPY automatically parallelizes across files
-- Structure data as multiple files for best performance
-- Good: /sales/part-00000.parquet through part-00059.parquet
-- This maps to 60 distributions
COPY INTO [dbo].[FactSales]
FROM 'https://storage.blob.core.windows.net/container/sales/*.parquet'
WITH (
FILE_TYPE = 'PARQUET'
);
Compression Recommendations
-- Parquet with Snappy compression (best overall)
-- Already compressed, no COMPRESSION option needed
COPY INTO [dbo].[FactSales]
FROM 'https://storage.blob.core.windows.net/container/*.parquet'
WITH (
FILE_TYPE = 'PARQUET'
);
-- CSV with Gzip (good compression ratio)
COPY INTO [dbo].[FactSales]
FROM 'https://storage.blob.core.windows.net/container/*.csv.gz'
WITH (
FILE_TYPE = 'CSV',
COMPRESSION = 'Gzip'
);
Monitoring COPY Operations
Check COPY Progress
SELECT
request_id,
status,
percent_complete,
total_elapsed_time,
row_count,
command
FROM sys.dm_pdw_exec_requests
WHERE command LIKE '%COPY%'
ORDER BY submit_time DESC;
View Detailed Steps
SELECT
step_index,
operation_type,
location_type,
status,
row_count,
total_elapsed_time
FROM sys.dm_pdw_request_steps
WHERE request_id = 'QID1234'
ORDER BY step_index;
Comparison: COPY vs PolyBase vs BCP
| Feature | COPY | PolyBase | BCP |
|---|---|---|---|
| Performance | Fastest | Fast | Slower |
| Setup | Simple | Complex | Simple |
| Error Handling | Excellent | Good | Basic |
| Query Data in Place | No | Yes | No |
| Identity Columns | Yes | No | Yes |
ETL Pattern with COPY
-- Common ETL pattern using COPY
BEGIN TRANSACTION;
-- Truncate staging
TRUNCATE TABLE [staging].[Sales];
-- Load new data
COPY INTO [staging].[Sales]
FROM 'https://storage.blob.core.windows.net/incoming/sales_*.parquet'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Managed Identity')
);
-- Merge into fact table
MERGE [dbo].[FactSales] AS target
USING [staging].[Sales] AS source
ON target.SalesOrderId = source.SalesOrderId
WHEN MATCHED THEN
UPDATE SET Amount = source.Amount, ModifiedDate = GETDATE()
WHEN NOT MATCHED THEN
INSERT (SalesOrderId, OrderDate, Amount, LoadDate)
VALUES (source.SalesOrderId, source.OrderDate, source.Amount, GETDATE());
COMMIT;
Best Practices
- Use Parquet format - Best performance and compression
- Size files appropriately - 256MB to 1GB per file
- Use Managed Identity - Simplifies credential management
- Handle errors - Configure MAXERRORS and ERRORFILE
- Monitor progress - Track long-running COPY operations
- Match file count to distributions - 60 files for optimal parallelism
Conclusion
The COPY command is the preferred method for loading data into Synapse dedicated SQL pools. Its combination of performance, simplicity, and robust error handling makes it the go-to choice for ETL workloads.
Tomorrow, we’ll explore result set caching for accelerating repeated queries.