ADF Self-Hosted Integration Runtime - Connecting On-Premises Data
The Self-Hosted Integration Runtime (SHIR) is your bridge between Azure Data Factory and on-premises or private network data sources. It enables secure data movement and transformation without exposing your internal network to the internet. Let me share how to set up, configure, and optimize SHIR for production workloads.
Understanding Integration Runtime Types
Azure Data Factory supports three IR types:
- Azure IR - Fully managed, for cloud-to-cloud scenarios
- Self-Hosted IR - Customer managed, for on-premises or private networks
- Azure-SSIS IR - For running SSIS packages in Azure
When to Use Self-Hosted IR
Use SHIR when connecting to:
- On-premises SQL Server, Oracle, SAP
- Private network resources (VNet, VPN)
- File systems behind firewalls
- Legacy systems without public endpoints
- Data sources requiring Windows authentication
Setting Up Self-Hosted IR
Step 1: Create IR in Azure
# Using PowerShell
$integrationRuntime = Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName "myResourceGroup" `
-DataFactoryName "myDataFactory" `
-Name "OnPremisesIR" `
-Type SelfHosted `
-Description "Integration runtime for on-premises data"
# Get authentication keys
$authKeys = Get-AzDataFactoryV2IntegrationRuntimeKey `
-ResourceGroupName "myResourceGroup" `
-DataFactoryName "myDataFactory" `
-Name "OnPremisesIR"
Write-Host "Auth Key 1: $($authKeys.AuthKey1)"
Step 2: Install on Windows Server
# Download and install silently
$downloadUrl = "https://go.microsoft.com/fwlink/?linkid=839822"
$installerPath = "$env:TEMP\IntegrationRuntime.msi"
Invoke-WebRequest -Uri $downloadUrl -OutFile $installerPath
Start-Process msiexec.exe -Wait -ArgumentList "/i $installerPath /quiet"
# Register with authentication key
$irPath = "C:\Program Files\Microsoft Integration Runtime\5.0\Shared"
& "$irPath\dmgcmd.exe" -Key "IR@xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx@myDataFactory@myResourceGroup@xxxxxxxxxxxxxxxxxxxxxxxx"
Step 3: Verify Registration
# Check IR status
& "$irPath\dmgcmd.exe" -Status
# Expected output:
# Integration Runtime (Self-hosted) is connected.
# Name: OnPremisesIR
# Version: 5.x.xxxx.x
# Status: Online
High Availability Configuration
Multi-Node Setup
# On additional nodes, use the same auth key
& "C:\Program Files\Microsoft Integration Runtime\5.0\Shared\dmgcmd.exe" `
-Key "IR@xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx@myDataFactory@myResourceGroup@xxxxxxxxxxxxxxxxxxxxxxxx"
Node Management
# List nodes
Get-AzDataFactoryV2IntegrationRuntimeNode `
-ResourceGroupName "myResourceGroup" `
-DataFactoryName "myDataFactory" `
-IntegrationRuntimeName "OnPremisesIR"
# Remove a node
Remove-AzDataFactoryV2IntegrationRuntimeNode `
-ResourceGroupName "myResourceGroup" `
-DataFactoryName "myDataFactory" `
-IntegrationRuntimeName "OnPremisesIR" `
-NodeName "Node_1"
Load Balancing Configuration
{
"name": "OnPremisesIR",
"type": "SelfHosted",
"typeProperties": {
"linkedInfo": {
"resourceId": "/subscriptions/.../integrationRuntimes/SharedIR"
}
}
}
Network Configuration
Firewall Rules
Outbound ports required:
- 443 (HTTPS) - Communication with ADF service
- 1433 - Azure SQL (if using)
- 10000-20000 - Data movement (configurable)
Endpoints to allow:
- *.servicebus.windows.net
- *.frontend.clouddatahub.net
- *.core.windows.net
- login.microsoftonline.com
Proxy Configuration
# Configure proxy settings
$irConfig = @{
ProxyUrl = "http://proxy.company.com:8080"
ProxyUsername = "domain\proxyuser"
ProxyPassword = "proxypassword"
BypassProxyForLocalAddress = $true
}
& "$irPath\dmgcmd.exe" -ConfigureProxy @irConfig
Express Route Configuration
# Terraform example for private link
resource "azurerm_data_factory_integration_runtime_self_hosted" "main" {
name = "OnPremisesIR"
data_factory_id = azurerm_data_factory.main.id
}
resource "azurerm_private_endpoint" "adf" {
name = "adf-private-endpoint"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
subnet_id = azurerm_subnet.private.id
private_service_connection {
name = "adf-connection"
private_connection_resource_id = azurerm_data_factory.main.id
subresource_names = ["dataFactory"]
is_manual_connection = false
}
}
Connecting to Data Sources
SQL Server (Windows Auth)
{
"name": "OnPremSqlServer",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Server=sqlserver.company.local;Database=SalesDB;Integrated Security=True;",
"userName": "domain\\serviceaccount",
"password": {
"type": "SecureString",
"value": "password"
}
},
"connectVia": {
"referenceName": "OnPremisesIR",
"type": "IntegrationRuntimeReference"
}
}
}
Oracle Database
{
"name": "OnPremOracle",
"properties": {
"type": "Oracle",
"typeProperties": {
"connectionString": "Host=oracleserver.company.local;Port=1521;ServiceName=ORCL;",
"authenticationType": "Basic",
"username": "oracleuser",
"password": {
"type": "SecureString",
"value": "password"
}
},
"connectVia": {
"referenceName": "OnPremisesIR",
"type": "IntegrationRuntimeReference"
}
}
}
File System
{
"name": "OnPremFileSystem",
"properties": {
"type": "FileServer",
"typeProperties": {
"host": "\\\\fileserver\\share",
"userId": "domain\\fileuser",
"password": {
"type": "SecureString",
"value": "password"
}
},
"connectVia": {
"referenceName": "OnPremisesIR",
"type": "IntegrationRuntimeReference"
}
}
}
SAP HANA
{
"name": "SapHana",
"properties": {
"type": "SapHana",
"typeProperties": {
"server": "saphana.company.local:30015",
"authenticationType": "Basic",
"userName": "SAPUSER",
"password": {
"type": "SecureString",
"value": "password"
}
},
"connectVia": {
"referenceName": "OnPremisesIR",
"type": "IntegrationRuntimeReference"
}
}
}
Performance Optimization
Node Sizing Guidelines
Small workload (< 1 TB/day):
CPU: 4 cores
Memory: 8 GB
Network: 100 Mbps
Medium workload (1-10 TB/day):
CPU: 8 cores
Memory: 16 GB
Network: 1 Gbps
Large workload (> 10 TB/day):
CPU: 16+ cores
Memory: 32+ GB
Network: 10 Gbps
Multiple nodes recommended
Concurrent Job Configuration
# Set concurrent jobs limit
& "$irPath\dmgcmd.exe" -SetConcurrentJobsLimit 20
# Monitor concurrent jobs
& "$irPath\dmgcmd.exe" -GetConcurrentJobsLimit
Data Movement Optimization
{
"name": "CopyFromOnPrem",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlServerSource",
"sqlReaderQuery": "SELECT * FROM Sales WHERE ModifiedDate > @lastWatermark"
},
"sink": {
"type": "AzureBlobFSSource"
},
"parallelCopies": 32,
"dataIntegrationUnits": 256,
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "StagingBlob",
"type": "LinkedServiceReference"
},
"path": "staging"
}
}
}
Monitoring and Troubleshooting
Integration Runtime Monitor
# Get detailed status
$irStatus = Get-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName "myResourceGroup" `
-DataFactoryName "myDataFactory" `
-Name "OnPremisesIR" `
-Status
$irStatus | Select-Object Name, State, Version, @{
N='Nodes'; E={$_.Nodes.Count}
}, @{
N='ConcurrentJobs'; E={$_.Nodes[0].ConcurrentJobsRunning}
}
Event Logs
# Check Windows Event Logs
Get-EventLog -LogName Application -Source "Integration Runtime" -Newest 50 |
Where-Object { $_.EntryType -eq "Error" } |
Select-Object TimeGenerated, Message
# Check IR logs directly
Get-ChildItem "C:\ProgramData\Microsoft\DataTransfer\Gateway\Trace" |
Sort-Object LastWriteTime -Descending |
Select-Object -First 10
Common Issues and Solutions
# Issue: Connection timeout
# Solution: Check network connectivity
Test-NetConnection -ComputerName "sqlserver.company.local" -Port 1433
# Issue: Authentication failure
# Solution: Verify credentials
$cred = Get-Credential
$conn = New-Object System.Data.SqlClient.SqlConnection
$conn.ConnectionString = "Server=sqlserver.company.local;Database=master;Integrated Security=False;User ID=$($cred.UserName);Password=$($cred.GetNetworkCredential().Password)"
$conn.Open()
$conn.Close()
# Issue: IR offline
# Solution: Restart service
Restart-Service "DIAHostService"
Azure Monitor Integration
{
"type": "Microsoft.Insights/diagnosticSettings",
"properties": {
"logs": [
{
"category": "IntegrationRuntimeLogs",
"enabled": true
},
{
"category": "IntegrationRuntimeAvailabilityLogs",
"enabled": true
}
],
"metrics": [
{
"category": "AllMetrics",
"enabled": true
}
],
"workspaceId": "/subscriptions/.../workspaces/myworkspace"
}
}
Sharing Integration Runtime
Create Shared IR
# Primary factory - create IR
$sharedIR = Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName "centralRG" `
-DataFactoryName "centralFactory" `
-Name "SharedIR" `
-Type SelfHosted `
-Description "Shared IR for multiple factories"
# Get resource ID
$sharedIRId = $sharedIR.Id
Link to Other Factories
# Linked factory - reference shared IR
Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName "projectRG" `
-DataFactoryName "projectFactory" `
-Name "LinkedSharedIR" `
-Type SelfHosted `
-SharedIntegrationRuntimeResourceId $sharedIRId
Best Practices
- Deploy multiple nodes for high availability
- Use dedicated service accounts with minimum permissions
- Monitor resource utilization and scale as needed
- Keep IR updated to latest version
- Secure network communication with ExpressRoute or VPN
- Configure proper timeout values
- Enable diagnostic logging for troubleshooting
- Document firewall rules and network requirements
Conclusion
The Self-Hosted Integration Runtime is essential for hybrid data integration scenarios. By properly configuring high availability, network settings, and performance optimization, you can build reliable data pipelines that bridge your on-premises and cloud environments. The key is planning your deployment architecture and monitoring for issues proactively.