5 min read
Git Integration in Databricks for Version Control
Git Integration in Databricks for Version Control
Databricks Git integration enables seamless version control for notebooks and files. Let’s explore how to set up and use Git integration effectively for collaborative development.
Supported Git Providers
Databricks supports:
- Azure DevOps
- GitHub
- GitLab
- Bitbucket
- AWS CodeCommit
Setting Up Git Credentials
Personal Access Token (GitHub)
- Generate token in GitHub with
reposcope - Add to Databricks:
User Settings > Git Integration > Git provider: GitHub
Enter username and token
Azure DevOps
User Settings > Git Integration > Git provider: Azure DevOps Services
Enter Azure DevOps username and PAT
Via API
import requests
def set_git_credentials(workspace_url, token, git_provider, git_username, git_token):
url = f"{workspace_url}/api/2.0/git-credentials"
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
}
payload = {
"git_provider": git_provider,
"git_username": git_username,
"personal_access_token": git_token
}
response = requests.post(url, headers=headers, json=payload)
return response.json()
# Usage
set_git_credentials(
"https://adb-123.0.azuredatabricks.net",
DATABRICKS_TOKEN,
"azureDevOpsServices",
"user@company.com",
AZURE_DEVOPS_PAT
)
Creating a Repo in Databricks
Via UI
- Navigate to Repos
- Click “Add Repo”
- Enter repository URL
- Select branch
Via API
def create_repo(workspace_url, token, repo_url, provider, path):
url = f"{workspace_url}/api/2.0/repos"
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
}
payload = {
"url": repo_url,
"provider": provider,
"path": path
}
response = requests.post(url, headers=headers, json=payload)
return response.json()
# Create repo
repo = create_repo(
WORKSPACE_URL,
TOKEN,
"https://github.com/company/data-platform",
"gitHub",
"/Repos/user@company.com/data-platform"
)
Branch Management
Switching Branches
def update_repo_branch(workspace_url, token, repo_id, branch):
url = f"{workspace_url}/api/2.0/repos/{repo_id}"
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
}
payload = {"branch": branch}
response = requests.patch(url, headers=headers, json=payload)
return response.json()
# Switch to feature branch
update_repo_branch(WORKSPACE_URL, TOKEN, repo_id, "feature/new-etl")
Pull Latest Changes
def pull_repo(workspace_url, token, repo_id):
url = f"{workspace_url}/api/2.0/repos/{repo_id}"
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
}
# Getting repo triggers a pull
response = requests.get(url, headers=headers)
return response.json()
Project Structure
Recommended Repository Structure
data-platform/
├── .gitignore
├── README.md
├── notebooks/
│ ├── etl/
│ │ ├── extract.py
│ │ ├── transform.py
│ │ └── load.py
│ ├── analytics/
│ │ └── reports.py
│ └── ml/
│ └── training.py
├── src/
│ ├── __init__.py
│ ├── utils/
│ │ ├── __init__.py
│ │ └── helpers.py
│ └── models/
│ ├── __init__.py
│ └── data_models.py
├── tests/
│ ├── __init__.py
│ └── test_utils.py
├── config/
│ ├── dev.json
│ ├── staging.json
│ └── prod.json
└── jobs/
├── daily_etl.json
└── weekly_reports.json
.gitignore for Databricks
# Databricks
.databricks/
*.egg-info/
dist/
build/
# Python
__pycache__/
*.py[cod]
*$py.class
.Python
*.so
# Jupyter/Databricks notebooks
.ipynb_checkpoints/
# IDE
.idea/
.vscode/
*.swp
# Environment
.env
.venv/
# Logs
*.log
logs/
# OS
.DS_Store
Thumbs.db
Working with Python Packages
Importing Local Modules
# In notebook, add src to path
import sys
sys.path.append("/Workspace/Repos/user@company.com/data-platform/src")
# Now import your modules
from utils.helpers import process_data
from models.data_models import SalesRecord
Using Relative Imports
# src/etl/extract.py
from ..utils.helpers import clean_data
from ..models.data_models import RawRecord
CI/CD Integration
GitHub Actions Workflow
# .github/workflows/databricks-ci.yml
name: Databricks CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install dependencies
run: |
pip install pytest pyspark
pip install -r requirements.txt
- name: Run tests
run: pytest tests/
deploy:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install Databricks CLI
run: pip install databricks-cli
- name: Configure Databricks
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
run: |
echo "[DEFAULT]" > ~/.databrickscfg
echo "host = $DATABRICKS_HOST" >> ~/.databrickscfg
echo "token = $DATABRICKS_TOKEN" >> ~/.databrickscfg
- name: Update Repos
run: |
# Get repo ID and update
REPO_ID=$(databricks repos list --output JSON | jq -r '.repos[] | select(.path=="/Repos/Production/data-platform") | .id')
databricks repos update --repo-id $REPO_ID --branch main
Azure DevOps Pipeline
# azure-pipelines.yml
trigger:
- main
- develop
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: Test
jobs:
- job: RunTests
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.8'
- script: |
pip install pytest pyspark
pip install -r requirements.txt
displayName: 'Install dependencies'
- script: pytest tests/ --junitxml=test-results.xml
displayName: 'Run tests'
- task: PublishTestResults@2
inputs:
testResultsFormat: 'JUnit'
testResultsFiles: 'test-results.xml'
- stage: Deploy
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- job: DeployToDatabricks
steps:
- script: pip install databricks-cli
displayName: 'Install Databricks CLI'
- script: |
echo "[DEFAULT]" > ~/.databrickscfg
echo "host = $(DATABRICKS_HOST)" >> ~/.databrickscfg
echo "token = $(DATABRICKS_TOKEN)" >> ~/.databrickscfg
displayName: 'Configure Databricks CLI'
- script: |
python scripts/deploy.py --env production
displayName: 'Deploy to Databricks'
Multi-Environment Strategy
Environment-Specific Branches
main -> Production
develop -> Staging
feature/* -> Development
Repo Structure Per Environment
/Repos/
├── Production/
│ └── data-platform/ (main branch)
├── Staging/
│ └── data-platform/ (develop branch)
└── user@company.com/
└── data-platform/ (feature branches)
Automated Sync Script
# scripts/sync_repos.py
import requests
def sync_environment_repos(workspace_url, token, repo_configs):
for config in repo_configs:
repo_id = get_repo_id(workspace_url, token, config["path"])
if repo_id:
# Update existing repo
update_repo_branch(workspace_url, token, repo_id, config["branch"])
print(f"Updated {config['path']} to {config['branch']}")
else:
# Create new repo
create_repo(
workspace_url, token,
config["url"],
config["provider"],
config["path"]
)
print(f"Created {config['path']}")
# Configuration
repos = [
{
"url": "https://github.com/company/data-platform",
"provider": "gitHub",
"path": "/Repos/Production/data-platform",
"branch": "main"
},
{
"url": "https://github.com/company/data-platform",
"provider": "gitHub",
"path": "/Repos/Staging/data-platform",
"branch": "develop"
}
]
sync_environment_repos(WORKSPACE_URL, TOKEN, repos)
Best Practices
- Use feature branches - Develop in isolated branches
- PR reviews - Require reviews before merging
- Automated testing - Run tests on PRs
- Environment separation - Different repos for prod/dev
- Consistent structure - Follow standard project layout
- Document notebooks - Include README files
- Version dependencies - Use requirements.txt
Conclusion
Git integration transforms Databricks into a proper development environment with version control, collaboration, and CI/CD capabilities. By following these practices, teams can maintain code quality and streamline deployments.
Tomorrow, we’ll wrap up the series with Repos in Databricks for managing production deployments.