Skip to content
Back to Blog
2 min read

Microsoft Build 2025 Preview: What to Expect for Data and AI

I wrote “Microsoft Build 2025 Preview: What to Expect for Data and AI” to share practical, production-minded guidance on this topic.

Expected Themes

1. AI-First Development

Build 2025 will likely emphasize AI as a core part of every developer workflow:

Expected Announcements:
├── GitHub Copilot Workspace GA
├── Copilot for Azure Portal enhancements
├── AI-assisted debugging in VS Code
├── Natural language infrastructure deployment
└── Copilot for data engineering

2. Agent Platform Maturity

Azure AI Agent Service is expected to mature:

# Speculative: Enhanced Agent SDK at Build 2025
# Using existing patterns from Azure AI and Semantic Kernel

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

# Multi-agent systems with improved memory
credential = DefaultAzureCredential()
client = AIProjectClient(
    credential=credential,
    endpoint="https://your-project.api.azureml.ms"
)

# Create agent with persistent memory
agent = client.agents.create_agent(
    model="gpt-4o",
    name="data-analyst",
    instructions="You are a data analyst agent.",
    tools=[
        {"type": "code_interpreter"},
        {"type": "file_search"}
    ]
)

# Enhanced orchestration with Semantic Kernel
kernel = sk.Kernel()
kernel.add_service(AzureChatCompletion(
    deployment_name="gpt-4o",
    endpoint="https://your-resource.openai.azure.com/"
))

# Future: Native MCP server support expected

3. Fabric Evolution

Microsoft Fabric is expected to receive significant updates:

Predicted Fabric Updates:
├── Real-Time Intelligence GA enhancements
├── Copilot for all Fabric experiences
├── Cross-cloud data sharing
├── Enhanced governance (Purview integration)
├── Fabric for startups (free tier?)
└── Native vector search in OneLake

4. Model Innovation

New models and capabilities expected:

# Speculative: New model capabilities
from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-12-01-preview",
    azure_endpoint="https://your-resource.openai.azure.com/"
)

# Phi-4: Next generation small model
response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Analyze this data..."}]
)
# Future: On-device deployment options expected

# GPT-4.5 or GPT-5 preview?
response = client.chat.completions.create(
    model="gpt-5-preview",  # Speculative
    messages=[...],
    # Enhanced reasoning capabilities expected
)

# Multimodal improvements
response = client.chat.completions.create(
    model="gpt-4o-next",  # Speculative
    messages=[{
        "role": "user",
        "content": [
            {"type": "video_url", "video_url": {"url": "..."}},  # Native video
            {"type": "text", "text": "Analyze this meeting recording"}
        ]
    }]
)

Data Platform Predictions

1. Unified Data + AI Platform

Current State:
├── Azure AI Foundry (AI development)
├── Microsoft Fabric (Data platform)
├── Power Platform (Low-code)
└── Dynamics 365 (Business apps)

Predicted Convergence:
└── Single unified platform with:
    ├── Seamless data flow
    ├── Integrated AI capabilities
    ├── Unified governance
    └── Single billing/management

2. Vector Search Native in Fabric

# Speculative: Native vector operations in Fabric

# In Fabric Warehouse (T-SQL)
"""
CREATE TABLE documents_with_vectors (
    id INT PRIMARY KEY,
    content VARCHAR(MAX),
    embedding VECTOR(1536)  -- Native vector type (speculative)
);

-- Native vector search (speculative)
SELECT id, content
FROM documents_with_vectors
ORDER BY VECTOR_DISTANCE(embedding, @query_vector)
LIMIT 10;
"""

# Current approach: Use Spark with vector libraries
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.linalg import Vectors
import numpy as np

spark = SparkSession.builder.getOrCreate()

# Read documents
df = spark.read.table("lakehouse.documents")

# Generate embeddings using Azure OpenAI
from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-02-15-preview",
    azure_endpoint="https://your-resource.openai.azure.com/"
)

def get_embedding(text: str) -> list:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding

# Apply to dataframe (using UDF)
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, DoubleType

embed_udf = udf(get_embedding, ArrayType(DoubleType()))
df_vectors = df.withColumn("embedding", embed_udf(F.col("content")))

# Save with embeddings
df_vectors.write.format("delta").saveAsTable("lakehouse.documents_with_embeddings")

3. Real-Time AI Pipelines

# Current approach: Streaming AI with Spark and Azure OpenAI
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import json

spark = SparkSession.builder.getOrCreate()

# Read from EventHub
stream_df = spark.readStream \
    .format("eventhubs") \
    .options(**eventhub_config) \
    .load()

# Parse events
parsed = stream_df.select(
    F.from_json(F.col("body").cast("string"), schema).alias("data")
).select("data.*")

# AI enrichment function
def classify_risk(transaction_json: str) -> str:
    from openai import AzureOpenAI
    client = AzureOpenAI(
        api_version="2024-02-15-preview",
        azure_endpoint="https://your-resource.openai.azure.com/"
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Classify risk level (low/medium/high) for this transaction: {transaction_json}"
        }],
        max_tokens=10
    )
    return response.choices[0].message.content

# Register UDF
classify_udf = F.udf(classify_risk, StringType())

# Apply AI classification
enriched = parsed.withColumn(
    "risk_level",
    classify_udf(F.to_json(F.struct("*")))
)

# Write results
query = enriched.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/ai_enrichment") \
    .toTable("enriched_transactions")

Developer Experience Predictions

1. Natural Language Development

# Speculative: Natural language code generation
# Current approach using Azure OpenAI

from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-02-15-preview",
    azure_endpoint="https://your-resource.openai.azure.com/"
)

def generate_pipeline_code(description: str) -> str:
    """Generate data pipeline code from natural language description."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a data engineering assistant. Generate Python/PySpark
                code for data pipelines based on user descriptions. Use best practices and
                include error handling."""
            },
            {
                "role": "user",
                "content": description
            }
        ]
    )
    return response.choices[0].message.content

# Generate entire pipeline from description
pipeline_code = generate_pipeline_code("""
    Create a data pipeline that:
    1. Reads from Salesforce
    2. Joins with customer master in Fabric
    3. Enriches with AI classification
    4. Writes to gold layer
    5. Refreshes Power BI
""")

print(pipeline_code)

2. AI-Assisted Debugging

# Current approach: AI-assisted error analysis
from openai import AzureOpenAI
import traceback

client = AzureOpenAI(
    api_version="2024-02-15-preview",
    azure_endpoint="https://your-resource.openai.azure.com/"
)

def analyze_error(error: Exception, code_context: str = "") -> dict:
    """Use AI to analyze an error and suggest fixes."""
    error_info = {
        "type": type(error).__name__,
        "message": str(error),
        "traceback": traceback.format_exc()
    }

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a debugging assistant. Analyze the error and provide:
                1. Root cause analysis
                2. Suggested fix
                3. Similar issues that might be related
                Respond in JSON format."""
            },
            {
                "role": "user",
                "content": f"Error: {json.dumps(error_info)}\n\nCode context:\n{code_context}"
            }
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

# When an error occurs
try:
    result = my_pipeline.run()
except Exception as e:
    # AI analyzes the error
    analysis = analyze_error(e, code_context="...")

    print(f"Root cause: {analysis['root_cause']}")
    print(f"Suggested fix: {analysis['suggested_fix']}")
    print(f"Similar issues: {analysis['similar_issues']}")

Enterprise Features

1. Enhanced Security

# Speculative: AI-aware security policies
# Current approach: Use Azure Policy and Purview

security_policy:
  ai_governance:
    - rule: "no_pii_in_prompts"
      action: "redact"
    - rule: "audit_all_ai_calls"
      destination: "azure_monitor"
    - rule: "model_access_by_role"
      config:
        gpt-4o: ["data_scientists", "ml_engineers"]
        gpt-4o-mini: ["all_developers"]

2. Cost Management

# Current approach: Track AI costs with custom logging
from openai import AzureOpenAI
from azure.monitor.opentelemetry import configure_azure_monitor
import logging

configure_azure_monitor()
logger = logging.getLogger(__name__)

class CostTrackingClient:
    """Wrapper to track AI API costs."""

    # Pricing per 1M tokens (example rates)
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "text-embedding-3-large": {"input": 0.13, "output": 0}
    }

    def __init__(self, monthly_limit: float = 10000):
        self.client = AzureOpenAI(
            api_version="2024-02-15-preview",
            azure_endpoint="https://your-resource.openai.azure.com/"
        )
        self.monthly_limit = monthly_limit
        self.monthly_spend = 0

    def chat_completion(self, model: str, messages: list, **kwargs):
        response = self.client.chat.completions.create(
            model=model,
            messages=messages,
            **kwargs
        )

        # Calculate cost
        usage = response.usage
        pricing = self.PRICING.get(model, {"input": 0, "output": 0})
        cost = (usage.prompt_tokens * pricing["input"] +
                usage.completion_tokens * pricing["output"]) / 1_000_000

        self.monthly_spend += cost

        # Log for monitoring
        logger.info(f"AI API call: model={model}, cost=${cost:.4f}, total=${self.monthly_spend:.2f}")

        # Alert if approaching limit
        if self.monthly_spend > self.monthly_limit * 0.8:
            logger.warning(f"AI spend at {100*self.monthly_spend/self.monthly_limit:.1f}% of monthly limit")

        return response

# Usage
cost_client = CostTrackingClient(monthly_limit=10000)
response = cost_client.chat_completion("gpt-4o-mini", messages=[...])

What to Watch For

  1. Keynote announcements: Major platform changes
  2. Model announcements: New versions, capabilities
  3. Pricing changes: Often announced at Build
  4. Preview releases: Early access to new features
  5. Partner integrations: Ecosystem expansions

Preparing for Build

  1. Review current architecture: Know what you have
  2. Identify gaps: What problems need solving?
  3. Budget planning: New features may mean new costs
  4. Skills assessment: Will your team need training?
  5. Watch sessions: Plan which talks to attend

Build 2025 promises to be significant for data and AI professionals. Stay tuned for the actual announcements and be ready to experiment with new capabilities.\n\n## Takeaways\n\nAdd a concise, personal takeaway and recommended next steps here.\n

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.