Back to Blog
2 min read

Evaluating RAG Systems: Metrics and Automated Testing with Azure AI

Building RAG systems is straightforward; ensuring they work correctly is hard. Azure AI provides comprehensive evaluation tools that measure retrieval quality, generation accuracy, and safety. Here’s how to implement automated RAG evaluation in your CI/CD pipeline.

Core Evaluation Metrics

Azure AI Evaluation SDK calculates key metrics automatically:

from azure.ai.evaluation import (
    RelevanceEvaluator,
    GroundednessEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    RetrievalEvaluator
)
from azure.ai.projects import AIProjectClient

project = AIProjectClient(subscription_id, resource_group, project_name, credential)

# Initialize evaluators
relevance = RelevanceEvaluator(model_config={"deployment": "gpt-4o"})
groundedness = GroundednessEvaluator(model_config={"deployment": "gpt-4o"})
coherence = CoherenceEvaluator(model_config={"deployment": "gpt-4o"})
retrieval = RetrievalEvaluator(model_config={"deployment": "gpt-4o"})

async def evaluate_rag_response(
    query: str,
    retrieved_docs: list[str],
    generated_response: str,
    ground_truth: str = None
) -> dict:

    context = "\n\n".join(retrieved_docs)

    results = {
        "relevance": await relevance(
            query=query,
            response=generated_response
        ),
        "groundedness": await groundedness(
            context=context,
            response=generated_response
        ),
        "coherence": await coherence(
            response=generated_response
        ),
        "retrieval_quality": await retrieval(
            query=query,
            context=context
        )
    }

    if ground_truth:
        from azure.ai.evaluation import SimilarityEvaluator
        similarity = SimilarityEvaluator(model_config={"deployment": "gpt-4o"})
        results["similarity"] = await similarity(
            response=generated_response,
            ground_truth=ground_truth
        )

    return results

Building Test Suites

Create comprehensive test datasets that cover edge cases:

test_cases = [
    {
        "query": "What is the refund policy?",
        "expected_topics": ["30-day window", "original payment method"],
        "min_relevance": 4.0,
        "min_groundedness": 4.0
    },
    # Add more test cases...
]

async def run_evaluation_suite(rag_pipeline, test_cases):
    results = []
    for case in test_cases:
        response = await rag_pipeline.query(case["query"])
        metrics = await evaluate_rag_response(
            query=case["query"],
            retrieved_docs=response.sources,
            generated_response=response.answer
        )
        results.append({"case": case, "metrics": metrics})
    return results

CI/CD Integration

Fail builds when quality metrics drop below thresholds, catching regressions before they reach production.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.