2 min read
Evaluating RAG Systems: Metrics and Automated Testing with Azure AI
Building RAG systems is straightforward; ensuring they work correctly is hard. Azure AI provides comprehensive evaluation tools that measure retrieval quality, generation accuracy, and safety. Here’s how to implement automated RAG evaluation in your CI/CD pipeline.
Core Evaluation Metrics
Azure AI Evaluation SDK calculates key metrics automatically:
from azure.ai.evaluation import (
RelevanceEvaluator,
GroundednessEvaluator,
CoherenceEvaluator,
FluencyEvaluator,
RetrievalEvaluator
)
from azure.ai.projects import AIProjectClient
project = AIProjectClient(subscription_id, resource_group, project_name, credential)
# Initialize evaluators
relevance = RelevanceEvaluator(model_config={"deployment": "gpt-4o"})
groundedness = GroundednessEvaluator(model_config={"deployment": "gpt-4o"})
coherence = CoherenceEvaluator(model_config={"deployment": "gpt-4o"})
retrieval = RetrievalEvaluator(model_config={"deployment": "gpt-4o"})
async def evaluate_rag_response(
query: str,
retrieved_docs: list[str],
generated_response: str,
ground_truth: str = None
) -> dict:
context = "\n\n".join(retrieved_docs)
results = {
"relevance": await relevance(
query=query,
response=generated_response
),
"groundedness": await groundedness(
context=context,
response=generated_response
),
"coherence": await coherence(
response=generated_response
),
"retrieval_quality": await retrieval(
query=query,
context=context
)
}
if ground_truth:
from azure.ai.evaluation import SimilarityEvaluator
similarity = SimilarityEvaluator(model_config={"deployment": "gpt-4o"})
results["similarity"] = await similarity(
response=generated_response,
ground_truth=ground_truth
)
return results
Building Test Suites
Create comprehensive test datasets that cover edge cases:
test_cases = [
{
"query": "What is the refund policy?",
"expected_topics": ["30-day window", "original payment method"],
"min_relevance": 4.0,
"min_groundedness": 4.0
},
# Add more test cases...
]
async def run_evaluation_suite(rag_pipeline, test_cases):
results = []
for case in test_cases:
response = await rag_pipeline.query(case["query"])
metrics = await evaluate_rag_response(
query=case["query"],
retrieved_docs=response.sources,
generated_response=response.answer
)
results.append({"case": case, "metrics": metrics})
return results
CI/CD Integration
Fail builds when quality metrics drop below thresholds, catching regressions before they reach production.