E-009: Evaluation & Quality

Status: Planned Owner: @bilal Priority: P2 — Quality & Confidence

Objective

Catch regressions in AI conversation quality and improve emergency detection accuracy.

Tasks

Golden Test Suite (ADR-017 Phase 4)

TaskIDDescriptionStatus
Build golden test setEVAL-00130-50 tenant message scenarios with expected tool callsPlanned
Test runnerEVAL-002Stub LLM provider, assert tool selection + argumentsPlanned
CI integrationEVAL-003Run alongside unit tests in GitHub ActionsPlanned
Retrieval metricsEVAL-004precision@5, monthly reportPlanned

Emergency Detection (ADR-017 Phase 5)

TaskIDDescriptionStatus
LLM validationEMRG-001Secondary check: “is this an active emergency?”Planned
Configurable keywordsEMRG-002Per-organisation emergency keyword listsPlanned
Country-specific contactsEMRG-003Configurable emergency numbers (currently UK-hardcoded)Planned

Dependencies