E-009: Evaluation & Quality

Status: Planned Owner: @bilal Priority: P2 — Quality & Confidence

Objective

Catch regressions in AI conversation quality and improve emergency detection accuracy.

Tasks

Golden Test Suite (ADR-017 Phase 4)

Task	ID	Description	Status
Build golden test set	EVAL-001	30-50 tenant message scenarios with expected tool calls	Planned
Test runner	EVAL-002	Stub LLM provider, assert tool selection + arguments	Planned
CI integration	EVAL-003	Run alongside unit tests in GitHub Actions	Planned
Retrieval metrics	EVAL-004	precision@5, monthly report	Planned

Emergency Detection (ADR-017 Phase 5)

Task	ID	Description	Status
LLM validation	EMRG-001	Secondary check: “is this an active emergency?”	Planned
Configurable keywords	EMRG-002	Per-organisation emergency keyword lists	Planned
Country-specific contacts	EMRG-003	Configurable emergency numbers (currently UK-hardcoded)	Planned

Dependencies

Document AI pipeline (E-005 Document AI & RAG Context) — for retrieval quality testing
Channel integrations (E-008 Channel Integrations) — for real-world scenarios