E-009: Evaluation & Quality
Status: Planned Owner: @bilal Priority: P2 — Quality & Confidence
Objective
Catch regressions in AI conversation quality and improve emergency detection accuracy.
Tasks
Golden Test Suite (ADR-017 Phase 4)
| Task | ID | Description | Status |
|---|---|---|---|
| Build golden test set | EVAL-001 | 30-50 tenant message scenarios with expected tool calls | Planned |
| Test runner | EVAL-002 | Stub LLM provider, assert tool selection + arguments | Planned |
| CI integration | EVAL-003 | Run alongside unit tests in GitHub Actions | Planned |
| Retrieval metrics | EVAL-004 | precision@5, monthly report | Planned |
Emergency Detection (ADR-017 Phase 5)
| Task | ID | Description | Status |
|---|---|---|---|
| LLM validation | EMRG-001 | Secondary check: “is this an active emergency?” | Planned |
| Configurable keywords | EMRG-002 | Per-organisation emergency keyword lists | Planned |
| Country-specific contacts | EMRG-003 | Configurable emergency numbers (currently UK-hardcoded) | Planned |
Dependencies
- Document AI pipeline (E-005 Document AI & RAG Context) — for retrieval quality testing
- Channel integrations (E-008 Channel Integrations) — for real-world scenarios