ADR-017: RAG Pipeline v2 — LLM-Driven Orchestration & Retrieval Improvements
Status: In Review Owner: @bilal @deen Date: 2026-02-15 Supersedes: ADR-015 Identity-First Conversations (identity design preserved, orchestration redesigned)
Context
What triggered this review
Manual testing revealed fundamental flaws in keyword-based conversation orchestration:
- “My boiler is not working” skipped detail gathering →
hasRichDetail()falsely counted “boiler” as location - “gimme a min I’ll take pics” created the issue immediately → system couldn’t distinguish intent
- “I have a leak in the bathroom” treated as sufficient → didn’t ask pipe vs tap vs ceiling
Root cause: Keyword matching for critical flow decisions. The LLM is used for intent classification and response generation, but all flow control uses hardcoded word lists.
Full Pipeline Audit Findings
- Conversation orchestration: Implicit state machine via message prefixes, fragile
hasRichDetail(), no conversation expiry, no idempotency - Intent classification: Quick classification bypasses LLM too aggressively, hardcoded to single provider
- Retrieval: Vector-only search, no deduplication, no diversity, no metadata enrichment
- Chunking: Word-count based (not semantic), no metadata, small chunks silently dropped
- Emergency detection: Substring matching with false positives, UK-hardcoded
What ADR-015 got right (preserved)
- Identity-first flow: UNIDENTIFIED → IDENTIFIED → CONFIRMED → ACTIVE
- WhatsApp media ingestion pipeline
- Contextual RAG with property knowledge
- Auto-confirm for high-confidence single-property matches
Decision
1. LLM Tool Use for Conversation Orchestration
Replace keyword state machine with Claude tool use. Define tools the LLM can invoke:
| Tool | When to Use |
|---|---|
ask_for_details | Tenant mentioned a problem but specifics needed |
ask_for_photo | Only after clear details (location + description) |
create_issue | Clear description + location + photo provided/declined |
respond | Greetings, acknowledgments, “I’ll send later” |
escalate | Tenant frustrated, too complex, requests human |
System prompt provides guardrails (tone, gathering rules, emergency bypass, property context from RAG).
What this eliminates: hasRichDetail(), buildDetailRequest(), message prefix state tracking, handleDetailResponse/PhotoResponse, isRepairRequest(), generateConfirmation()
What stays: Emergency keyword detection (speed-critical), identity flow, createIssueFromConversation(), notifyLandlord(), duplicate guard
2. Conversation State in Database
ALTER TABLE conversations ADD COLUMN gathering_state text DEFAULT 'idle'
CHECK (gathering_state IN ('idle', 'gathering_details', 'awaiting_photo', 'issue_created'));
ALTER TABLE conversations ADD COLUMN identity_status text DEFAULT 'unidentified'
CHECK (identity_status IN ('unidentified', 'identified', 'confirmed', 'active'));Two orthogonal state dimensions: Identity and Gathering. Updated when LLM invokes specific tools.
3. Improved Retrieval
- Hybrid search: pgvector + pg_trgm (RRF fusion, 60/40 vector/keyword weight)
- Metadata-enriched chunks: document title + section heading on each chunk
- Better chunking: 300-500 tokens with 50-100 token overlap, paragraph-level with heading context
4. Emergency Detection: Keep Keyword + LLM Validation
Keyword detection triggers immediate safety response. LLM validation runs async to reduce false positives (“I read about a fire on the news” → not emergency). Emergency keywords made configurable per org/country.
5. Evaluation Framework
- Tier 1 (per-commit): 30-50 golden test scenarios with expected tool calls
- Tier 2 (weekly): Retrieval quality — precision@5, context relevance
- Tier 3 (monthly): 50 real conversation samples, LLM-as-judge scoring
Implementation Phases
- LLM Tool Use Orchestration — Rewrite process.ts, add gathering_state, remove keyword logic
- Identity-First Integration — Add identity_status, gate tools by identity state
- Retrieval Improvements — pg_trgm, RRF hybrid search, metadata enrichment
- Evaluation Framework — Golden test set in CI
- Emergency Detection — LLM validation, configurable keywords
Migration Strategy
New processInboundMessageV2() behind feature flag USE_LLM_ORCHESTRATION=true. Both paths share same issue creation, notification, and retrieval modules. Remove v1 after validation.
Consequences
Positive
- Natural conversations (LLM asks contextual follow-ups)
- False issue creation eliminated
- Category/urgency are context-aware
- Per-org customisation via system prompt
- Evaluation catches regressions
- Gathering state persists in DB
Negative
- Every gathering turn calls LLM (~$0.003-0.01 per turn)
- Latency +500-1000ms per turn vs keyword matching (~5ms)
- LLM non-determinism
- Testing harder (mocking tool responses)
Mitigations
- Skip LLM for obvious greetings (“hi”, “thanks”)
- Use Haiku for tool calls (faster, cheaper)
- Temperature 0 + golden tests catch drift