ADR-017: RAG Pipeline v2 — LLM-Driven Orchestration & Retrieval Improvements
Status: In Review Owner: @bilal @deen Date: 2026-02-15 Supersedes: ADR-015 Identity-First Conversations (identity design preserved, orchestration redesigned)
Context
What triggered this review
Manual testing revealed fundamental flaws in keyword-based conversation orchestration:
- “My boiler is not working” skipped detail gathering →
hasRichDetail()falsely counted “boiler” as location - “gimme a min I’ll take pics” created the issue immediately → system couldn’t distinguish intent
- “I have a leak in the bathroom” treated as sufficient → didn’t ask pipe vs tap vs ceiling
Root cause: Keyword matching for critical flow decisions. The LLM is used for intent classification and response generation, but all flow control uses hardcoded word lists.
Full Pipeline Audit Findings
- Conversation orchestration: Implicit state machine via message prefixes, fragile
hasRichDetail(), no conversation expiry, no idempotency - Intent classification: Quick classification bypasses LLM too aggressively, hardcoded to single provider
- Retrieval: Vector-only search, no deduplication, no diversity, no metadata enrichment
- Chunking: Word-count based (not semantic), no metadata, small chunks silently dropped
- Emergency detection: Substring matching with false positives, UK-hardcoded
What ADR-015 got right (preserved)
- Identity-first flow: UNIDENTIFIED → IDENTIFIED → CONFIRMED → ACTIVE
- WhatsApp media ingestion pipeline
- Contextual RAG with property knowledge
- Auto-confirm for high-confidence single-property matches
Decision
1. LLM Tool Use for Conversation Orchestration
Replace keyword state machine with Claude tool use. Define tools the LLM can invoke:
| Tool | When to Use |
|---|---|
ask_for_details | Tenant mentioned a problem but specifics needed |
ask_for_photo | Only after clear details (location + description) |
create_issue | Clear description + location + photo provided/declined |
respond | Greetings, acknowledgments, “I’ll send later” |
escalate | Tenant frustrated, too complex, requests human |
System prompt provides guardrails (tone, gathering rules, emergency bypass, property context from RAG).
What this eliminates: hasRichDetail(), buildDetailRequest(), message prefix state tracking, handleDetailResponse/PhotoResponse, isRepairRequest(), generateConfirmation()
What stays: Emergency keyword detection (speed-critical), identity flow, createIssueFromConversation(), notifyLandlord(), duplicate guard
2. Conversation State in Database
ALTER TABLE conversations ADD COLUMN gathering_state text DEFAULT 'idle'
CHECK (gathering_state IN ('idle', 'gathering_details', 'awaiting_photo', 'issue_created'));
ALTER TABLE conversations ADD COLUMN identity_status text DEFAULT 'unidentified'
CHECK (identity_status IN ('unidentified', 'identified', 'confirmed', 'active'));Two orthogonal state dimensions: Identity and Gathering. Updated when LLM invokes specific tools.
3. Improved Retrieval
- Hybrid search: pgvector + pg_trgm (RRF fusion, 60/40 vector/keyword weight)
- Metadata-enriched chunks: document title + section heading on each chunk
- Better chunking: 300-500 tokens with 50-100 token overlap, paragraph-level with heading context
4. Emergency Detection: Keep Keyword + LLM Validation
Keyword detection triggers immediate safety response. LLM validation runs async to reduce false positives (“I read about a fire on the news” → not emergency). Emergency keywords made configurable per org/country.
5. Evaluation Framework
- Tier 1 (per-commit): 30-50 golden test scenarios with expected tool calls
- Tier 2 (weekly): Retrieval quality — precision@5, context relevance
- Tier 3 (monthly): 50 real conversation samples, LLM-as-judge scoring
Implementation Phases
- LLM Tool Use Orchestration — Rewrite process.ts, add gathering_state, remove keyword logic
- Identity-First Integration — Add identity_status, gate tools by identity state
- Retrieval Improvements — pg_trgm, RRF hybrid search, metadata enrichment
- Evaluation Framework — Golden test set in CI
- Emergency Detection — LLM validation, configurable keywords
Migration Strategy
New processInboundMessageV2() behind feature flag USE_LLM_ORCHESTRATION=true. Both paths share same issue creation, notification, and retrieval modules. Remove v1 after validation.
Consequences
Positive
- Natural conversations (LLM asks contextual follow-ups)
- False issue creation eliminated
- Category/urgency are context-aware
- Per-org customisation via system prompt
- Evaluation catches regressions
- Gathering state persists in DB
Negative
- Every gathering turn calls LLM (~0.003-0.01 for Sonnet first turn/issue creation)
- Latency: ~5-6s first message (Sonnet), ~2-3s gathering turns (Haiku) — mitigated by parallel fetching and conditional RAG
- LLM non-determinism
- Testing harder (mocking tool responses)
Mitigations (implemented)
- Smart intent classification: LLM fallback only fires when the message might be a greeting or is truly unclear. Issue/question intents skip the ~900ms Haiku intent call — the orchestrator classifies via tool selection.
- Conditional RAG retrieval: Skipped on first message (IDLE state, ≤2 messages) where the LLM only needs to ask follow-up questions. Saves ~400-1100ms per first message.
- Model routing: Haiku for gathering turns (
GATHERING_DETAILS,AWAITING_PHOTOwhencreate_issueis not available). Sonnet for first turn and issue creation decisions. Cuts gathering turns from ~5s to ~1-2s. - Parallel data fetching: Pre-LLM queries (hasIssue, org, property, RAG, history) run concurrently via
Promise.all. Eliminates redundant conversation history fetch. - Temperature 0 + golden tests catch drift