ADR-017: RAG Pipeline v2 — LLM-Driven Orchestration & Retrieval Improvements

Status: In Review Owner: @bilal @deen Date: 2026-02-15 Supersedes: ADR-015 Identity-First Conversations (identity design preserved, orchestration redesigned)

Context

What triggered this review

Manual testing revealed fundamental flaws in keyword-based conversation orchestration:

  1. “My boiler is not working” skipped detail gathering → hasRichDetail() falsely counted “boiler” as location
  2. “gimme a min I’ll take pics” created the issue immediately → system couldn’t distinguish intent
  3. “I have a leak in the bathroom” treated as sufficient → didn’t ask pipe vs tap vs ceiling

Root cause: Keyword matching for critical flow decisions. The LLM is used for intent classification and response generation, but all flow control uses hardcoded word lists.

Full Pipeline Audit Findings

  • Conversation orchestration: Implicit state machine via message prefixes, fragile hasRichDetail(), no conversation expiry, no idempotency
  • Intent classification: Quick classification bypasses LLM too aggressively, hardcoded to single provider
  • Retrieval: Vector-only search, no deduplication, no diversity, no metadata enrichment
  • Chunking: Word-count based (not semantic), no metadata, small chunks silently dropped
  • Emergency detection: Substring matching with false positives, UK-hardcoded

What ADR-015 got right (preserved)

  • Identity-first flow: UNIDENTIFIED → IDENTIFIED → CONFIRMED → ACTIVE
  • WhatsApp media ingestion pipeline
  • Contextual RAG with property knowledge
  • Auto-confirm for high-confidence single-property matches

Decision

1. LLM Tool Use for Conversation Orchestration

Replace keyword state machine with Claude tool use. Define tools the LLM can invoke:

ToolWhen to Use
ask_for_detailsTenant mentioned a problem but specifics needed
ask_for_photoOnly after clear details (location + description)
create_issueClear description + location + photo provided/declined
respondGreetings, acknowledgments, “I’ll send later”
escalateTenant frustrated, too complex, requests human

System prompt provides guardrails (tone, gathering rules, emergency bypass, property context from RAG).

What this eliminates: hasRichDetail(), buildDetailRequest(), message prefix state tracking, handleDetailResponse/PhotoResponse, isRepairRequest(), generateConfirmation()

What stays: Emergency keyword detection (speed-critical), identity flow, createIssueFromConversation(), notifyLandlord(), duplicate guard

2. Conversation State in Database

ALTER TABLE conversations ADD COLUMN gathering_state text DEFAULT 'idle'
  CHECK (gathering_state IN ('idle', 'gathering_details', 'awaiting_photo', 'issue_created'));
 
ALTER TABLE conversations ADD COLUMN identity_status text DEFAULT 'unidentified'
  CHECK (identity_status IN ('unidentified', 'identified', 'confirmed', 'active'));

Two orthogonal state dimensions: Identity and Gathering. Updated when LLM invokes specific tools.

3. Improved Retrieval

  • Hybrid search: pgvector + pg_trgm (RRF fusion, 60/40 vector/keyword weight)
  • Metadata-enriched chunks: document title + section heading on each chunk
  • Better chunking: 300-500 tokens with 50-100 token overlap, paragraph-level with heading context

4. Emergency Detection: Keep Keyword + LLM Validation

Keyword detection triggers immediate safety response. LLM validation runs async to reduce false positives (“I read about a fire on the news” → not emergency). Emergency keywords made configurable per org/country.

5. Evaluation Framework

  • Tier 1 (per-commit): 30-50 golden test scenarios with expected tool calls
  • Tier 2 (weekly): Retrieval quality — precision@5, context relevance
  • Tier 3 (monthly): 50 real conversation samples, LLM-as-judge scoring

Implementation Phases

  1. LLM Tool Use Orchestration — Rewrite process.ts, add gathering_state, remove keyword logic
  2. Identity-First Integration — Add identity_status, gate tools by identity state
  3. Retrieval Improvements — pg_trgm, RRF hybrid search, metadata enrichment
  4. Evaluation Framework — Golden test set in CI
  5. Emergency Detection — LLM validation, configurable keywords

Migration Strategy

New processInboundMessageV2() behind feature flag USE_LLM_ORCHESTRATION=true. Both paths share same issue creation, notification, and retrieval modules. Remove v1 after validation.

Consequences

Positive

  • Natural conversations (LLM asks contextual follow-ups)
  • False issue creation eliminated
  • Category/urgency are context-aware
  • Per-org customisation via system prompt
  • Evaluation catches regressions
  • Gathering state persists in DB

Negative

  • Every gathering turn calls LLM (~$0.003-0.01 per turn)
  • Latency +500-1000ms per turn vs keyword matching (~5ms)
  • LLM non-determinism
  • Testing harder (mocking tool responses)

Mitigations

  • Skip LLM for obvious greetings (“hi”, “thanks”)
  • Use Haiku for tool calls (faster, cheaper)
  • Temperature 0 + golden tests catch drift