ADR-017: RAG Pipeline v2 — LLM-Driven Orchestration & Retrieval Improvements

Status: In Review Owner: @bilal @deen Date: 2026-02-15 Supersedes: ADR-015 Identity-First Conversations (identity design preserved, orchestration redesigned)

Context

What triggered this review

Manual testing revealed fundamental flaws in keyword-based conversation orchestration:

“My boiler is not working” skipped detail gathering → hasRichDetail() falsely counted “boiler” as location
“gimme a min I’ll take pics” created the issue immediately → system couldn’t distinguish intent
“I have a leak in the bathroom” treated as sufficient → didn’t ask pipe vs tap vs ceiling

Root cause: Keyword matching for critical flow decisions. The LLM is used for intent classification and response generation, but all flow control uses hardcoded word lists.

Full Pipeline Audit Findings

Conversation orchestration: Implicit state machine via message prefixes, fragile hasRichDetail(), no conversation expiry, no idempotency
Intent classification: Quick classification bypasses LLM too aggressively, hardcoded to single provider
Retrieval: Vector-only search, no deduplication, no diversity, no metadata enrichment
Chunking: Word-count based (not semantic), no metadata, small chunks silently dropped
Emergency detection: Substring matching with false positives, UK-hardcoded

What ADR-015 got right (preserved)

Identity-first flow: UNIDENTIFIED → IDENTIFIED → CONFIRMED → ACTIVE
WhatsApp media ingestion pipeline
Contextual RAG with property knowledge
Auto-confirm for high-confidence single-property matches

Decision

1. LLM Tool Use for Conversation Orchestration

Replace keyword state machine with Claude tool use. Define tools the LLM can invoke:

Tool	When to Use
`ask_for_details`	Tenant mentioned a problem but specifics needed
`ask_for_photo`	Only after clear details (location + description)
`create_issue`	Clear description + location + photo provided/declined
`respond`	Greetings, acknowledgments, “I’ll send later”
`escalate`	Tenant frustrated, too complex, requests human

System prompt provides guardrails (tone, gathering rules, emergency bypass, property context from RAG).

What this eliminates: hasRichDetail(), buildDetailRequest(), message prefix state tracking, handleDetailResponse/PhotoResponse, isRepairRequest(), generateConfirmation()

What stays: Emergency keyword detection (speed-critical), identity flow, createIssueFromConversation(), notifyLandlord(), duplicate guard

2. Conversation State in Database

ALTER TABLE conversations ADD COLUMN gathering_state text DEFAULT 'idle'
  CHECK (gathering_state IN ('idle', 'gathering_details', 'awaiting_photo', 'issue_created'));
 
ALTER TABLE conversations ADD COLUMN identity_status text DEFAULT 'unidentified'
  CHECK (identity_status IN ('unidentified', 'identified', 'confirmed', 'active'));

Two orthogonal state dimensions: Identity and Gathering. Updated when LLM invokes specific tools.

3. Improved Retrieval

Hybrid search: pgvector + pg_trgm (RRF fusion, 60/40 vector/keyword weight)
Metadata-enriched chunks: document title + section heading on each chunk
Better chunking: 300-500 tokens with 50-100 token overlap, paragraph-level with heading context

4. Emergency Detection: Keep Keyword + LLM Validation

Keyword detection triggers immediate safety response. LLM validation runs async to reduce false positives (“I read about a fire on the news” → not emergency). Emergency keywords made configurable per org/country.

5. Evaluation Framework

Tier 1 (per-commit): 30-50 golden test scenarios with expected tool calls
Tier 2 (weekly): Retrieval quality — precision@5, context relevance
Tier 3 (monthly): 50 real conversation samples, LLM-as-judge scoring

Implementation Phases

LLM Tool Use Orchestration — Rewrite process.ts, add gathering_state, remove keyword logic
Identity-First Integration — Add identity_status, gate tools by identity state
Retrieval Improvements — pg_trgm, RRF hybrid search, metadata enrichment
Evaluation Framework — Golden test set in CI
Emergency Detection — LLM validation, configurable keywords

Migration Strategy

New processInboundMessageV2() behind feature flag USE_LLM_ORCHESTRATION=true. Both paths share same issue creation, notification, and retrieval modules. Remove v1 after validation.

Consequences

Positive

Natural conversations (LLM asks contextual follow-ups)
False issue creation eliminated
Category/urgency are context-aware
Per-org customisation via system prompt
Evaluation catches regressions
Gathering state persists in DB

Negative

Every gathering turn calls LLM (~$0.003-0.01 per turn)
Latency +500-1000ms per turn vs keyword matching (~5ms)
LLM non-determinism
Testing harder (mocking tool responses)

Mitigations

Skip LLM for obvious greetings (“hi”, “thanks”)
Use Haiku for tool calls (faster, cheaper)
Temperature 0 + golden tests catch drift

EHQ Brain

Explorer

ADR-017 RAG Pipeline v2

ADR-017: RAG Pipeline v2 — LLM-Driven Orchestration & Retrieval Improvements

Context

What triggered this review

Full Pipeline Audit Findings

What ADR-015 got right (preserved)

Decision

1. LLM Tool Use for Conversation Orchestration

2. Conversation State in Database

3. Improved Retrieval

4. Emergency Detection: Keep Keyword + LLM Validation

5. Evaluation Framework

Implementation Phases

Migration Strategy

Consequences

Positive

Negative

Mitigations

Graph View

Table of Contents

Backlinks

EHQ Brain

Explorer

ADR-017 RAG Pipeline v2

ADR-017: RAG Pipeline v2 — LLM-Driven Orchestration & Retrieval Improvements

Context

What triggered this review

Full Pipeline Audit Findings

What ADR-015 got right (preserved)

Decision

1. LLM Tool Use for Conversation Orchestration

2. Conversation State in Database

3. Improved Retrieval

4. Emergency Detection: Keep Keyword + LLM Validation

5. Evaluation Framework

Implementation Phases

Migration Strategy

Consequences

Positive

Negative

Mitigations

Related

Graph View

Table of Contents

Backlinks