ADR-021: Observability & Error Handling

Status: Planning Owner: @bilal @deen Date: 2026-02-15

Why This Needs an ADR

Envo handles tenant emergencies (gas leaks, flooding, no heating). If the system fails silently — a webhook doesn’t process, an LLM call times out, an event doesn’t publish — tenants don’t get help and landlords don’t know.

Currently:

Sentry DSN is listed as “optional” in the CI/CD doc
No structured logging strategy
ADR-013 has correlation IDs but no unified tracing
No LLM cost/latency monitoring
No alerting for critical failures
console.error is the primary error handling pattern

What Needs Monitoring

Tier 1: Safety-Critical (must alert immediately)

Signal	Why	Alert
Webhook processing failures (Twilio, VAPI)	Tenant messages lost	Slack + SMS to on-call
Emergency detection failures	Gas leak not flagged	Slack + SMS
Issue creation failures	Tenant reported but nothing logged	Slack
LLM provider complete outage	No AI responses at all	Slack

Tier 2: Operational (alert within hours)

Signal	Why	Alert
Event publish failures > 10%	n8n/notifications degraded	Slack
LLM latency p95 > 5s	Tenant experience degraded	Slack
Error rate > 5% on any endpoint	Something broken	Slack
Cron job missed scheduled run	Cleanup/alerts not happening	Slack
Database connection pool exhaustion	App will fail soon	Slack

Tier 3: Business Intelligence (daily/weekly review)

Signal	Why
LLM cost per day/week	Budget tracking
LLM token usage by model	Optimisation
Conversation volume by channel	Growth tracking
Issue creation rate	Product health
RAG retrieval quality metrics	AI quality

Options for Error Tracking

Option	Cost	Effort	Notes
Sentry (free tier)	Free (5K events/mo)	Low — SDK install	Best DX, source maps, breadcrumbs
Vercel Analytics + Logs	Included in Pro	Zero	Basic, no alerting
Axiom (via Vercel integration)	Free tier	Low	Good for logs, integrates with Vercel
BetterStack (Logtail)	Free tier	Low	Structured logging + uptime monitoring

Recommendation: Sentry for errors + Vercel logs for general observability. Add structured logging later.

Structured Logging

Replace console.log/error with structured JSON logs:

// Before
console.error('Failed to process message', error)
 
// After
logger.error('webhook.process_failed', {
  channel: 'whatsapp',
  conversationId,
  error: error.message,
  correlationId,  // From ADR-013
})

Minimum fields per log entry:

timestamp, level, event (dot-notation like webhook.process_failed)
correlationId (links related operations)
organisationId (for multi-tenant debugging)
userId or tenantId (who triggered it)

LLM Monitoring

LLM calls are the most expensive and least predictable part of the stack. Need to track:

// Wrapper around every LLM call
const result = await withLLMMonitoring({
  provider: 'anthropic',
  model: 'claude-haiku-4-5-20251001',
  purpose: 'intent_classification',  // or 'tool_use', 'rag_response', 'rerank'
  organisationId,
}, async () => {
  return anthropic.messages.create(...)
})
 
// Logs: provider, model, purpose, input_tokens, output_tokens, latency_ms, success, error

This enables:

Cost breakdown by purpose (how much does reranking cost vs responses?)
Latency tracking per model
Failure rate per provider (trigger fallback alerts)
Per-org cost tracking (for BYOAK billing, ADR-014 White-Label & BYOAK)

Request Tracing

ADR-013 already has correlation_id on domain events. Extend to full request tracing:

Twilio webhook → correlationId generated
  → processInboundMessage (same correlationId)
    → identifyTenant (same correlationId)
    → LLM tool use call (same correlationId)
    → createIssue (same correlationId)
    → emit domain event (same correlationId)
    → notifyLandlord (same correlationId)

When something fails, search by correlationId to see the full chain.

Minimum Viable Observability

For pre-deployment:

Sentry SDK — Install, configure source maps, catch unhandled errors
Webhook health check — Log every webhook receipt, alert if none received in 1 hour
LLM call logging — Structured log of every LLM call (provider, tokens, latency, success)
Health endpoint — /api/health checks DB + at least one LLM provider available

Add structured logging, request tracing, and business metrics as the product scales.

ADR-013 Event-Driven Architecture (correlation IDs)
ADR-017 RAG Pipeline v2 (evaluation framework)
ADR-020 Background Jobs & Scheduled Tasks (job monitoring)
Infrastructure
Costs & Usage

EHQ Brain

Explorer

ADR-021 Observability & Error Handling

ADR-021: Observability & Error Handling

Why This Needs an ADR

What Needs Monitoring

Tier 1: Safety-Critical (must alert immediately)

Tier 2: Operational (alert within hours)

Tier 3: Business Intelligence (daily/weekly review)

Options for Error Tracking

Structured Logging

LLM Monitoring

Request Tracing

Minimum Viable Observability

Graph View

Table of Contents

Backlinks

EHQ Brain

Explorer

ADR-021 Observability & Error Handling

ADR-021: Observability & Error Handling

Why This Needs an ADR

What Needs Monitoring

Tier 1: Safety-Critical (must alert immediately)

Tier 2: Operational (alert within hours)

Tier 3: Business Intelligence (daily/weekly review)

Options for Error Tracking

Structured Logging

LLM Monitoring

Request Tracing

Minimum Viable Observability

Related

Graph View

Table of Contents

Backlinks