ADR-021: Observability & Error Handling

Status: Planning Owner: @bilal @deen Date: 2026-02-15

Why This Needs an ADR

Envo handles tenant emergencies (gas leaks, flooding, no heating). If the system fails silently — a webhook doesn’t process, an LLM call times out, an event doesn’t publish — tenants don’t get help and landlords don’t know.

Currently:

  • Sentry DSN is listed as “optional” in the CI/CD doc
  • No structured logging strategy
  • ADR-013 has correlation IDs but no unified tracing
  • No LLM cost/latency monitoring
  • No alerting for critical failures
  • console.error is the primary error handling pattern

What Needs Monitoring

Tier 1: Safety-Critical (must alert immediately)

SignalWhyAlert
Webhook processing failures (Twilio, VAPI)Tenant messages lostSlack + SMS to on-call
Emergency detection failuresGas leak not flaggedSlack + SMS
Issue creation failuresTenant reported but nothing loggedSlack
LLM provider complete outageNo AI responses at allSlack

Tier 2: Operational (alert within hours)

SignalWhyAlert
Event publish failures > 10%n8n/notifications degradedSlack
LLM latency p95 > 5sTenant experience degradedSlack
Error rate > 5% on any endpointSomething brokenSlack
Cron job missed scheduled runCleanup/alerts not happeningSlack
Database connection pool exhaustionApp will fail soonSlack

Tier 3: Business Intelligence (daily/weekly review)

SignalWhy
LLM cost per day/weekBudget tracking
LLM token usage by modelOptimisation
Conversation volume by channelGrowth tracking
Issue creation rateProduct health
RAG retrieval quality metricsAI quality

Options for Error Tracking

OptionCostEffortNotes
Sentry (free tier)Free (5K events/mo)Low — SDK installBest DX, source maps, breadcrumbs
Vercel Analytics + LogsIncluded in ProZeroBasic, no alerting
Axiom (via Vercel integration)Free tierLowGood for logs, integrates with Vercel
BetterStack (Logtail)Free tierLowStructured logging + uptime monitoring

Recommendation: Sentry for errors + Vercel logs for general observability. Add structured logging later.

Structured Logging

Replace console.log/error with structured JSON logs:

// Before
console.error('Failed to process message', error)
 
// After
logger.error('webhook.process_failed', {
  channel: 'whatsapp',
  conversationId,
  error: error.message,
  correlationId,  // From ADR-013
})

Minimum fields per log entry:

  • timestamp, level, event (dot-notation like webhook.process_failed)
  • correlationId (links related operations)
  • organisationId (for multi-tenant debugging)
  • userId or tenantId (who triggered it)

LLM Monitoring

LLM calls are the most expensive and least predictable part of the stack. Need to track:

// Wrapper around every LLM call
const result = await withLLMMonitoring({
  provider: 'anthropic',
  model: 'claude-haiku-4-5-20251001',
  purpose: 'intent_classification',  // or 'tool_use', 'rag_response', 'rerank'
  organisationId,
}, async () => {
  return anthropic.messages.create(...)
})
 
// Logs: provider, model, purpose, input_tokens, output_tokens, latency_ms, success, error

This enables:

  • Cost breakdown by purpose (how much does reranking cost vs responses?)
  • Latency tracking per model
  • Failure rate per provider (trigger fallback alerts)
  • Per-org cost tracking (for BYOAK billing, ADR-014 White-Label & BYOAK)

Request Tracing

ADR-013 already has correlation_id on domain events. Extend to full request tracing:

Twilio webhook → correlationId generated
  → processInboundMessage (same correlationId)
    → identifyTenant (same correlationId)
    → LLM tool use call (same correlationId)
    → createIssue (same correlationId)
    → emit domain event (same correlationId)
    → notifyLandlord (same correlationId)

When something fails, search by correlationId to see the full chain.

Minimum Viable Observability

For pre-deployment:

  1. Sentry SDK — Install, configure source maps, catch unhandled errors
  2. Webhook health check — Log every webhook receipt, alert if none received in 1 hour
  3. LLM call logging — Structured log of every LLM call (provider, tokens, latency, success)
  4. Health endpoint/api/health checks DB + at least one LLM provider available

Add structured logging, request tracing, and business metrics as the product scales.