ADR-021: Observability & Error Handling
Status: Planning Owner: @bilal @deen Date: 2026-02-15
Why This Needs an ADR
Envo handles tenant emergencies (gas leaks, flooding, no heating). If the system fails silently — a webhook doesn’t process, an LLM call times out, an event doesn’t publish — tenants don’t get help and landlords don’t know.
Currently:
- Sentry DSN is listed as “optional” in the CI/CD doc
- No structured logging strategy
- ADR-013 has correlation IDs but no unified tracing
- No LLM cost/latency monitoring
- No alerting for critical failures
console.erroris the primary error handling pattern
What Needs Monitoring
Tier 1: Safety-Critical (must alert immediately)
| Signal | Why | Alert |
|---|---|---|
| Webhook processing failures (Twilio, VAPI) | Tenant messages lost | Slack + SMS to on-call |
| Emergency detection failures | Gas leak not flagged | Slack + SMS |
| Issue creation failures | Tenant reported but nothing logged | Slack |
| LLM provider complete outage | No AI responses at all | Slack |
Tier 2: Operational (alert within hours)
| Signal | Why | Alert |
|---|---|---|
| Event publish failures > 10% | n8n/notifications degraded | Slack |
| LLM latency p95 > 5s | Tenant experience degraded | Slack |
| Error rate > 5% on any endpoint | Something broken | Slack |
| Cron job missed scheduled run | Cleanup/alerts not happening | Slack |
| Database connection pool exhaustion | App will fail soon | Slack |
Tier 3: Business Intelligence (daily/weekly review)
| Signal | Why |
|---|---|
| LLM cost per day/week | Budget tracking |
| LLM token usage by model | Optimisation |
| Conversation volume by channel | Growth tracking |
| Issue creation rate | Product health |
| RAG retrieval quality metrics | AI quality |
Options for Error Tracking
| Option | Cost | Effort | Notes |
|---|---|---|---|
| Sentry (free tier) | Free (5K events/mo) | Low — SDK install | Best DX, source maps, breadcrumbs |
| Vercel Analytics + Logs | Included in Pro | Zero | Basic, no alerting |
| Axiom (via Vercel integration) | Free tier | Low | Good for logs, integrates with Vercel |
| BetterStack (Logtail) | Free tier | Low | Structured logging + uptime monitoring |
Recommendation: Sentry for errors + Vercel logs for general observability. Add structured logging later.
Structured Logging
Replace console.log/error with structured JSON logs:
// Before
console.error('Failed to process message', error)
// After
logger.error('webhook.process_failed', {
channel: 'whatsapp',
conversationId,
error: error.message,
correlationId, // From ADR-013
})Minimum fields per log entry:
timestamp,level,event(dot-notation likewebhook.process_failed)correlationId(links related operations)organisationId(for multi-tenant debugging)userIdortenantId(who triggered it)
LLM Monitoring
LLM calls are the most expensive and least predictable part of the stack. Need to track:
// Wrapper around every LLM call
const result = await withLLMMonitoring({
provider: 'anthropic',
model: 'claude-haiku-4-5-20251001',
purpose: 'intent_classification', // or 'tool_use', 'rag_response', 'rerank'
organisationId,
}, async () => {
return anthropic.messages.create(...)
})
// Logs: provider, model, purpose, input_tokens, output_tokens, latency_ms, success, errorThis enables:
- Cost breakdown by purpose (how much does reranking cost vs responses?)
- Latency tracking per model
- Failure rate per provider (trigger fallback alerts)
- Per-org cost tracking (for BYOAK billing, ADR-014 White-Label & BYOAK)
Request Tracing
ADR-013 already has correlation_id on domain events. Extend to full request tracing:
Twilio webhook → correlationId generated
→ processInboundMessage (same correlationId)
→ identifyTenant (same correlationId)
→ LLM tool use call (same correlationId)
→ createIssue (same correlationId)
→ emit domain event (same correlationId)
→ notifyLandlord (same correlationId)
When something fails, search by correlationId to see the full chain.
Minimum Viable Observability
For pre-deployment:
- Sentry SDK — Install, configure source maps, catch unhandled errors
- Webhook health check — Log every webhook receipt, alert if none received in 1 hour
- LLM call logging — Structured log of every LLM call (provider, tokens, latency, success)
- Health endpoint —
/api/healthchecks DB + at least one LLM provider available
Add structured logging, request tracing, and business metrics as the product scales.
Related
- ADR-013 Event-Driven Architecture (correlation IDs)
- ADR-017 RAG Pipeline v2 (evaluation framework)
- ADR-020 Background Jobs & Scheduled Tasks (job monitoring)
- Infrastructure
- Costs & Usage