Provider-Agnostic LLM Architecture
Status: Draft Created: 2026-04-15 Depends on: None (can start immediately) Feature flags: None (infrastructure change, transparent to end users)
Why
The LLM layer has two problems:
-
Vendor lock-in. The tenant engine (safety-critical, tenant-facing) and CSV column mapping both use the Anthropic SDK directly — no fallback, no redundancy. If Anthropic has an outage or rate-limits us, tenant conversations stop working entirely.
-
Cost. Claude Sonnet handles every RAG generation, intent classification, and conversation summary. These are high-volume, low-complexity calls that don’t need a frontier model. Open-source models via Fireworks or Cloudflare Workers AI cost 5-20x less.
Current call sites and their abstraction status:
| Call Site | File | Provider | Abstracted? | Tool-Use? |
|---|---|---|---|---|
| RAG response generation | lib/rag/generate.ts | Multi-provider fallback | Yes | No |
| Intent classification | lib/rag/intent.ts | Multi-provider fallback | Yes | No |
| Conversation summaries | lib/rag/summary.ts | Multi-provider fallback | Yes | No |
| Tenant engine orchestrator | lib/tenant-engine/process.ts | Direct Anthropic SDK | No | Yes (5 tools) |
| CSV column mapping | app/api/import/map-columns/route.ts | Direct Anthropic SDK | No | No |
| Embeddings | lib/rag/embed.ts | Direct OpenAI | No | No |
The tenant engine is the hardest to migrate — it uses Anthropic’s native tool-use format with 5 tools (ask_for_details, ask_for_photo, create_issue, respond, escalate) and model routing between Haiku and Sonnet based on conversation state.
What Ships
- Vercel AI SDK (
ai) as the application-layer abstraction — unified tool definitions via Zod schemas, provider-agnosticgenerateText/streamText, automatic format translation between Anthropic/OpenAI tool-use formats - Cloudflare AI Gateway as the proxy layer — request caching, rate limiting, analytics, and fallback chains. Free. Already in our Cloudflare account.
- New providers: Fireworks AI + Cloudflare Workers AI — fast, cheap inference for non-critical paths
- Provider-agnostic tenant engine — tool-use definitions work across providers, with per-provider fallback
- Fallback chains with exponential backoff — every LLM call site has a ranked provider list with automatic failover
- Cost routing — critical paths (tenant engine) use frontier models; high-volume paths (RAG, summaries) use cheap open-source models
Architecture
Three-Layer Stack
┌──────────────────────────────────────────────────────────┐
│ Application Layer │
│ │
│ Vercel AI SDK (@ai-sdk) │
│ ├── Unified tool definitions (Zod → any provider format) │
│ ├── generateText / streamText (provider-agnostic) │
│ ├── Provider registry (createProviderRegistry) │
│ └── Middleware hooks (logging, metrics, guardrails) │
└────────────────────────┬─────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────┐
│ Gateway Layer │
│ │
│ Cloudflare AI Gateway (FREE) │
│ ├── Response caching (repeated RAG queries) │
│ ├── Rate limiting (cost protection) │
│ ├── Request/token/cost analytics │
│ ├── Fallback chains (Universal Endpoint) │
│ └── cf-aig-step header (which provider handled it) │
│ │
│ URL: gateway.ai.cloudflare.com/v1/{account}/{gw}/{prov} │
└────────────────────────┬─────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────┐
│ Provider Layer │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ Anthropic │ │ Fireworks AI │ │ CF Workers AI │ │
│ │ Claude │ │ Llama 3.1 │ │ Qwen3 / Mistral │ │
│ │ Sonnet/Haiku │ │ 70B / 8B │ │ Native, near-zero │ │
│ └─────────────┘ └──────────────┘ └────────────────────┘ │
│ │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ OpenAI │ │ Stub │ │
│ │ GPT-4o │ │ Local dev │ │
│ └─────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
Why These Choices
Vercel AI SDK over OpenRouter SDK:
OpenRouter is a unified API (300+ models, one key, OpenAI-compatible) — attractive but adds a middleman. The 5.5% platform fee compounds at scale. More importantly, it doesn’t solve the tool-use portability problem: you still need to handle Anthropic vs OpenAI tool-use format differences in your application code.
The Vercel AI SDK is a client-side library (free, MIT) that normalises tool-use definitions across all providers. Define a tool once with Zod, call any model. It handles the Anthropic input_schema ↔ OpenAI function.parameters translation automatically. This is the actual problem we need to solve.
Cloudflare AI Gateway over Portkey/LiteLLM:
- Portkey: $49/month production tier, another vendor. AI Gateway is free, same account.
- LiteLLM: Python server — can’t run in Workers, requires separate infrastructure.
- AI Gateway: Free, already in our Cloudflare account, native to Workers. Gives us caching, rate limiting, analytics, and fallback chains without any additional infra.
Fireworks AI over Cerebras:
- Cerebras: extremely fast (custom silicon) but limited model selection, less mature tool-use.
- Fireworks: fast (speculative decoding), broad model selection (Llama 3.1/3.3, Mixtral, firefunction), mature tool-use, OpenAI-compatible API.
Workers AI as free fallback:
We’re already on Cloudflare. Workers AI runs on their GPUs, callable via native binding (zero network hop). Free tier: 10,000 Neurons/day. Qwen3 30B and Mistral Small 3.1 24B both support function calling. Perfect as the last-resort fallback before stub.
Provider Routing by Use Case
Tenant Engine (tool-use, safety-critical)
Primary: Anthropic Claude Sonnet (best tool-use reliability)
Secondary: Fireworks Llama 3.1 70B (good tool-use, fast, 10x cheaper)
Tertiary: Workers AI Qwen3 30B (free, native, decent tool-use)
Last: Stub (dev only)
Model routing within the tenant engine stays:
- Haiku equivalent for gathering turns (when
canCreateIssue === false) - Sonnet equivalent for orchestration turns (when tools include
create_issue)
When falling back to Fireworks/Workers AI, both tiers use the same model (no Haiku/Sonnet split for open-source — the cost difference isn’t significant enough to justify the complexity).
RAG Response Generation (high-volume, non-critical)
Primary: Fireworks Llama 3.1 70B (fast, cheap — $0.20/M input tokens)
Secondary: Workers AI Qwen3 30B (free tier)
Tertiary: Anthropic Claude Haiku (reliable but more expensive)
Last: Stub
Intent Classification (high-volume, latency-sensitive)
Primary: Workers AI Mistral Small 3.1 24B (free, native, fast)
Secondary: Fireworks Llama 3.1 8B (very fast, very cheap)
Tertiary: Anthropic Claude Haiku (fallback)
Last: Quick-classify (regex, always available)
Note: the existing quick-classify (regex/keyword) already handles 85%+ of intents at 0ms. The LLM path only fires when confidence < 0.85. This is already cost-efficient.
Conversation Summaries (batch, non-critical)
Primary: Fireworks Llama 3.1 8B (summaries don't need big models)
Secondary: Workers AI Qwen3 30B (free)
Tertiary: Anthropic Claude Haiku
Last: Stub
CSV Column Mapping (low-volume, batch)
Primary: Fireworks Llama 3.1 70B (smart enough, cheap)
Secondary: Anthropic Claude Sonnet (reliable fallback)
Last: Return empty mappings (user maps manually)
Embeddings (unchanged)
Primary: OpenAI text-embedding-3-small ($0.02/M tokens — already cheap)
Fallback: Stub embeddings (dev only)
No reason to change. Neither Fireworks nor Workers AI offers a compelling embedding alternative at this price point.
Estimated Cost Impact
Current (all Anthropic)
| Use case | Model | Estimated monthly volume | Cost/M input | Cost/M output | Monthly est. |
|---|---|---|---|---|---|
| Tenant engine | Sonnet | ~10,000 calls | $3.00 | $15.00 | ~$60-150 |
| RAG generation | Sonnet | ~5,000 calls | $3.00 | $15.00 | ~$30-75 |
| Intent (LLM path) | Haiku | ~2,000 calls | $0.25 | $1.25 | ~$2-5 |
| Summaries | Sonnet | ~3,000 calls | $3.00 | $15.00 | ~$15-45 |
| Column mapping | Sonnet | ~100 calls | $3.00 | $15.00 | ~$1-2 |
| Total | ~$108-277 |
After migration
| Use case | Model | Cost/M input | Cost/M output | Monthly est. |
|---|---|---|---|---|
| Tenant engine | Sonnet (primary) | $3.00 | $15.00 | ~$50-120 (reduced via Haiku routing) |
| RAG generation | Fireworks Llama 3.1 70B | $0.20 | $0.20 | ~$2-4 |
| Intent (LLM path) | Workers AI Mistral 3.1 | Free (10k/day) | Free | ~$0 |
| Summaries | Fireworks Llama 3.1 8B | $0.05 | $0.08 | ~$0.50 |
| Column mapping | Fireworks Llama 3.1 70B | $0.20 | $0.20 | ~$0.10 |
| Total | ~$53-125 |
Savings: ~50-55%, primarily from moving RAG/summaries off Sonnet. The tenant engine stays mostly on Anthropic (highest reliability for safety-critical tool-use) but gains redundancy.
As we scale to 3,000+ properties, the savings compound — RAG and summary volume grows linearly with property count.
Implementation Detail
1. New Dependencies
pnpm add ai @ai-sdk/anthropic @ai-sdk/openaiThe @ai-sdk/openai provider works with any OpenAI-compatible API (Fireworks, Workers AI) via createOpenAI({ baseURL }).
2. Provider Registry
New file: lib/ai/providers.ts
import { createProviderRegistry, createOpenAI } from 'ai'
import { createAnthropic } from '@ai-sdk/anthropic'
// AI Gateway base URLs (proxied through Cloudflare)
const AI_GATEWAY_BASE = `https://gateway.ai.cloudflare.com/v1/${ACCOUNT_ID}/${GATEWAY_ID}`
export function createProviders(env: Record<string, string | undefined>) {
const anthropic = env.ANTHROPIC_API_KEY
? createAnthropic({
apiKey: env.ANTHROPIC_API_KEY,
baseURL: `${AI_GATEWAY_BASE}/anthropic/v1`,
})
: null
const fireworks = env.FIREWORKS_API_KEY
? createOpenAI({
apiKey: env.FIREWORKS_API_KEY,
baseURL: `${AI_GATEWAY_BASE}/fireworks-ai/v1`,
name: 'fireworks',
})
: null
const workersAi = createOpenAI({
apiKey: 'workers-ai', // Dummy — auth via CF service binding
baseURL: `${AI_GATEWAY_BASE}/workers-ai/v1`,
name: 'workers-ai',
})
const openai = env.OPENAI_API_KEY
? createOpenAI({
apiKey: env.OPENAI_API_KEY,
baseURL: `${AI_GATEWAY_BASE}/openai/v1`,
})
: null
return { anthropic, fireworks, workersAi, openai }
}3. Fallback Utility
New file: lib/ai/fallback.ts
import { generateText, type LanguageModel, type GenerateTextResult } from 'ai'
interface FallbackOptions {
models: Array<{ model: LanguageModel; name: string }>
maxRetries?: number // Per-model retries (default: 2)
initialDelayMs?: number // Backoff start (default: 1000)
maxDelayMs?: number // Backoff cap (default: 5000)
}
interface FallbackResult<T> {
result: T
provider: string
attemptIndex: number
totalAttempts: number
}
export async function generateWithFallback(
options: FallbackOptions & Omit<Parameters<typeof generateText>[0], 'model'>
): Promise<FallbackResult<GenerateTextResult<Record<string, never>>>> {
const {
models,
maxRetries = 2,
initialDelayMs = 1000,
maxDelayMs = 5000,
...generateOptions
} = options
const errors: Array<{ provider: string; error: unknown }> = []
for (let i = 0; i < models.length; i++) {
const { model, name } = models[i]
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const result = await generateText({ ...generateOptions, model })
return {
result,
provider: name,
attemptIndex: i,
totalAttempts: errors.length + attempt + 1,
}
} catch (error) {
console.warn(`[ai-fallback] ${name} attempt ${attempt + 1} failed:`, error)
errors.push({ provider: name, error })
if (attempt < maxRetries) {
const delay = Math.min(initialDelayMs * 2 ** attempt, maxDelayMs)
await new Promise(r => setTimeout(r, delay))
}
}
}
}
throw new AggregateError(
errors.map(e => e.error),
`All ${models.length} providers failed: ${errors.map(e => e.provider).join(', ')}`
)
}4. Tenant Engine Migration (tool-use)
Modified: lib/tenant-engine/process.ts
Before (Anthropic-specific):
const CONVERSATION_TOOLS: Anthropic.Tool[] = [
{
name: 'create_issue',
input_schema: {
type: 'object',
properties: {
description: { type: 'string' },
category: { type: 'string', enum: ['PLUMBING', ...] },
urgency: { type: 'string', enum: ['LOW', 'MEDIUM', 'HIGH'] },
confirmation_message: { type: 'string' },
},
required: ['description', 'category', 'urgency', 'confirmation_message'],
},
},
// ... 4 more tools
]
const response = await anthropic.messages.create({
model,
tools: availableTools,
messages,
})
const toolUseBlock = response.content.find(b => b.type === 'tool_use')
switch (toolUseBlock.name) { ... }After (provider-agnostic via AI SDK):
import { generateText, tool } from 'ai'
import { z } from 'zod'
const CONVERSATION_TOOLS = {
create_issue: tool({
description: 'Create maintenance issue and notify landlord',
parameters: z.object({
description: z.string(),
category: z.enum([
'PLUMBING', 'ELECTRICAL', 'HEATING', 'STRUCTURAL', 'APPLIANCES',
'CLEANING', 'PEST_CONTROL', 'LOCKS_SECURITY', 'WINDOWS_DOORS',
'GARDEN_EXTERIOR', 'FIRE_SAFETY', 'DAMP_MOULD', 'NOISE_NUISANCE', 'OTHER',
]),
urgency: z.enum(['LOW', 'MEDIUM', 'HIGH']),
confirmation_message: z.string(),
}),
}),
ask_for_details: tool({
description: 'Ask tenant for more specifics about their issue',
parameters: z.object({
message: z.string(),
missing_info: z.array(z.string()),
}),
}),
ask_for_photo: tool({
description: 'Request photo/video evidence after gathering details',
parameters: z.object({
message: z.string(),
}),
}),
respond: tool({
description: 'Send conversational response to tenant',
parameters: z.object({
message: z.string(),
}),
}),
escalate: tool({
description: 'Escalate conversation to human staff',
parameters: z.object({
message: z.string(),
reason: z.string(),
}),
}),
}
// Tool-use with fallback
const { result, provider } = await generateWithFallback({
models: getToolUseProviderChain(canCreateIssue),
system: systemPrompt,
messages,
tools: canCreateIssue
? CONVERSATION_TOOLS
: omit(CONVERSATION_TOOLS, ['create_issue']),
temperature: 0,
maxTokens: 1024,
})
// AI SDK normalises tool calls across providers
const toolCall = result.toolCalls[0]
if (toolCall) {
switch (toolCall.toolName) {
case 'create_issue':
return await handleToolCreateIssue(toolCall.args, ...)
case 'ask_for_details':
return await handleToolAskForDetails(toolCall.args, ...)
// ...
}
}5. Provider Chain Helpers
New file: lib/ai/chains.ts
import type { LanguageModel } from 'ai'
import { createProviders } from './providers'
// Tenant engine: reliability > cost
export function getToolUseProviderChain(
needsStrongModel: boolean
): Array<{ model: LanguageModel; name: string }> {
const { anthropic, fireworks, workersAi } = createProviders(process.env)
const chain: Array<{ model: LanguageModel; name: string }> = []
if (anthropic) {
chain.push({
model: anthropic(needsStrongModel ? 'claude-sonnet-4-20250514' : 'claude-haiku-4-5-20251001'),
name: needsStrongModel ? 'anthropic-sonnet' : 'anthropic-haiku',
})
}
if (fireworks) {
chain.push({
model: fireworks('accounts/fireworks/models/llama-v3p1-70b-instruct'),
name: 'fireworks-llama-70b',
})
}
chain.push({
model: workersAi('@cf/qwen/qwen3-30b'),
name: 'workers-ai-qwen3',
})
return chain
}
// RAG/summaries: cost > reliability
export function getCheapProviderChain(): Array<{ model: LanguageModel; name: string }> {
const { fireworks, workersAi, anthropic } = createProviders(process.env)
const chain: Array<{ model: LanguageModel; name: string }> = []
if (fireworks) {
chain.push({
model: fireworks('accounts/fireworks/models/llama-v3p1-70b-instruct'),
name: 'fireworks-llama-70b',
})
}
chain.push({
model: workersAi('@cf/qwen/qwen3-30b'),
name: 'workers-ai-qwen3',
})
if (anthropic) {
chain.push({
model: anthropic('claude-haiku-4-5-20251001'),
name: 'anthropic-haiku',
})
}
return chain
}
// Summaries: cheapest possible
export function getSummaryProviderChain(): Array<{ model: LanguageModel; name: string }> {
const { fireworks, workersAi, anthropic } = createProviders(process.env)
const chain: Array<{ model: LanguageModel; name: string }> = []
if (fireworks) {
chain.push({
model: fireworks('accounts/fireworks/models/llama-v3p1-8b-instruct'),
name: 'fireworks-llama-8b',
})
}
chain.push({
model: workersAi('@cf/qwen/qwen3-30b'),
name: 'workers-ai-qwen3',
})
if (anthropic) {
chain.push({
model: anthropic('claude-haiku-4-5-20251001'),
name: 'anthropic-haiku',
})
}
return chain
}6. Cloudflare AI Gateway Setup
Dashboard configuration (one-time):
- Cloudflare Dashboard → AI → AI Gateway → Create Gateway
- Name:
envo-llm - Enable: Caching (TTL: 300s), Rate Limiting (100 req/min), Logging
Gateway URL pattern:
https://gateway.ai.cloudflare.com/v1/{account_id}/envo-llm/{provider}
# Anthropic: .../envo-llm/anthropic/v1/messages
# OpenAI: .../envo-llm/openai/v1/chat/completions
# Fireworks: .../envo-llm/fireworks-ai/v1/chat/completions
# Workers AI: .../envo-llm/workers-ai/v1/chat/completions
Universal Endpoint (gateway-level fallback):
// POST https://gateway.ai.cloudflare.com/v1/{account_id}/envo-llm
// Body: array of provider configs — gateway tries in order
[
{
"provider": "fireworks-ai",
"endpoint": "v1/chat/completions",
"headers": { "Authorization": "Bearer ${FIREWORKS_API_KEY}" },
"query": { "model": "accounts/fireworks/models/llama-v3p1-70b-instruct", ... }
},
{
"provider": "workers-ai",
"endpoint": "v1/chat/completions",
"query": { "model": "@cf/qwen/qwen3-30b", ... }
},
{
"provider": "anthropic",
"endpoint": "v1/messages",
"headers": { "x-api-key": "${ANTHROPIC_API_KEY}" },
"query": { "model": "claude-haiku-4-5-20251001", ... }
}
]The cf-aig-step response header tells you which provider handled the request (0 = primary, 1 = first fallback, etc.) — useful for monitoring.
7. Environment Variables
New variables in .env:
# Cloudflare AI Gateway
CF_AI_GATEWAY_ACCOUNT_ID= # Cloudflare account ID
CF_AI_GATEWAY_ID=envo-llm # Gateway name
# Fireworks AI
FIREWORKS_API_KEY= # From fireworks.ai dashboard
# Workers AI (no key needed — uses CF service binding)
# Just needs the AI Gateway URL
# Routing config
LLM_TENANT_ENGINE_PROVIDER=anthropic # Primary for tool-use
LLM_RAG_PROVIDER=fireworks # Primary for RAG
LLM_SUMMARY_PROVIDER=fireworks # Primary for summariesMigration Plan
Phase 1: Foundation (no behaviour change)
- Install
ai,@ai-sdk/anthropic,@ai-sdk/openai - Create
lib/ai/providers.ts— provider registry - Create
lib/ai/fallback.ts—generateWithFallbackutility - Create
lib/ai/chains.ts— provider chain definitions - Set up Cloudflare AI Gateway in dashboard (envo-llm)
- Add new env vars to
.env.example - Wire AI Gateway URLs as
baseURLfor existing Anthropic + OpenAI clients - Verify all existing calls still work through the gateway (no behaviour change)
Phase 2: Migrate non-critical paths
- Migrate
lib/rag/generate.tsto use AI SDK +getCheapProviderChain() - Migrate
lib/rag/intent.tsLLM path to use AI SDK - Migrate
lib/rag/summary.tsto use AI SDK +getSummaryProviderChain() - Migrate
app/api/import/map-columns/route.tsto use AI SDK - Remove old provider files (
lib/rag/providers/claude.ts,openai.ts,kimi.ts,glm.ts) - Keep
lib/rag/providers/stub.tsas dev fallback - Verify RAG quality with Fireworks Llama 3.1 70B (compare sample outputs)
Phase 3: Migrate tenant engine (critical path)
- Rewrite tool definitions from
Anthropic.Tool[]to Zod + AI SDKtool() - Replace
anthropic.messages.create()withgenerateWithFallback()+getToolUseProviderChain() - Update tool response handling to use AI SDK’s
result.toolCalls[0] - Preserve model routing logic (strong model for orchestration, light model for gathering)
- Test: run 50+ sample conversations through each provider and compare tool selection accuracy
- Test: verify emergency fast-path still bypasses LLM (keyword-only, no regression)
- Test: verify identity gating still blocks
create_issuetool for unconfirmed tenants - Deploy behind feature flag — route 10% traffic to Fireworks, monitor tool-use accuracy
- Gradually increase to 100% once confident
Phase 4: Cleanup + monitoring
- Remove
@anthropic-ai/sdkdirect imports (keep as AI SDK sub-provider) - Remove old
lib/rag/providers/directory (replaced bylib/ai/) - Update
lib/rag/config.tsto reference new provider chain names - Add AI Gateway analytics to ops dashboard (cost per provider, latency, error rate)
- Set up cost alerting (Cloudflare notifications if daily spend exceeds threshold)
- Document provider chain configuration in
.envo/learnings.md
Files to Create
| File | Purpose |
|---|---|
lib/ai/providers.ts | Provider registry — Anthropic, Fireworks, Workers AI, OpenAI |
lib/ai/fallback.ts | generateWithFallback() with exponential backoff |
lib/ai/chains.ts | Use-case-specific provider chains |
lib/ai/index.ts | Public exports |
Files to Modify
| File | Change |
|---|---|
lib/tenant-engine/process.ts | Replace direct Anthropic SDK with AI SDK + fallback |
app/api/import/map-columns/route.ts | Replace direct Anthropic SDK with AI SDK + fallback |
lib/rag/generate.ts | Replace completeWithFallback with AI SDK |
lib/rag/intent.ts | Replace LLM path with AI SDK |
lib/rag/summary.ts | Replace completeWithFallback with AI SDK |
lib/rag/config.ts | Update model references |
package.json | Add ai, @ai-sdk/anthropic, @ai-sdk/openai |
.env.example | Add Fireworks + AI Gateway env vars |
Files to Delete
| File | Reason |
|---|---|
lib/rag/providers/claude.ts | Replaced by lib/ai/providers.ts |
lib/rag/providers/openai.ts | Replaced by lib/ai/providers.ts |
lib/rag/providers/kimi.ts | Replaced by lib/ai/providers.ts |
lib/rag/providers/glm.ts | Replaced by lib/ai/providers.ts |
lib/rag/providers/index.ts | Replaced by lib/ai/fallback.ts |
lib/rag/providers/types.ts | Replaced by AI SDK types |
Keep: lib/rag/providers/stub.ts (adapt to AI SDK interface for dev fallback)
Open-Source Model Tool-Use Quality
Models ranked by tool-use reliability for our use case (multi-tool selection with structured output):
| Tier | Model | Size | Tool-Use Quality | Notes |
|---|---|---|---|---|
| S | Claude Sonnet | — | Excellent | Best-in-class. Current baseline. |
| A | Llama 3.1 70B | 70B | Good | Most battle-tested OSS model for tool-use. Available on Fireworks. |
| A | Qwen3 30B | 30B | Good | Strong function calling. Available on Workers AI natively. |
| A | Mistral Small 3.1 | 24B | Good | Parallel function support. Available on Workers AI. |
| B | Llama 3.3 70B | 70B | Good | Improved instruction following over 3.1. |
| B | Llama 4 Scout | 17B (MoE) | Decent | Multimodal + tool-use. New, less battle-tested. |
| C | Llama 3.1 8B | 8B | Adequate | Fine for simple single-tool calls (summaries). Unreliable for multi-tool. |
| C | Hermes 2 Pro | 7B | Adequate | Fine-tuned for function calling. Small but specialised. |
For the tenant engine, only Tier A+ models should be in the fallback chain. For RAG/summaries, Tier B-C is fine.
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Open-source tool-use picks wrong tool | Tenant gets wrong response, issue miscategorised | Gradual rollout (10% → 50% → 100%). Compare tool selection accuracy per provider. Keep Anthropic as primary for tool-use. |
| AI Gateway adds latency | Slower tenant responses | Gateway is on Cloudflare’s edge — expect <5ms overhead. Monitor via cf-aig-step timing. |
| Fireworks outage | RAG/summaries fail | Workers AI as free fallback (native, no external dependency). Anthropic as last resort. |
| Workers AI model quality degrades | Bad fallback responses | Workers AI is last resort, not primary. Monitor quality via logging. |
| AI SDK doesn’t support a provider feature | Can’t use provider-specific capabilities | AI SDK has provider-specific extensions. Worst case: drop to raw HTTP for that one call. |
| Tool definitions diverge between Zod and current Anthropic format | Behavioural regression | Side-by-side testing: run same prompts through old and new paths, compare tool selections. |
| Vercel AI SDK too heavy for Workers bundle | 10MB bundle limit exceeded | ai core is ~50KB. Provider packages are small. Should be fine. Verify during Phase 1. |
| Cache poisoning via AI Gateway | Stale/wrong responses served | Disable caching for tenant engine (tool-use). Only cache RAG queries (idempotent). |
What’s NOT in This Plan
- Embeddings migration — OpenAI
text-embedding-3-smallstays. Already cheap ($0.02/M tokens). - Streaming responses — Current architecture doesn’t stream to tenants (SMS/WhatsApp are request-response). Not needed yet.
- OpenRouter — Adds a middleman + 5.5% fee. Direct provider access via AI Gateway is cheaper and gives us more control. Reconsider if we need 50+ models.
- Self-hosted models — Not worth the ops burden at current scale. Revisit if we hit 10,000+ properties.
- Fine-tuning — Open-source models work well enough out of the box for our use cases. Revisit if tool-use accuracy on Llama is consistently <90%.
Verification
- Phase 1: All existing calls work through AI Gateway — no behaviour change, just proxy
- Phase 2: RAG responses from Fireworks are comparable quality to Claude (manual review of 20 sample outputs)
- Phase 3: Tenant engine tool selection accuracy on Fireworks Llama 3.1 70B is >95% match vs Claude on a test set of 50 conversations
- Phase 3: Emergency detection still works (keyword fast-path, no LLM regression)
- Phase 3: Identity gating blocks
create_issuefor unconfirmed tenants (no regression) - Phase 4: AI Gateway analytics show cost reduction of 40-55%
- Phase 4: P99 latency for tenant engine stays under 3 seconds
- Phase 4: Zero-downtime during provider failover (test by temporarily disabling primary provider API key)