Provider-Agnostic LLM Architecture

Status: Draft Created: 2026-04-15 Depends on: None (can start immediately) Feature flags: None (infrastructure change, transparent to end users)

Why

The LLM layer has two problems:

Vendor lock-in. The tenant engine (safety-critical, tenant-facing) and CSV column mapping both use the Anthropic SDK directly — no fallback, no redundancy. If Anthropic has an outage or rate-limits us, tenant conversations stop working entirely.
Cost. Claude Sonnet handles every RAG generation, intent classification, and conversation summary. These are high-volume, low-complexity calls that don’t need a frontier model. Open-source models via Fireworks or Cloudflare Workers AI cost 5-20x less.

Current call sites and their abstraction status:

Call Site	File	Provider	Abstracted?	Tool-Use?
RAG response generation	`lib/rag/generate.ts`	Multi-provider fallback	Yes	No
Intent classification	`lib/rag/intent.ts`	Multi-provider fallback	Yes	No
Conversation summaries	`lib/rag/summary.ts`	Multi-provider fallback	Yes	No
Tenant engine orchestrator	`lib/tenant-engine/process.ts`	Direct Anthropic SDK	No	Yes (5 tools)
CSV column mapping	`app/api/import/map-columns/route.ts`	Direct Anthropic SDK	No	No
Embeddings	`lib/rag/embed.ts`	Direct OpenAI	No	No

The tenant engine is the hardest to migrate — it uses Anthropic’s native tool-use format with 5 tools (ask_for_details, ask_for_photo, create_issue, respond, escalate) and model routing between Haiku and Sonnet based on conversation state.

What Ships

Vercel AI SDK (ai) as the application-layer abstraction — unified tool definitions via Zod schemas, provider-agnostic generateText/streamText, automatic format translation between Anthropic/OpenAI tool-use formats
Cloudflare AI Gateway as the proxy layer — request caching, rate limiting, analytics, and fallback chains. Free. Already in our Cloudflare account.
New providers: Fireworks AI + Cloudflare Workers AI — fast, cheap inference for non-critical paths
Provider-agnostic tenant engine — tool-use definitions work across providers, with per-provider fallback
Fallback chains with exponential backoff — every LLM call site has a ranked provider list with automatic failover
Cost routing — critical paths (tenant engine) use frontier models; high-volume paths (RAG, summaries) use cheap open-source models

Architecture

Three-Layer Stack

┌──────────────────────────────────────────────────────────┐
│  Application Layer                                        │
│                                                           │
│  Vercel AI SDK (@ai-sdk)                                  │
│  ├── Unified tool definitions (Zod → any provider format) │
│  ├── generateText / streamText (provider-agnostic)        │
│  ├── Provider registry (createProviderRegistry)           │
│  └── Middleware hooks (logging, metrics, guardrails)       │
└────────────────────────┬─────────────────────────────────┘
                         │
┌────────────────────────▼─────────────────────────────────┐
│  Gateway Layer                                            │
│                                                           │
│  Cloudflare AI Gateway (FREE)                             │
│  ├── Response caching (repeated RAG queries)              │
│  ├── Rate limiting (cost protection)                      │
│  ├── Request/token/cost analytics                         │
│  ├── Fallback chains (Universal Endpoint)                 │
│  └── cf-aig-step header (which provider handled it)       │
│                                                           │
│  URL: gateway.ai.cloudflare.com/v1/{account}/{gw}/{prov} │
└────────────────────────┬─────────────────────────────────┘
                         │
┌────────────────────────▼─────────────────────────────────┐
│  Provider Layer                                           │
│                                                           │
│  ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│  │ Anthropic    │ │ Fireworks AI │ │ CF Workers AI      │ │
│  │ Claude       │ │ Llama 3.1    │ │ Qwen3 / Mistral    │ │
│  │ Sonnet/Haiku │ │ 70B / 8B     │ │ Native, near-zero  │ │
│  └─────────────┘ └──────────────┘ └────────────────────┘ │
│                                                           │
│  ┌─────────────┐ ┌──────────────┐                        │
│  │ OpenAI      │ │ Stub         │                        │
│  │ GPT-4o      │ │ Local dev    │                        │
│  └─────────────┘ └──────────────┘                        │
└──────────────────────────────────────────────────────────┘

Why These Choices

Vercel AI SDK over OpenRouter SDK:

OpenRouter is a unified API (300+ models, one key, OpenAI-compatible) — attractive but adds a middleman. The 5.5% platform fee compounds at scale. More importantly, it doesn’t solve the tool-use portability problem: you still need to handle Anthropic vs OpenAI tool-use format differences in your application code.

The Vercel AI SDK is a client-side library (free, MIT) that normalises tool-use definitions across all providers. Define a tool once with Zod, call any model. It handles the Anthropic input_schema ↔ OpenAI function.parameters translation automatically. This is the actual problem we need to solve.

Cloudflare AI Gateway over Portkey/LiteLLM:

Portkey: $49/month production tier, another vendor. AI Gateway is free, same account.
LiteLLM: Python server — can’t run in Workers, requires separate infrastructure.
AI Gateway: Free, already in our Cloudflare account, native to Workers. Gives us caching, rate limiting, analytics, and fallback chains without any additional infra.

Fireworks AI over Cerebras:

Cerebras: extremely fast (custom silicon) but limited model selection, less mature tool-use.
Fireworks: fast (speculative decoding), broad model selection (Llama 3.1/3.3, Mixtral, firefunction), mature tool-use, OpenAI-compatible API.

Workers AI as free fallback:

We’re already on Cloudflare. Workers AI runs on their GPUs, callable via native binding (zero network hop). Free tier: 10,000 Neurons/day. Qwen3 30B and Mistral Small 3.1 24B both support function calling. Perfect as the last-resort fallback before stub.

Provider Routing by Use Case

Tenant Engine (tool-use, safety-critical)

Primary:    Anthropic Claude Sonnet     (best tool-use reliability)
Secondary:  Fireworks Llama 3.1 70B    (good tool-use, fast, 10x cheaper)
Tertiary:   Workers AI Qwen3 30B       (free, native, decent tool-use)
Last:       Stub                        (dev only)

Model routing within the tenant engine stays:

Haiku equivalent for gathering turns (when canCreateIssue === false)
Sonnet equivalent for orchestration turns (when tools include create_issue)

When falling back to Fireworks/Workers AI, both tiers use the same model (no Haiku/Sonnet split for open-source — the cost difference isn’t significant enough to justify the complexity).

RAG Response Generation (high-volume, non-critical)

Primary:    Fireworks Llama 3.1 70B    (fast, cheap — $0.20/M input tokens)
Secondary:  Workers AI Qwen3 30B       (free tier)
Tertiary:   Anthropic Claude Haiku     (reliable but more expensive)
Last:       Stub

Intent Classification (high-volume, latency-sensitive)

Primary:    Workers AI Mistral Small 3.1 24B  (free, native, fast)
Secondary:  Fireworks Llama 3.1 8B             (very fast, very cheap)
Tertiary:   Anthropic Claude Haiku             (fallback)
Last:       Quick-classify (regex, always available)

Note: the existing quick-classify (regex/keyword) already handles 85%+ of intents at 0ms. The LLM path only fires when confidence < 0.85. This is already cost-efficient.

Conversation Summaries (batch, non-critical)

Primary:    Fireworks Llama 3.1 8B     (summaries don't need big models)
Secondary:  Workers AI Qwen3 30B       (free)
Tertiary:   Anthropic Claude Haiku
Last:       Stub

CSV Column Mapping (low-volume, batch)

Primary:    Fireworks Llama 3.1 70B    (smart enough, cheap)
Secondary:  Anthropic Claude Sonnet    (reliable fallback)
Last:       Return empty mappings (user maps manually)

Embeddings (unchanged)

Primary:    OpenAI text-embedding-3-small  ($0.02/M tokens — already cheap)
Fallback:   Stub embeddings (dev only)

No reason to change. Neither Fireworks nor Workers AI offers a compelling embedding alternative at this price point.

Estimated Cost Impact

Current (all Anthropic)

Use case	Model	Estimated monthly volume	Cost/M input	Cost/M output	Monthly est.
Tenant engine	Sonnet	~10,000 calls	$3.00	$15.00	~$60-150
RAG generation	Sonnet	~5,000 calls	$3.00	$15.00	~$30-75
Intent (LLM path)	Haiku	~2,000 calls	$0.25	$1.25	~$2-5
Summaries	Sonnet	~3,000 calls	$3.00	$15.00	~$15-45
Column mapping	Sonnet	~100 calls	$3.00	$15.00	~$1-2
Total					~$108-277

After migration

Use case	Model	Cost/M input	Cost/M output	Monthly est.
Tenant engine	Sonnet (primary)	$3.00	$15.00	~$50-120 (reduced via Haiku routing)
RAG generation	Fireworks Llama 3.1 70B	$0.20	$0.20	~$2-4
Intent (LLM path)	Workers AI Mistral 3.1	Free (10k/day)	Free	~$0
Summaries	Fireworks Llama 3.1 8B	$0.05	$0.08	~$0.50
Column mapping	Fireworks Llama 3.1 70B	$0.20	$0.20	~$0.10
Total				~$53-125

Savings: ~50-55%, primarily from moving RAG/summaries off Sonnet. The tenant engine stays mostly on Anthropic (highest reliability for safety-critical tool-use) but gains redundancy.

As we scale to 3,000+ properties, the savings compound — RAG and summary volume grows linearly with property count.

Implementation Detail

1. New Dependencies

pnpm add ai @ai-sdk/anthropic @ai-sdk/openai

The @ai-sdk/openai provider works with any OpenAI-compatible API (Fireworks, Workers AI) via createOpenAI({ baseURL }).

2. Provider Registry

New file: lib/ai/providers.ts

import { createProviderRegistry, createOpenAI } from 'ai'
import { createAnthropic } from '@ai-sdk/anthropic'
 
// AI Gateway base URLs (proxied through Cloudflare)
const AI_GATEWAY_BASE = `https://gateway.ai.cloudflare.com/v1/${ACCOUNT_ID}/${GATEWAY_ID}`
 
export function createProviders(env: Record<string, string | undefined>) {
  const anthropic = env.ANTHROPIC_API_KEY
    ? createAnthropic({
        apiKey: env.ANTHROPIC_API_KEY,
        baseURL: `${AI_GATEWAY_BASE}/anthropic/v1`,
      })
    : null
 
  const fireworks = env.FIREWORKS_API_KEY
    ? createOpenAI({
        apiKey: env.FIREWORKS_API_KEY,
        baseURL: `${AI_GATEWAY_BASE}/fireworks-ai/v1`,
        name: 'fireworks',
      })
    : null
 
  const workersAi = createOpenAI({
    apiKey: 'workers-ai',  // Dummy — auth via CF service binding
    baseURL: `${AI_GATEWAY_BASE}/workers-ai/v1`,
    name: 'workers-ai',
  })
 
  const openai = env.OPENAI_API_KEY
    ? createOpenAI({
        apiKey: env.OPENAI_API_KEY,
        baseURL: `${AI_GATEWAY_BASE}/openai/v1`,
      })
    : null
 
  return { anthropic, fireworks, workersAi, openai }
}

3. Fallback Utility

New file: lib/ai/fallback.ts

import { generateText, type LanguageModel, type GenerateTextResult } from 'ai'
 
interface FallbackOptions {
  models: Array<{ model: LanguageModel; name: string }>
  maxRetries?: number          // Per-model retries (default: 2)
  initialDelayMs?: number     // Backoff start (default: 1000)
  maxDelayMs?: number         // Backoff cap (default: 5000)
}
 
interface FallbackResult<T> {
  result: T
  provider: string
  attemptIndex: number
  totalAttempts: number
}
 
export async function generateWithFallback(
  options: FallbackOptions & Omit<Parameters<typeof generateText>[0], 'model'>
): Promise<FallbackResult<GenerateTextResult<Record<string, never>>>> {
  const {
    models,
    maxRetries = 2,
    initialDelayMs = 1000,
    maxDelayMs = 5000,
    ...generateOptions
  } = options
 
  const errors: Array<{ provider: string; error: unknown }> = []
 
  for (let i = 0; i < models.length; i++) {
    const { model, name } = models[i]
 
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
      try {
        const result = await generateText({ ...generateOptions, model })
        return {
          result,
          provider: name,
          attemptIndex: i,
          totalAttempts: errors.length + attempt + 1,
        }
      } catch (error) {
        console.warn(`[ai-fallback] ${name} attempt ${attempt + 1} failed:`, error)
        errors.push({ provider: name, error })
 
        if (attempt < maxRetries) {
          const delay = Math.min(initialDelayMs * 2 ** attempt, maxDelayMs)
          await new Promise(r => setTimeout(r, delay))
        }
      }
    }
  }
 
  throw new AggregateError(
    errors.map(e => e.error),
    `All ${models.length} providers failed: ${errors.map(e => e.provider).join(', ')}`
  )
}

4. Tenant Engine Migration (tool-use)

Modified: lib/tenant-engine/process.ts

Before (Anthropic-specific):

const CONVERSATION_TOOLS: Anthropic.Tool[] = [
  {
    name: 'create_issue',
    input_schema: {
      type: 'object',
      properties: {
        description: { type: 'string' },
        category: { type: 'string', enum: ['PLUMBING', ...] },
        urgency: { type: 'string', enum: ['LOW', 'MEDIUM', 'HIGH'] },
        confirmation_message: { type: 'string' },
      },
      required: ['description', 'category', 'urgency', 'confirmation_message'],
    },
  },
  // ... 4 more tools
]
 
const response = await anthropic.messages.create({
  model,
  tools: availableTools,
  messages,
})
 
const toolUseBlock = response.content.find(b => b.type === 'tool_use')
switch (toolUseBlock.name) { ... }

After (provider-agnostic via AI SDK):

import { generateText, tool } from 'ai'
import { z } from 'zod'
 
const CONVERSATION_TOOLS = {
  create_issue: tool({
    description: 'Create maintenance issue and notify landlord',
    parameters: z.object({
      description: z.string(),
      category: z.enum([
        'PLUMBING', 'ELECTRICAL', 'HEATING', 'STRUCTURAL', 'APPLIANCES',
        'CLEANING', 'PEST_CONTROL', 'LOCKS_SECURITY', 'WINDOWS_DOORS',
        'GARDEN_EXTERIOR', 'FIRE_SAFETY', 'DAMP_MOULD', 'NOISE_NUISANCE', 'OTHER',
      ]),
      urgency: z.enum(['LOW', 'MEDIUM', 'HIGH']),
      confirmation_message: z.string(),
    }),
  }),
 
  ask_for_details: tool({
    description: 'Ask tenant for more specifics about their issue',
    parameters: z.object({
      message: z.string(),
      missing_info: z.array(z.string()),
    }),
  }),
 
  ask_for_photo: tool({
    description: 'Request photo/video evidence after gathering details',
    parameters: z.object({
      message: z.string(),
    }),
  }),
 
  respond: tool({
    description: 'Send conversational response to tenant',
    parameters: z.object({
      message: z.string(),
    }),
  }),
 
  escalate: tool({
    description: 'Escalate conversation to human staff',
    parameters: z.object({
      message: z.string(),
      reason: z.string(),
    }),
  }),
}
 
// Tool-use with fallback
const { result, provider } = await generateWithFallback({
  models: getToolUseProviderChain(canCreateIssue),
  system: systemPrompt,
  messages,
  tools: canCreateIssue
    ? CONVERSATION_TOOLS
    : omit(CONVERSATION_TOOLS, ['create_issue']),
  temperature: 0,
  maxTokens: 1024,
})
 
// AI SDK normalises tool calls across providers
const toolCall = result.toolCalls[0]
if (toolCall) {
  switch (toolCall.toolName) {
    case 'create_issue':
      return await handleToolCreateIssue(toolCall.args, ...)
    case 'ask_for_details':
      return await handleToolAskForDetails(toolCall.args, ...)
    // ...
  }
}

5. Provider Chain Helpers

New file: lib/ai/chains.ts

import type { LanguageModel } from 'ai'
import { createProviders } from './providers'
 
// Tenant engine: reliability > cost
export function getToolUseProviderChain(
  needsStrongModel: boolean
): Array<{ model: LanguageModel; name: string }> {
  const { anthropic, fireworks, workersAi } = createProviders(process.env)
  const chain: Array<{ model: LanguageModel; name: string }> = []
 
  if (anthropic) {
    chain.push({
      model: anthropic(needsStrongModel ? 'claude-sonnet-4-20250514' : 'claude-haiku-4-5-20251001'),
      name: needsStrongModel ? 'anthropic-sonnet' : 'anthropic-haiku',
    })
  }
 
  if (fireworks) {
    chain.push({
      model: fireworks('accounts/fireworks/models/llama-v3p1-70b-instruct'),
      name: 'fireworks-llama-70b',
    })
  }
 
  chain.push({
    model: workersAi('@cf/qwen/qwen3-30b'),
    name: 'workers-ai-qwen3',
  })
 
  return chain
}
 
// RAG/summaries: cost > reliability
export function getCheapProviderChain(): Array<{ model: LanguageModel; name: string }> {
  const { fireworks, workersAi, anthropic } = createProviders(process.env)
  const chain: Array<{ model: LanguageModel; name: string }> = []
 
  if (fireworks) {
    chain.push({
      model: fireworks('accounts/fireworks/models/llama-v3p1-70b-instruct'),
      name: 'fireworks-llama-70b',
    })
  }
 
  chain.push({
    model: workersAi('@cf/qwen/qwen3-30b'),
    name: 'workers-ai-qwen3',
  })
 
  if (anthropic) {
    chain.push({
      model: anthropic('claude-haiku-4-5-20251001'),
      name: 'anthropic-haiku',
    })
  }
 
  return chain
}
 
// Summaries: cheapest possible
export function getSummaryProviderChain(): Array<{ model: LanguageModel; name: string }> {
  const { fireworks, workersAi, anthropic } = createProviders(process.env)
  const chain: Array<{ model: LanguageModel; name: string }> = []
 
  if (fireworks) {
    chain.push({
      model: fireworks('accounts/fireworks/models/llama-v3p1-8b-instruct'),
      name: 'fireworks-llama-8b',
    })
  }
 
  chain.push({
    model: workersAi('@cf/qwen/qwen3-30b'),
    name: 'workers-ai-qwen3',
  })
 
  if (anthropic) {
    chain.push({
      model: anthropic('claude-haiku-4-5-20251001'),
      name: 'anthropic-haiku',
    })
  }
 
  return chain
}

6. Cloudflare AI Gateway Setup

Dashboard configuration (one-time):

Cloudflare Dashboard → AI → AI Gateway → Create Gateway
Name: envo-llm
Enable: Caching (TTL: 300s), Rate Limiting (100 req/min), Logging

Gateway URL pattern:

https://gateway.ai.cloudflare.com/v1/{account_id}/envo-llm/{provider}

# Anthropic:   .../envo-llm/anthropic/v1/messages
# OpenAI:      .../envo-llm/openai/v1/chat/completions
# Fireworks:   .../envo-llm/fireworks-ai/v1/chat/completions
# Workers AI:  .../envo-llm/workers-ai/v1/chat/completions

Universal Endpoint (gateway-level fallback):

// POST https://gateway.ai.cloudflare.com/v1/{account_id}/envo-llm
// Body: array of provider configs — gateway tries in order
[
  {
    "provider": "fireworks-ai",
    "endpoint": "v1/chat/completions",
    "headers": { "Authorization": "Bearer ${FIREWORKS_API_KEY}" },
    "query": { "model": "accounts/fireworks/models/llama-v3p1-70b-instruct", ... }
  },
  {
    "provider": "workers-ai",
    "endpoint": "v1/chat/completions",
    "query": { "model": "@cf/qwen/qwen3-30b", ... }
  },
  {
    "provider": "anthropic",
    "endpoint": "v1/messages",
    "headers": { "x-api-key": "${ANTHROPIC_API_KEY}" },
    "query": { "model": "claude-haiku-4-5-20251001", ... }
  }
]

The cf-aig-step response header tells you which provider handled the request (0 = primary, 1 = first fallback, etc.) — useful for monitoring.

7. Environment Variables

New variables in .env:

# Cloudflare AI Gateway
CF_AI_GATEWAY_ACCOUNT_ID=       # Cloudflare account ID
CF_AI_GATEWAY_ID=envo-llm       # Gateway name
 
# Fireworks AI
FIREWORKS_API_KEY=               # From fireworks.ai dashboard
 
# Workers AI (no key needed — uses CF service binding)
# Just needs the AI Gateway URL
 
# Routing config
LLM_TENANT_ENGINE_PROVIDER=anthropic   # Primary for tool-use
LLM_RAG_PROVIDER=fireworks             # Primary for RAG
LLM_SUMMARY_PROVIDER=fireworks         # Primary for summaries

Migration Plan

Phase 1: Foundation (no behaviour change)

Install ai, @ai-sdk/anthropic, @ai-sdk/openai
Create lib/ai/providers.ts — provider registry
Create lib/ai/fallback.ts — generateWithFallback utility
Create lib/ai/chains.ts — provider chain definitions
Set up Cloudflare AI Gateway in dashboard (envo-llm)
Add new env vars to .env.example
Wire AI Gateway URLs as baseURL for existing Anthropic + OpenAI clients
Verify all existing calls still work through the gateway (no behaviour change)

Phase 2: Migrate non-critical paths

Migrate lib/rag/generate.ts to use AI SDK + getCheapProviderChain()
Migrate lib/rag/intent.ts LLM path to use AI SDK
Migrate lib/rag/summary.ts to use AI SDK + getSummaryProviderChain()
Migrate app/api/import/map-columns/route.ts to use AI SDK
Remove old provider files (lib/rag/providers/claude.ts, openai.ts, kimi.ts, glm.ts)
Keep lib/rag/providers/stub.ts as dev fallback
Verify RAG quality with Fireworks Llama 3.1 70B (compare sample outputs)

Phase 3: Migrate tenant engine (critical path)

Rewrite tool definitions from Anthropic.Tool[] to Zod + AI SDK tool()
Replace anthropic.messages.create() with generateWithFallback() + getToolUseProviderChain()
Update tool response handling to use AI SDK’s result.toolCalls[0]
Preserve model routing logic (strong model for orchestration, light model for gathering)
Test: run 50+ sample conversations through each provider and compare tool selection accuracy
Test: verify emergency fast-path still bypasses LLM (keyword-only, no regression)
Test: verify identity gating still blocks create_issue tool for unconfirmed tenants
Deploy behind feature flag — route 10% traffic to Fireworks, monitor tool-use accuracy
Gradually increase to 100% once confident

Phase 4: Cleanup + monitoring

Remove @anthropic-ai/sdk direct imports (keep as AI SDK sub-provider)
Remove old lib/rag/providers/ directory (replaced by lib/ai/)
Update lib/rag/config.ts to reference new provider chain names
Add AI Gateway analytics to ops dashboard (cost per provider, latency, error rate)
Set up cost alerting (Cloudflare notifications if daily spend exceeds threshold)
Document provider chain configuration in .envo/learnings.md

Files to Create

File	Purpose
`lib/ai/providers.ts`	Provider registry — Anthropic, Fireworks, Workers AI, OpenAI
`lib/ai/fallback.ts`	`generateWithFallback()` with exponential backoff
`lib/ai/chains.ts`	Use-case-specific provider chains
`lib/ai/index.ts`	Public exports

Files to Modify

File	Change
`lib/tenant-engine/process.ts`	Replace direct Anthropic SDK with AI SDK + fallback
`app/api/import/map-columns/route.ts`	Replace direct Anthropic SDK with AI SDK + fallback
`lib/rag/generate.ts`	Replace `completeWithFallback` with AI SDK
`lib/rag/intent.ts`	Replace LLM path with AI SDK
`lib/rag/summary.ts`	Replace `completeWithFallback` with AI SDK
`lib/rag/config.ts`	Update model references
`package.json`	Add `ai`, `@ai-sdk/anthropic`, `@ai-sdk/openai`
`.env.example`	Add Fireworks + AI Gateway env vars

Files to Delete

File	Reason
`lib/rag/providers/claude.ts`	Replaced by `lib/ai/providers.ts`
`lib/rag/providers/openai.ts`	Replaced by `lib/ai/providers.ts`
`lib/rag/providers/kimi.ts`	Replaced by `lib/ai/providers.ts`
`lib/rag/providers/glm.ts`	Replaced by `lib/ai/providers.ts`
`lib/rag/providers/index.ts`	Replaced by `lib/ai/fallback.ts`
`lib/rag/providers/types.ts`	Replaced by AI SDK types

Keep: lib/rag/providers/stub.ts (adapt to AI SDK interface for dev fallback)

Open-Source Model Tool-Use Quality

Models ranked by tool-use reliability for our use case (multi-tool selection with structured output):

Tier	Model	Size	Tool-Use Quality	Notes
S	Claude Sonnet	—	Excellent	Best-in-class. Current baseline.
A	Llama 3.1 70B	70B	Good	Most battle-tested OSS model for tool-use. Available on Fireworks.
A	Qwen3 30B	30B	Good	Strong function calling. Available on Workers AI natively.
A	Mistral Small 3.1	24B	Good	Parallel function support. Available on Workers AI.
B	Llama 3.3 70B	70B	Good	Improved instruction following over 3.1.
B	Llama 4 Scout	17B (MoE)	Decent	Multimodal + tool-use. New, less battle-tested.
C	Llama 3.1 8B	8B	Adequate	Fine for simple single-tool calls (summaries). Unreliable for multi-tool.
C	Hermes 2 Pro	7B	Adequate	Fine-tuned for function calling. Small but specialised.

For the tenant engine, only Tier A+ models should be in the fallback chain. For RAG/summaries, Tier B-C is fine.

Risks & Mitigations

Risk	Impact	Mitigation
Open-source tool-use picks wrong tool	Tenant gets wrong response, issue miscategorised	Gradual rollout (10% → 50% → 100%). Compare tool selection accuracy per provider. Keep Anthropic as primary for tool-use.
AI Gateway adds latency	Slower tenant responses	Gateway is on Cloudflare’s edge — expect <5ms overhead. Monitor via `cf-aig-step` timing.
Fireworks outage	RAG/summaries fail	Workers AI as free fallback (native, no external dependency). Anthropic as last resort.
Workers AI model quality degrades	Bad fallback responses	Workers AI is last resort, not primary. Monitor quality via logging.
AI SDK doesn’t support a provider feature	Can’t use provider-specific capabilities	AI SDK has provider-specific extensions. Worst case: drop to raw HTTP for that one call.
Tool definitions diverge between Zod and current Anthropic format	Behavioural regression	Side-by-side testing: run same prompts through old and new paths, compare tool selections.
Vercel AI SDK too heavy for Workers bundle	10MB bundle limit exceeded	`ai` core is ~50KB. Provider packages are small. Should be fine. Verify during Phase 1.
Cache poisoning via AI Gateway	Stale/wrong responses served	Disable caching for tenant engine (tool-use). Only cache RAG queries (idempotent).

What’s NOT in This Plan

Embeddings migration — OpenAI text-embedding-3-small stays. Already cheap ($0.02/M tokens).
Streaming responses — Current architecture doesn’t stream to tenants (SMS/WhatsApp are request-response). Not needed yet.
OpenRouter — Adds a middleman + 5.5% fee. Direct provider access via AI Gateway is cheaper and gives us more control. Reconsider if we need 50+ models.
Self-hosted models — Not worth the ops burden at current scale. Revisit if we hit 10,000+ properties.
Fine-tuning — Open-source models work well enough out of the box for our use cases. Revisit if tool-use accuracy on Llama is consistently <90%.

Verification

Phase 1: All existing calls work through AI Gateway — no behaviour change, just proxy
Phase 2: RAG responses from Fireworks are comparable quality to Claude (manual review of 20 sample outputs)
Phase 3: Tenant engine tool selection accuracy on Fireworks Llama 3.1 70B is >95% match vs Claude on a test set of 50 conversations
Phase 3: Emergency detection still works (keyword fast-path, no LLM regression)
Phase 3: Identity gating blocks create_issue for unconfirmed tenants (no regression)
Phase 4: AI Gateway analytics show cost reduction of 40-55%
Phase 4: P99 latency for tenant engine stays under 3 seconds
Phase 4: Zero-downtime during provider failover (test by temporarily disabling primary provider API key)

EHQ Brain

Explorer

llm-provider-agnostic