Provider-Agnostic LLM Architecture

Status: Draft Created: 2026-04-15 Depends on: None (can start immediately) Feature flags: None (infrastructure change, transparent to end users)


Why

The LLM layer has two problems:

  1. Vendor lock-in. The tenant engine (safety-critical, tenant-facing) and CSV column mapping both use the Anthropic SDK directly — no fallback, no redundancy. If Anthropic has an outage or rate-limits us, tenant conversations stop working entirely.

  2. Cost. Claude Sonnet handles every RAG generation, intent classification, and conversation summary. These are high-volume, low-complexity calls that don’t need a frontier model. Open-source models via Fireworks or Cloudflare Workers AI cost 5-20x less.

Current call sites and their abstraction status:

Call SiteFileProviderAbstracted?Tool-Use?
RAG response generationlib/rag/generate.tsMulti-provider fallbackYesNo
Intent classificationlib/rag/intent.tsMulti-provider fallbackYesNo
Conversation summarieslib/rag/summary.tsMulti-provider fallbackYesNo
Tenant engine orchestratorlib/tenant-engine/process.tsDirect Anthropic SDKNoYes (5 tools)
CSV column mappingapp/api/import/map-columns/route.tsDirect Anthropic SDKNoNo
Embeddingslib/rag/embed.tsDirect OpenAINoNo

The tenant engine is the hardest to migrate — it uses Anthropic’s native tool-use format with 5 tools (ask_for_details, ask_for_photo, create_issue, respond, escalate) and model routing between Haiku and Sonnet based on conversation state.


What Ships

  1. Vercel AI SDK (ai) as the application-layer abstraction — unified tool definitions via Zod schemas, provider-agnostic generateText/streamText, automatic format translation between Anthropic/OpenAI tool-use formats
  2. Cloudflare AI Gateway as the proxy layer — request caching, rate limiting, analytics, and fallback chains. Free. Already in our Cloudflare account.
  3. New providers: Fireworks AI + Cloudflare Workers AI — fast, cheap inference for non-critical paths
  4. Provider-agnostic tenant engine — tool-use definitions work across providers, with per-provider fallback
  5. Fallback chains with exponential backoff — every LLM call site has a ranked provider list with automatic failover
  6. Cost routing — critical paths (tenant engine) use frontier models; high-volume paths (RAG, summaries) use cheap open-source models

Architecture

Three-Layer Stack

┌──────────────────────────────────────────────────────────┐
│  Application Layer                                        │
│                                                           │
│  Vercel AI SDK (@ai-sdk)                                  │
│  ├── Unified tool definitions (Zod → any provider format) │
│  ├── generateText / streamText (provider-agnostic)        │
│  ├── Provider registry (createProviderRegistry)           │
│  └── Middleware hooks (logging, metrics, guardrails)       │
└────────────────────────┬─────────────────────────────────┘
                         │
┌────────────────────────▼─────────────────────────────────┐
│  Gateway Layer                                            │
│                                                           │
│  Cloudflare AI Gateway (FREE)                             │
│  ├── Response caching (repeated RAG queries)              │
│  ├── Rate limiting (cost protection)                      │
│  ├── Request/token/cost analytics                         │
│  ├── Fallback chains (Universal Endpoint)                 │
│  └── cf-aig-step header (which provider handled it)       │
│                                                           │
│  URL: gateway.ai.cloudflare.com/v1/{account}/{gw}/{prov} │
└────────────────────────┬─────────────────────────────────┘
                         │
┌────────────────────────▼─────────────────────────────────┐
│  Provider Layer                                           │
│                                                           │
│  ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│  │ Anthropic    │ │ Fireworks AI │ │ CF Workers AI      │ │
│  │ Claude       │ │ Llama 3.1    │ │ Qwen3 / Mistral    │ │
│  │ Sonnet/Haiku │ │ 70B / 8B     │ │ Native, near-zero  │ │
│  └─────────────┘ └──────────────┘ └────────────────────┘ │
│                                                           │
│  ┌─────────────┐ ┌──────────────┐                        │
│  │ OpenAI      │ │ Stub         │                        │
│  │ GPT-4o      │ │ Local dev    │                        │
│  └─────────────┘ └──────────────┘                        │
└──────────────────────────────────────────────────────────┘

Why These Choices

Vercel AI SDK over OpenRouter SDK:

OpenRouter is a unified API (300+ models, one key, OpenAI-compatible) — attractive but adds a middleman. The 5.5% platform fee compounds at scale. More importantly, it doesn’t solve the tool-use portability problem: you still need to handle Anthropic vs OpenAI tool-use format differences in your application code.

The Vercel AI SDK is a client-side library (free, MIT) that normalises tool-use definitions across all providers. Define a tool once with Zod, call any model. It handles the Anthropic input_schema ↔ OpenAI function.parameters translation automatically. This is the actual problem we need to solve.

Cloudflare AI Gateway over Portkey/LiteLLM:

  • Portkey: $49/month production tier, another vendor. AI Gateway is free, same account.
  • LiteLLM: Python server — can’t run in Workers, requires separate infrastructure.
  • AI Gateway: Free, already in our Cloudflare account, native to Workers. Gives us caching, rate limiting, analytics, and fallback chains without any additional infra.

Fireworks AI over Cerebras:

  • Cerebras: extremely fast (custom silicon) but limited model selection, less mature tool-use.
  • Fireworks: fast (speculative decoding), broad model selection (Llama 3.1/3.3, Mixtral, firefunction), mature tool-use, OpenAI-compatible API.

Workers AI as free fallback:

We’re already on Cloudflare. Workers AI runs on their GPUs, callable via native binding (zero network hop). Free tier: 10,000 Neurons/day. Qwen3 30B and Mistral Small 3.1 24B both support function calling. Perfect as the last-resort fallback before stub.


Provider Routing by Use Case

Tenant Engine (tool-use, safety-critical)

Primary:    Anthropic Claude Sonnet     (best tool-use reliability)
Secondary:  Fireworks Llama 3.1 70B    (good tool-use, fast, 10x cheaper)
Tertiary:   Workers AI Qwen3 30B       (free, native, decent tool-use)
Last:       Stub                        (dev only)

Model routing within the tenant engine stays:

  • Haiku equivalent for gathering turns (when canCreateIssue === false)
  • Sonnet equivalent for orchestration turns (when tools include create_issue)

When falling back to Fireworks/Workers AI, both tiers use the same model (no Haiku/Sonnet split for open-source — the cost difference isn’t significant enough to justify the complexity).

RAG Response Generation (high-volume, non-critical)

Primary:    Fireworks Llama 3.1 70B    (fast, cheap — $0.20/M input tokens)
Secondary:  Workers AI Qwen3 30B       (free tier)
Tertiary:   Anthropic Claude Haiku     (reliable but more expensive)
Last:       Stub

Intent Classification (high-volume, latency-sensitive)

Primary:    Workers AI Mistral Small 3.1 24B  (free, native, fast)
Secondary:  Fireworks Llama 3.1 8B             (very fast, very cheap)
Tertiary:   Anthropic Claude Haiku             (fallback)
Last:       Quick-classify (regex, always available)

Note: the existing quick-classify (regex/keyword) already handles 85%+ of intents at 0ms. The LLM path only fires when confidence < 0.85. This is already cost-efficient.

Conversation Summaries (batch, non-critical)

Primary:    Fireworks Llama 3.1 8B     (summaries don't need big models)
Secondary:  Workers AI Qwen3 30B       (free)
Tertiary:   Anthropic Claude Haiku
Last:       Stub

CSV Column Mapping (low-volume, batch)

Primary:    Fireworks Llama 3.1 70B    (smart enough, cheap)
Secondary:  Anthropic Claude Sonnet    (reliable fallback)
Last:       Return empty mappings (user maps manually)

Embeddings (unchanged)

Primary:    OpenAI text-embedding-3-small  ($0.02/M tokens — already cheap)
Fallback:   Stub embeddings (dev only)

No reason to change. Neither Fireworks nor Workers AI offers a compelling embedding alternative at this price point.


Estimated Cost Impact

Current (all Anthropic)

Use caseModelEstimated monthly volumeCost/M inputCost/M outputMonthly est.
Tenant engineSonnet~10,000 calls$3.00$15.00~$60-150
RAG generationSonnet~5,000 calls$3.00$15.00~$30-75
Intent (LLM path)Haiku~2,000 calls$0.25$1.25~$2-5
SummariesSonnet~3,000 calls$3.00$15.00~$15-45
Column mappingSonnet~100 calls$3.00$15.00~$1-2
Total~$108-277

After migration

Use caseModelCost/M inputCost/M outputMonthly est.
Tenant engineSonnet (primary)$3.00$15.00~$50-120 (reduced via Haiku routing)
RAG generationFireworks Llama 3.1 70B$0.20$0.20~$2-4
Intent (LLM path)Workers AI Mistral 3.1Free (10k/day)Free~$0
SummariesFireworks Llama 3.1 8B$0.05$0.08~$0.50
Column mappingFireworks Llama 3.1 70B$0.20$0.20~$0.10
Total~$53-125

Savings: ~50-55%, primarily from moving RAG/summaries off Sonnet. The tenant engine stays mostly on Anthropic (highest reliability for safety-critical tool-use) but gains redundancy.

As we scale to 3,000+ properties, the savings compound — RAG and summary volume grows linearly with property count.


Implementation Detail

1. New Dependencies

pnpm add ai @ai-sdk/anthropic @ai-sdk/openai

The @ai-sdk/openai provider works with any OpenAI-compatible API (Fireworks, Workers AI) via createOpenAI({ baseURL }).

2. Provider Registry

New file: lib/ai/providers.ts

import { createProviderRegistry, createOpenAI } from 'ai'
import { createAnthropic } from '@ai-sdk/anthropic'
 
// AI Gateway base URLs (proxied through Cloudflare)
const AI_GATEWAY_BASE = `https://gateway.ai.cloudflare.com/v1/${ACCOUNT_ID}/${GATEWAY_ID}`
 
export function createProviders(env: Record<string, string | undefined>) {
  const anthropic = env.ANTHROPIC_API_KEY
    ? createAnthropic({
        apiKey: env.ANTHROPIC_API_KEY,
        baseURL: `${AI_GATEWAY_BASE}/anthropic/v1`,
      })
    : null
 
  const fireworks = env.FIREWORKS_API_KEY
    ? createOpenAI({
        apiKey: env.FIREWORKS_API_KEY,
        baseURL: `${AI_GATEWAY_BASE}/fireworks-ai/v1`,
        name: 'fireworks',
      })
    : null
 
  const workersAi = createOpenAI({
    apiKey: 'workers-ai',  // Dummy — auth via CF service binding
    baseURL: `${AI_GATEWAY_BASE}/workers-ai/v1`,
    name: 'workers-ai',
  })
 
  const openai = env.OPENAI_API_KEY
    ? createOpenAI({
        apiKey: env.OPENAI_API_KEY,
        baseURL: `${AI_GATEWAY_BASE}/openai/v1`,
      })
    : null
 
  return { anthropic, fireworks, workersAi, openai }
}

3. Fallback Utility

New file: lib/ai/fallback.ts

import { generateText, type LanguageModel, type GenerateTextResult } from 'ai'
 
interface FallbackOptions {
  models: Array<{ model: LanguageModel; name: string }>
  maxRetries?: number          // Per-model retries (default: 2)
  initialDelayMs?: number     // Backoff start (default: 1000)
  maxDelayMs?: number         // Backoff cap (default: 5000)
}
 
interface FallbackResult<T> {
  result: T
  provider: string
  attemptIndex: number
  totalAttempts: number
}
 
export async function generateWithFallback(
  options: FallbackOptions & Omit<Parameters<typeof generateText>[0], 'model'>
): Promise<FallbackResult<GenerateTextResult<Record<string, never>>>> {
  const {
    models,
    maxRetries = 2,
    initialDelayMs = 1000,
    maxDelayMs = 5000,
    ...generateOptions
  } = options
 
  const errors: Array<{ provider: string; error: unknown }> = []
 
  for (let i = 0; i < models.length; i++) {
    const { model, name } = models[i]
 
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
      try {
        const result = await generateText({ ...generateOptions, model })
        return {
          result,
          provider: name,
          attemptIndex: i,
          totalAttempts: errors.length + attempt + 1,
        }
      } catch (error) {
        console.warn(`[ai-fallback] ${name} attempt ${attempt + 1} failed:`, error)
        errors.push({ provider: name, error })
 
        if (attempt < maxRetries) {
          const delay = Math.min(initialDelayMs * 2 ** attempt, maxDelayMs)
          await new Promise(r => setTimeout(r, delay))
        }
      }
    }
  }
 
  throw new AggregateError(
    errors.map(e => e.error),
    `All ${models.length} providers failed: ${errors.map(e => e.provider).join(', ')}`
  )
}

4. Tenant Engine Migration (tool-use)

Modified: lib/tenant-engine/process.ts

Before (Anthropic-specific):

const CONVERSATION_TOOLS: Anthropic.Tool[] = [
  {
    name: 'create_issue',
    input_schema: {
      type: 'object',
      properties: {
        description: { type: 'string' },
        category: { type: 'string', enum: ['PLUMBING', ...] },
        urgency: { type: 'string', enum: ['LOW', 'MEDIUM', 'HIGH'] },
        confirmation_message: { type: 'string' },
      },
      required: ['description', 'category', 'urgency', 'confirmation_message'],
    },
  },
  // ... 4 more tools
]
 
const response = await anthropic.messages.create({
  model,
  tools: availableTools,
  messages,
})
 
const toolUseBlock = response.content.find(b => b.type === 'tool_use')
switch (toolUseBlock.name) { ... }

After (provider-agnostic via AI SDK):

import { generateText, tool } from 'ai'
import { z } from 'zod'
 
const CONVERSATION_TOOLS = {
  create_issue: tool({
    description: 'Create maintenance issue and notify landlord',
    parameters: z.object({
      description: z.string(),
      category: z.enum([
        'PLUMBING', 'ELECTRICAL', 'HEATING', 'STRUCTURAL', 'APPLIANCES',
        'CLEANING', 'PEST_CONTROL', 'LOCKS_SECURITY', 'WINDOWS_DOORS',
        'GARDEN_EXTERIOR', 'FIRE_SAFETY', 'DAMP_MOULD', 'NOISE_NUISANCE', 'OTHER',
      ]),
      urgency: z.enum(['LOW', 'MEDIUM', 'HIGH']),
      confirmation_message: z.string(),
    }),
  }),
 
  ask_for_details: tool({
    description: 'Ask tenant for more specifics about their issue',
    parameters: z.object({
      message: z.string(),
      missing_info: z.array(z.string()),
    }),
  }),
 
  ask_for_photo: tool({
    description: 'Request photo/video evidence after gathering details',
    parameters: z.object({
      message: z.string(),
    }),
  }),
 
  respond: tool({
    description: 'Send conversational response to tenant',
    parameters: z.object({
      message: z.string(),
    }),
  }),
 
  escalate: tool({
    description: 'Escalate conversation to human staff',
    parameters: z.object({
      message: z.string(),
      reason: z.string(),
    }),
  }),
}
 
// Tool-use with fallback
const { result, provider } = await generateWithFallback({
  models: getToolUseProviderChain(canCreateIssue),
  system: systemPrompt,
  messages,
  tools: canCreateIssue
    ? CONVERSATION_TOOLS
    : omit(CONVERSATION_TOOLS, ['create_issue']),
  temperature: 0,
  maxTokens: 1024,
})
 
// AI SDK normalises tool calls across providers
const toolCall = result.toolCalls[0]
if (toolCall) {
  switch (toolCall.toolName) {
    case 'create_issue':
      return await handleToolCreateIssue(toolCall.args, ...)
    case 'ask_for_details':
      return await handleToolAskForDetails(toolCall.args, ...)
    // ...
  }
}

5. Provider Chain Helpers

New file: lib/ai/chains.ts

import type { LanguageModel } from 'ai'
import { createProviders } from './providers'
 
// Tenant engine: reliability > cost
export function getToolUseProviderChain(
  needsStrongModel: boolean
): Array<{ model: LanguageModel; name: string }> {
  const { anthropic, fireworks, workersAi } = createProviders(process.env)
  const chain: Array<{ model: LanguageModel; name: string }> = []
 
  if (anthropic) {
    chain.push({
      model: anthropic(needsStrongModel ? 'claude-sonnet-4-20250514' : 'claude-haiku-4-5-20251001'),
      name: needsStrongModel ? 'anthropic-sonnet' : 'anthropic-haiku',
    })
  }
 
  if (fireworks) {
    chain.push({
      model: fireworks('accounts/fireworks/models/llama-v3p1-70b-instruct'),
      name: 'fireworks-llama-70b',
    })
  }
 
  chain.push({
    model: workersAi('@cf/qwen/qwen3-30b'),
    name: 'workers-ai-qwen3',
  })
 
  return chain
}
 
// RAG/summaries: cost > reliability
export function getCheapProviderChain(): Array<{ model: LanguageModel; name: string }> {
  const { fireworks, workersAi, anthropic } = createProviders(process.env)
  const chain: Array<{ model: LanguageModel; name: string }> = []
 
  if (fireworks) {
    chain.push({
      model: fireworks('accounts/fireworks/models/llama-v3p1-70b-instruct'),
      name: 'fireworks-llama-70b',
    })
  }
 
  chain.push({
    model: workersAi('@cf/qwen/qwen3-30b'),
    name: 'workers-ai-qwen3',
  })
 
  if (anthropic) {
    chain.push({
      model: anthropic('claude-haiku-4-5-20251001'),
      name: 'anthropic-haiku',
    })
  }
 
  return chain
}
 
// Summaries: cheapest possible
export function getSummaryProviderChain(): Array<{ model: LanguageModel; name: string }> {
  const { fireworks, workersAi, anthropic } = createProviders(process.env)
  const chain: Array<{ model: LanguageModel; name: string }> = []
 
  if (fireworks) {
    chain.push({
      model: fireworks('accounts/fireworks/models/llama-v3p1-8b-instruct'),
      name: 'fireworks-llama-8b',
    })
  }
 
  chain.push({
    model: workersAi('@cf/qwen/qwen3-30b'),
    name: 'workers-ai-qwen3',
  })
 
  if (anthropic) {
    chain.push({
      model: anthropic('claude-haiku-4-5-20251001'),
      name: 'anthropic-haiku',
    })
  }
 
  return chain
}

6. Cloudflare AI Gateway Setup

Dashboard configuration (one-time):

  1. Cloudflare Dashboard → AI → AI Gateway → Create Gateway
  2. Name: envo-llm
  3. Enable: Caching (TTL: 300s), Rate Limiting (100 req/min), Logging

Gateway URL pattern:

https://gateway.ai.cloudflare.com/v1/{account_id}/envo-llm/{provider}

# Anthropic:   .../envo-llm/anthropic/v1/messages
# OpenAI:      .../envo-llm/openai/v1/chat/completions
# Fireworks:   .../envo-llm/fireworks-ai/v1/chat/completions
# Workers AI:  .../envo-llm/workers-ai/v1/chat/completions

Universal Endpoint (gateway-level fallback):

// POST https://gateway.ai.cloudflare.com/v1/{account_id}/envo-llm
// Body: array of provider configs — gateway tries in order
[
  {
    "provider": "fireworks-ai",
    "endpoint": "v1/chat/completions",
    "headers": { "Authorization": "Bearer ${FIREWORKS_API_KEY}" },
    "query": { "model": "accounts/fireworks/models/llama-v3p1-70b-instruct", ... }
  },
  {
    "provider": "workers-ai",
    "endpoint": "v1/chat/completions",
    "query": { "model": "@cf/qwen/qwen3-30b", ... }
  },
  {
    "provider": "anthropic",
    "endpoint": "v1/messages",
    "headers": { "x-api-key": "${ANTHROPIC_API_KEY}" },
    "query": { "model": "claude-haiku-4-5-20251001", ... }
  }
]

The cf-aig-step response header tells you which provider handled the request (0 = primary, 1 = first fallback, etc.) — useful for monitoring.

7. Environment Variables

New variables in .env:

# Cloudflare AI Gateway
CF_AI_GATEWAY_ACCOUNT_ID=       # Cloudflare account ID
CF_AI_GATEWAY_ID=envo-llm       # Gateway name
 
# Fireworks AI
FIREWORKS_API_KEY=               # From fireworks.ai dashboard
 
# Workers AI (no key needed — uses CF service binding)
# Just needs the AI Gateway URL
 
# Routing config
LLM_TENANT_ENGINE_PROVIDER=anthropic   # Primary for tool-use
LLM_RAG_PROVIDER=fireworks             # Primary for RAG
LLM_SUMMARY_PROVIDER=fireworks         # Primary for summaries

Migration Plan

Phase 1: Foundation (no behaviour change)

  • Install ai, @ai-sdk/anthropic, @ai-sdk/openai
  • Create lib/ai/providers.ts — provider registry
  • Create lib/ai/fallback.tsgenerateWithFallback utility
  • Create lib/ai/chains.ts — provider chain definitions
  • Set up Cloudflare AI Gateway in dashboard (envo-llm)
  • Add new env vars to .env.example
  • Wire AI Gateway URLs as baseURL for existing Anthropic + OpenAI clients
  • Verify all existing calls still work through the gateway (no behaviour change)

Phase 2: Migrate non-critical paths

  • Migrate lib/rag/generate.ts to use AI SDK + getCheapProviderChain()
  • Migrate lib/rag/intent.ts LLM path to use AI SDK
  • Migrate lib/rag/summary.ts to use AI SDK + getSummaryProviderChain()
  • Migrate app/api/import/map-columns/route.ts to use AI SDK
  • Remove old provider files (lib/rag/providers/claude.ts, openai.ts, kimi.ts, glm.ts)
  • Keep lib/rag/providers/stub.ts as dev fallback
  • Verify RAG quality with Fireworks Llama 3.1 70B (compare sample outputs)

Phase 3: Migrate tenant engine (critical path)

  • Rewrite tool definitions from Anthropic.Tool[] to Zod + AI SDK tool()
  • Replace anthropic.messages.create() with generateWithFallback() + getToolUseProviderChain()
  • Update tool response handling to use AI SDK’s result.toolCalls[0]
  • Preserve model routing logic (strong model for orchestration, light model for gathering)
  • Test: run 50+ sample conversations through each provider and compare tool selection accuracy
  • Test: verify emergency fast-path still bypasses LLM (keyword-only, no regression)
  • Test: verify identity gating still blocks create_issue tool for unconfirmed tenants
  • Deploy behind feature flag — route 10% traffic to Fireworks, monitor tool-use accuracy
  • Gradually increase to 100% once confident

Phase 4: Cleanup + monitoring

  • Remove @anthropic-ai/sdk direct imports (keep as AI SDK sub-provider)
  • Remove old lib/rag/providers/ directory (replaced by lib/ai/)
  • Update lib/rag/config.ts to reference new provider chain names
  • Add AI Gateway analytics to ops dashboard (cost per provider, latency, error rate)
  • Set up cost alerting (Cloudflare notifications if daily spend exceeds threshold)
  • Document provider chain configuration in .envo/learnings.md

Files to Create

FilePurpose
lib/ai/providers.tsProvider registry — Anthropic, Fireworks, Workers AI, OpenAI
lib/ai/fallback.tsgenerateWithFallback() with exponential backoff
lib/ai/chains.tsUse-case-specific provider chains
lib/ai/index.tsPublic exports

Files to Modify

FileChange
lib/tenant-engine/process.tsReplace direct Anthropic SDK with AI SDK + fallback
app/api/import/map-columns/route.tsReplace direct Anthropic SDK with AI SDK + fallback
lib/rag/generate.tsReplace completeWithFallback with AI SDK
lib/rag/intent.tsReplace LLM path with AI SDK
lib/rag/summary.tsReplace completeWithFallback with AI SDK
lib/rag/config.tsUpdate model references
package.jsonAdd ai, @ai-sdk/anthropic, @ai-sdk/openai
.env.exampleAdd Fireworks + AI Gateway env vars

Files to Delete

FileReason
lib/rag/providers/claude.tsReplaced by lib/ai/providers.ts
lib/rag/providers/openai.tsReplaced by lib/ai/providers.ts
lib/rag/providers/kimi.tsReplaced by lib/ai/providers.ts
lib/rag/providers/glm.tsReplaced by lib/ai/providers.ts
lib/rag/providers/index.tsReplaced by lib/ai/fallback.ts
lib/rag/providers/types.tsReplaced by AI SDK types

Keep: lib/rag/providers/stub.ts (adapt to AI SDK interface for dev fallback)


Open-Source Model Tool-Use Quality

Models ranked by tool-use reliability for our use case (multi-tool selection with structured output):

TierModelSizeTool-Use QualityNotes
SClaude SonnetExcellentBest-in-class. Current baseline.
ALlama 3.1 70B70BGoodMost battle-tested OSS model for tool-use. Available on Fireworks.
AQwen3 30B30BGoodStrong function calling. Available on Workers AI natively.
AMistral Small 3.124BGoodParallel function support. Available on Workers AI.
BLlama 3.3 70B70BGoodImproved instruction following over 3.1.
BLlama 4 Scout17B (MoE)DecentMultimodal + tool-use. New, less battle-tested.
CLlama 3.1 8B8BAdequateFine for simple single-tool calls (summaries). Unreliable for multi-tool.
CHermes 2 Pro7BAdequateFine-tuned for function calling. Small but specialised.

For the tenant engine, only Tier A+ models should be in the fallback chain. For RAG/summaries, Tier B-C is fine.


Risks & Mitigations

RiskImpactMitigation
Open-source tool-use picks wrong toolTenant gets wrong response, issue miscategorisedGradual rollout (10% → 50% → 100%). Compare tool selection accuracy per provider. Keep Anthropic as primary for tool-use.
AI Gateway adds latencySlower tenant responsesGateway is on Cloudflare’s edge — expect <5ms overhead. Monitor via cf-aig-step timing.
Fireworks outageRAG/summaries failWorkers AI as free fallback (native, no external dependency). Anthropic as last resort.
Workers AI model quality degradesBad fallback responsesWorkers AI is last resort, not primary. Monitor quality via logging.
AI SDK doesn’t support a provider featureCan’t use provider-specific capabilitiesAI SDK has provider-specific extensions. Worst case: drop to raw HTTP for that one call.
Tool definitions diverge between Zod and current Anthropic formatBehavioural regressionSide-by-side testing: run same prompts through old and new paths, compare tool selections.
Vercel AI SDK too heavy for Workers bundle10MB bundle limit exceededai core is ~50KB. Provider packages are small. Should be fine. Verify during Phase 1.
Cache poisoning via AI GatewayStale/wrong responses servedDisable caching for tenant engine (tool-use). Only cache RAG queries (idempotent).

What’s NOT in This Plan

  • Embeddings migration — OpenAI text-embedding-3-small stays. Already cheap ($0.02/M tokens).
  • Streaming responses — Current architecture doesn’t stream to tenants (SMS/WhatsApp are request-response). Not needed yet.
  • OpenRouter — Adds a middleman + 5.5% fee. Direct provider access via AI Gateway is cheaper and gives us more control. Reconsider if we need 50+ models.
  • Self-hosted models — Not worth the ops burden at current scale. Revisit if we hit 10,000+ properties.
  • Fine-tuning — Open-source models work well enough out of the box for our use cases. Revisit if tool-use accuracy on Llama is consistently <90%.

Verification

  1. Phase 1: All existing calls work through AI Gateway — no behaviour change, just proxy
  2. Phase 2: RAG responses from Fireworks are comparable quality to Claude (manual review of 20 sample outputs)
  3. Phase 3: Tenant engine tool selection accuracy on Fireworks Llama 3.1 70B is >95% match vs Claude on a test set of 50 conversations
  4. Phase 3: Emergency detection still works (keyword fast-path, no LLM regression)
  5. Phase 3: Identity gating blocks create_issue for unconfirmed tenants (no regression)
  6. Phase 4: AI Gateway analytics show cost reduction of 40-55%
  7. Phase 4: P99 latency for tenant engine stays under 3 seconds
  8. Phase 4: Zero-downtime during provider failover (test by temporarily disabling primary provider API key)