SoxAI Architecture: Building a Multi-Provider AI Gateway at Scale

SoxAI gives developers and enterprises a single OpenAI-compatible endpoint that fans out to 200+ models from 40+ providers. The pitch is simple; the implementation is not. This post is a detailed look at the system design — the channel health scoring, atomic billing, and the multi-tenant isolation model — written for engineers who want to understand what's actually running.

High-Level Architecture

Client (OpenAI SDK)
       │  HTTPS /v1/chat/completions
       ▼
  API Gateway (Rust, axum)
       │
       ├─── Auth & Rate Limit ─────► Redis (token buckets)
       │
       ├─── Request Router ─────────► Channel Health Store
       │                               (scored every 30s)
       │
       ├─── Usage Pre-Reserve ──────► Postgres (atomic billing)
       │
       ▼
 Upstream Adapter Pool
  ├── Anthropic adapter
  ├── OpenAI adapter
  ├── Google Vertex adapter
  ├── AWS Bedrock adapter
  └── ... (37 more)
       │
       ▼
 Upstream Provider API
       │
       ▼
  Response Transformer
       │
       ▼
 Usage Finalize & Log ──────────► ClickHouse (analytics)
       │
       ▼
     Client

Every component in this diagram runs as an independent service. The API Gateway is stateless — any instance can handle any request. State lives in Redis (hot), Postgres (authoritative), and ClickHouse (analytical).

The Channel Abstraction

SoxAI's internal unit of routing is a channel: a combination of a provider, a model, and a set of credentials. A single model like gpt-4o can have multiple channels — for example, separate channels for the US and EU Azure OpenAI deployments, or a primary channel and a backup channel on different API keys.

interface Channel {
  id: string;
  provider: "openai" | "anthropic" | "google" | "aws_bedrock" | ...;
  model_alias: string;       // What the client sends ("gpt-4o")
  upstream_model: string;    // What we send upstream ("gpt-4o-2024-11-20")
  endpoint: string;
  api_key_ref: string;       // Reference to secret store, never plaintext
  region?: string;
  priority: number;          // Higher = preferred when healthy
  weight: number;            // Used in weighted round-robin within same priority
  max_rpm: number;           // Rate limit we enforce before even trying
  health_score: number;      // 0–100, computed every 30s
}

Health Scoring

Every 30 seconds a background job runs synthetic probes against each active channel and updates its health_score. The score combines:

def compute_health_score(channel_id: str) -> float:
    m = get_metrics(channel_id, window_seconds=300)
 
    # Component scores (each 0–1)
    availability = 1.0 - m.error_rate_5m
    latency_score = sigmoid_normalize(m.p95_ttfb_ms, target=800, scale=400)
    consistency   = 1.0 - clamp(m.p95_ttfb_ms / m.p50_ttfb_ms - 1.0, 0, 1)
 
    # Weighted composite
    score = (
        0.50 * availability +
        0.35 * latency_score +
        0.15 * consistency
    ) * 100
 
    # Hard penalties
    if m.consecutive_errors >= 3:
        score *= 0.1   # Near-zero until recovery
 
    return round(score, 1)

Channels with a score below 20 are removed from the active routing pool immediately and re-evaluated every 60 seconds. This is the mechanism that delivers transparent failover — when Anthropic has an incident, all Claude-3.5-Sonnet channels from Anthropic drop off, and requests automatically route to any configured backup (e.g., Claude on AWS Bedrock).

Routing Algorithm

Given a model alias and a set of healthy channels:

Filter to channels where health_score >= 20
Group by priority — always prefer highest priority first
Within a priority group, use weighted random selection proportional to weight * health_score
If the selected channel returns a 429 or 503, mark it temporarily degraded and retry once with a different channel in the same priority group

This means routing decisions are local — no distributed coordination needed per-request. The health scores are pre-computed and cached in Redis with a 35-second TTL.

Atomic Billing

The hardest unsolved problem in AI gateway design is billing integrity. Token counts for LLM responses aren't known until the response completes (or streams in), which creates a window where usage can exceed a budget without any enforcement.

SoxAI solves this with a pre-consume, post-reconcile model:

-- Before forwarding to upstream:
BEGIN;
  SELECT balance FROM accounts WHERE id = $1 FOR UPDATE;
  -- If balance < estimated_cost: ROLLBACK and return 402
  UPDATE accounts
     SET balance = balance - $estimated_cost,
         reserved = reserved + $estimated_cost
   WHERE id = $1;
  INSERT INTO reservations (request_id, account_id, amount, created_at)
       VALUES ($2, $1, $estimated_cost, NOW());
COMMIT;
 
-- After upstream responds:
BEGIN;
  UPDATE accounts
     SET reserved  = reserved  - $estimated_cost,
         balance   = balance   + $estimated_cost - $actual_cost
   WHERE id = $1;
  DELETE FROM reservations WHERE request_id = $2;
  INSERT INTO usage_log (...) VALUES (...);
COMMIT;

$estimated_cost is computed from the input token count (known before forwarding) multiplied by a per-model cost coefficient plus a 20% buffer for output tokens. The post-reconcile step adjusts for actual output token count.

This design has one important property: it is impossible to exceed your balance by more than one request's estimated cost. The Postgres row-level lock ensures serial access to each account's balance.

For streaming responses, we buffer the full output before finalising billing. This adds a few hundred milliseconds to the perceived end-of-stream but is necessary for accuracy. We're investigating a streaming-reconcile approach for 2026.

Multi-Tenant Isolation

SoxAI is a multi-tenant system. Tenant isolation operates at three levels:

Level 1 — API Key Namespacing

Every API key belongs to exactly one team. Keys are stored as argon2id hashes. The lookup is team_id = key_table[hash(input_key)]. No cross-tenant key reuse is possible because the hash space is partitioned.

Level 2 — Row-Level Security in Postgres

All application tables have a team_id column and a corresponding RLS policy:

ALTER TABLE usage_log ENABLE ROW LEVEL SECURITY;
 
CREATE POLICY tenant_isolation ON usage_log
  USING (team_id = current_setting('app.team_id')::uuid);

The application layer sets app.team_id at connection checkout. This means even a SQL injection bug cannot leak cross-tenant data — the database itself enforces the boundary.

Level 3 — Rate Limit Namespacing in Redis

Rate limit counters are keyed as ratelimit:{team_id}:{model}:{window}. There is no shared counter between tenants. A tenant consuming their full RPM allowance has no effect on other tenants.

WebAuthn MFA

All console logins require either TOTP or WebAuthn (passkey). We implemented WebAuthn using the webauthn-rs crate, which enforces:

User verification required (UV = required)
Resident key required for passkey flows
RP ID pinned to soxai.io

Sessions are 8-hour JWTs signed with EdDSA (Ed25519). Refresh tokens are stored server-side and bound to the initiating IP address and user-agent hash. Any change to those attributes invalidates the refresh token.

Observability

SoxAI emits OpenTelemetry traces for every request. Each trace includes:

soxai.channel_id — which channel was selected
soxai.upstream_model — exact model string sent upstream
soxai.ttfb_ms — time to first token
soxai.input_tokens, soxai.output_tokens
soxai.upstream_status_code
soxai.routing_reason — why this channel was chosen

These attributes are queryable in Grafana Tempo and, more importantly, they feed back into the channel health scoring pipeline. The full trace-to-health-score loop has an end-to-end latency of under 35 seconds.

Next: SoxAI Development Journey & 2026 Roadmap