SoxAI Architecture: Building a Multi-Provider AI Gateway at Scale
SoxAI gives developers and enterprises a single OpenAI-compatible endpoint that fans out to 200+ models from 40+ providers. The pitch is simple; the implementation is not. This post is a detailed look at the system design โ the channel health scoring, atomic billing, and the multi-tenant isolation model โ written for engineers who want to understand what's actually running.
High-Level Architecture
Client (OpenAI SDK)
โ HTTPS /v1/chat/completions
โผ
API Gateway (Rust, axum)
โ
โโโโ Auth & Rate Limit โโโโโโบ Redis (token buckets)
โ
โโโโ Request Router โโโโโโโโโโบ Channel Health Store
โ (scored every 30s)
โ
โโโโ Usage Pre-Reserve โโโโโโโบ Postgres (atomic billing)
โ
โผ
Upstream Adapter Pool
โโโ Anthropic adapter
โโโ OpenAI adapter
โโโ Google Vertex adapter
โโโ AWS Bedrock adapter
โโโ ... (37 more)
โ
โผ
Upstream Provider API
โ
โผ
Response Transformer
โ
โผ
Usage Finalize & Log โโโโโโโโโโโบ ClickHouse (analytics)
โ
โผ
Client
Every component in this diagram runs as an independent service. The API Gateway is stateless โ any instance can handle any request. State lives in Redis (hot), Postgres (authoritative), and ClickHouse (analytical).
The Channel Abstraction
SoxAI's internal unit of routing is a channel: a combination of a provider, a model, and a set of credentials. A single model like gpt-4o can have multiple channels โ for example, separate channels for the US and EU Azure OpenAI deployments, or a primary channel and a backup channel on different API keys.
interface Channel {
id: string;
provider: "openai" | "anthropic" | "google" | "aws_bedrock" | ...;
model_alias: string; // What the client sends ("gpt-4o")
upstream_model: string; // What we send upstream ("gpt-4o-2024-11-20")
endpoint: string;
api_key_ref: string; // Reference to secret store, never plaintext
region?: string;
priority: number; // Higher = preferred when healthy
weight: number; // Used in weighted round-robin within same priority
max_rpm: number; // Rate limit we enforce before even trying
health_score: number; // 0โ100, computed every 30s
}Health Scoring
Every 30 seconds a background job runs synthetic probes against each active channel and updates its health_score. The score combines:
def compute_health_score(channel_id: str) -> float:
m = get_metrics(channel_id, window_seconds=300)
# Component scores (each 0โ1)
availability = 1.0 - m.error_rate_5m
latency_score = sigmoid_normalize(m.p95_ttfb_ms, target=800, scale=400)
consistency = 1.0 - clamp(m.p95_ttfb_ms / m.p50_ttfb_ms - 1.0, 0, 1)
# Weighted composite
score = (
0.50 * availability +
0.35 * latency_score +
0.15 * consistency
) * 100
# Hard penalties
if m.consecutive_errors >= 3:
score *= 0.1 # Near-zero until recovery
return round(score, 1)Channels with a score below 20 are removed from the active routing pool immediately and re-evaluated every 60 seconds. This is the mechanism that delivers transparent failover โ when Anthropic has an incident, all Claude-3.5-Sonnet channels from Anthropic drop off, and requests automatically route to any configured backup (e.g., Claude on AWS Bedrock).
Routing Algorithm
Given a model alias and a set of healthy channels:
- Filter to channels where
health_score >= 20 - Group by
priorityโ always prefer highest priority first - Within a priority group, use weighted random selection proportional to
weight * health_score - If the selected channel returns a 429 or 503, mark it temporarily degraded and retry once with a different channel in the same priority group
This means routing decisions are local โ no distributed coordination needed per-request. The health scores are pre-computed and cached in Redis with a 35-second TTL.
Atomic Billing
The hardest unsolved problem in AI gateway design is billing integrity. Token counts for LLM responses aren't known until the response completes (or streams in), which creates a window where usage can exceed a budget without any enforcement.
SoxAI solves this with a pre-consume, post-reconcile model:
-- Before forwarding to upstream:
BEGIN;
SELECT balance FROM accounts WHERE id = $1 FOR UPDATE;
-- If balance < estimated_cost: ROLLBACK and return 402
UPDATE accounts
SET balance = balance - $estimated_cost,
reserved = reserved + $estimated_cost
WHERE id = $1;
INSERT INTO reservations (request_id, account_id, amount, created_at)
VALUES ($2, $1, $estimated_cost, NOW());
COMMIT;
-- After upstream responds:
BEGIN;
UPDATE accounts
SET reserved = reserved - $estimated_cost,
balance = balance + $estimated_cost - $actual_cost
WHERE id = $1;
DELETE FROM reservations WHERE request_id = $2;
INSERT INTO usage_log (...) VALUES (...);
COMMIT;$estimated_cost is computed from the input token count (known before forwarding) multiplied by a per-model cost coefficient plus a 20% buffer for output tokens. The post-reconcile step adjusts for actual output token count.
This design has one important property: it is impossible to exceed your balance by more than one request's estimated cost. The Postgres row-level lock ensures serial access to each account's balance.
For streaming responses, we buffer the full output before finalising billing. This adds a few hundred milliseconds to the perceived end-of-stream but is necessary for accuracy. We're investigating a streaming-reconcile approach for 2026.
Multi-Tenant Isolation
SoxAI is a multi-tenant system. Tenant isolation operates at three levels:
Level 1 โ API Key Namespacing
Every API key belongs to exactly one team. Keys are stored as argon2id hashes. The lookup is team_id = key_table[hash(input_key)]. No cross-tenant key reuse is possible because the hash space is partitioned.
Level 2 โ Row-Level Security in Postgres
All application tables have a team_id column and a corresponding RLS policy:
ALTER TABLE usage_log ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON usage_log
USING (team_id = current_setting('app.team_id')::uuid);The application layer sets app.team_id at connection checkout. This means even a SQL injection bug cannot leak cross-tenant data โ the database itself enforces the boundary.
Level 3 โ Rate Limit Namespacing in Redis
Rate limit counters are keyed as ratelimit:{team_id}:{model}:{window}. There is no shared counter between tenants. A tenant consuming their full RPM allowance has no effect on other tenants.
WebAuthn MFA
All console logins require either TOTP or WebAuthn (passkey). We implemented WebAuthn using the webauthn-rs crate, which enforces:
- User verification required (
UV = required) - Resident key required for passkey flows
- RP ID pinned to
soxai.io
Sessions are 8-hour JWTs signed with EdDSA (Ed25519). Refresh tokens are stored server-side and bound to the initiating IP address and user-agent hash. Any change to those attributes invalidates the refresh token.
Observability
SoxAI emits OpenTelemetry traces for every request. Each trace includes:
soxai.channel_idโ which channel was selectedsoxai.upstream_modelโ exact model string sent upstreamsoxai.ttfb_msโ time to first tokensoxai.input_tokens,soxai.output_tokenssoxai.upstream_status_codesoxai.routing_reasonโ why this channel was chosen
These attributes are queryable in Grafana Tempo and, more importantly, they feed back into the channel health scoring pipeline. The full trace-to-health-score loop has an end-to-end latency of under 35 seconds.