SoxAI: From Internal Tool to AI Infrastructure — Development Journey & 2026 Roadmap

SoxAI didn't start as a product. It started as a 200-line Python script that helped the OneDotNet engineering team switch between Claude and GPT-4 without changing code everywhere. This post tells the story of how it grew, what we got wrong along the way, and the concrete roadmap for the rest of 2026.

The Origin: A Problem We Had Ourselves

In early 2024, our team was building AI features into FastSox — anomaly detection, smart connection recommendations, traffic classification. We were calling both the Anthropic API and the OpenAI API directly. The code looked like this:

# The problem
if use_claude:
    client = anthropic.Anthropic()
    response = client.messages.create(model="claude-3-opus-20240229", ...)
    text = response.content[0].text
else:
    client = openai.OpenAI()
    response = client.chat.completions.create(model="gpt-4o", ...)
    text = response.choices[0].message.content

Every callsite had two branches. Switching models for an experiment meant touching dozens of files. When Anthropic had a brief outage in March 2024, our features silently degraded with no automatic failover.

The fix was obvious: an abstraction layer that presented a single interface and handled provider differences internally. We called it sox-relay initially. It was a FastAPI service with three endpoints.

2024 — Building the Right Abstractions

The first external users found us on HackerNews after a short "I built a thing" post. Within two weeks we had ~300 teams signed up. Their usage immediately revealed what we'd missed:

Billing transparency. Teams wanted to know exactly how much each call cost, attributed to each project or user. We had no billing at all — it was pure passthrough.

Rate limit visibility. Providers have rate limits. When we hit them, we were returning 429s to users with no context. Teams needed to know whether the limit was theirs or the provider's.

Audit logs. Enterprise teams couldn't use a system with no record of what was sent and to whom.

We spent Q3 2024 rebuilding the billing and audit infrastructure. This is when we moved from SQLite (fine for prototyping) to Postgres with the atomic pre-consume design described in our architecture post.

The Worst Bug We Shipped

In October 2024 we had a billing bug that underbilled by approximately 15% for streaming requests. The root cause: when a streaming response is interrupted mid-way (client disconnects), we were recording zero output tokens instead of counting what had already been streamed.

We caught it because our cost reconciliation check — which compares what we charged vs. what upstream invoiced us — was failing for one customer with heavy streaming use. The fix was straightforward (track tokens as they stream, commit on disconnect), but the experience convinced us to invest heavily in billing test coverage. We now run a suite of 400+ billing integration tests against real provider responses recorded in CI.

2025 — Going Multi-Tenant

In 2025 we grew from hundreds to thousands of teams. The architecture that worked at small scale started showing cracks:

Single Redis instance. We were storing all rate limit state in one Redis. A noisy tenant could create enough key churn to spike latency for everyone. We partitioned to 8 Redis shards, keyed by team_id mod 8.

API key security model. Early API keys were stored as bcrypt hashes. bcrypt is intentionally slow — at high request rates, key validation was consuming 8% of CPU. We migrated to argon2id with lower memory cost parameters tuned for our latency budget (under 5ms per key lookup).

Provider onboarding time. Adding a new provider required writing an adapter, deploying it, and manually testing each model. We standardised the adapter interface and built an internal test harness that runs a standard suite (basic chat, streaming, function calling, vision) against every adapter in CI. New provider time-to-production dropped from ~2 weeks to ~3 days.

2026 Roadmap

Q1 2026: Prompt Caching Integration (In Progress)

Anthropic's prompt caching reduces cost by up to 90% for repeated prompt prefixes (like long system prompts or few-shot examples). OpenAI has an equivalent. We're adding automatic cache-key management to SoxAI so clients don't need to implement this themselves.

The design: SoxAI hashes the first N tokens of each request. If the hash matches a recent request from the same team, we attach the provider-specific cache control headers automatically. The cache hit rate in our internal testing is 62% for teams with consistent system prompts.

Target: GA in April 2026.

Q2 2026: Semantic Router

Today routing is purely based on model alias and channel health. In Q2 we're adding a semantic routing layer that can dispatch requests to different models based on the content of the prompt:

# Example routing rule
routes:
  - name: code-tasks
    trigger:
      classifier: builtin/code-detection
      threshold: 0.85
    target_model: claude-3-7-sonnet-20250219
    fallback_model: gpt-4o
 
  - name: long-context-tasks
    trigger:
      input_tokens: ">= 50000"
    target_model: gemini-2.0-flash-exp
 
  - name: default
    target_model: gpt-4o-mini

The classifier runs as a local DistilBERT model (no API call required), adding under 10ms to routing latency. Initial categories: code, long-context, vision, reasoning, and a general-purpose fallback.

Q3 2026: Agent Execution Layer

The biggest shift in AI usage in 2026 is agentic workloads — Claude or GPT-4 running multi-step tasks, calling tools, and deciding when to stop. These look different from single-turn chat in several ways:

Long-running: A task might take 5 minutes, not 500ms
Multi-call: One "task" maps to dozens of API calls
Stateful: Each call needs the full history of prior calls

We're building a lightweight agent execution layer on top of SoxAI's existing infrastructure:

Task ID groups all API calls belonging to the same agent run
Usage and cost is attributed per-task, not per-call
Long-running tasks get a webhook callback instead of blocking the HTTP connection
Tool definitions are registered once and referenced by ID rather than re-sent with every call

This is not a full agent framework (we're not building LangChain). It's a session management layer that makes agentic patterns first-class in the gateway.

Q4 2026: Private Deployment

Several enterprise prospects can't use a shared cloud service due to data sovereignty requirements — financial services, healthcare, government. In Q4 we'll release a self-hosted SoxAI package:

Docker Compose for single-server evaluation
Helm chart for Kubernetes production deployments
Postgres + Redis as the only external dependencies
Air-gapped mode: provider credentials stay on-premise, no telemetry to SoxAI cloud

The UI and core API will be open-sourced under a BSL 1.1 licence at launch. Commercial deployments above a certain request volume require a licence.

Things We Got Wrong (and Learned From)

Underestimating provider API drift. OpenAI changes the tool_calls format in streaming responses semi-regularly. We learned to version-pin our response parsers per model and run regression tests against recorded API traffic.

Building billing before auth. Our first auth system was "any valid JWT works." We added RBAC (owner, admin, member, read-only) 9 months later, which required a painful data migration. Build auth correctly before you have 2,000 teams.

Not exposing enough raw provider headers. Some providers return useful metadata (X-Request-Id, rate limit remaining, cache hit status) in response headers that we were stripping. Enterprise teams needed these for their own observability. We now pass through all x- headers verbatim.

Read the technical architecture post: SoxAI Architecture: Building a Multi-Provider AI Gateway at Scale

Sign up at console.soxai.io — the free tier includes 5 million tokens/month.