LLM Status Architecture: Monitoring 40+ AI Providers in Real Time
LLM Status answers one question: "Is this AI model working right now?" The question is simple. Building a reliable answer for 40+ providers, 200+ model endpoints, and an international audience requires more infrastructure than it first appears. This post explains the full technical design.
What We're Actually Measuring
Most status pages monitor HTTP availability โ is the endpoint returning 200? That's necessary but not sufficient for AI model APIs. A provider can return 200 while:
- Producing output 10ร slower than normal (GPU degradation)
- Truncating responses mid-generation (OOM on the inference cluster)
- Returning garbage tokens (model serving misconfiguration)
- Working fine for short prompts but timing out on long ones
LLM Status runs semantic probes โ real inference requests that measure what users actually care about.
Probe Design
Each provider gets a probe definition that specifies:
# Example probe for Anthropic Claude 3.5 Sonnet
provider: anthropic
model: claude-3-5-sonnet-20241022
probe_type: chat
request:
system: "You are a calculator."
messages:
- role: user
content: "What is 17 ร 23? Reply with only the number."
expected:
response_contains: "391"
min_tokens: 1
max_tokens: 10
timeout_ms: 8000
metrics:
- ttfb_ms # Time to first token
- total_ms # Full response time
- tokens_output
sla:
ttfb_p95_ms: 2000
total_p95_ms: 5000The response_contains check catches semantic failures โ if the model returns "I cannot perform calculations" instead of a number, the probe fails even if the HTTP status is 200.
Probe Frequency
We run probes every 60 seconds per model. With 200+ models, that's 200+ concurrent HTTP requests per minute. To avoid being rate-limited, we distribute probes across:
- A pool of 20 prober workers in different geographic regions (US-East, US-West, EU-West, AP-Southeast)
- Dedicated API keys for monitoring (separate from production keys)
- Jittered probe timing (ยฑ15 seconds random offset) to avoid thundering herd
Each prober worker runs independently and writes results to a central time-series store.
Data Pipeline
Prober Workers (20 regions)
โ
โ HTTP + gRPC
โผ
Ingest API (Rust, axum)
โ
โโโโบ TimescaleDB (raw probe results, 30-day retention)
โ
โโโโบ Redis Streams (real-time fan-out)
โ
โโโโบ Anomaly Detector (Python, runs every 30s)
โ
โโโโบ WebSocket Hub (pushes to connected browsers)
TimescaleDB Schema
Raw probe results are stored as time-series data with hypertables partitioned by day:
CREATE TABLE probe_results (
ts TIMESTAMPTZ NOT NULL,
provider TEXT NOT NULL,
model TEXT NOT NULL,
region TEXT NOT NULL,
ttfb_ms INTEGER,
total_ms INTEGER,
tokens_out INTEGER,
success BOOLEAN NOT NULL,
error_code TEXT,
error_msg TEXT
);
SELECT create_hypertable('probe_results', 'ts');
CREATE INDEX ON probe_results (provider, model, ts DESC);Continuous aggregates pre-compute hourly and daily rollups to keep the public dashboard responsive:
CREATE MATERIALIZED VIEW probe_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', ts) AS bucket,
provider,
model,
region,
AVG(ttfb_ms) FILTER (WHERE success) AS avg_ttfb_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ttfb_ms)
FILTER (WHERE success) AS p95_ttfb_ms,
COUNT(*) FILTER (WHERE success)::float / COUNT(*) AS success_rate
FROM probe_results
GROUP BY bucket, provider, model, region;Anomaly Detection
The anomaly detector runs every 30 seconds against the last 5 minutes of data. It uses a simple but effective algorithm: 3-sigma deviation from the trailing 7-day hourly baseline.
def detect_anomaly(model_id: str, metric: str, current_value: float) -> Anomaly | None:
baseline = get_baseline(model_id, metric) # 7-day hourly p50 and stddev
hour_of_week = current_hour_of_week()
mu = baseline[hour_of_week]["p50"]
sigma = baseline[hour_of_week]["stddev"]
z_score = (current_value - mu) / sigma if sigma > 0 else 0
if z_score > 3.0:
return Anomaly(
model_id=model_id,
metric=metric,
severity="degraded" if z_score < 5.0 else "outage",
value=current_value,
baseline=mu,
)
return NoneThe 7-day hourly baseline accounts for regular traffic patterns โ many providers are measurably slower during US business hours when GPU demand peaks. Without this, Monday morning traffic spikes would generate false alerts.
Incident Lifecycle
HEALTHY โโ[probe_success_rate < 0.7 for 2min]โโโบ INVESTIGATING
โ
[success_rate < 0.3 for 5min]
โ
โผ
OUTAGE โโ[success_rate > 0.9 for 10min]โโโบ RECOVERING
โ
[success_rate > 0.99 for 5min]
โ
โผ
RESOLVED
State transitions require sustained evidence โ single bad probes don't trigger incidents. The minimum incident duration before an outage is declared is 5 minutes.
Public API
LLM Status exposes a public REST API (no auth required for reads):
# Current status of all providers
GET https://llmstatus.io/api/v1/status
# Historical uptime for a specific model (last 90 days)
GET https://llmstatus.io/api/v1/history?provider=anthropic&model=claude-3-5-sonnet
# Active incidents
GET https://llmstatus.io/api/v1/incidents?status=ongoing
# Subscribe to real-time updates
WebSocket: wss://llmstatus.io/api/v1/wsThe WebSocket endpoint pushes JSON events whenever a status changes. The public dashboard at llmstatus.io is entirely driven by this WebSocket connection โ no polling.
Infrastructure
LLM Status runs on a deliberately small footprint:
| Component | Spec |
|---|---|
| API / Web | 2ร 4-core VMs (active-active, load balanced) |
| TimescaleDB | 1ร 8-core VM, 256GB NVMe, daily backup to R2 |
| Redis | 1ร 4-core VM (Streams + cache) |
| Prober Workers | 20ร 1-core VMs, one per monitoring region |
Total infrastructure cost: approximately $800/month. We've kept it minimal intentionally โ LLM Status is an open-source project and we want the self-hosting option to be accessible to small teams.
Open Source
Everything that runs LLM Status is open source at github.com/llmstatus/llmstatus:
- Core prober (Go): probe execution, result ingest
- Anomaly detector (Python): baseline computation, incident management
- API server (Rust): REST endpoints, WebSocket hub
- Dashboard (Next.js): the UI you see at llmstatus.io
- Helm charts: for self-hosted Kubernetes deployments
- Probe definitions: all 200+ provider/model probe configurations (YAML)
Contributions welcome โ especially new provider probe definitions.
Next: LLM Status: From Side Project to Open-Source Infrastructure & 2026 Roadmap