LLM Status Architecture: Monitoring 40+ AI Providers in Real Time

LLM Status answers one question: "Is this AI model working right now?" The question is simple. Building a reliable answer for 40+ providers, 200+ model endpoints, and an international audience requires more infrastructure than it first appears. This post explains the full technical design.

What We're Actually Measuring

Most status pages monitor HTTP availability — is the endpoint returning 200? That's necessary but not sufficient for AI model APIs. A provider can return 200 while:

Producing output 10× slower than normal (GPU degradation)
Truncating responses mid-generation (OOM on the inference cluster)
Returning garbage tokens (model serving misconfiguration)
Working fine for short prompts but timing out on long ones

LLM Status runs semantic probes — real inference requests that measure what users actually care about.

Probe Design

Each provider gets a probe definition that specifies:

# Example probe for Anthropic Claude 3.5 Sonnet
provider: anthropic
model: claude-3-5-sonnet-20241022
probe_type: chat
request:
  system: "You are a calculator."
  messages:
    - role: user
      content: "What is 17 × 23? Reply with only the number."
expected:
  response_contains: "391"
  min_tokens: 1
  max_tokens: 10
  timeout_ms: 8000
metrics:
  - ttfb_ms        # Time to first token
  - total_ms       # Full response time
  - tokens_output
sla:
  ttfb_p95_ms: 2000
  total_p95_ms: 5000

The response_contains check catches semantic failures — if the model returns "I cannot perform calculations" instead of a number, the probe fails even if the HTTP status is 200.

Probe Frequency

We run probes every 60 seconds per model. With 200+ models, that's 200+ concurrent HTTP requests per minute. To avoid being rate-limited, we distribute probes across:

A pool of 20 prober workers in different geographic regions (US-East, US-West, EU-West, AP-Southeast)
Dedicated API keys for monitoring (separate from production keys)
Jittered probe timing (±15 seconds random offset) to avoid thundering herd

Each prober worker runs independently and writes results to a central time-series store.

Data Pipeline

Prober Workers (20 regions)
       │
       │ HTTP + gRPC
       ▼
  Ingest API (Rust, axum)
       │
       ├──► TimescaleDB (raw probe results, 30-day retention)
       │
       └──► Redis Streams (real-time fan-out)
               │
               ├──► Anomaly Detector (Python, runs every 30s)
               │
               └──► WebSocket Hub (pushes to connected browsers)

TimescaleDB Schema

Raw probe results are stored as time-series data with hypertables partitioned by day:

CREATE TABLE probe_results (
  ts          TIMESTAMPTZ NOT NULL,
  provider    TEXT NOT NULL,
  model       TEXT NOT NULL,
  region      TEXT NOT NULL,
  ttfb_ms     INTEGER,
  total_ms    INTEGER,
  tokens_out  INTEGER,
  success     BOOLEAN NOT NULL,
  error_code  TEXT,
  error_msg   TEXT
);
 
SELECT create_hypertable('probe_results', 'ts');
CREATE INDEX ON probe_results (provider, model, ts DESC);

Continuous aggregates pre-compute hourly and daily rollups to keep the public dashboard responsive:

CREATE MATERIALIZED VIEW probe_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', ts) AS bucket,
  provider,
  model,
  region,
  AVG(ttfb_ms) FILTER (WHERE success) AS avg_ttfb_ms,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ttfb_ms)
    FILTER (WHERE success)                           AS p95_ttfb_ms,
  COUNT(*) FILTER (WHERE success)::float / COUNT(*) AS success_rate
FROM probe_results
GROUP BY bucket, provider, model, region;

Anomaly Detection

The anomaly detector runs every 30 seconds against the last 5 minutes of data. It uses a simple but effective algorithm: 3-sigma deviation from the trailing 7-day hourly baseline.

def detect_anomaly(model_id: str, metric: str, current_value: float) -> Anomaly | None:
    baseline = get_baseline(model_id, metric)   # 7-day hourly p50 and stddev
    hour_of_week = current_hour_of_week()
 
    mu    = baseline[hour_of_week]["p50"]
    sigma = baseline[hour_of_week]["stddev"]
 
    z_score = (current_value - mu) / sigma if sigma > 0 else 0
 
    if z_score > 3.0:
        return Anomaly(
            model_id=model_id,
            metric=metric,
            severity="degraded" if z_score < 5.0 else "outage",
            value=current_value,
            baseline=mu,
        )
    return None

The 7-day hourly baseline accounts for regular traffic patterns — many providers are measurably slower during US business hours when GPU demand peaks. Without this, Monday morning traffic spikes would generate false alerts.

Incident Lifecycle

HEALTHY ──[probe_success_rate < 0.7 for 2min]──► INVESTIGATING
            │
    [success_rate < 0.3 for 5min]
            │
            ▼
         OUTAGE ──[success_rate > 0.9 for 10min]──► RECOVERING
                                                          │
                              [success_rate > 0.99 for 5min]
                                                          │
                                                          ▼
                                                       RESOLVED

State transitions require sustained evidence — single bad probes don't trigger incidents. The minimum incident duration before an outage is declared is 5 minutes.

Public API

LLM Status exposes a public REST API (no auth required for reads):

# Current status of all providers
GET https://llmstatus.io/api/v1/status
 
# Historical uptime for a specific model (last 90 days)
GET https://llmstatus.io/api/v1/history?provider=anthropic&model=claude-3-5-sonnet
 
# Active incidents
GET https://llmstatus.io/api/v1/incidents?status=ongoing
 
# Subscribe to real-time updates
WebSocket: wss://llmstatus.io/api/v1/ws

The WebSocket endpoint pushes JSON events whenever a status changes. The public dashboard at llmstatus.io is entirely driven by this WebSocket connection — no polling.

Infrastructure

LLM Status runs on a deliberately small footprint:

Component	Spec
API / Web	2× 4-core VMs (active-active, load balanced)
TimescaleDB	1× 8-core VM, 256GB NVMe, daily backup to R2
Redis	1× 4-core VM (Streams + cache)
Prober Workers	20× 1-core VMs, one per monitoring region

Total infrastructure cost: approximately $800/month. We've kept it minimal intentionally — LLM Status is an open-source project and we want the self-hosting option to be accessible to small teams.

Open Source

Everything that runs LLM Status is open source at github.com/llmstatus/llmstatus:

Core prober (Go): probe execution, result ingest
Anomaly detector (Python): baseline computation, incident management
API server (Rust): REST endpoints, WebSocket hub
Dashboard (Next.js): the UI you see at llmstatus.io
Helm charts: for self-hosted Kubernetes deployments
Probe definitions: all 200+ provider/model probe configurations (YAML)

Contributions welcome — especially new provider probe definitions.

Next: LLM Status: From Side Project to Open-Source Infrastructure & 2026 Roadmap

LLM Status Architecture: Monitoring 40+ AI Providers in Real Time

LLM Status Architecture: Monitoring 40+ AI Providers in Real Time

What We're Actually Measuring

Probe Design

Probe Frequency

Data Pipeline

TimescaleDB Schema

Anomaly Detection

Incident Lifecycle

Public API

Infrastructure

Open Source

Share this article

Related Articles

LLM Status: From Side Project to Open-Source Infrastructure & 2026 Roadmap

SoxAI: From Internal Tool to AI Infrastructure — Development Journey & 2026 Roadmap

SoxAI Architecture: Building a Multi-Provider AI Gateway at Scale