Engineering

LLM Status Architecture: Monitoring 40+ AI Providers in Real Time

How LLM Status probes, aggregates, and serves uptime data for every major AI model provider โ€” the prober design, anomaly detection, and the open-source infrastructure behind llmstatus.io.

โ€ข
Liang Bo
โ€ข
5 min read
#LLM Status#Architecture#Monitoring#Open Source#AI Infrastructure
LLM Status Architecture: Monitoring 40+ AI Providers in Real Time

LLM Status Architecture: Monitoring 40+ AI Providers in Real Time

LLM Status answers one question: "Is this AI model working right now?" The question is simple. Building a reliable answer for 40+ providers, 200+ model endpoints, and an international audience requires more infrastructure than it first appears. This post explains the full technical design.

What We're Actually Measuring

Most status pages monitor HTTP availability โ€” is the endpoint returning 200? That's necessary but not sufficient for AI model APIs. A provider can return 200 while:

  • Producing output 10ร— slower than normal (GPU degradation)
  • Truncating responses mid-generation (OOM on the inference cluster)
  • Returning garbage tokens (model serving misconfiguration)
  • Working fine for short prompts but timing out on long ones

LLM Status runs semantic probes โ€” real inference requests that measure what users actually care about.

Probe Design

Each provider gets a probe definition that specifies:

# Example probe for Anthropic Claude 3.5 Sonnet
provider: anthropic
model: claude-3-5-sonnet-20241022
probe_type: chat
request:
  system: "You are a calculator."
  messages:
    - role: user
      content: "What is 17 ร— 23? Reply with only the number."
expected:
  response_contains: "391"
  min_tokens: 1
  max_tokens: 10
  timeout_ms: 8000
metrics:
  - ttfb_ms        # Time to first token
  - total_ms       # Full response time
  - tokens_output
sla:
  ttfb_p95_ms: 2000
  total_p95_ms: 5000

The response_contains check catches semantic failures โ€” if the model returns "I cannot perform calculations" instead of a number, the probe fails even if the HTTP status is 200.

Probe Frequency

We run probes every 60 seconds per model. With 200+ models, that's 200+ concurrent HTTP requests per minute. To avoid being rate-limited, we distribute probes across:

  1. A pool of 20 prober workers in different geographic regions (US-East, US-West, EU-West, AP-Southeast)
  2. Dedicated API keys for monitoring (separate from production keys)
  3. Jittered probe timing (ยฑ15 seconds random offset) to avoid thundering herd

Each prober worker runs independently and writes results to a central time-series store.

Data Pipeline

Prober Workers (20 regions)
       โ”‚
       โ”‚ HTTP + gRPC
       โ–ผ
  Ingest API (Rust, axum)
       โ”‚
       โ”œโ”€โ”€โ–บ TimescaleDB (raw probe results, 30-day retention)
       โ”‚
       โ””โ”€โ”€โ–บ Redis Streams (real-time fan-out)
               โ”‚
               โ”œโ”€โ”€โ–บ Anomaly Detector (Python, runs every 30s)
               โ”‚
               โ””โ”€โ”€โ–บ WebSocket Hub (pushes to connected browsers)

TimescaleDB Schema

Raw probe results are stored as time-series data with hypertables partitioned by day:

CREATE TABLE probe_results (
  ts          TIMESTAMPTZ NOT NULL,
  provider    TEXT NOT NULL,
  model       TEXT NOT NULL,
  region      TEXT NOT NULL,
  ttfb_ms     INTEGER,
  total_ms    INTEGER,
  tokens_out  INTEGER,
  success     BOOLEAN NOT NULL,
  error_code  TEXT,
  error_msg   TEXT
);
 
SELECT create_hypertable('probe_results', 'ts');
CREATE INDEX ON probe_results (provider, model, ts DESC);

Continuous aggregates pre-compute hourly and daily rollups to keep the public dashboard responsive:

CREATE MATERIALIZED VIEW probe_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', ts) AS bucket,
  provider,
  model,
  region,
  AVG(ttfb_ms) FILTER (WHERE success) AS avg_ttfb_ms,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ttfb_ms)
    FILTER (WHERE success)                           AS p95_ttfb_ms,
  COUNT(*) FILTER (WHERE success)::float / COUNT(*) AS success_rate
FROM probe_results
GROUP BY bucket, provider, model, region;

Anomaly Detection

The anomaly detector runs every 30 seconds against the last 5 minutes of data. It uses a simple but effective algorithm: 3-sigma deviation from the trailing 7-day hourly baseline.

def detect_anomaly(model_id: str, metric: str, current_value: float) -> Anomaly | None:
    baseline = get_baseline(model_id, metric)   # 7-day hourly p50 and stddev
    hour_of_week = current_hour_of_week()
 
    mu    = baseline[hour_of_week]["p50"]
    sigma = baseline[hour_of_week]["stddev"]
 
    z_score = (current_value - mu) / sigma if sigma > 0 else 0
 
    if z_score > 3.0:
        return Anomaly(
            model_id=model_id,
            metric=metric,
            severity="degraded" if z_score < 5.0 else "outage",
            value=current_value,
            baseline=mu,
        )
    return None

The 7-day hourly baseline accounts for regular traffic patterns โ€” many providers are measurably slower during US business hours when GPU demand peaks. Without this, Monday morning traffic spikes would generate false alerts.

Incident Lifecycle

HEALTHY โ”€โ”€[probe_success_rate < 0.7 for 2min]โ”€โ”€โ–บ INVESTIGATING
            โ”‚
    [success_rate < 0.3 for 5min]
            โ”‚
            โ–ผ
         OUTAGE โ”€โ”€[success_rate > 0.9 for 10min]โ”€โ”€โ–บ RECOVERING
                                                          โ”‚
                              [success_rate > 0.99 for 5min]
                                                          โ”‚
                                                          โ–ผ
                                                       RESOLVED

State transitions require sustained evidence โ€” single bad probes don't trigger incidents. The minimum incident duration before an outage is declared is 5 minutes.

Public API

LLM Status exposes a public REST API (no auth required for reads):

# Current status of all providers
GET https://llmstatus.io/api/v1/status
 
# Historical uptime for a specific model (last 90 days)
GET https://llmstatus.io/api/v1/history?provider=anthropic&model=claude-3-5-sonnet
 
# Active incidents
GET https://llmstatus.io/api/v1/incidents?status=ongoing
 
# Subscribe to real-time updates
WebSocket: wss://llmstatus.io/api/v1/ws

The WebSocket endpoint pushes JSON events whenever a status changes. The public dashboard at llmstatus.io is entirely driven by this WebSocket connection โ€” no polling.

Infrastructure

LLM Status runs on a deliberately small footprint:

ComponentSpec
API / Web2ร— 4-core VMs (active-active, load balanced)
TimescaleDB1ร— 8-core VM, 256GB NVMe, daily backup to R2
Redis1ร— 4-core VM (Streams + cache)
Prober Workers20ร— 1-core VMs, one per monitoring region

Total infrastructure cost: approximately $800/month. We've kept it minimal intentionally โ€” LLM Status is an open-source project and we want the self-hosting option to be accessible to small teams.

Open Source

Everything that runs LLM Status is open source at github.com/llmstatus/llmstatus:

  • Core prober (Go): probe execution, result ingest
  • Anomaly detector (Python): baseline computation, incident management
  • API server (Rust): REST endpoints, WebSocket hub
  • Dashboard (Next.js): the UI you see at llmstatus.io
  • Helm charts: for self-hosted Kubernetes deployments
  • Probe definitions: all 200+ provider/model probe configurations (YAML)

Contributions welcome โ€” especially new provider probe definitions.


Next: LLM Status: From Side Project to Open-Source Infrastructure & 2026 Roadmap

Share this article

Related Articles

Back to Blog