FastSox Architecture Deep Dive: How We Built a Sub-20ms VPN

When we started FastSox, most VPNs were adding 40–80ms of round-trip overhead even on a good day. We set an internal bar of under 20ms added latency for 90% of connections and built the architecture backwards from that constraint. This post explains every major layer of how we got there.

System Overview

FastSox has three independently deployed tiers:

Client Apps (iOS, Android, macOS, Windows, Linux)
        │
        ▼
Smart Connect Service  ──── Telemetry & ML Store
        │
        ▼
   Edge Nodes  ─────────────  Control Plane API
  (WireGuard + eBPF)
        │
        ▼
    Destination

The Smart Connect Service makes routing decisions. Edge Nodes carry actual traffic. The Control Plane API manages configuration, keys, and health state.

These three components can scale and fail independently, which was a non-negotiable design constraint from day one.

The Data Plane: WireGuard + eBPF

Why WireGuard

We evaluated OpenVPN, IKEv2/IPSec, and WireGuard before settling on WireGuard as our default protocol. The decision was straightforward:

Protocol	Handshake	Crypto	Kernel LOC
OpenVPN	TLS 1.3	AES-GCM	~100k
IKEv2	ISAKMP	AES-GCM	~80k
WireGuard	Noise_IK	ChaCha20-Poly1305	~4k

Fewer lines of kernel code means a smaller attack surface and faster security audits. The Noise_IK handshake completes in a single round trip, which is why our connection establishment time dropped from 3.2s (OpenVPN) to under 900ms.

eBPF for Per-Flow Telemetry

WireGuard gives us encrypted tunnels, but it doesn't tell us much about what's flowing through them at a per-connection level. We use eBPF programs attached to the WireGuard interface to collect:

Per-flow RTT estimates (via TCP timestamp options)
Packet loss rates (sequence gap analysis)
Byte counts per destination ASN

This data feeds the ML routing engine every 5 seconds without any userspace overhead.

// Simplified eBPF probe — attaches to xdp ingress on wg0
SEC("xdp")
int measure_rtt(struct xdp_md *ctx) {
    struct ethhdr *eth = data;
    struct iphdr  *ip  = data + sizeof(*eth);
    struct tcphdr *tcp = data + sizeof(*eth) + sizeof(*ip);
 
    if (tcp->syn && tcp->ack) {
        // SYN-ACK received — compute RTT from timestamp option
        u32 rtt_us = bpf_ktime_get_ns() / 1000 - get_tsecr(tcp);
        update_flow_rtt(&flow_key, rtt_us);
    }
    return XDP_PASS;
}

The AI Routing Engine

What "Smart Connect" Actually Does

Smart Connect is not a neural network. Early prototypes used a deep reinforcement learning agent but it was too slow to explain, too slow to update, and regularly made decisions that operators couldn't reason about. We replaced it with a gradient-boosted decision tree trained on 90 days of telemetry.

Features the model uses:

FEATURES = [
    "client_asn",           # Your ISP
    "client_country",
    "destination_asn",      # Target service's ISP
    "destination_country",
    "hour_of_day",          # Congestion patterns differ by time
    "day_of_week",
    "node_rtt_p50",         # 50th percentile RTT to candidate node
    "node_rtt_p95",
    "node_packet_loss_rate",
    "node_cpu_utilization",
    "node_active_sessions",
    "protocol",             # WireGuard vs IKEv2 vs OpenVPN
    "application_category", # Inferred from SNI: streaming / gaming / work
]

Target: minimise rtt_p95 weighted by application_category (gaming gets 2× weight because jitter matters more for UDP games than for HTTP).

Inference Latency

The model must respond before the connection completes its first handshake — meaning under 300ms. We serve it as a Rust binary with the model serialized using bincode. Median inference time in production: 4ms.

Online Learning

Every completed session generates a training example. We retrain weekly on a rolling 90-day window. Model accuracy (correct node selection vs. retrospective optimal) sits at 76% — which sounds modest but represents a 2× improvement over nearest-geography selection.

Key Management

Each client generates a WireGuard keypair locally. The public key is registered with the Control Plane. Edge Nodes never see private keys. When a session ends, the allowed-IP entry is removed from the edge node within 60 seconds.

For multi-hop sessions:

Client ─[key_A]─► Node 1 ─[key_B]─► Node 2 ─► Destination

The client generates a separate ephemeral keypair for each hop. Node 1 sees the client's IP but not the destination. Node 2 sees the destination but not the client's IP. Neither node sees both.

Observability Stack

Production telemetry runs on:

Metrics: Prometheus + VictoriaMetrics (long-term storage)
Tracing: OpenTelemetry → Tempo
Logs: Vector → ClickHouse (structured logs, queryable in seconds)
Alerting: Alertmanager with PagerDuty escalation

The dashboard that oncall watches lives at grafana.internal/d/fastsox-edge. Key SLIs:

SLI	Target	Current
Connection success rate	≥ 99.5%	99.73%
p95 added latency	≤ 20ms	17ms
Session establishment time	≤ 1s	780ms
Auth API availability	≥ 99.9%	99.96%

Lessons from Three Years of Production

1. Kernel upgrades are your biggest operational risk. WireGuard is in-kernel since Linux 5.6, but a minor kernel update can change scheduling behaviour and push your p95 latency out by 5ms. Canary one node per AZ before rolling updates.

2. Mobile clients need a different reconnection strategy than desktop. iOS will kill background sockets aggressively. Our iOS client maintains a keep-alive ping every 25 seconds and re-establishes the WireGuard handshake proactively when the foreground timer detects a network-type change.

3. CGNAT is everywhere. About 38% of our mobile users sit behind carrier-grade NAT. WireGuard's UDP makes this workable (no server-initiated packets needed), but you must handle the case where a user's public IP changes mid-session without any signal to the server.

Next in this series: FastSox 2026 Roadmap — QUIC transport, post-quantum keys, and the mesh architecture