FastSox Architecture Deep Dive: How We Built a Sub-20ms VPN
When we started FastSox, most VPNs were adding 40โ80ms of round-trip overhead even on a good day. We set an internal bar of under 20ms added latency for 90% of connections and built the architecture backwards from that constraint. This post explains every major layer of how we got there.
System Overview
FastSox has three independently deployed tiers:
Client Apps (iOS, Android, macOS, Windows, Linux)
โ
โผ
Smart Connect Service โโโโ Telemetry & ML Store
โ
โผ
Edge Nodes โโโโโโโโโโโโโ Control Plane API
(WireGuard + eBPF)
โ
โผ
Destination
The Smart Connect Service makes routing decisions. Edge Nodes carry actual traffic. The Control Plane API manages configuration, keys, and health state.
These three components can scale and fail independently, which was a non-negotiable design constraint from day one.
The Data Plane: WireGuard + eBPF
Why WireGuard
We evaluated OpenVPN, IKEv2/IPSec, and WireGuard before settling on WireGuard as our default protocol. The decision was straightforward:
| Protocol | Handshake | Crypto | Kernel LOC |
|---|---|---|---|
| OpenVPN | TLS 1.3 | AES-GCM | ~100k |
| IKEv2 | ISAKMP | AES-GCM | ~80k |
| WireGuard | Noise_IK | ChaCha20-Poly1305 | ~4k |
Fewer lines of kernel code means a smaller attack surface and faster security audits. The Noise_IK handshake completes in a single round trip, which is why our connection establishment time dropped from 3.2s (OpenVPN) to under 900ms.
eBPF for Per-Flow Telemetry
WireGuard gives us encrypted tunnels, but it doesn't tell us much about what's flowing through them at a per-connection level. We use eBPF programs attached to the WireGuard interface to collect:
- Per-flow RTT estimates (via TCP timestamp options)
- Packet loss rates (sequence gap analysis)
- Byte counts per destination ASN
This data feeds the ML routing engine every 5 seconds without any userspace overhead.
// Simplified eBPF probe โ attaches to xdp ingress on wg0
SEC("xdp")
int measure_rtt(struct xdp_md *ctx) {
struct ethhdr *eth = data;
struct iphdr *ip = data + sizeof(*eth);
struct tcphdr *tcp = data + sizeof(*eth) + sizeof(*ip);
if (tcp->syn && tcp->ack) {
// SYN-ACK received โ compute RTT from timestamp option
u32 rtt_us = bpf_ktime_get_ns() / 1000 - get_tsecr(tcp);
update_flow_rtt(&flow_key, rtt_us);
}
return XDP_PASS;
}The AI Routing Engine
What "Smart Connect" Actually Does
Smart Connect is not a neural network. Early prototypes used a deep reinforcement learning agent but it was too slow to explain, too slow to update, and regularly made decisions that operators couldn't reason about. We replaced it with a gradient-boosted decision tree trained on 90 days of telemetry.
Features the model uses:
FEATURES = [
"client_asn", # Your ISP
"client_country",
"destination_asn", # Target service's ISP
"destination_country",
"hour_of_day", # Congestion patterns differ by time
"day_of_week",
"node_rtt_p50", # 50th percentile RTT to candidate node
"node_rtt_p95",
"node_packet_loss_rate",
"node_cpu_utilization",
"node_active_sessions",
"protocol", # WireGuard vs IKEv2 vs OpenVPN
"application_category", # Inferred from SNI: streaming / gaming / work
]Target: minimise rtt_p95 weighted by application_category (gaming gets 2ร weight because jitter matters more for UDP games than for HTTP).
Inference Latency
The model must respond before the connection completes its first handshake โ meaning under 300ms. We serve it as a Rust binary with the model serialized using bincode. Median inference time in production: 4ms.
Online Learning
Every completed session generates a training example. We retrain weekly on a rolling 90-day window. Model accuracy (correct node selection vs. retrospective optimal) sits at 76% โ which sounds modest but represents a 2ร improvement over nearest-geography selection.
Key Management
Each client generates a WireGuard keypair locally. The public key is registered with the Control Plane. Edge Nodes never see private keys. When a session ends, the allowed-IP entry is removed from the edge node within 60 seconds.
For multi-hop sessions:
Client โ[key_A]โโบ Node 1 โ[key_B]โโบ Node 2 โโบ Destination
The client generates a separate ephemeral keypair for each hop. Node 1 sees the client's IP but not the destination. Node 2 sees the destination but not the client's IP. Neither node sees both.
Observability Stack
Production telemetry runs on:
- Metrics: Prometheus + VictoriaMetrics (long-term storage)
- Tracing: OpenTelemetry โ Tempo
- Logs: Vector โ ClickHouse (structured logs, queryable in seconds)
- Alerting: Alertmanager with PagerDuty escalation
The dashboard that oncall watches lives at grafana.internal/d/fastsox-edge. Key SLIs:
| SLI | Target | Current |
|---|---|---|
| Connection success rate | โฅ 99.5% | 99.73% |
| p95 added latency | โค 20ms | 17ms |
| Session establishment time | โค 1s | 780ms |
| Auth API availability | โฅ 99.9% | 99.96% |
Lessons from Three Years of Production
1. Kernel upgrades are your biggest operational risk. WireGuard is in-kernel since Linux 5.6, but a minor kernel update can change scheduling behaviour and push your p95 latency out by 5ms. Canary one node per AZ before rolling updates.
2. Mobile clients need a different reconnection strategy than desktop. iOS will kill background sockets aggressively. Our iOS client maintains a keep-alive ping every 25 seconds and re-establishes the WireGuard handshake proactively when the foreground timer detects a network-type change.
3. CGNAT is everywhere. About 38% of our mobile users sit behind carrier-grade NAT. WireGuard's UDP makes this workable (no server-initiated packets needed), but you must handle the case where a user's public IP changes mid-session without any signal to the server.
Next in this series: FastSox 2026 Roadmap โ QUIC transport, post-quantum keys, and the mesh architecture