cgn-routerKV-Aware Request Router
OpenAI-compatible HTTP gateway with sub-millisecond KV-aware routing, disaggregated dispatch, and per-tenant admission control.
Overview
cgn-router is the entry point for all inference requests. It accepts OpenAI-compatible HTTP traffic, computes a four-term routing score for each candidate node, and dispatches requests where the KV prefix cache already lives. For long prompts, it can split prefill and decode across separate worker pools via disaggregated dispatch.
Features
- Four-term linear scoring: score(n) = w_kv·overlap + w_load·(1−util) + w_pwr·(1−watts) + w_cap·capacity
- BLAKE3 sequence-chained digests for prompt prefix hashing (first ~64 tokens, 16-token blocks)
- Stable-hash tie-breaking preserves KV cache locality under burst conditions
- Per-(model, role) admission with immediate 503 on queue overflow — no buffering
- Disaggregated dispatch: separate prefill and decode agents for long-prompt workloads
- Multi-model cascade routing through progressively larger models on low confidence
- Cross-cluster federation forwarding
- Live policy reload via etcd + lock-free arc_swap (zero hot-path cost)
- Auth via API key (Bearer token) or OIDC with JWKS rotation
- Rate limiting with configurable RPS and burst per endpoint
Architecture
Client → HTTP listener (port 8080) → auth check (cgn-auth) → rate limit → BLAKE3 prefix hash → score all candidate agents → select best → open gRPC stream to agent (mTLS) → re-encode tokens as SSE chunks → stream to client. Score weights are hot-reloaded from etcd. Metrics emitted to cgn-metrics async.
Configuration
| Key | Type | Default | Description |
|---|---|---|---|
| router.listen_http | string | 0.0.0.0:8080 | HTTP listen address for OpenAI-compatible API |
| router.listen_grpc | string | 0.0.0.0:9090 | gRPC listen address for internal communication |
| router.listen_admin | string | 0.0.0.0:9091 | Admin/health endpoint |
| router.node_id | string | "" | Unique node identifier |
| router.score_weights.kv | f64 | 0.55 | Weight for KV prefix overlap score |
| router.score_weights.load | f64 | 0.25 | Weight for inverse load score |
| router.score_weights.power | f64 | 0.10 | Weight for power efficiency score |
| router.score_weights.capacity | f64 | 0.10 | Weight for available capacity score |
| router.rate_limit.rps | u32 | 1000 | Requests per second limit |
| router.rate_limit.burst | u32 | 2000 | Burst bucket size |
Example
[cluster]
name = "production"
state_backend = "etcd"
etcd_endpoints = ["http://etcd-0:2379"]
[security]
require_mtls = true
[auth]
enabled = true
[router]
listen_http = "0.0.0.0:8080"
listen_grpc = "0.0.0.0:9090"
node_id = "router-01"
[router.score_weights]
kv = 0.55
load = 0.25
power = 0.10
capacity = 0.10
[models."llama-3.1-70b"]
tp = 4Performance
Routing decision: < 500µs per vCPU
HTTP overhead vs direct engine: < 3ms p99
Score tie-break threshold (ε): 0.01
Policy reload: zero hot-path cost (lock-free arc_swap)