cgn-router

KV-Aware Request Router

OpenAI-compatible HTTP gateway with sub-millisecond KV-aware routing, disaggregated dispatch, and per-tenant admission control.

Overview

cgn-router is the entry point for all inference requests. It accepts OpenAI-compatible HTTP traffic, computes a four-term routing score for each candidate node, and dispatches requests where the KV prefix cache already lives. For long prompts, it can split prefill and decode across separate worker pools via disaggregated dispatch.

Features

Four-term linear scoring: score(n) = w_kv·overlap + w_load·(1−util) + w_pwr·(1−watts) + w_cap·capacity
BLAKE3 sequence-chained digests for prompt prefix hashing (first ~64 tokens, 16-token blocks)
Stable-hash tie-breaking preserves KV cache locality under burst conditions
Per-(model, role) admission with immediate 503 on queue overflow — no buffering
Disaggregated dispatch: separate prefill and decode agents for long-prompt workloads
Multi-model cascade routing through progressively larger models on low confidence
Cross-cluster federation forwarding
Live policy reload via etcd + lock-free arc_swap (zero hot-path cost)
Auth via API key (Bearer token) or OIDC with JWKS rotation
Rate limiting with configurable RPS and burst per endpoint

Architecture

Client → HTTP listener (port 8080) → auth check (cgn-auth) → rate limit → BLAKE3 prefix hash → score all candidate agents → select best → open gRPC stream to agent (mTLS) → re-encode tokens as SSE chunks → stream to client. Score weights are hot-reloaded from etcd. Metrics emitted to cgn-metrics async.

Configuration

Key	Type	Default	Description
router.listen_http	string	0.0.0.0:8080	HTTP listen address for OpenAI-compatible API
router.listen_grpc	string	0.0.0.0:9090	gRPC listen address for internal communication
router.listen_admin	string	0.0.0.0:9091	Admin/health endpoint
router.node_id	string	""	Unique node identifier
router.score_weights.kv	f64	0.55	Weight for KV prefix overlap score
router.score_weights.load	f64	0.25	Weight for inverse load score
router.score_weights.power	f64	0.10	Weight for power efficiency score
router.score_weights.capacity	f64	0.10	Weight for available capacity score
router.rate_limit.rps	u32	1000	Requests per second limit
router.rate_limit.burst	u32	2000	Burst bucket size

Example

toml

[cluster]
name           = "production"
state_backend  = "etcd"
etcd_endpoints = ["http://etcd-0:2379"]

[security]
require_mtls = true

[auth]
enabled = true

[router]
listen_http  = "0.0.0.0:8080"
listen_grpc  = "0.0.0.0:9090"
node_id      = "router-01"

[router.score_weights]
kv       = 0.55
load     = 0.25
power    = 0.10
capacity = 0.10

[models."llama-3.1-70b"]
tp = 4

Performance

Routing decision: < 500µs per vCPU

HTTP overhead vs direct engine: < 3ms p99

Score tie-break threshold (ε): 0.01

Policy reload: zero hot-path cost (lock-free arc_swap)

cgn-agent