Routing — KV-aware scoring deep dive
The router's job is to pick the best node for each request in sub-millisecond p99. "Best" is a four-term linear combination of signals scraped from every live agent.
The score
For every healthy node n that hosts the requested model, we
compute:
score(n) = w_kv · overlap(prefix, n)
+ w_load · (1 − util(n))
+ w_pwr · (1 − norm_watts(n))
+ w_cap · capacity(n)
Each term is normalised to [0, 1]; the four weights are read live
from etcd at /cognitora/routing/policy (defaults 0.55, 0.25, 0.10, 0.10) and applied via an arc_swap::ArcSwap so policy changes are
zero-cost on the hot path.
overlap(prefix, n)
The router hashes the prompt's first ~64 tokens to a chain of
BLAKE3 digests (one per 16-token block). For each candidate node it
asks cgn-kvcached (over the local UDS) how many leading blocks are
present in the warm tier. The fraction of matched blocks is the
overlap signal. This is the single most-important term — it's why
KV-aware routing exists.
util(n)
Each cgn-agent publishes its in-flight request count and total
GPU SM utilisation every [agent].heartbeat (default 5 s). The
router computes
util = max(in_flight / max_concurrent, sm_pct)
and uses (1 − util) so a busy node loses score linearly.
norm_watts(n)
cgn-metrics exports cgn_power_watts{component=...} per host.
The router fetches a 30 s rolling average and normalises by the
fleet's max so an idle node with low watts wins, all else equal.
This is the only term whose default weight is small (0.10); turn
it up for energy-aware clusters.
capacity(n)
Static metadata: TP size, dtype, VRAM, free model slots. Mostly breaks ties when two nodes are equally busy and equally cached.
Tie-break
When two nodes score within ε (0.01 default) the router falls back
to a stable hash of (node_id, prefix_hash). That keeps a
prefix-bound request hitting the same node across retries — the
stickiness is what makes KV-aware routing worth it under burst
traffic.
Admission
After scoring, the router calls Admission::try_admit(model, role).
The admission counter is per-(model, role) and bounded by
[router.admission].max_queue. A Permit is held for the lifetime
of the request and decrements on drop (RAII). When the queue is
full the router returns 503 immediately without calling the agent.
We deliberately don't queue — queueing inflates TTFT and the client's deadline budget is more useful at the source. The queue parameter is therefore a hard cap on concurrency, not a buffer.
Cascade
Optional. When [router.cascade].enabled = true and the request's
model has a cascade chain configured, the router runs the request
against the smallest model first. After the response comes back, the
mean log-probability across emitted tokens is compared against
[router.cascade].confidence_threshold. If it's below threshold the
router escalates to the next model in the chain.
The score function above runs once per cascade step, so each step
goes to the best-suited node for that model. The cascade FSM lives
in cgn-router/src/cascade.rs; the gateway invokes it as a wrapper
around routing::pick.
Disaggregation
Optional. When [router.disagg].enabled = true and the prompt
exceeds colocate_below_tokens, the router asks for two nodes:
a prefill agent that runs the first forward pass and emits KV blocks
into cgn-kvcached, and a decode agent that picks up those blocks
(via QUIC/RDMA) and streams tokens. The handshake adds one round
trip but pays for itself on long-prompt / short-completion workloads
because decode-only nodes can be smaller / cheaper.
The QUIC transport is the default; RDMA is available behind the
rdma feature flag on Linux hosts with ibverbs.
Where the code lives
| File | Role |
|---|---|
rust/services/cgn-router/src/routing/score.rs | the score function above |
rust/services/cgn-router/src/routing/selector.rs | pick: scoring + admission + tie |
rust/services/cgn-router/src/admission.rs | per-(model,role) queue + RAII permit |
rust/services/cgn-router/src/cascade.rs | cascade FSM |
rust/services/cgn-router/src/disagg.rs | prefill/decode plan |
rust/services/cgn-router/src/cluster/{registry,watcher}.rs | etcd-backed node registry |