Cognitora architecture
One-page technical overview. For repo and crate layout see
architecture/repo-layout.md. For the full configuration reference seereference/config.md.
1. Topology
Every box is a single statically-linked Rust binary. cgn-metrics,
cgn-ctl, and cgn-operator are shown elsewhere; they sit alongside
the four hot-path daemons above.
2. Hot path
A request travels:
- HTTP ingress —
cgn-routeraccepts an OpenAI-compatible request on:8080. Auth is checked bycgn-auth(API key inAuthorization: Bearer ...or OIDC bearer with JWKS rotation). Rate limiting is applied bycgn-ratelimitkeyed on the authenticated subject. - Approximation — the gateway hashes the prompt prefix into a tree of sequence-chained BLAKE3 digests (each chunk's digest covers all preceding tokens). The digests are the input to the routing score and ensure that two requests with identical chunks in different positions never collide.
- Score — for every healthy node that hosts the requested model,
routing::scorecomputestotal = w_kv·longest_prefix + w_load·load + w_pwr·power + w_cap·capacityusing the policy from etcd (live-updated viaarc_swap). The router selects the highest-scoring node; ties break on prefix hash. - Admission —
admission::try_admitincrements an inflight counter for the (model, role) pair; the request is rejected with 429 if the global queue is full. - Forward — the gateway opens an
Agent.GenerategRPC stream over mTLS.cgn-agentproxies the stream to the engine subprocess (vllm,sglang,llama_cpp, oropenai_compat— selected by[engine].kind; pluggable through theEnginetrait). - Stream out — tokens flow back through the gRPC stream and are
re-encoded as OpenAI SSE chunks by
gateway::sse.
For long prompts, the optional disagg path replaces step 5 with a
two-stage flow: a prefill agent runs the first forward pass and emits
KV blocks via the engine's --kv-transfer-config connector
(NixlConnector, optionally stacked with LMCacheConnectorV1 or
DynamoConnector(KVBM) — see architecture/kv-strategy.md);
the decode agent consumes them and streams the rest of the response.
3. Cluster state
etcd holds:
/cognitora/nodes/<node_id>—NodeHealthJSON written by everycgn-agentevery[agent].heartbeat(default 5 s). Stale entries are filtered out byrouting::pick./cognitora/routing/policy— score weights + admission tunables, written bycgn-operatorfrom aRoutingPolicyCRD or bycgn-ctl cluster set-policyfor non-K8s deploys.
There is no PostgreSQL, MySQL, or KV-store-of-the-week dependency: etcd is the only coordination service.
4. KV cache
Cognitora separates KV management into four layers that compose:
| # | Layer | Owner |
|---|---|---|
| 1 | Engine-internal KV (GPU HBM) | vLLM / SGLang / llama.cpp / TRT-LLM |
| 2 | Engine-side offload connector | LMCache · HiCache · KVBM · NIXL (selected via engine.kv_offload) |
| 3 | Cross-worker KV transfer | NixlConnector (vLLM) or NIXL inside KVBM/HiCache |
| 4 | Cross-cluster KV-aware routing | cgn-kvcached + cgn-router |
cgn-kvcached runs once per GPU host and exposes
cognitora.v1.Kv over gRPC + a QUIC transport. It owns:
- RAM tier (warm) — a
RamTierkeyed byBlockAddress(model digest + layer). Eviction is approximate-LRU. - SSD tier (cold) — block files under
/var/lib/cognitora/kv/ssd/. The metadata index is RocksDB by default; an in-memory fallback exists for dev builds (see thepersistent-indexfeature). - GPU residency index — a window of pinned-block addresses reported by the engine, so the router can answer "is this prefix on this GPU right now?" without traversing the engine.
Cross-host fetches use QUIC (1-RTT, 0-RTT for re-tries on the same
peer). The wire format is binary Frame { addr, model, layer, bytes }
with bincode-encoded headers.
For the deep dive on layer 2 (engine-side offload) — including the
LMCache vs HiCache vs KVBM matrix — see
architecture/kv-strategy.md.
5. Power and energy
cgn-power reads:
- Redfish out-of-band power (chassis + per-PSU draw),
- IPMI as a fallback,
- NVML per-GPU power.
cgn-metrics exports those readings as cgn_power_watts{component=...}
gauges. The router subscribes to the metrics endpoint and incorporates
the result into the power term of the routing score, biasing requests
toward energy-efficient nodes when the operator turns the weight up.
6. Security
- mTLS everywhere by default.
cgn-tlsprovides one helper for loading PEM material and another for generating dev PKI material on install (rcgen). - API auth: API keys (sha256-hashed, hot-reloaded from disk) and OpenID Connect bearer tokens (validated against JWKS).
- Distroless images with
nonrootUID; no shell in production containers. - All commits land on a release branch only after
clippy -D warnings,cargo test, andhelm lintpass in CI.
7. Capabilities
| Area | What ships |
|---|---|
| Engines | vLLM · SGLang · llama.cpp · OpenAI-compat (TRT-LLM via thin driver) |
| Single-node serving | cgn-router + cgn-agent + cgn-kvcached + OpenAI HTTP/SSE |
| Multi-node routing | Sequence-chained BLAKE3 digests + longest-prefix overlap + load / power / capacity scoring |
| KV offload backends | none / nixl / lmcache / hicache / kvbm — one TOML knob, auto-rendered into engine argv |
| Cross-node KV transport | QUIC frame codec; RDMA behind a feature flag; prefill/decode disagg |
| Multi-tenancy | OIDC SSO with group → scope mapping; in-process and Redis rate limit |
| Cascade | SLM → Mid → LLM via cascade::Cascade::run |
| Kubernetes | cgn-operator reconciles InferenceCluster, ModelPool, RoutingPolicy |
| Federation | Cross-cluster forwarder in cgn-router::federation |
| Energy-aware autoscale | Closed-loop drain hints written to etcd, picked up by the operator |
| SLO admission | Per-tenant deadline propagation in cgn-router::deadline |
The tests/perf/ harness gates every PR against the
performance targets in the README.