KV strategy: how Cognitora composes engines, offload backends, and the cross-cluster index

This document explains what KV caching is in Cognitora, which third-party KV systems we integrate with (LMCache, SGLang HiCache, NVIDIA KVBM, NIXL), and why we don't reinvent any of them. It also positions us against NVIDIA Dynamo — the closest analogue — and highlights the points where we are deliberately ahead.

If you are looking for the runtime data plane (RAM tier, SSD tier, cross-node QUIC fetch), see kv-tiering.md. This page is about strategy, not implementation.

The four KV layers

A request that hits Cognitora touches up to four distinct KV systems, each with a different owner:

#	Layer	Owner	What lives here	Latency target
1	Engine-internal KV	vLLM / SGLang / llama.cpp	The KV blocks the model writes during prefill. Pinned in GPU HBM.	µs (GPU local)
2	Engine-side offload connector	The connector (`LMCacheConnectorV1`, `DynamoConnector`, `--enable-hierarchical-cache`)	Spillover blocks in CPU RAM, NVMe, Redis, Mooncake, S3. Stacks on layer 1 via vLLM/SGLang's connector ABI.	sub-ms → ms
3	Cross-worker KV transfer	`NixlConnector` (vLLM) or NIXL inside KVBM/HiCache	The KV produced on a prefill GPU streamed to a decode GPU in disagg topologies.	RDMA-bound, sub-ms
4	Cross-cluster KV-aware routing	`cgn-kvcached` + `cgn-router`	A persistent index of which node holds which prefix. Used to score candidate workers before a request is dispatched.	sub-ms (same DC)

Cognitora owns layer 4. We integrate layers 2 and 3. Layer 1 is left untouched — it's the engine's internal accounting.

What we use today

Capability	Today	Future-work tag
Layer 4 — cross-cluster prefix index (Rust, RocksDB, QUIC peer fetch)	yes	mature
Layer 3 — `NixlConnector` for vLLM disagg	yes	mature
Layer 2 — `LMCacheConnectorV1` for vLLM (agg + disagg)	yes (auto-wired by `kv_offload = "lmcache"`)	mature
Layer 2 — SGLang HiCache (`--enable-hierarchical-cache`)	yes (auto-wired by `kv_offload = "hicache"`)	mature
Layer 2 — `DynamoConnector` (KVBM)	yes (auto-wired by `kv_offload = "kvbm"`) — for parity benchmarks	benchmarking
Layer 2 — FlexKV	no	considering
Mooncake-backed shared HiCache pool	no, but the recipe TOML can pass through `--hicache-storage-backend mooncake`	compatible

No, we do not currently use LMCache by default. It is now opt-in via [engine].kv_offload = "lmcache" and ships in two reference recipes:

recipes/llama3-8b/vllm/agg-lmcache/
recipes/llama3-8b/vllm/disagg-lmcache/

Why we don't ship our own KVBM-equivalent

Three reasons:

Coupling tax. KVBM lives behind vLLM's KVConnectorBase_V1 and TRT-LLM's PyTorch backend. Reimplementing it would require tracking two engine ABIs whose only consumers in practice are KVBM, LMCache, and FlexKV. We would spend most of our time chasing breaking changes in those ABIs.
Diminishing returns. Once you have a working connector (LMCache, KVBM, FlexKV, HiCache) the marginal gain from yet another connector is small — they all hit the same ~3-10× TTFT improvement on RAG workloads. The big wins come from layer 4 (smart routing) and layer 3 (disagg topologies), which we already own.
Coverage matters more than depth. Dynamo is vLLM/TRT-LLM centric for KVBM; we get strictly broader coverage by using LMCache for vLLM, HiCache for SGLang, and KVBM for benchmark parity, all selectable via one TOML knob.

Single dial: `engine.kv_offload`

[engine]
kind       = "vllm"        # or "sglang", "llama_cpp", "openai_compat"
kv_offload = "lmcache"     # or "none" | "nixl" | "hicache" | "kvbm"

cgn-agent translates this to the right CLI flags at engine launch. The compatibility matrix is:

engine	`none`	`nixl`	`lmcache`	`hicache`	`kvbm`
`vllm`	yes	yes	yes	no	yes
`sglang`	yes	yes	no	yes	no
`llama_cpp`	yes	no	no	no	no
`openai_compat`	yes	no	no	no	no

Disagg topologies ([agent].role = "prefill" or "decode") auto-stack the chosen offload backend with NIXL, mirroring NVIDIA's recommended pattern from docs/integrations/lmcache-integration.md.

The full rendering table is in cgn-agent::engine::spawn. For convenience:

vllm × lmcache × prefill → --kv-transfer-config '{"kv_connector":"PdConnector",...,"connectors":[LMCache,NIXL]}'
vllm × lmcache × decode → --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
vllm × kvbm × * → --kv-transfer-config '{"kv_connector":"DynamoConnector",...,"kv_connector_module_path":"kvbm.vllm_integration.connector"}'
sglang × hicache × * → --enable-hierarchical-cache --hicache-ratio 2 --hicache-write-policy write_through --hicache-storage-backend nixl

Comparison: Cognitora vs Dynamo

Concern	Dynamo	Cognitora
Built-in offload backend	KVBM (Rust + Python)	None — we integrate LMCache / HiCache / KVBM as alternatives
Cross-cluster prefix index	`kv-router` + NATS events	`cgn-kvcached` + `cgn-router` (single-binary, RocksDB, QUIC peer fetch)
Routing correctness	Radix tree of GPU-resident blocks	Same idea, plus sequence-chained BLAKE3 digests so the router never confuses repeated chunks at different positions (see `cgn-core::hash::hash_seq_chunks`)
Engine support	vLLM, TRT-LLM, SGLang	vLLM, SGLang, llama.cpp, openai-compat
Disaggregation	Yes (1P1D, 2P2D)	Yes (`vllm/disagg-single-node`, `vllm/disagg-lmcache`)
Deployment artefact	Kubernetes operator + CRDs	Flat TOML profiles + `up.sh`; optional `cgn-ctl recipe up`
Python footprint	Required (frontend + most backends)	None for the Rust binaries; engines bring their own
LMCache support	Yes (one-off launch script)	Yes (`kv_offload = "lmcache"`, agg & disagg)
HiCache support	Yes (Mooncake-backed)	Yes (`kv_offload = "hicache"`, NIXL backend by default; Mooncake via passthrough)
KVBM support	Yes (native)	Yes (`kv_offload = "kvbm"`) — benchmark parity
Single-binary install	No	Yes — three Rust binaries, no Python

How "beating Dynamo" looks concretely

We don't claim a faster KV offload backend in absolute terms — KVBM and LMCache are both well-tuned and our recipes piggyback on them. What we do claim:

Better routing correctness. Sequence-chained digests + longest-prefix overlap make our KV scoring positionally correct where Dynamo's chunk-overlap can mis-score interleaved prefixes.
Broader engine coverage. SGLang HiCache and llama.cpp are first-class, not afterthoughts.
Lower operational tax. Three Rust binaries, no Kubernetes operator, no Python control plane. Same recipes work bare-metal and in Kubernetes.
One dial for KV. engine.kv_offload swaps backends without editing the engine argv yourself; recipes don't drift.
Federation-ready. cgn-kvcached's QUIC peer fetch composes across clusters. KVBM's leader/worker topology is single-cluster.

When to pick which `kv_offload`

Workload	Recommended `kv_offload`	Recipe
Dev / smoke tests / one-shot completions	`none`	`vllm/agg`
Long sessions, RAG, chat, repeated system prompts	`lmcache` (vLLM) / `hicache` (SGLang)	`vllm/agg-lmcache`, `sglang/agg-hicache`
2-GPU node, want best TTFT under load	`lmcache` (disagg)	`vllm/disagg-lmcache`
Head-to-head benchmark vs Dynamo	`kvbm`	`vllm/agg-kvbm`
Air-gapped / no extra Python deps	`none` + cgn-kvcached	`vllm/agg`, `llama-cpp/cpu`

Future work

kv_offload = "flexkv" — Tencent's FlexKV connector. The rendering shape is the same as LMCache; we just need a renderer branch and a recipe.
[engine.sglang].hicache_storage_backend — first-class TOML for picking nixl | mooncake | nvme | s3 instead of overriding via extra_args.
WSPT prefill scheduling in cgn-router — Dynamo's "weighted shortest predicted task" admission. The KV-overlap signal needed for it already exists; the queue restructure is what's missing.
Federated peer fetch policy — a routing knob for "prefer intra-cluster cache hit over remote LMCache hit" to bound egress cost.

KV Strategy