cgn-agentEngine Supervisor
Per-node agent that drives any inference engine exposing the OpenAI HTTP surface. Handles telemetry, KV handoff, and multi-engine clusters.
Overview
cgn-agent sits between the router and the inference engine subprocess. It exposes a gRPC stream interface (Agent.Generate), proxies requests to the engine via a pluggable Engine trait, manages GPU residency tracking for KV cache awareness, and publishes health heartbeats to etcd. It supports vLLM, SGLang, llama.cpp, TensorRT-LLM, and any OpenAI-compatible endpoint.
Features
- Pluggable Engine trait: vLLM, SGLang, llama.cpp, TensorRT-LLM, OpenAI-compatible
- Agent.Generate gRPC stream interface for router communication
- NodeHealth JSON heartbeats to etcd every 5 seconds (configurable)
- GPU residency index tracking for KV cache awareness
- NixlConnector for prefill/decode KV handoff in disaggregated mode
- NVML telemetry: in-flight request count, GPU SM utilization
- Role-based operation: prefill-only, decode-only, or both
- Pluggable KV offload via single TOML knob: none, nixl, lmcache, hicache, kvbm
- Auto-renders engine-specific launch arguments from configuration
- Simultaneous multi-engine clusters on the same node
Architecture
Router → gRPC stream (Agent.Generate) → cgn-agent → Engine trait → HTTP proxy to engine subprocess (/v1/chat/completions). Agent writes NodeHealth heartbeats to etcd and reports GPU residency to cgn-kvcached via Unix domain socket. NVML telemetry sampled for utilization metrics.
Configuration
| Key | Type | Default | Description |
|---|---|---|---|
| agent.listen | string | 0.0.0.0:7080 | gRPC listen address |
| agent.role | enum | "both" | Agent role: "prefill", "decode", or "both" |
| agent.node_id | string | "" | Unique node identifier |
| agent.heartbeat | duration | "5s" | Health heartbeat interval to etcd |
| agent.kv_uds | string | "" | Unix domain socket path for cgn-kvcached |
| engine.kind | enum | "vllm" | Engine backend: vllm, sglang, llama_cpp, openai_compat |
| engine.url | string | "http://127.0.0.1:8000" | Engine HTTP endpoint |
| engine.kv_offload | enum | "none" | KV offload strategy: none, nixl, lmcache, hicache, kvbm |
Example
[cluster]
name = "production"
state_backend = "etcd"
etcd_endpoints = ["http://etcd-0:2379"]
[security]
require_mtls = true
[agent]
listen = "0.0.0.0:7080"
role = "both"
node_id = "agent-gpu-01"
kv_uds = "/tmp/cognitora-kv.sock"
[engine]
kind = "vllm"
url = "http://127.0.0.1:8000"
kv_offload = "nixl"
[models."llama-3.1-70b"]
tp = 4Performance
Heartbeat interval: 5s (configurable)
Engine proxy overhead: sub-millisecond
Supports simultaneous multi-engine on single node