cgn-agent

Engine Supervisor

Per-node agent that drives any inference engine exposing the OpenAI HTTP surface. Handles telemetry, KV handoff, and multi-engine clusters.

Overview

cgn-agent sits between the router and the inference engine subprocess. It exposes a gRPC stream interface (Agent.Generate), proxies requests to the engine via a pluggable Engine trait, manages GPU residency tracking for KV cache awareness, and publishes health heartbeats to etcd. It supports vLLM, SGLang, llama.cpp, TensorRT-LLM, and any OpenAI-compatible endpoint.

Features

Pluggable Engine trait: vLLM, SGLang, llama.cpp, TensorRT-LLM, OpenAI-compatible
Agent.Generate gRPC stream interface for router communication
NodeHealth JSON heartbeats to etcd every 5 seconds (configurable)
GPU residency index tracking for KV cache awareness
NixlConnector for prefill/decode KV handoff in disaggregated mode
NVML telemetry: in-flight request count, GPU SM utilization
Role-based operation: prefill-only, decode-only, or both
Pluggable KV offload via single TOML knob: none, nixl, lmcache, hicache, kvbm
Auto-renders engine-specific launch arguments from configuration
Simultaneous multi-engine clusters on the same node

Architecture

Router → gRPC stream (Agent.Generate) → cgn-agent → Engine trait → HTTP proxy to engine subprocess (/v1/chat/completions). Agent writes NodeHealth heartbeats to etcd and reports GPU residency to cgn-kvcached via Unix domain socket. NVML telemetry sampled for utilization metrics.

Configuration

Key	Type	Default	Description
agent.listen	string	0.0.0.0:7080	gRPC listen address
agent.role	enum	"both"	Agent role: "prefill", "decode", or "both"
agent.node_id	string	""	Unique node identifier
agent.heartbeat	duration	"5s"	Health heartbeat interval to etcd
agent.kv_uds	string	""	Unix domain socket path for cgn-kvcached
engine.kind	enum	"vllm"	Engine backend: vllm, sglang, llama_cpp, openai_compat
engine.url	string	"http://127.0.0.1:8000"	Engine HTTP endpoint
engine.kv_offload	enum	"none"	KV offload strategy: none, nixl, lmcache, hicache, kvbm

Example

toml

[cluster]
name           = "production"
state_backend  = "etcd"
etcd_endpoints = ["http://etcd-0:2379"]

[security]
require_mtls = true

[agent]
listen   = "0.0.0.0:7080"
role     = "both"
node_id  = "agent-gpu-01"
kv_uds   = "/tmp/cognitora-kv.sock"

[engine]
kind       = "vllm"
url        = "http://127.0.0.1:8000"
kv_offload = "nixl"

[models."llama-3.1-70b"]
tp = 4

Performance

Heartbeat interval: 5s (configurable)

Engine proxy overhead: sub-millisecond

Supports simultaneous multi-engine on single node

cgn-router

cgn-kvcached