All components
cgn-agent

Engine Supervisor

Per-node agent that drives any inference engine exposing the OpenAI HTTP surface. Handles telemetry, KV handoff, and multi-engine clusters.

Overview

cgn-agent sits between the router and the inference engine subprocess. It exposes a gRPC stream interface (Agent.Generate), proxies requests to the engine via a pluggable Engine trait, manages GPU residency tracking for KV cache awareness, and publishes health heartbeats to etcd. It supports vLLM, SGLang, llama.cpp, TensorRT-LLM, and any OpenAI-compatible endpoint.

Features

  • Pluggable Engine trait: vLLM, SGLang, llama.cpp, TensorRT-LLM, OpenAI-compatible
  • Agent.Generate gRPC stream interface for router communication
  • NodeHealth JSON heartbeats to etcd every 5 seconds (configurable)
  • GPU residency index tracking for KV cache awareness
  • NixlConnector for prefill/decode KV handoff in disaggregated mode
  • NVML telemetry: in-flight request count, GPU SM utilization
  • Role-based operation: prefill-only, decode-only, or both
  • Pluggable KV offload via single TOML knob: none, nixl, lmcache, hicache, kvbm
  • Auto-renders engine-specific launch arguments from configuration
  • Simultaneous multi-engine clusters on the same node

Architecture

Router → gRPC stream (Agent.Generate) → cgn-agent → Engine trait → HTTP proxy to engine subprocess (/v1/chat/completions). Agent writes NodeHealth heartbeats to etcd and reports GPU residency to cgn-kvcached via Unix domain socket. NVML telemetry sampled for utilization metrics.

Configuration

KeyTypeDefaultDescription
agent.listenstring0.0.0.0:7080gRPC listen address
agent.roleenum"both"Agent role: "prefill", "decode", or "both"
agent.node_idstring""Unique node identifier
agent.heartbeatduration"5s"Health heartbeat interval to etcd
agent.kv_udsstring""Unix domain socket path for cgn-kvcached
engine.kindenum"vllm"Engine backend: vllm, sglang, llama_cpp, openai_compat
engine.urlstring"http://127.0.0.1:8000"Engine HTTP endpoint
engine.kv_offloadenum"none"KV offload strategy: none, nixl, lmcache, hicache, kvbm

Example

toml
[cluster]
name           = "production"
state_backend  = "etcd"
etcd_endpoints = ["http://etcd-0:2379"]

[security]
require_mtls = true

[agent]
listen   = "0.0.0.0:7080"
role     = "both"
node_id  = "agent-gpu-01"
kv_uds   = "/tmp/cognitora-kv.sock"

[engine]
kind       = "vllm"
url        = "http://127.0.0.1:8000"
kv_offload = "nixl"

[models."llama-3.1-70b"]
tp = 4

Performance

Heartbeat interval: 5s (configurable)

Engine proxy overhead: sub-millisecond

Supports simultaneous multi-engine on single node