Runbook: agent stuck / engine not ready
Symptoms: cgn_agent_engine_ready{node_id=...} == 0 for > 2 min,
the router is rejecting traffic for a model with 503 unavailable: no live node serving model ….
Triage
-
Identify the offender:
cgn_agent_engine_ready == 0then on the host:
kubectl -n cognitora logs <agent-pod> --tail=500 --timestamps -
Common signals:
engine readiness probe timed out→ vLLM is still loading weights. Some models take 5+ min on first start. The agent will wait for[agent].ready_timeout(default 120s) and then restart the engine.engine exited with code N→ vLLM crashed. Look one frame deeper — usually OOM or a bad model spec.nvml: insufficient permissions→ the pod is missing thenvidia.com/gpuresource request or the agent's user lacksvideo/rendergroup membership on the host.
-
Force-restart:
kubectl -n cognitora delete pod <agent-pod>The DaemonSet recreates it; the engine reloads from cache. If weights are on object storage, pre-pull a copy to a local PVC to shorten future restarts.
-
Drain instead of bouncing:
cgn-ctl cluster drain <node_id>Drain stops the router from picking the node and lets in-flight requests finish. Once the inflight count is 0, the agent is safe to restart.
Prevention
- Pre-pull model weights (the operator's
ModelPool.spec.preload = truedoes this). - Set generous
livenessProbe.timeoutSecondson the agent for multi-tens-of-GB models. - Run NVML telemetry alongside the engine —
cgn_agent_gpu_mem_usednear max for > 1 min is the canary for OOM crashes.