Runbook: router not responding
Symptoms: external clients receive 502/504, or curl http://router:8080/healthz hangs / errors.
Triage
-
Check the deployment:
kubectl -n cognitora get pods -l app.kubernetes.io/component=router- Pod
RunningandReady 1/1? → continue to step 2. - Pod
CrashLoopBackOff→ step 4. - Pod stuck
Pending→ step 5.
- Pod
-
Hit the admin port directly:
kubectl -n cognitora port-forward svc/cognitora-router 9091:9091 curl localhost:9091/healthz curl localhost:9091/metrics | grep cgn_router_admission_inflight- 200 + low inflight → 502 originates further upstream (LB, ingress). Check ingress controller and external LB.
- 200 + inflight at
[router.admission].max_queue→ admission queue saturated. Step 6.
-
Tail logs:
kubectl -n cognitora logs -l app.kubernetes.io/component=router --tail=200 --timestampsCommon signals:
etcd watch failure→ step 7.engine unreachable(rare; agent-side) → look at the agent.
-
Crash loop. Pull the last log:
kubectl -n cognitora logs <pod> --previousMostly config errors at this stage:
require_mtls=true but cert/key/ca not set→ check thepkiSecret.failed to bind 0.0.0.0:8080→ another container is using the host port; restart the node.
-
Pod Pending. Usually a missing PVC or a node that no longer matches the nodeSelector.
kubectl describe pod <name>is the single best signal. -
Admission queue saturated. The router is healthy; downstream agents are slow.
- Increase
[router.admission].max_queueif it's a transient spike. - More common cause: an agent is stuck waiting on the engine.
Check
cgn_agent_engine_readyper node; restart the agent on the offender.
- Increase
-
etcd watcher failed. The router will keep serving traffic from its last-good policy snapshot but won't observe new agents.
ETCDCTL_API=3 etcdctl --endpoints=$ETCD endpoint health- If etcd is degraded → fix etcd first. The router stays up during the outage.
- If etcd is healthy → check the network policy between the router pod and etcd. mTLS material may have rotated.
Prevention
- Set
router.replicas=2minimum in production. - Set a PodDisruptionBudget with
minAvailable: 1. - Run a Prometheus alert on
cgn:router_p99_routing_us > 1500for 10 min — the routing fast-path is the canary for everything else.