Runbook: router not responding

Symptoms: external clients receive 502/504, or curl http://router:8080/healthz hangs / errors.

Triage

Check the deployment:
```
kubectl -n cognitora get pods -l app.kubernetes.io/component=router
```
- Pod Running and Ready 1/1? → continue to step 2.
- Pod CrashLoopBackOff → step 4.
- Pod stuck Pending → step 5.
Hit the admin port directly:
```
kubectl -n cognitora port-forward svc/cognitora-router 9091:9091
curl localhost:9091/healthz
curl localhost:9091/metrics | grep cgn_router_admission_inflight
```
- 200 + low inflight → 502 originates further upstream (LB, ingress). Check ingress controller and external LB.
- 200 + inflight at [router.admission].max_queue → admission queue saturated. Step 6.
Tail logs:
```
kubectl -n cognitora logs -l app.kubernetes.io/component=router --tail=200 --timestamps
```
Common signals:
- etcd watch failure → step 7.
- engine unreachable (rare; agent-side) → look at the agent.
Crash loop. Pull the last log:
```
kubectl -n cognitora logs <pod> --previous
```
Mostly config errors at this stage:
- require_mtls=true but cert/key/ca not set → check the pki Secret.
- failed to bind 0.0.0.0:8080 → another container is using the host port; restart the node.
Pod Pending. Usually a missing PVC or a node that no longer matches the nodeSelector. kubectl describe pod <name> is the single best signal.
Admission queue saturated. The router is healthy; downstream agents are slow.
- Increase [router.admission].max_queue if it's a transient spike.
- More common cause: an agent is stuck waiting on the engine. Check cgn_agent_engine_ready per node; restart the agent on the offender.
etcd watcher failed. The router will keep serving traffic from its last-good policy snapshot but won't observe new agents.
```
ETCDCTL_API=3 etcdctl --endpoints=$ETCD endpoint health
```
- If etcd is degraded → fix etcd first. The router stays up during the outage.
- If etcd is healthy → check the network policy between the router pod and etcd. mTLS material may have rotated.

Prevention

Set router.replicas=2 minimum in production.
Set a PodDisruptionBudget with minAvailable: 1.
Run a Prometheus alert on cgn:router_p99_routing_us > 1500 for 10 min — the routing fast-path is the canary for everything else.

Runbook: Router Down

Runbook: router not responding

Triage

Prevention