Runbook: cache hit ratio collapsed
Symptoms: cgn:cache_hit_ratio_5m < 0.30 on a multi-replica
model, p99 TTFT spikes, GPU utilisation low (no prefix reuse).
Triage
-
Confirm the metric:
avg_over_time(cgn_router_cache_hit_ratio[5m]) by (model)Was the model recently deployed? Cold starts always look bad for the first ~5 min — give it a window and re-check.
-
Check sticky routing:
sum(rate(cgn_router_requests_total[5m])) by (model, node_id)If the load is fanned out evenly across nodes for a high-prefix- reuse workload, the score weights are off:
kubectl edit routingpolicy default # bump scoreWeights.kv to 0.7 (default 0.55), drop load to 0.15The router picks up the change in < 1 s via etcd.
-
Check kvcached health:
cgn_kvcached_blocks{tier="ram"} cgn_kvcached_lookup_seconds_count{outcome="miss"} / cgn_kvcached_lookup_seconds_countblocksnear 0 → the daemon restarted and lost its in-memory state. Expected; warms back in a few minutes.- High miss rate on RAM with low miss rate on SSD → SSD is
holding the working set but the eviction policy on RAM is too
aggressive. Bump
[kv].ram_gib.
-
Check that prompts actually share prefixes. The most common "false miss" is a client adding a per-request system prompt timestamp. Sample:
kubectl -n cognitora logs <router-pod> | jq 'select(.target=="cgn_router::routing") | .prefix_hash' | sort -u | head -20If every hash is unique, the workload doesn't reuse prefixes — the metric is reporting truth.
Knobs to twist (in order of safety)
routingpolicy.spec.scoreWeights.kv↑ — favours nodes with the prefix already cached.[kv].ram_gib↑ — bigger warm tier, fewer evictions.[kv].ssd_gib↑ — bigger cold tier, fewer cold-disk fetches.[router.disagg].enabled = truefor prompts > 256 tokens — frees up decode-only nodes to retain their KV better.- (Last resort) Drop
[router.cascade]if it's enabled — the cascade can fragment prefix-sharing across model tiers.