SLOs

Cognitora defines two SLO tiers: platform-SLOs that apply to the control plane (the routing decision, the admission queue, the KV cache) and workload-SLOs that the operator owns (TTFT and TPS for each model).

Platform SLOs (CI-gated)

These are the same numbers exposed in the README's perf table. Each PR runs the perf harness in tests/perf/ and a regression > 5% of any of them blocks merge.

SLI	Target	Source
`cgn_router_routing_decision_seconds` p99	< 500 µs / vCPU	`cgn-router::routing::score`
HTTP overhead (router p99 − engine p99)	< 3 ms	`tests/perf/router_overhead.rs`
`cgn_kvcached_lookup_seconds` p99 (warm)	< 200 µs	`cgn-kvcached` RAM tier
`cgn_kvcached_lookup_seconds` p99 (cold)	< 5 ms	`cgn-kvcached` SSD tier
Cross-node 1 MiB block fetch p99 (10 GbE)	< 12 ms	QUIC transport
Cache hit ratio (representative trace)	≥ 0.55	`cgn:cache_hit_ratio_5m`
Energy efficiency vs round-robin baseline	≥ 1.4×	tokens/s ÷ Σ watts

Workload SLOs

Per-model, owned by whoever runs the deployment. Example targets we ship as defaults:

Model class	TTFT p95	Tokens/s p50 (per stream)
Small (≤8B)	200 ms	60
Mid (≤30B)	500 ms	35
Large	1.0 s	18

The router's admission control rejects requests once a node would violate the configured [router.admission].ttft_slo_ms for the model. That ttft_slo_ms value should be set per-cluster from the matching tier above.

Error budgets

A 30-day rolling window. Every SLO above gets a 99.5% target by default — that's 3.6 hours of error budget per month. The Helm chart's PrometheusRules emit:

cgn:slo_burn_rate_5m{model=...} — fast burn detector (1h, 14.4×)
cgn:slo_burn_rate_1h{model=...} — slow burn detector (6h, 6×)

Alerts fire when both windows are burning above their multiplier.

How we keep them honest

Every PR runs the perf harness. It uses synthetic prompts and pre-warmed KV blocks so the numbers are reproducible. CI fails if p99 regresses > 5%.
Release builds publish the numbers to the GitHub Release page so every tag has a reproducible baseline.
Production scrapes the same series. The dashboards alert on real traffic; we don't lean on synthetic-only.