Kubernetes

Edit on GitHub

Kubernetes guide

Cognitora ships as a Helm chart at deploy/kubernetes/helm/cognitora/.

Prerequisites

  • Kubernetes 1.28+ with the NVIDIA GPU operator installed on every GPU node.
  • Helm 3.13+.
  • etcd reachable from the cluster (you can run it as a StatefulSet alongside Cognitora; future versions will package it).

Install the CRDs

The CRDs are versioned independently of the chart so you can upgrade the operator without re-running helm install:

kubectl apply -f deploy/kubernetes/crds/

Install the chart

Pin the version in production. From the repo:

helm install cognitora ./deploy/kubernetes/helm/cognitora \
  --namespace cognitora --create-namespace \
  --set router.replicas=2 \
  --set agent.resources.limits."nvidia\.com/gpu"=1 \
  --set kvcached.ramGib=16

Or from the GHCR OCI repository:

helm install cognitora oci://ghcr.io/cognitora/charts/cognitora \
  --namespace cognitora --create-namespace --version 0.1.0

Declarative model loading

apiVersion: cognitora.dev/v1alpha1
kind: ModelPool
metadata:
  name: llama3-70b
  namespace: cognitora
spec:
  model: meta-llama/Meta-Llama-3-70B-Instruct
  tensorParallel: 4
  dtype: bfloat16
  replicas: 2
  cascade:
    - llama3-8b           # try the SLM first
    - llama3-70b          # escalate when confidence drops

The operator translates the spec into Agent.LoadModel RPCs against the agents that match nodeSelector.

Routing policy

apiVersion: cognitora.dev/v1alpha1
kind: RoutingPolicy
metadata:
  name: default
  namespace: cognitora
spec:
  scoreWeights: { kv: 0.7, load: 0.2, power: 0.05, capacity: 0.05 }
  admission: { maxQueue: 8192, ttftSloMs: 600 }
  cascade:   { enabled: true, confidenceThreshold: -1.2 }

Editing this resource updates the live router without restart — the operator publishes the JSON to etcd at /cognitora/routing/policy and cgn-router's arc_swap watcher picks up the new weights inside a second.

Exposing the OpenAI surface

The chart's router.service.type defaults to ClusterIP. To expose publicly:

helm upgrade cognitora ./deploy/kubernetes/helm/cognitora \
  --reuse-values \
  --set router.service.type=LoadBalancer

Or wire your existing IngressController:

router:
  ingress:
    enabled: true
    className: nginx
    host: api.example.com

Observability

Every binary exposes Prometheus on its admin port (:9091 for router/agent/kvcached, :9092 for metrics). The chart ships a PodMonitor (in templates/podmonitor.yaml, generated when the Prom operator CRDs are present).