Scale and serve LLM inference
at datacenter scale.
The open-source inference orchestration layer. Engine-agnostic, KV-aware, and Kubernetes-native — built in Rust for bare-metal-to-cloud deployments.
Overview
The inference orchestration layer
Cognitora doesn't replace your inference engine — it coordinates multiple engines into a unified, distributed cluster.
Sub-millisecond routing
Sequence-chained BLAKE3 digests with longest-prefix overlap. Under 500µs per routing decision.
Engine-agnostic
Orchestrates vLLM, SGLang, TensorRT-LLM, and llama.cpp through a single OpenAI-compatible gateway.
Production-hardened
mTLS everywhere. Cosign-verified releases. OIDC, API-key, and RBAC out of the box.
1 to 1,000+ GPUs
Cross-datacenter federation via QUIC. Scale from a single node to a global fleet.
Platform
Modular by design
Each component is independently deployable. Run the full stack on Kubernetes or pick the pieces you need.
cgn-routerKV-aware request router
OpenAI-compatible gateway. BLAKE3 sequence digests + longest-prefix overlap for KV-aware routing. Disaggregates prefill from decode.
Learn morecgn-agentEngine supervisor
Per-node agent driving any OpenAI HTTP surface. NVML telemetry, KV handoff via NixlConnector, multi-engine clusters.
Learn morecgn-kvcachedMulti-tier KV cache
GPU/RAM/SSD tiered cache with cross-node QUIC/RDMA fetch. Pluggable offload: nixl, lmcache, hicache, or kvbm.
Learn morecgn-metricsPower-aware telemetry
Prometheus aggregator with Redfish, IPMI, and DCGM. Power-aware routing deprioritizes thermally stressed nodes.
Learn morecgn-ctlAdmin CLI
Embeds Helm. Cluster bring-up, model deploys, traffic shaping, debugging. Same CRDs the operator manages.
Learn morecgn-operatorKubernetes operator
kube-rs. CRDs for InferenceCluster, ModelPool, RoutingPolicy. Declarative model rollouts and traffic splits.
Learn moreFeatures
Built for production GPU fleets
KV-Aware Routing
Sequence-chained BLAKE3 digests eliminate redundant prefill across the cluster.
Prefill/Decode Split
Separate replica pools via NixlConnector handoff. Tune hardware per phase.
Multi-Tier KV Cache
RAM (DashMap), SSD (RocksDB), and cross-node QUIC peer fetch.
Energy-Aware Scheduling
Redfish/IPMI/DCGM telemetry. Deprioritize thermally stressed nodes.
Cross-Cluster Federation
Router forwards across clusters. KV cache peers via QUIC.
Multi-Model Cascade
SLM→LLM gating on log-probability. Save costs, preserve quality.
Pluggable KV Offload
One TOML knob: none, nixl, lmcache, hicache, or kvbm.
Engine-Agnostic Agent
Drives any process exposing /v1/chat/completions. Zero engine-specific code.
Compatibility
Engine support matrix
Any engine exposing the OpenAI HTTP surface. Feature depth varies by backend.
| Feature | vLLM | SGLang | llama.cpp | TRT-LLM | OpenAI |
|---|---|---|---|---|---|
| OpenAI HTTP gateway | |||||
| KV-aware routing | |||||
| Prefill/decode disaggregate | — | — | |||
| KV offload — LMCache | |||||
| KV offload — HiCache | |||||
| KV offload — KVBM | |||||
| Multi-tier KV (RAM/SSD) | |||||
| Multi-model cascade | |||||
| Energy-aware admission |
wired through config, awaiting upstream support
Architecture
Disaggregated by design
Prefill and decode on separate pools. KV-aware routing places requests where caches live. Internal gRPC with mTLS everywhere.
Bare Metal
systemd on Linux
Kubernetes
Helm + operator CRDs
Cloud
Terraform: AWS, GCP, Azure, Hetzner
Local Dev
Docker, Ollama for macOS
Use Cases
From single models to fleet orchestration
Reasoning Models
MoE and chain-of-thought workloads benefit from disaggregated prefill/decode. Split compute phases across specialized pools.
Kubernetes AI Scaling
Deploy across GKE, EKS, AKS with Helm and a native operator. CRDs for InferenceCluster, ModelPool, RoutingPolicy.
AI Agents
Agent workloads reuse context heavily. KV-aware routing lands follow-ups where the prefix cache lives.
Code Generation
Large repo contexts cached across RAM, SSD, and peer nodes. Hot contexts stay resident, cold ones page transparently.
Cost-Optimized Serving
SLM→LLM cascade gates on log-probability. Cut costs on straightforward queries without sacrificing quality.
Performance
CI-gated targets
Every merge to main is gated on these numbers. Harnesses in the repo.
Routing decision
< 500µs
per vCPU, CI-gated
HTTP overhead
< 3ms
p99 vs direct engine
Warm cache hit
< 200µs
RAM tier lookup
Cold cache hit
< 5ms
SSD/RocksDB tier
Cross-node fetch
< 12ms
1 MiB, QUIC, 10 GbE
Cache hit ratio
≥ 55%
representative traces
Energy efficiency
1.4×
vs round-robin
Model bring-up
< 30s
using recipes
Recipes
Production deployment profiles
model/engine/topology/ — 3-line up.sh, serving in <30s.
| Model | Engine | Topology |
|---|---|---|
| Llama-3.1 8B | vLLM | 1×A100 aggregated |
| vLLM | 1×A100 disaggregated | |
| SGLang | 1×A100 | |
| vLLM + LMCache | 1×A100 | |
| Llama-3.3 70B FP8 | vLLM | TP=4 multi-GPU |
| Qwen3-7B | vLLM | 1×GPU |
| SGLang | 1×GPU | |
| DeepSeek-v4-Flash | vLLM | Multi-GPU |
| SGLang | Multi-GPU |
$ cd recipes/llama-3.1-8b/vllm/1xa100 && ./up.shQuickstart
One command to install
curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh# x86_64 and aarch64 — auto-detected
curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh
# pin a specific version
CGN_VERSION=v0.1.0 \
curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh
# verify
cgn-ctl versioncurl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Community
Built in the open
Contribute
Issues, PRs, design discussions. Apache-2.0, CLA-free.
OpenDiscussions
Architecture proposals, deployment questions, roadmap.
OpenDocs
Guides, CRD references, runbooks — live from GitHub.
OpenReleases
Changelogs, migration notes. Cosign-verified artifacts.
OpenReady to deploy?
Star the repo, follow the quickstart, bring distributed inference to your fleet.