Open-source · Apache-2.0

Scale and serve LLM inference
at datacenter scale.

The open-source inference orchestration layer. Engine-agnostic, KV-aware, and Kubernetes-native — built in Rust for bare-metal-to-cloud deployments.

Get Started GitHub Docs

RustOpenAI APIvLLMSGLangllama.cppTensorRT-LLMKubernetesQUIC/RDMA

Overview

The inference orchestration layer

Cognitora doesn't replace your inference engine — it coordinates multiple engines into a unified, distributed cluster.

Sub-millisecond routing

Sequence-chained BLAKE3 digests with longest-prefix overlap. Under 500µs per routing decision.

Engine-agnostic

Orchestrates vLLM, SGLang, TensorRT-LLM, and llama.cpp through a single OpenAI-compatible gateway.

Production-hardened

mTLS everywhere. Cosign-verified releases. OIDC, API-key, and RBAC out of the box.

1 to 1,000+ GPUs

Cross-datacenter federation via QUIC. Scale from a single node to a global fleet.

Platform

Modular by design

Each component is independently deployable. Run the full stack on Kubernetes or pick the pieces you need.

cgn-router

KV-aware request router

OpenAI-compatible gateway. BLAKE3 sequence digests + longest-prefix overlap for KV-aware routing. Disaggregates prefill from decode.

Learn more

cgn-agent

Engine supervisor

Per-node agent driving any OpenAI HTTP surface. NVML telemetry, KV handoff via NixlConnector, multi-engine clusters.

Learn more

cgn-kvcached

Multi-tier KV cache

GPU/RAM/SSD tiered cache with cross-node QUIC/RDMA fetch. Pluggable offload: nixl, lmcache, hicache, or kvbm.

Learn more

cgn-metrics

Power-aware telemetry

Prometheus aggregator with Redfish, IPMI, and DCGM. Power-aware routing deprioritizes thermally stressed nodes.

Learn more

cgn-ctl

Admin CLI

Embeds Helm. Cluster bring-up, model deploys, traffic shaping, debugging. Same CRDs the operator manages.

Learn more

cgn-operator

Kubernetes operator

kube-rs. CRDs for InferenceCluster, ModelPool, RoutingPolicy. Declarative model rollouts and traffic splits.

Learn more

Features

Built for production GPU fleets

KV-Aware Routing

Sequence-chained BLAKE3 digests eliminate redundant prefill across the cluster.

Prefill/Decode Split

Separate replica pools via NixlConnector handoff. Tune hardware per phase.

Multi-Tier KV Cache

RAM (DashMap), SSD (RocksDB), and cross-node QUIC peer fetch.

Energy-Aware Scheduling

Redfish/IPMI/DCGM telemetry. Deprioritize thermally stressed nodes.

Cross-Cluster Federation

Router forwards across clusters. KV cache peers via QUIC.

Multi-Model Cascade

SLM→LLM gating on log-probability. Save costs, preserve quality.

Pluggable KV Offload

One TOML knob: none, nixl, lmcache, hicache, or kvbm.

Engine-Agnostic Agent

Drives any process exposing /v1/chat/completions. Zero engine-specific code.

Compatibility

Engine support matrix

Any engine exposing the OpenAI HTTP surface. Feature depth varies by backend.

Feature	llama.cpp	OpenAI
OpenAI HTTP gateway
KV-aware routing
Prefill/decode disaggregate	—	—
KV offload — LMCache
KV offload — HiCache
KV offload — KVBM
Multi-tier KV (RAM/SSD)
Multi-model cascade
Energy-aware admission

wired through config, awaiting upstream support

Architecture

Disaggregated by design

Prefill and decode on separate pools. KV-aware routing places requests where caches live. Internal gRPC with mTLS everywhere.

Bare Metal

systemd on Linux

Kubernetes

Helm + operator CRDs

Cloud

Terraform: AWS, GCP, Azure, Hetzner

Local Dev

Docker, Ollama for macOS

Use Cases

From single models to fleet orchestration

Reasoning Models

MoE and chain-of-thought workloads benefit from disaggregated prefill/decode. Split compute phases across specialized pools.

MoECoTLong context

Kubernetes AI Scaling

Deploy across GKE, EKS, AKS with Helm and a native operator. CRDs for InferenceCluster, ModelPool, RoutingPolicy.

HelmCRDsMulti-cloud

AI Agents

Agent workloads reuse context heavily. KV-aware routing lands follow-ups where the prefix cache lives.

KV cachePrefix reuseLow TTFT

Code Generation

Large repo contexts cached across RAM, SSD, and peer nodes. Hot contexts stay resident, cold ones page transparently.

Context reuseMulti-tierStreaming

Cost-Optimized Serving

SLM→LLM cascade gates on log-probability. Cut costs on straightforward queries without sacrificing quality.

SLM→LLMCostQuality gates

Performance

CI-gated targets

Every merge to main is gated on these numbers. Harnesses in the repo.

Routing decision

< 500µs

per vCPU, CI-gated

HTTP overhead

< 3ms

p99 vs direct engine

Warm cache hit

< 200µs

RAM tier lookup

Cold cache hit

< 5ms

SSD/RocksDB tier

Cross-node fetch

< 12ms

1 MiB, QUIC, 10 GbE

Cache hit ratio

≥ 55%

representative traces

Energy efficiency

1.4×

vs round-robin

Model bring-up

< 30s

using recipes

Recipes

Production deployment profiles

model/engine/topology/ — 3-line up.sh, serving in <30s.

Model	Engine	Topology
Llama-3.1 8B	vLLM	1×A100 aggregated
	vLLM	1×A100 disaggregated
	SGLang	1×A100
	vLLM + LMCache	1×A100
Llama-3.3 70B FP8	vLLM	TP=4 multi-GPU
Qwen3-7B	vLLM	1×GPU
	SGLang	1×GPU
DeepSeek-v4-Flash	vLLM	Multi-GPU
	SGLang	Multi-GPU

$ cd recipes/llama-3.1-8b/vllm/1xa100 && ./up.sh

Quickstart

One command to install

$curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh

Linux

# x86_64 and aarch64 — auto-detected
curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh

# pin a specific version
CGN_VERSION=v0.1.0 \
  curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh

# verify
cgn-ctl version

NEXT — Send a request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Community

Ready to deploy?

Star the repo, follow the quickstart, bring distributed inference to your fleet.

Star on GitHub Quickstart

Scale and serve LLM inference
at datacenter scale.

The inference orchestration layer

Sub-millisecond routing

Engine-agnostic

Production-hardened

1 to 1,000+ GPUs

Modular by design

KV-aware request router

Engine supervisor

Multi-tier KV cache

Power-aware telemetry

Admin CLI

Kubernetes operator

Built for production GPU fleets

KV-Aware Routing

Prefill/Decode Split

Multi-Tier KV Cache

Energy-Aware Scheduling

Cross-Cluster Federation

Multi-Model Cascade

Pluggable KV Offload

Engine-Agnostic Agent

Engine support matrix

Disaggregated by design

From single models to fleet orchestration

Reasoning Models

Kubernetes AI Scaling

AI Agents

Code Generation

Cost-Optimized Serving

CI-gated targets

Production deployment profiles

One command to install

Built in the open

Contribute

Discussions

Docs

Releases

Ready to deploy?

Scale and serve LLM inferenceat datacenter scale.

The inference orchestration layer

Sub-millisecond routing

Engine-agnostic

Production-hardened

1 to 1,000+ GPUs

Modular by design

KV-aware request router

Engine supervisor

Multi-tier KV cache

Power-aware telemetry

Admin CLI

Kubernetes operator

Built for production GPU fleets

KV-Aware Routing

Prefill/Decode Split

Multi-Tier KV Cache

Energy-Aware Scheduling

Cross-Cluster Federation

Multi-Model Cascade

Pluggable KV Offload

Engine-Agnostic Agent

Engine support matrix

Disaggregated by design

From single models to fleet orchestration

Reasoning Models

Kubernetes AI Scaling

AI Agents

Code Generation

Cost-Optimized Serving

CI-gated targets

Production deployment profiles

One command to install

Built in the open

Contribute

Discussions

Docs

Releases

Ready to deploy?

Scale and serve LLM inference
at datacenter scale.