cognitora
Open-source · Apache-2.0

Scale and serve LLM inference
at datacenter scale.

The open-source inference orchestration layer. Engine-agnostic, KV-aware, and Kubernetes-native — built in Rust for bare-metal-to-cloud deployments.

RustOpenAI APIvLLMSGLangllama.cppTensorRT-LLMKubernetesQUIC/RDMA

Overview

The inference orchestration layer

Cognitora doesn't replace your inference engine — it coordinates multiple engines into a unified, distributed cluster.

Sub-millisecond routing

Sequence-chained BLAKE3 digests with longest-prefix overlap. Under 500µs per routing decision.

Engine-agnostic

Orchestrates vLLM, SGLang, TensorRT-LLM, and llama.cpp through a single OpenAI-compatible gateway.

Production-hardened

mTLS everywhere. Cosign-verified releases. OIDC, API-key, and RBAC out of the box.

1 to 1,000+ GPUs

Cross-datacenter federation via QUIC. Scale from a single node to a global fleet.

Features

Built for production GPU fleets

KV-Aware Routing

Sequence-chained BLAKE3 digests eliminate redundant prefill across the cluster.

Prefill/Decode Split

Separate replica pools via NixlConnector handoff. Tune hardware per phase.

Multi-Tier KV Cache

RAM (DashMap), SSD (RocksDB), and cross-node QUIC peer fetch.

Energy-Aware Scheduling

Redfish/IPMI/DCGM telemetry. Deprioritize thermally stressed nodes.

Cross-Cluster Federation

Router forwards across clusters. KV cache peers via QUIC.

Multi-Model Cascade

SLM→LLM gating on log-probability. Save costs, preserve quality.

Pluggable KV Offload

One TOML knob: none, nixl, lmcache, hicache, or kvbm.

Engine-Agnostic Agent

Drives any process exposing /v1/chat/completions. Zero engine-specific code.

Compatibility

Engine support matrix

Any engine exposing the OpenAI HTTP surface. Feature depth varies by backend.

FeaturevLLMSGLangllama.cppTRT-LLMOpenAI
OpenAI HTTP gateway
KV-aware routing
Prefill/decode disaggregate
KV offload — LMCache
KV offload — HiCache
KV offload — KVBM
Multi-tier KV (RAM/SSD)
Multi-model cascade
Energy-aware admission

wired through config, awaiting upstream support

Architecture

Disaggregated by design

Prefill and decode on separate pools. KV-aware routing places requests where caches live. Internal gRPC with mTLS everywhere.

CONTROL PLANEcgn-operatorCRDscgn-ctlCLIetcdstate + policiesDATA PLANEClientOpenAI SDKHTTPcgn-routerBLAKE3 KV routingDisaggregated dispatch:8080 HTTP · :9090 gRPCgRPCcgn-agent ×NprefillvLLM / SGLang / TRT-LLMNixlcgn-agent ×NdecodeToken streaming · NVMLUDScgn-kvcachedGPU → RAM → SSDDashMap · RocksDBCross-node QUIC fetch:7090 gRPC · :7091 QUICSSE streamOBSERVABILITYcgn-metricsRedfish · DCGMPrometheusGrafanapower score

Bare Metal

systemd on Linux

Kubernetes

Helm + operator CRDs

Cloud

Terraform: AWS, GCP, Azure, Hetzner

Local Dev

Docker, Ollama for macOS

Use Cases

From single models to fleet orchestration

Reasoning Models

MoE and chain-of-thought workloads benefit from disaggregated prefill/decode. Split compute phases across specialized pools.

MoECoTLong context

Kubernetes AI Scaling

Deploy across GKE, EKS, AKS with Helm and a native operator. CRDs for InferenceCluster, ModelPool, RoutingPolicy.

HelmCRDsMulti-cloud

AI Agents

Agent workloads reuse context heavily. KV-aware routing lands follow-ups where the prefix cache lives.

KV cachePrefix reuseLow TTFT

Code Generation

Large repo contexts cached across RAM, SSD, and peer nodes. Hot contexts stay resident, cold ones page transparently.

Context reuseMulti-tierStreaming

Cost-Optimized Serving

SLM→LLM cascade gates on log-probability. Cut costs on straightforward queries without sacrificing quality.

SLM→LLMCostQuality gates

Performance

CI-gated targets

Every merge to main is gated on these numbers. Harnesses in the repo.

Routing decision

< 500µs

per vCPU, CI-gated

HTTP overhead

< 3ms

p99 vs direct engine

Warm cache hit

< 200µs

RAM tier lookup

Cold cache hit

< 5ms

SSD/RocksDB tier

Cross-node fetch

< 12ms

1 MiB, QUIC, 10 GbE

Cache hit ratio

≥ 55%

representative traces

Energy efficiency

1.4×

vs round-robin

Model bring-up

< 30s

using recipes

Recipes

Production deployment profiles

model/engine/topology/ — 3-line up.sh, serving in <30s.

ModelEngineTopology
Llama-3.1 8BvLLM1×A100 aggregated
vLLM1×A100 disaggregated
SGLang1×A100
vLLM + LMCache1×A100
Llama-3.3 70B FP8vLLMTP=4 multi-GPU
Qwen3-7BvLLM1×GPU
SGLang1×GPU
DeepSeek-v4-FlashvLLMMulti-GPU
SGLangMulti-GPU
$ cd recipes/llama-3.1-8b/vllm/1xa100 && ./up.sh

Quickstart

One command to install

$curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh
Linux
# x86_64 and aarch64 — auto-detected
curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh

# pin a specific version
CGN_VERSION=v0.1.0 \
  curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh

# verify
cgn-ctl version
NEXT — Send a request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'