Quickstart

Edit on GitHub

Quickstart — 5 minutes from zero

This page walks from "I just heard about Cognitora" to "I'm streaming tokens through it" without committing to a Kubernetes cluster. We'll run everything on one host, with mTLS off and a fake engine that returns canned tokens — perfect for a demo, a laptop test, or a CI smoke check.

Prerequisites

  • Linux for prebuilt binaries (x86_64 or aarch64). macOS is supported as a dev platform via the from-source path below.
  • Rust 1.89+ if building from source.
  • ~2 minutes to build.

1. Install

Pick one of:

# Option A — prebuilt binaries (Linux x86_64 / aarch64).
# Pulls a sha256-verified release tarball from GitHub. Override CGN_PREFIX
# to install somewhere other than /usr/local/bin or ~/.cognitora/bin.
curl -fsSL https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | sh

# Option B — from source (any platform; required on macOS)
git clone https://github.com/antonellof/cognitora-inference cognitora
cd cognitora
cargo build --release --no-default-features \
  -p cgn-router -p cgn-agent -p cgn-kvcached -p cgn-ctl

If you used option B, the binaries are under target/release/. Either prepend that to your PATH or use the full path below.

2. Bootstrap dev PKI

cgn-ctl pki bootstrap --out /tmp/pki --san localhost

You'll get four PEM files in /tmp/pki. We won't enable mTLS for this run — but the files prove cgn-ctl pki works.

3. Issue an API key

mkdir -p /tmp/cognitora
cgn-ctl key create --file /tmp/cognitora/api-keys --scopes "chat,embed"
# prints: cgn-c782d73a8c914c3da49191626f95737e
export CGN_KEY=cgn-c782...   # paste the printed token

4. Minimal config

cat > /tmp/cognitora/cognitora.toml <<'EOF'
[cluster]
name     = "demo"
data_dir = "/tmp/cognitora"
etcd     = []                      # single-node, no etcd

[security]
require_mtls = false

[auth]
enabled       = true
api_keys_file = "/tmp/cognitora/api-keys"

[router]
listen_http  = "127.0.0.1:8080"
listen_grpc  = "127.0.0.1:7070"
listen_admin = "127.0.0.1:9091"

[models.demo]
hf_repo = "fake://demo"            # fake engine for the smoke test
EOF

5. Boot the router

cgn-router --config /tmp/cognitora/cognitora.toml &
sleep 2

You'll see a JSON log line per listener (HTTP on 8080, gRPC on 7070, admin on 9091).

6. Hit it with the OpenAI SDK

curl -sSfL http://127.0.0.1:8080/v1/models \
  -H "authorization: bearer $CGN_KEY"
# {"object":"list","data":[{"id":"demo","object":"model","owned_by":"cognitora"}]}

A streaming chat completion (returns 503 until you wire up an agent

  • engine):
curl -N -sS http://127.0.0.1:8080/v1/chat/completions \
  -H "authorization: bearer $CGN_KEY" \
  -H "content-type: application/json" \
  -d '{
    "model": "demo",
    "messages": [{"role":"user","content":"Hello"}],
    "stream": true
  }'

7. Run a real engine (vLLM or llama.cpp)

The fake engine above is fine for proving the gateway/auth/routing path, but for an actual model end-to-end use one of the bundled engine drivers:

Engine kindWhen to pick it
vllmNVIDIA GPU node. The agent spawns vllm serve <model> ….
llama_cppCPU node, Apple Silicon, or GPU offload via n_gpu_layers.
openai_compatThe engine is managed by systemd / k8s; the agent only proxies.

Two ready-to-run profiles live under examples/:

ProfileEngineBest for
examples/local-macopenai_compat → OllamamacOS laptop. No Python venv, no GGUF download — just ollama pull phi3:mini.
examples/multi-llmvllm (GPU) or llama_cpp (CPU)Linux box, server, or CI. Multi-model with a real engine.

macOS (Ollama path)

brew install jq unzip
ollama serve &
ollama pull phi3:mini
ollama pull llama3.2

cargo build --release --no-default-features \
  -p cgn-router -p cgn-agent -p cgn-kvcached -p cgn-ctl
bash scripts/install/install-etcd.sh
bash scripts/run/up.sh examples/local-mac
bash examples/local-mac/demo.sh

Linux (vLLM or llama.cpp)

bash scripts/install/bootstrap-debian.sh        # apt + rustup
bash scripts/install/install-engine-cpu.sh      # or install-engine-gpu.sh
bash scripts/install/install-etcd.sh
bash scripts/install/download-model.sh \
  --gguf qwen2.5-0.5b-instruct-q4_k_m.gguf  Qwen/Qwen2.5-0.5B-Instruct-GGUF
cargo build --release --no-default-features \
  -p cgn-router -p cgn-agent -p cgn-kvcached -p cgn-ctl
bash scripts/run/up.sh examples/multi-llm
bash examples/multi-llm/demo.sh

The same TOML profile boots a vLLM stack on a GPU host — only the [engine] block in agent-*.toml changes.

8. Run the smoke tests

These tests need only the binaries and Python 3 — no models, no GPUs:

# Engine-plugin layer + auth + rate-limit middleware. ~3 s.
./tests/e2e/multi_engine.sh

# Multi-node KV transport. Skips with REQUIRE_MULTINODE=0 if the second
# host isn't available; runs full QUIC handoff when it is.
./tests/e2e/multi_node_kv.sh

For a tighter dev loop, drop CGN_SKIP_BUILD=1 in front of the test once your target/release is warm.

9. What's next

10. Tear down

kill %1                                # the backgrounded router
rm -rf /tmp/cognitora /tmp/pki

# or, if you used scripts/run/up.sh with a profile:
bash scripts/run/down.sh examples/local-mac      # or examples/multi-llm