OpenAI-compatible HTTP surface
cgn-router listens on [router].listen_http (default :8080) and
speaks the OpenAI HTTP/SSE protocol. Any OpenAI SDK can target it
unchanged — point OPENAI_BASE_URL at the router and use any API
key from cgn-ctl key create.
from openai import OpenAI
client = OpenAI(
base_url="http://router.cognitora.local:8080/v1",
api_key="cgn-c782d73a8c914c3da49191626f95737e",
)
stream = client.chat.completions.create(
model="llama3-8b",
messages=[{"role": "user", "content": "Explain KV-aware routing in one sentence."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Endpoints
| Method | Path | Status | Notes |
|---|---|---|---|
POST | /v1/chat/completions | implemented | streaming + buffered |
POST | /v1/completions | implemented | alias of chat/completions for legacy SDKs |
POST | /v1/embeddings | implemented | KV-aware routing → Agent.Embed → engine /v1/embeddings round-trip |
GET | /v1/models | implemented | union of [models.*] config + live agents |
GET | /healthz | implemented | liveness probe (admin port) |
GET | /readyz | implemented | readiness probe (admin port) |
Not yet implemented (tracked, no committed timeline): assistants / threads, tool calls, fine-tunes, audio, image, batch.
Auth
Every /v1/* request goes through cgn-auth::middleware. Two flows:
- API key —
Authorization: Bearer cgn-<32hex>. The token's sha256 is matched against[auth].api_keys_file. Usecgn-ctl key create --scopes "chat,embed"to issue one. Tokens are shown once; the file stores hashes. - OIDC — same header but with a JWT.
cgn-authvalidates the signature against the issuer's JWKS (rotated every[auth].oidc_jwks_ttl, default 10m). Thesubclaim becomes the rate-limit subject.
When [auth].enabled = false the middleware is a no-op (CI / dev only).
Streaming
Server-Sent Events (text/event-stream). Each token chunk is one
data: {...}\n\n line followed by data: [DONE]\n\n. The shape
matches OpenAI exactly:
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1714566600,"model":"llama3-8b","choices":[{"index":0,"delta":{"content":"hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1714566600,"model":"llama3-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
The router never buffers a streaming response — tokens flow directly
from Agent.Generate through gateway::sse::SseEncoder to the
client.
Error envelope
All errors return the OpenAI-shaped JSON:
{ "error": { "code": null, "message": "<text>", "type": "<type>" } }
| HTTP | type | When |
|---|---|---|
| 401 | auth_error | missing / invalid bearer |
| 403 | auth_error | valid token but scope insufficient |
| 404 | model_not_found | unknown model name |
| 429 | rate_limit | cgn-ratelimit quota exhausted |
| 429 | server_error | router admission queue full (max_queue reached) |
| 503 | server_error | no live node serving the model for the requested role |
| 5xx | server_error | upstream failure (engine crash, transport) |
Headers
cgn-router honours and emits a small set of beyond-spec headers:
| Header | Direction | Purpose |
|---|---|---|
x-request-id | both | propagated to traces and cgn-agent logs |
x-cgn-subject | inbound* | set by cgn-auth; downstream rate limit reads this |
x-cgn-cache-hit | outbound | true/false — was the prefix found in cgn-kvcached |
x-cgn-node | outbound | which node id served the request |
x-cgn-cascade-step | outbound | which model in a cascade chain produced the response |
* set internally by middleware; an inbound request setting
x-cgn-subject directly is rejected.
OpenAPI
A machine-readable spec (openapi.yaml) lives next to this doc and is
regenerated from cgn-router::gateway::types on every release. Use it
with openapi-generator-cli to scaffold typed clients in any
language.