OpenAI HTTP

Edit on GitHub

OpenAI-compatible HTTP surface

cgn-router listens on [router].listen_http (default :8080) and speaks the OpenAI HTTP/SSE protocol. Any OpenAI SDK can target it unchanged — point OPENAI_BASE_URL at the router and use any API key from cgn-ctl key create.

from openai import OpenAI

client = OpenAI(
    base_url="http://router.cognitora.local:8080/v1",
    api_key="cgn-c782d73a8c914c3da49191626f95737e",
)

stream = client.chat.completions.create(
    model="llama3-8b",
    messages=[{"role": "user", "content": "Explain KV-aware routing in one sentence."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Endpoints

MethodPathStatusNotes
POST/v1/chat/completionsimplementedstreaming + buffered
POST/v1/completionsimplementedalias of chat/completions for legacy SDKs
POST/v1/embeddingsimplementedKV-aware routing → Agent.Embed → engine /v1/embeddings round-trip
GET/v1/modelsimplementedunion of [models.*] config + live agents
GET/healthzimplementedliveness probe (admin port)
GET/readyzimplementedreadiness probe (admin port)

Not yet implemented (tracked, no committed timeline): assistants / threads, tool calls, fine-tunes, audio, image, batch.

Auth

Every /v1/* request goes through cgn-auth::middleware. Two flows:

  • API keyAuthorization: Bearer cgn-<32hex>. The token's sha256 is matched against [auth].api_keys_file. Use cgn-ctl key create --scopes "chat,embed" to issue one. Tokens are shown once; the file stores hashes.
  • OIDC — same header but with a JWT. cgn-auth validates the signature against the issuer's JWKS (rotated every [auth].oidc_jwks_ttl, default 10m). The sub claim becomes the rate-limit subject.

When [auth].enabled = false the middleware is a no-op (CI / dev only).

Streaming

Server-Sent Events (text/event-stream). Each token chunk is one data: {...}\n\n line followed by data: [DONE]\n\n. The shape matches OpenAI exactly:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1714566600,"model":"llama3-8b","choices":[{"index":0,"delta":{"content":"hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1714566600,"model":"llama3-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]

The router never buffers a streaming response — tokens flow directly from Agent.Generate through gateway::sse::SseEncoder to the client.

Error envelope

All errors return the OpenAI-shaped JSON:

{ "error": { "code": null, "message": "<text>", "type": "<type>" } }
HTTPtypeWhen
401auth_errormissing / invalid bearer
403auth_errorvalid token but scope insufficient
404model_not_foundunknown model name
429rate_limitcgn-ratelimit quota exhausted
429server_errorrouter admission queue full (max_queue reached)
503server_errorno live node serving the model for the requested role
5xxserver_errorupstream failure (engine crash, transport)

Headers

cgn-router honours and emits a small set of beyond-spec headers:

HeaderDirectionPurpose
x-request-idbothpropagated to traces and cgn-agent logs
x-cgn-subjectinbound*set by cgn-auth; downstream rate limit reads this
x-cgn-cache-hitoutboundtrue/false — was the prefix found in cgn-kvcached
x-cgn-nodeoutboundwhich node id served the request
x-cgn-cascade-stepoutboundwhich model in a cascade chain produced the response

* set internally by middleware; an inbound request setting x-cgn-subject directly is rejected.

OpenAPI

A machine-readable spec (openapi.yaml) lives next to this doc and is regenerated from cgn-router::gateway::types on every release. Use it with openapi-generator-cli to scaffold typed clients in any language.