All components
cgn-metrics

Power-Aware Telemetry Aggregator

Prometheus aggregator surfacing power telemetry from Redfish, IPMI, and DCGM. Feeds the router's energy-aware scoring term.

Overview

cgn-metrics collects power and thermal readings from hardware management interfaces (Redfish, IPMI, NVML/DCGM), aggregates them into Prometheus-compatible gauges, and exposes them for the router's power-aware scoring. The router subscribes to the metrics endpoint and incorporates power data into its routing decisions, enabling energy-aware scheduling that deprioritizes thermally stressed nodes.

Features

  • Consumes power readings from Redfish, IPMI, and NVML/DCGM
  • Exports Prometheus gauges: cgn_power_watts{component=...}
  • 30-second rolling average for power normalization
  • Router subscribes and incorporates into power scoring term
  • Per-node thermal stress detection and alerting
  • First-class OpenTelemetry export support
  • Queue depth, KV hit rate, TTFT, ITL metrics
  • Per-tenant utilization tracking
  • Drop-in Grafana dashboards included in the repo

Architecture

Hardware BMC (Redfish/IPMI) → cgn-power collector → cgn-metrics aggregator → Prometheus scrape endpoint. Router queries metrics endpoint for power scores. DCGM/NVML provide GPU-level telemetry. All metrics available via /metrics HTTP endpoint.

Configuration

KeyTypeDefaultDescription
metrics.listenstring0.0.0.0:9100Prometheus metrics endpoint
metrics.power_sourcesarray["nvml"]Power data sources: redfish, ipmi, nvml, dcgm
metrics.scrape_intervalduration"15s"Power telemetry scrape interval
metrics.rolling_windowduration"30s"Rolling average window for power normalization

Example

toml
[cluster]
name           = "production"
state_backend  = "etcd"
etcd_endpoints = ["http://etcd-0:2379"]

[security]
require_mtls = true

[metrics]
listen          = "0.0.0.0:9100"
power_sources   = ["redfish", "nvml"]
scrape_interval = "15s"
rolling_window  = "30s"

Performance

Energy efficiency vs round-robin: ≥ 1.4×

Scrape interval: configurable (default 15s)

30s rolling average for stable power scores