cgn-metricsPower-Aware Telemetry Aggregator
Prometheus aggregator surfacing power telemetry from Redfish, IPMI, and DCGM. Feeds the router's energy-aware scoring term.
Overview
cgn-metrics collects power and thermal readings from hardware management interfaces (Redfish, IPMI, NVML/DCGM), aggregates them into Prometheus-compatible gauges, and exposes them for the router's power-aware scoring. The router subscribes to the metrics endpoint and incorporates power data into its routing decisions, enabling energy-aware scheduling that deprioritizes thermally stressed nodes.
Features
- Consumes power readings from Redfish, IPMI, and NVML/DCGM
- Exports Prometheus gauges: cgn_power_watts{component=...}
- 30-second rolling average for power normalization
- Router subscribes and incorporates into power scoring term
- Per-node thermal stress detection and alerting
- First-class OpenTelemetry export support
- Queue depth, KV hit rate, TTFT, ITL metrics
- Per-tenant utilization tracking
- Drop-in Grafana dashboards included in the repo
Architecture
Hardware BMC (Redfish/IPMI) → cgn-power collector → cgn-metrics aggregator → Prometheus scrape endpoint. Router queries metrics endpoint for power scores. DCGM/NVML provide GPU-level telemetry. All metrics available via /metrics HTTP endpoint.
Configuration
| Key | Type | Default | Description |
|---|---|---|---|
| metrics.listen | string | 0.0.0.0:9100 | Prometheus metrics endpoint |
| metrics.power_sources | array | ["nvml"] | Power data sources: redfish, ipmi, nvml, dcgm |
| metrics.scrape_interval | duration | "15s" | Power telemetry scrape interval |
| metrics.rolling_window | duration | "30s" | Rolling average window for power normalization |
Example
[cluster]
name = "production"
state_backend = "etcd"
etcd_endpoints = ["http://etcd-0:2379"]
[security]
require_mtls = true
[metrics]
listen = "0.0.0.0:9100"
power_sources = ["redfish", "nvml"]
scrape_interval = "15s"
rolling_window = "30s"Performance
Energy efficiency vs round-robin: ≥ 1.4×
Scrape interval: configurable (default 15s)
30s rolling average for stable power scores