32. GPU observability

Every metric on the AI Factory floor — what it means, when to worry, and why an idle GPU can sit at 90% memory.

This chapter is the reference for the hardware view: the five metric cards, the trend chart, the hardware-error events table, the alerts list, and the exact health thresholds behind every coloured dot. If you haven’t met the page yet, start with Chapter 30.

Everything here is for owners and providers — tenants whose card is reporting metrics. Consumers see a usage view instead (Chapter 30).

Where the numbers come from

The data is the NVIDIA DCGM exporter (DCGM_FI_DEV_* series), scraped by Prometheus. Daalu doesn’t invent metrics — it reads the same counters dcgmi and Grafana would, then renders them natively and scoped to your tenant on the server (see “Tenant scoping” below).

The page refreshes on a timer: the live summary every 15 seconds, the trend chart and events every 30. You can watch a load arrive in close to real time.

The five metric cards

Open a GPU’s detail view and you get five cards. When a tenant has more than one card, the cards show an aggregate and a per-GPU breakdown table appears underneath.

Card	Source	What it tells you
Temperature	`DCGM_FI_DEV_GPU_TEMP`	Core temperature in °C. The single best early-warning signal; thermal limits throttle the card before anything breaks.
Utilisation	`DCGM_FI_DEV_GPU_UTIL`	Percentage of time the GPU was busy. Tracks live inference work — 0% when the model is loaded but idle.
VRAM	`DCGM_FI_DEV_FB_USED` / `FB_FREE`	Used / total GB and a percentage. Expect this high and stable (see below).
Power	`DCGM_FI_DEV_POWER_USAGE`	Draw in watts. Summed across cards. Rises with real work; a useful proxy for “is the card actually doing something.”
SM-active %	`DCGM_FI_PROF_SM_ACTIVE`	Fraction of streaming multiprocessors active. A finer-grained “how busy is the compute” than utilisation.

Why it matters: high VRAM at ~0% utilisation is expected, not a fault. The vLLM server reserves roughly 90% of VRAM up front — the model weights plus a pre-allocated KV-cache pool — and holds it whether or not requests are flowing. So memory sits high and flat from the moment the model loads. Utilisation and power, by contrast, track live work: they read near zero while the model is loaded but idle, and climb as traffic hits the card. If you see 90% VRAM and 0% utilisation, your GPU is ready and waiting, not stuck. The detail view prints this reminder inline whenever it detects the pattern.

The Trend chart

Below the cards, the Trend chart plots one metric over a time window. Pick from:

Metric: Utilisation, Temp, Memory, Power.
Range: 1h, 6h, 24h, 7d.

Percentage metrics (utilisation, memory) use a fixed 0–100 axis on purpose: an idle, all-zero GPU reads as a flat line at the bottom of a full scale, not a misleading auto-zoomed wiggle. When utilisation or memory is flat at zero across the whole window, the chart says so explicitly — the card has simply been idle, and the line will rise as requests arrive.

Use the trend to answer questions the live cards can’t: Did utilisation actually climb during last night’s batch job? Is the card creeping warmer week over week? Did power track the traffic we expect?

The XID / ECC events table

This table surfaces hardware error events — the ones that actually predict failure. An empty table (“No hardware error events recorded — that’s a good sign”) is what you want to see.

Event	Source	What it means
XID	`DCGM_FI_DEV_XID_ERRORS`	An NVIDIA driver-level error code. XIDs cover a wide range of faults — from a transient app error to GPU-falling-off-the-bus. Any XID on a serving card is worth investigating.
ECC double-bit (DBE)	`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	An uncorrectable memory error. The hardware caught corruption it could not fix. Treat as critical — the card’s memory is suspect.
ECC single-bit (SBE)	`DCGM_FI_DEV_ECC_SBE_VOL_TOTAL`	A corrected memory error. The hardware fixed it transparently. A few are normal; a rising rate is an early sign of degrading memory.

The short version: SBE is a warning, DBE is critical, and any XID deserves a look. The events table is your hardware’s flight recorder — if a card starts misbehaving, this is where the first evidence shows up.

The alerts list

Alongside events, the Alerts list shows GPU alert rules currently firing or pending, scoped to the card you’re viewing (the overview shows them across all your cards). Each row carries a severity (info / warning / critical) or a pending badge, the rule name and summary, and how long ago it started.

These are the same alerts that drive notifications elsewhere in Daalu — the AI Factory just shows you the GPU-specific ones in context. Operator-wide GPU alerts with no tenant label are shown to owners and providers; alerts carrying another tenant’s label never appear.

The health model

Every coloured dot, the per-card health, and the header pill come from one deterministic rule applied to each GPU. There’s no fuzzy scoring — just thresholds:

Health	Condition (any of)
Critical (`crit`)	Any XID error, or any uncorrectable ECC (DBE), or temperature ≥ 90 °C
Degraded (`warn`)	Temperature ≥ 85 °C, or VRAM ≥ 97%
Healthy (`ok`)	None of the above

A few things worth internalising:

A single hardware error flips you to critical. One XID or one DBE is enough. This is intentional — these are the events that precede real failures.
VRAM ≥ 97% is a warning, not 90%. Because the serving stack intends to hold ~90% of VRAM, a high number is the normal resting state. Degraded only triggers when you’re genuinely close to the ceiling.
The header pill is the worst of your cards. If you run several, one critical card makes the whole pill critical — so you notice.

Tenant scoping — why you only see your own GPUs

You never pick a tenant filter, and you can’t see another tenant’s hardware. Every query the page issues is rewritten on the server to carry your tenant’s label before it reaches Prometheus. The metrics, events, alerts, trend, and reliability endpoints all run through the same scoping. A consumer’s response deliberately omits the provider card’s raw utilisation entirely — it isn’t filtered out in the browser, it’s never sent.

This is the same isolation guarantee the rest of Daalu enforces at every endpoint (Chapter 10).

The read-only API

Everything the page shows is also available over the API, scoped to your token’s tenant. These endpoints are read-only — handy for a status board or a scheduled check. Authenticate with a personal access token (Chapter 43).

# Live summary of your GPUs (the metric cards + health)
curl -s https://ops.daalu.io/api/ai-factory/gpu/summary \
  -H "Authorization: Bearer $DAALU_TOKEN"

{
  "gpus": [
    {
      "gpu": "0",
      "model": "NVIDIA RTX 2000 Ada Generation",
      "temp_c": 41.0,
      "util_pct": 0.0,
      "mem_used_gb": 14.4,
      "mem_total_gb": 16.0,
      "mem_pct": 90.0,
      "power_w": 19.0,
      "sm_active_pct": 0.0,
      "health": "ok"
    }
  ],
  "updated_at": "2026-06-07T12:00:00+00:00"
}

Other read-only endpoints under /api/ai-factory:

Endpoint	Returns
`GET /gpu/summary`	The live per-GPU metric snapshot above.
`GET /gpu/timeseries?metric=util&range=6h`	Trend points for one metric/range (`metric` = `util`/`temp`/`mem`/`power`; `range` = `1h`/`6h`/`24h`/`7d`). Optional `gpu=<index>`.
`GET /gpu/events`	The XID / ECC events table.
`GET /alerts`	Firing / pending GPU alerts.

The full surface is in Chapter 53 — API reference.

You can now read every signal on the floor. The next chapter is for admins who want to act on it — run diagnostics, validate the pipeline, and see the reliability posture protecting your card.

Next: Chapter 33 — Diagnostics and reliability

Connecting your GPU Diagnostics and reliability