33. Diagnostics and reliability

On-demand health checks, a one-click pipeline self-test, and the auto-remediation posture watching your card — for the people who keep the factory running.

The observability in Chapter 32 tells you what a card is doing. This chapter is about acting on it: running diagnostics when you suspect a fault, confirming your metrics pipeline is trustworthy, and understanding the reliability layer that contains GPU faults before they take down your inference.

Diagnostics and observability validation are tenant-admin tools. The reliability panel is read-only and visible to any owner or provider.

Diagnostics (admins)

The Diagnostics panel runs real, on-demand checks against your GPU node. Two kinds:

dcgmi diag — NVIDIA’s Data Center GPU Manager diagnostic, at one of three levels:

Level	Depth	Roughly
1 — quick	A fast health pass.	Seconds; safe to run anytime.
2 — medium	Deeper functional checks.	Minutes; puts real load on the card.
3 — long	The full stress + diagnostic suite.	Longest; heavy sustained load.

NCCL test — exercises the GPU-to-GPU interconnect. This only makes sense with more than one GPU, so on a single-card node the button is disabled with an explanatory note.

Each run is recorded and shows up in the runs list, moving through pending → running → passed / failed. Expand a run to see the exact command and its full output; a failure surfaces its reason inline so you don’t have to dig. The list auto-refreshes while anything is still running.

The acknowledgement gate

A quick Level-1 pass is harmless. The deeper runs are not — they deliberately load the card, which spikes latency for anything else it’s serving.

Warning: stressful runs require explicit acknowledgement. When you start a higher-level dcgmi diag or an NCCL test, Daalu does not run it immediately. The request comes back with a warning describing the impact, and you must re-submit with an explicit “I understand this loads the GPU” confirmation before it executes. This is a shared production card — the gate exists so a heavy diagnostic is never one stray click away. Run deep diagnostics off-peak, or against a candidate card before you put it into rotation.

Under the hood this is an HTTP 412 (“confirmation required”) that the UI turns into a confirmation dialog; the run only proceeds once you acknowledge it.

Validate the observability stack (admins)

Before you trust any dashboard, you want to know the pipeline feeding it is actually wired up. The Validate observability stack panel runs a read-only end-to-end self-check and renders a pass / fail / skip checklist:

Check	Confirms
Prometheus configured	Daalu has a metrics source to query at all.
dcgm-exporter target up	The DCGM exporter is being scraped (its scrape target is healthy).
DCGM series present	GPU metric series actually exist in Prometheus.
Per-tenant labels present	Your card’s series carry your tenant label — the basis of scoping. (Skipped if your tenant owns no card.)
Alert pipeline alive	The alerting path is delivering (the always-firing heartbeat alert is present).

Nothing here changes state — it only reads. Run it after connecting a GPU, or whenever the AI Factory looks empty and you want to know whether it’s “no traffic” or “broken plumbing.” A clean checklist means the floor is wired up and the numbers can be trusted.

Reliability (read-only)

The Reliability panel surfaces the automated layer keeping your card alive. It’s role-scoped — you see the posture for your card — and entirely read-only.

Health signals

Three signals, derived from the same DCGM stream as everything else, each with its own ok / warn / crit level:

XID errors — any value is critical.
Uncorrectable ECC (DBE) — any value is critical.
Max temperature — warn at ≥ 85 °C, critical at ≥ 90 °C.

These are the exact conditions a fault-containment system acts on, so the panel shows them right next to the system that acts.

NVSentinel auto-remediation

NVSentinel is GPU-fault containment. It watches the DCGM fault stream (XID / ECC / thermal) and, when a card faults, can cordon and recover the affected node in seconds — the reliability layer behind a bring-your-own-GPU SLA. The panel shows whether it’s active, what mode it’s in, and a count of remediations it has taken.

Note: NVSentinel ships in observe / dry-run mode first. A system that can cordon and reboot nodes is one you want to trust before you let it act. So it’s piloted in observe mode — it watches the fault stream and reports what it would do — before it’s promoted to actually drive remediation. While it’s in observe mode (or not deployed), GPU faults page a human via the runbook instead.

Just as important: the Daalu hub only ever reads NVSentinel’s metrics — it never drives remediation. The containment decision lives next to the hardware, in your cluster. The hub reports posture; it does not pull the lever.

cuda-checkpoint (gated)

The panel also shows a cuda-checkpoint status. CUDA checkpoint/restore can snapshot and resume GPU state — powerful for draining a card without losing in-flight work — but it is proprietary NVIDIA software under its own EULA. Daalu shows the capability so you know it exists, but it’s gated behind legal sign-off and off by default. Until it’s cleared it reads “gated — legal sign-off” and does nothing. We’d rather show you an honest “not yet” than ship something that hasn’t cleared licensing.

Who can do what

Action	Owner / Provider	Tenant admin	Consumer
View reliability posture	✓	✓	—
Run dcgmi diag / NCCL	—	✓	—
Acknowledge a stressful run	—	✓	—
Validate observability stack	—	✓	—

Diagnostics and validation act on the GPU node as a whole, so they live on the AI Factory overview — not inside a single card’s detail view. A tenant must own or provide a card for diagnostics to have something to run against.

With diagnostics and reliability covered, the last operational tool is load testing — proving how much your endpoint can actually take before latency degrades.

Next: Chapter 34 — AIPerf load testing

GPU observability AIPerf load testing