30. The AI Factory
Live observability for the GPUs that manufacture your tokens — no Grafana detour required.
Every answer the Assistant gives, every alert it triages, every line the coding agent writes is manufactured: a GPU turns your prompt into tokens. The AI Factory is the page where you watch that factory floor — utilisation, thermals, memory, hardware health, and the diagnostics that keep it honest — in the same app you do everything else, scoped so you only ever see your own hardware.
At a glance
What it is A native, role-aware dashboard for the GPUs serving your inference — metrics, events, alerts, diagnostics, reliability, and load testing. Where to find it AI Factory in the left sidebar ( /ai-factory).Who can use it Anyone in a tenant that owns, provides, or consumes GPU capacity. Diagnostics and validation are gated to tenant admins.
This chapter is the page tour. The metrics deep-dive lives in Chapter 32, admin diagnostics in Chapter 33, and load testing in Chapter 34. If you haven’t connected a GPU yet, start at Chapter 31.
Why it exists
Operators who run their own GPUs already have dcgmi, Grafana, and a
Prometheus full of DCGM_FI_DEV_* series. The problem isn’t the data
— it’s the context-switch. To answer “is my card healthy?” you’d leave
Daalu, open Grafana, find the right dashboard, and remember which row
is yours.
The AI Factory collapses that. It reads the same DCGM metrics through Prometheus, but renders them natively and tenant-scoped on the server. There is no dashboard to pick and no way to see another tenant’s hardware — the backend resolves what you’re allowed to see before a single series is returned.
The header — role, class, health
Three things sit at the top of the page and tell you almost everything at a glance.
-
Role badge. The backend resolves your relationship to GPU capacity into one of four roles and renders the page accordingly:
Role What it means What you see Owner Your tenant owns a dedicated GPU. Full hardware view. Provider You contribute a GPU to a shared pool. Full hardware view, plus a pool note. Consumer You draw on a shared pool you don’t own. A usage view — never someone else’s hardware. none No GPU relationship yet. An explainer and a link to get started. -
GPU-class chip. A small monospace tag —
ada-16(RTX 2000 Ada, 16 GB) orada-48(RTX 6000 Ada, 48 GB) — that tells you which hardware tier is behind your tokens. See Chapter 31 for what each class can serve. -
Health pill. For owners and providers, a live Healthy / Degraded / Critical pill summarising the worst card you have. It reflects the same health model the per-GPU dots use (thresholds in Chapter 32).
The page polls itself: the overview every 30 seconds, the live summary and alerts every 15. You can leave it open as a wall display.
The hardware view (owner / provider)
If you own or provide a GPU, the page opens on a list of your cards — one selectable card each, showing model, host, and three mini-stats (utilisation, temperature, VRAM %) plus a coloured health dot. Below the list sit your firing alerts and the Reliability panel.
Click any card to drill into its detail view:
- Five metric cards — Temperature, Utilisation, VRAM (used / total GB), Power (W), and SM-active %. When a card has more than one GPU, a per-GPU breakdown table appears underneath.
- A Trend chart — pick a metric (utilisation, temperature, memory, power) and a range (1h / 6h / 24h / 7d) to see how the card has behaved over time.
- An XID / ECC events table — hardware error events, or a reassuring “none recorded”.
- A firing / pending alerts list scoped to that card.
Chapter 32 explains every one of these in detail.
Why it matters: high VRAM at 0% utilisation is normal. The first time you open the detail view you may see ~90% of VRAM in use while utilisation reads 0%. That is not a fault. The vLLM server that serves your model reserves roughly 90% of VRAM up front — the model weights plus a pre-allocated KV-cache pool — and holds it whether or not requests are flowing. Utilisation tracks live inference work, so it sits at 0% while the model is loaded but idle and climbs only as traffic hits the card. The page calls this out inline so nobody pages an engineer over a healthy idle GPU.
The usage view (consumer)
If your tenant draws on a shared pool you don’t own, you get a usage-centric panel instead — and you never see another tenant’s hardware health. It shows:
- Tokens in / out and Requests for the current period.
- Average latency across your calls.
- A quota bar (used / limit) that turns red as you approach your cap.
- Shared-pool utilisation, when available, so you can tell whether a slow response is your quota or pool pressure.
This is the right view for teams on the Daalu-hosted tier: you care about throughput and quota, not fan curves on a card you don’t run.
Not connected yet (role “none”)
If your tenant has no GPU relationship, the page shows a short
explainer and a button to Usage & Pricing (/billing), where you
either connect your own GPU or pick a plan that routes you to a pool.
Once either is live, the factory floor lights up here automatically.
Admin and operator panels
Three more panels appear for the right people, always on the overview (never inside a single card’s detail):
- Diagnostics (tenant admins) — run
dcgmi diagor NCCL tests on demand. Stressful runs require an explicit acknowledgement. - Validate observability stack (tenant admins) — a read-only self-check that the whole metrics pipeline is wired up.
- Performance benchmarking / AIPerf (GPU owners, providers, and site operators) — a load test that produces an SLO and capacity curve.
Diagnostics and validation are covered in Chapter 33; AIPerf in Chapter 34.
How tenant-scoping works
You never choose a tenant filter, because there isn’t one to choose.
Every metrics query the page makes is rewritten on the server to carry
your tenant label before it reaches Prometheus. A consumer endpoint
deliberately returns usage numbers and null for pool hardware
utilisation — the provider’s raw card health is simply not in the
response. The isolation is structural, not cosmetic.
The AI Factory is the observability half of running inference on your own hardware. The conceptual model behind it — owners, providers, consumers, and how routing chooses a tier — is Chapter 16. To put a GPU behind it, read the next chapter.