31. Connecting your GPU
Bring your own NVIDIA card. Daalu routes inference to it first, at zero per-call cost, and lights up the AI Factory around it.
Most of Daalu’s AI runs on inference you don’t have to think about. But if you already own GPUs — or you have a hard requirement that prompts never touch a third-party model provider — you can put your own NVIDIA card behind the platform. Once it’s connected, Daalu routes inference to it first, the AI Factory starts streaming its health, and the coding workspace becomes available.
This chapter is the path from “I have a GPU” to “Daalu is using it.” It is deliberately free of cluster internals — the heavy lifting is a Helm chart and a card on the Usage & Pricing page.
Supported GPU classes
Daalu describes GPUs by class — a label that captures the card model and, more importantly, how much VRAM it has, which decides what you can serve.
| Class | Card | VRAM | What it can serve |
|---|---|---|---|
ada-16 | RTX 2000 Ada | 16 GB | The everyday 14B coder (AWQ-INT4), classifier-tier inference, alert triage. The single-card workhorse. |
ada-48 | RTX 6000 Ada | 48 GB | Everything ada-16 does, plus the larger 32B dense coder and the long-context Qwen3-Coder MoE. |
The class shows up as a chip in the AI Factory header and decides which coding models you can pick. A 16 GB card serves one model exclusively; a 48 GB card has the headroom for the bigger models and longer context windows.
Tip. You don’t need a data-center GPU to get real value. A single RTX 2000 Ada is enough to take the high-volume, low-stakes inference — alert classification, log triage, the everyday coding model — off commercial LLM billing entirely.
What “owner” vs “provider” means
When you connect a GPU, Daalu records your tenant’s relationship to it (the model behind this is Chapter 16):
- Owner — the card is dedicated to your tenant. Your inference runs on it; nobody else’s does. This is the default when you connect a GPU for your own use.
- Provider — you contribute the card to a shared pool that other tenants can draw on. You still see its full hardware health; you also see pool-level signals.
Either way you get the full hardware view in the AI Factory. The difference is who else’s tokens your silicon manufactures.
The two-step path
Connecting a GPU is two moves: join your cluster to Daalu, then deploy the serving stack onto it from the UI.
Step 1 — Join your cluster via Daalu Edge
Daalu reaches your GPU the same way it reaches anything behind your firewall: an outbound-only WireGuard tunnel opened by the Daalu Edge agent. You never open an inbound port and never hand over a long-lived credential.
The mechanics are covered end-to-end in Chapter 15 — Cluster federation and Chapter 41 — Deploying Daalu Edge. In short:
- In the operator app, request an invite for a new cluster. You get a single-use bootstrap token, valid for one hour.
- Install the Daalu Edge Helm chart
(
oci://ghcr.io/daalu/charts/daalu-edge) into the cluster that has the GPU, passing that token. - The agent dials home over HTTPS, exchanges the token for a long-lived tunnel config, and brings up WireGuard. The cluster row turns connected within ~30 seconds.
Your GPU node needs a working NVIDIA driver and the standard GPU
add-ons your cluster already uses to schedule GPU pods. If kubectl get nodes shows the GPU node Ready and your cluster can already run a
GPU workload, you’re ready for Step 2.
Step 2 — Deploy the serving stack from Usage & Pricing
With the cluster connected, the rest happens in the browser.
- Open Usage & Pricing (
/billing) and find the local-GPU onboarding card. - Point it at your connected cluster and pick the GPU class. Daalu deploys the vLLM serving stack onto the joined cluster for you — you don’t hand-write manifests.
- Watch the card’s status. It moves through configured to healthy once the model has loaded and Daalu has successfully called the endpoint. When it’s healthy you’ll see the base URL Daalu is using and the served model name.
The serving stack hosts an open-weight model over an OpenAI-compatible API that only Daalu can reach through the tunnel. First start takes a few minutes while the model weights download and compile; after that, restarts are quick.
What you get once it’s connected
The moment the status reads healthy, three things change:
- The AI Factory lights up. Your role badge, GPU-class chip, and live health pill appear, and the page starts streaming utilisation, thermals, memory, and hardware health for your card. See Chapter 30.
- Inference routes to your GPU first. Daalu’s router tries your own GPU tier ahead of the Daalu-hosted tier and ahead of any commercial model. When your card can serve a request, it does — at zero per-call cost to you. If the card is offline or can’t serve a given request, the router falls back gracefully. The routing rules are Chapter 17 and Chapter 42.
- The coding workspace becomes available. Because you now have an in-perimeter inference tier, the browser IDE unlocks and coding prompts route to your GPU — never to an external provider. See Chapter 35.
Why it matters. Connecting a GPU is the difference between renting every token and owning the high-volume ones. The commercial LLM is still there for the heavy, specialty requests that benefit from it — but the steady-state classifier and coding traffic that dominates by volume runs on hardware you already paid for.
What to expect — sizing
| If you want to… | Use | Notes |
|---|---|---|
| Take classifier + everyday coding off commercial billing | One ada-16 (RTX 2000 Ada) | The default 14B coder and triage traffic fit comfortably. |
| Serve the larger 32B coder or long-context agent model | One ada-48 (RTX 6000 Ada) | 48 GB headroom for bigger weights and KV cache. |
| Provide capacity to several tenants | ada-48, set as a provider | You’ll see pool-level signals alongside card health. |
| Survive a single-pod outage | A second GPU node of the same class | The router already falls back to the Daalu-hosted tier on a health-check miss, so a single card is degraded service, not downtime. |
A single card plus automatic fallback is enough for most teams. Add redundancy when inference becomes load-bearing for you.
When the card isn’t healthy
If the onboarding card stalls before healthy:
- Confirm the cluster row is still connected in the operator app — the serving stack is unreachable if the tunnel is down.
- Give the first start a few minutes; the model weights download on first boot.
- Check the AI Factory’s observability validation (admins) to confirm metrics are flowing.
Chapter 50 — Troubleshooting has the full checklist.
With a GPU connected, the rest of Part V is about living with it: what every metric means, how to run diagnostics, how to load-test for capacity, and how the coding workspace uses your card.