9. Connect a GPU (quickstart)

Point Daalu at your own NVIDIA hardware and serve inference at zero per-call cost — three steps, about fifteen minutes.

If your team owns an NVIDIA GPU, you can make it the engine behind Daalu’s AI. Once it’s connected, chat with the Assistant, classifier calls, and the coding Workspace all route to your hardware first, falling back to the Daalu-hosted tier only when your GPU is busy or offline. You pay for the GPU once; the inference calls are effectively free.

This is the quickstart. It gets you from idle hardware to served inference. For the full deploy options, model choices, and tuning, see Chapter 31 — Connecting your GPU. For the concept — how routing, cost, and sharing actually work — see Chapter 16 — The AI Factory model.

Supported GPUs. The serving stack targets RTX-class Ada hardware: the RTX 2000 Ada (16 GB) for smaller models, and the RTX 6000 Ada (48 GB) when you want headroom for larger models or longer context. Other CUDA-capable NVIDIA cards may work, but these two are the classes we test and tune against.

The three steps

1. Join your cluster

Daalu reaches your GPU the same way it reaches anything behind your firewall: an outbound-only Daalu Edge tunnel. Nothing on your side accepts inbound traffic, and nothing about your GPU is exposed to the public internet.

If you’ve already onboarded the cluster that hosts your GPU — you’ll see it as connected under Managed infra → Kubernetes clusters — skip ahead to step 2.

If not, onboard it now. The flow is a single Helm install of the Daalu Edge chart, which brings up a WireGuard tunnel back to the Daalu hub. The concept is covered in Chapter 15 — Cluster federation, and the step-by-step deploy is in Chapter 41 — Deploying Daalu Edge. Once the cluster row turns connected, come back here.

2. Deploy the serving stack

Open Usage & Pricing. Near the top you’ll find the local-GPU onboarding card. It walks you through deploying the inference serving stack onto the cluster you just connected:

Pick the connected cluster that has the GPU.
Confirm the GPU class and the model you want to serve (the card suggests a sensible default for your hardware — a smaller model on the RTX 2000 Ada, a larger one on the RTX 6000 Ada).
Apply. Daalu deploys a small serving pod that exposes an OpenAI-compatible HTTP API, reachable only over your Edge tunnel.

The card then reports live status as the stack comes up:

Configured — the serving stack has been deployed to your cluster.
Healthy — the Daalu router has successfully called the endpoint and is getting valid completions back.
Base URL — the in-cluster address the router is using.
Model — the model currently being served.

Give the pod a minute or two to pull its image and load weights. When the card flips to healthy, you’re done.

Tip: If the card stays at configured but never reaches healthy, it’s almost always the GPU itself — the node has no schedulable GPU, the driver isn’t present, or the model is too large for the card’s memory. The card surfaces the pod’s status so you can see which. Chapter 31 has the full troubleshooting flow; Chapter 33 covers diagnostics and reliability once you’re running.

3. The AI Factory lights up

As soon as the router marks your GPU healthy, two things happen:

The AI Factory page comes alive. The previously-empty AI Factory page now shows your GPU: its health, utilization, the model it’s serving, and the share of inference now running locally. This is your day-to-day view of the hardware — Chapter 32 covers its observability panels in depth.

Inference starts routing to your GPU first. Every Assistant chat turn, every classifier call, and every routine inference task tries your GPU before anything else. When it can serve the request, it does — at zero per-call cost. When it’s saturated or offline, the router transparently falls back to the Daalu-hosted tier, so nothing breaks. You’ll see the local-vs-hosted split reflected on Usage & Pricing.

The coding Workspace becomes available. With your GPU online, the Workspace — a browser-based, AI-assisted coding environment — is backed by your own inference. Use it to edit runbooks, automations, and infrastructure code without your code or prompts leaving your hardware.

What you’ve built

You now have an AI Factory: your own GPU serving the bulk of Daalu’s inference, with hosted models as a safety net and a single place — Usage & Pricing — to watch the cost difference. From here:

Go deeper on the deploy — model selection, multiple GPUs, sizing: Chapter 31 — Connecting your GPU.
Understand the model — how routing, cost, and GPU sharing work: Chapter 16 — The AI Factory model.
Watch and tune — utilization, diagnostics, and load testing: Chapters 32–34.

Next: Chapter 10 — Tenants and access

Onboarding your observability stack Tenants, users, and access control