17. The LLM router

One decision, made per call: serve it on your own GPU, on the Daalu-hosted tier, or — only when nothing else can — on an external commercial model.

If you’ve ever looked at an OpenAI bill from a tool that uses LLMs heavily, you know that “we’re going to pay vendors for every prompt” is not the only way an operations product can be built. Daalu lets you bring your own GPU and run inference on your hardware first, with shared and commercial models as a fallback for the small fraction of prompts that need a model your card can’t serve.

This chapter covers the customer-facing view: the tiers the router chooses among, what each one costs you, and how the decision is made. The economics behind sharing a GPU — owner, provider, consumer — are covered in Chapter 16 — The AI Factory model. The hands-on setup guide is Chapter 31 — Connecting your GPU.

Why run inference on your own GPU

Three reasons customers turn this on:

Cost. Even at small scale, the inference cost of a single operator using the Assistant heavily, plus a few automations, plus the alert-explanation agent, is hundreds of dollars a month at commercial-LLM prices. A used data- center GPU pays for itself within months — and once it’s connected, calls that land on it cost you nothing per call.
Privacy. When a prompt runs on your own GPU, it never leaves your network. For some industries, this is the difference between “we can use this” and “we cannot.”
Latency. A GPU a couple of hops away over the federation tunnel can respond in 100–300 ms. A commercial LLM round-trips closer to 1–3 seconds. Across thousands of small calls a day, that adds up.

You can use Daalu without connecting a GPU perfectly well. The router seamlessly falls through to the Daalu-hosted tier or to commercial models. But if you have GPU capacity sitting around, this is the way to spend it.

Why it matters Your own GPU keeps inference on your hardware. Nothing about a prompt that runs there — the question, the data the Assistant pulled to answer it, or the response — touches an external vendor. That property is enforced by the router, not by a setting you have to remember to flip.

What you need

The minimum setup to put your own GPU in the cascade:

A federated cluster (Chapter 15 — Cluster federation). The GPU node lives inside one of your clusters; the router reaches it through the federation tunnel.
A supported GPU in that cluster — an RTX 2000 Ada (16 GB) or RTX 6000 Ada (48 GB). The card size determines which models you can serve.
The vLLM serving stack, deployed for you when you connect the GPU. Daalu publishes a tested chart that handles model selection, the device plugin, and resource limits.
Open model weights — Llama and Qwen families are the tested defaults. The chart can pull them on first start or use weights you’ve pre-staged.

Connecting a GPU is a guided flow, not a manual chart install — Chapter 31 — Connecting your GPU walks through it end-to-end.

The tiers the router chooses among

The LLM router is a small component in the Daalu cloud that decides, for each inference call, where to send it. The decision is per call, not per tenant — the router can serve one prompt on your GPU and the next on a commercial model, inside the same conversation.

It works down an ordered cascade, trying each tier in turn and falling through on failure:

Your own GPU. If your tenant has a connected GPU reachable over the federation tunnel, the router tries it first. Calls served here cost you nothing per call — you already paid for the card.
The Daalu-hosted tier. Operator-run shared GPUs, reached through the inference gateway. The gateway enforces your monthly quota and per-minute rate limit, meters the usage, and — if the call lands on a GPU another tenant contributed — credits that contributor. You pay per usage against your quota. See Chapter 44 — Daalu-hosted AI.
External commercial models. Anthropic and OpenAI-compatible vendors, used only as a last-resort fallback for prompts the first two tiers can’t serve.

Which tiers are even eligible depends on your plan’s routing policy. A plan can prefer your own hardware first, or pin everything to external vendors, or anything in between — admins set this and can override specific routes in Settings.

Note Coding requests never touch the external tier. When the in-IDE coding assistant (Chapter 35 — Coding workspace) calls a model, the router drops tiers 3 entirely: code is only ever served on your own GPU or the Daalu-hosted tier. Your source never leaves the customer-or-operator perimeter. This is enforced at the router, the single chokepoint every call passes through.

In a healthy tenant with a connected GPU, 80–95 % of inference (by volume) runs on your hardware, with the remainder spilling to the next tier. The exact ratio depends on what your team does with the Assistant.

Tool-using investigations

The router doesn’t only serve one-shot text completions. When the Assistant investigates an alert, it runs a multi-step, tool-calling loop — querying metrics, reading logs, listing resources — and each step is an inference call routed by the same cascade. So an alert triage runs on your own GPU first, then the Daalu-hosted tier, exactly like a chat turn does.

One practical detail: the external Anthropic tier is skipped for these tool-using calls, because its function-calling wire format differs from the OpenAI-compatible tiers the rest of the cascade uses. In practice this is invisible — your own GPU and the Daalu-hosted tier already cover tool use — but it’s why a heavy alert investigation behaves slightly differently from a plain chat question when you have no GPU connected at all.

Where the router shows up in the product

Two pages tell the story.

The AI Factory page (Chapter 30 — Overview) shows the hardware side: whether your GPU is online, how each tier is being used, and — depending on your role in the GPU-sharing model — utilisation, diagnostics, or your consumption of a shared pool.

The Billing page (Chapter 27 — Billing) shows the money side:

Inference last 7 days — a stacked breakdown of calls by tier, split into classifier / chat / agent.
Spend — what you paid, with a comparison to “what this would have cost on commercial models alone.” Tenants with a connected GPU commonly see 70–90 % savings.

The SKU picker lets you see what each commercially-equivalent model would have cost — useful when you’re sizing how much GPU hardware to buy.

When connecting a GPU is and isn’t worth it

Honest assessment:

Worth it if:

You’re already running a GPU for other workloads — the Daalu serving pod is light and schedules alongside other GPU work on the same card.
Your team uses the Assistant heavily — multiple operators, daily use, agents running on a schedule.
You have privacy requirements that make external commercial models hard to justify.
Your inference latency starts being noticed (you’re calling the Assistant in the middle of incident response and the multi-second wait is friction).

Not worth it if:

You’re a small team using the Assistant occasionally. Your fallback-tier bill is going to be modest; a dedicated GPU won’t pay back.
You don’t have spare physical infrastructure and don’t want to operate any — and you’d rather lean on the Daalu-hosted tier, which needs no hardware on your side.
Your primary workload is long-context summarization (e.g., digesting 100k tokens of logs at a time). Open models can do this but it stresses VRAM; the economics of a long-context commercial model are better.

The decision is not hard to reverse. You can connect a GPU, evaluate for a month, disconnect it if you don’t like it — nothing else about your tenant changes.

Model choices

Which models you can serve on your own GPU is governed by the card you bring:

RTX 2000 Ada (16 GB) — fits a strong general 8B-class model (Llama and Qwen families are the tested defaults) and, for the coding workspace, a 14B coder.
RTX 6000 Ada (48 GB) — fits larger 30–32B models, including the long-context coder used for agentic, tool-driven work.

The router auto-detects which model is being served on your endpoint and routes accordingly — you don’t tell it twice. For the coding workspace specifically you pick the model per workspace in the create form; see Chapter 35 — Coding workspace.

Newer models are added on a rolling basis as we validate them. The current tested set is at Help → Hardware compatibility.

What your own GPU can’t do

To set expectations — these are the cases the router deliberately falls through to the Daalu-hosted or external tier:

Frontier-class quality on the hardest prompts. Open models are catching up but aren’t always there. The router knows this and falls through when a prompt requires it (outside coding, where it never leaves the perimeter).
Vision and audio. Open vision-language models exist but aren’t in the tested set yet; multi-modal inference still goes to a fallback tier.
Fine-tuning your own model. The serving stack runs base models. Custom-trained variants aren’t supported at the customer level yet — talk to us if this matters.

Next: Chapter 18 — Home

The AI Factory model Home