44. The Daalu-hosted AI tier

Inference on Daalu’s GPUs, billed per token — for buyers who want neither a GPU on premises nor their prompts going to a public LLM vendor.

Chapter 31 is about running inference on your own GPU. Chapter 42 explains how the router chooses between your GPU and cloud LLMs. This chapter is the third option: letting Daalu host the GPU and selling you inference on it.

If you don’t want to buy a GPU and you don’t want your prompt data flowing through a public LLM vendor, this is the tier you take.

What it is

Daalu operates a pool of GPUs in our own cluster running open-weight models (Llama-class models for chat, a Qwen Coder model for the coding assistant). When you enable the Daalu-hosted tier on your tenant, the inference calls that would otherwise have gone to a cloud LLM get routed to this pool instead.

You pay per million tokens, the same way you’d pay a cloud LLM vendor — but roughly an order of magnitude less, because the underlying cost is amortised GPU plus electricity, not a third-party vendor margin.

When this tier is the right choice

Pick Daalu-hosted when:

You don’t want a GPU on premises. Buying, racking, and maintaining a card is real work; this tier lets you skip it.
You don’t want your prompts going to a public LLM vendor. Maybe legal flagged a vendor’s data-handling clause, or you have a compliance constraint that says “no third-party AI.” Daalu runs the GPU, and we’re already inside your trust boundary as your ops-platform vendor — so this tier doesn’t widen it.
Your usage is bursty. Owning a GPU pays off only with consistent use. If your AI calls are quiet most of the time with occasional spikes, you pay-as-you-go here and never pay for idle silicon.
You want to A/B test against your own GPU. Run the same prompts through both and compare quality and cost.

Stick with your own GPU (Chapter 31) when:

You have volume — past roughly 5 million tokens/month, owning the hardware wins on unit economics.
Your prompts contain data your legal team won’t even let Daalu see. Then the only acceptable option is your own hardware.
You’ve already bought the card.

Stick with cloud LLMs when:

You need a frontier-quality model. The open-weights catalogue is good, but it isn’t a frontier model.
You depend on a specific provider feature that hasn’t been ported to an OpenAI-compatible API.

The tiers aren’t mutually exclusive — you can enable Daalu-hosted and keep cloud as a fallback. The router prefers the cheapest available tier per call.

How it’s billed

Two line items on the billing page, both per million tokens:

Token type	Daalu-hosted rate	Frontier cloud rate	Savings
Prompt (input)	$0.30	$3.00	~10×
Completion (output)	$0.90	$15.00	~17×

These are indicative list prices; volume discounts apply on larger plans — talk to your account contact. Your billing page (the spend view, Chapter 48) breaks Daalu-hosted spend out by source — “chat”, “classifier”, “coding” — so you can see which feature is burning the budget.

Quotas and rate limits

When we enable the tier for your tenant, we set:

A monthly token cap — your plan’s included tokens plus any overage policy you’ve configured.
A per-minute rate limit — defaults to 60 requests/min and 50,000 tokens/min. High enough that you never notice it in normal use; tight enough that a runaway loop in one of your agents can’t burn your whole month in five minutes.

If you hit the monthly cap, your overage policy decides what happens:

hard_stop — Daalu-hosted calls return 429; the router does not fall through to cloud. For strict budgets. Default for trial plans.
throttle — calls succeed but slow down, letting a non-critical workflow finish without paying for cloud overage. Default for budget plans.
cloud_overflow — overage requests fall through to cloud at the cloud passthrough rate. Default for production plans where AI uptime matters more than a few dollars of overflow.

Switch policies any time on Settings → AI tier; the change takes effect on the next call.

Data handling

We treat this tier the same as the rest of the platform:

Prompts and completions are not persisted by default. The serving runtime holds them in memory long enough to answer; our gateway holds them long enough to relay; neither writes them to disk.
Token counts are persisted. We record what you used and what it cost — we need that to bill you — but not the prompt or completion text alongside it.
The audit log records that a call happened: the tenant, the user, the source feature, the model, the token counts, and the latency. No prompt body.
Opt-in full capture. If you want full prompt+completion capture for debugging, enable it under Settings → AI tier → Full capture and point it at an object store you control. Captures go straight there, not to Daalu’s database.

We never use customer prompts for training, fine-tuning, or anything else.

Enabling the tier

Ask your account contact to turn it on, or — if your plan allows self-serve — toggle it under Settings → AI tier:

Click Enable Daalu-hosted inference.
Pick a monthly limit (your plan’s included tokens, or whatever you’ve added on top).
Pick an overage policy (above).
Save.

Within about a minute, your next inference call starts using the Daalu-hosted pool. Watch Billing → Source to confirm — calls that previously showed a cloud provider start showing the Daalu-hosted tier.

When the pool is busy

We over-provision capacity, but unlike cloud LLMs we don’t have infinite scale. Under heavy load:

Your rate-limit bucket is unaffected — you’re never denied for being a heavy user.
Latency rises as requests queue (continuous batching). You might see p99 chat-completion latency climb for a few minutes.
If you’ve set cloud_overflow, the router skips the busy tier and goes straight to cloud for that period.

We monitor sustained saturation and add capacity. You don’t have to do anything.

Good to know

Responses arrive as a complete answer, not token-by-token. If you need streaming output, your own GPU or a cloud provider is the path.
The catalogue is fixed. If your workflow needs a model we don’t run, the answer is your own GPU (Chapter 31).
The hosted GPUs run in one region. Users far from it may see modestly higher latency than an on-prem GPU would give.