47. SKU breakdown

A SKU pins a routing policy. The policy decides which tier serves each call — and your own GPU serves at zero per-call cost.

The SKU picker on Usage & Pricing has a card per plan. This chapter explains what a SKU actually controls, the tiers your calls can land on, and how to pick the right one for your team.

A SKU is a purchasable plan. Beyond a flat monthly base and a bundle of included calls, the thing it really pins is a routing policy — the rule the LLM router follows when it decides where each inference call runs. Different policies produce very different bills for the same workload.

The two things a call has

Every inference call carries an intent and produces an outcome. They are not the same, and billing is always against the outcome.

Intent — what the caller asked for:

Intent	Work	Typical shape
Classifier	Short routing/scoring decisions — “which team owns this alert?”, “does this need a bigger model?“	100–500 tokens in, a handful out. Very high volume.
Quality	Real reasoning — answering an operator, explaining an alert, writing a briefing, reviewing a diff	1k–100k tokens in, hundreds to thousands out. Lower volume, higher value.

Outcome — which tier actually served it:

Tier	What it is	Per-call cost
Your own GPU	A card you connected through Daalu Edge	$0 — you own the hardware
Daalu-hosted	A Daalu-operated shared GPU	Low metered rate
External classifier	A cheap commercial model for classification	Pass-through tokens
External quality	A premium commercial model for hard reasoning	Pass-through tokens

Why it matters. You request an intent; the router picks the tier. A “quality” intent might land on your own GPU (free), the Daalu-hosted GPU (cheap), or a premium commercial model (pass-through) depending on your policy and what’s online. You pay for the silicon that did the work — never for the one you wished had.

The routing policies

Each SKU pins one of four policies. This is the most important choice on the page.

Local-First (default)

Classifier intents run on the Daalu-hosted GPU; quality intents go to a commercial model. Cheapest unit economics for a team that hasn’t connected its own GPU yet — the high-volume classifier traffic stays off commercial billing entirely.

Hybrid

Both intents try a Daalu-operated GPU first and fall back to commercial models on a miss or a model-fit problem. The best quality-and-cost mix once there’s enough GPU headroom in the pool (a 48 GB card, for instance) to serve quality prompts locally.

External-Only

Never routes to a Daalu-operated GPU. For customers whose data-residency clauses forbid prompts transiting a shared host. Every call is pass-through commercial. The most expensive policy, chosen for compliance rather than cost.

Your own GPU

Once you’ve connected a card (Owner or Provider — see Chapter 16), your inference routes to your hardware first. Calls served there are zero per-call cost; the SKU carries a flat licence instead of per-token metering. Anything your card can’t serve falls back gracefully to the Daalu-hosted or commercial tiers.

Tip. A 16 GB ada-16 card on a Local-First or Hybrid policy already takes nearly all classifier traffic off the meter. Step up to a 48 GB ada-48 when you want quality prompts served locally too. See Chapter 31.

How a call is priced

The billing engine is small and predictable. For each tenant:

monthly bill  =  SKU monthly base
              +  Σ over calls( tokens × that tier's per-million rate )

The first N calls each month are included in the base (your SKU’s bundle); calls beyond N are metered.
A tier priced at zero on your SKU isn’t offered — the router won’t use it. Your-own-GPU calls are zero by construction.
The dollar amount for each call is computed and stamped at the time of the call, against the rates in force then. Historical rows survive future price changes; we never back-bill.

There’s no separate ledger and no running balance — every figure on the billing page is a live sum over the recorded calls.

What the Usage & Pricing page shows

For the current month, the page breaks your inference down two ways:

By tier — how many calls and how many dollars landed on your own GPU vs. Daalu-hosted vs. each external tier. This is your savings story at a glance.
By source — which agent, briefing, or automation is spending the budget (e.g. infra.alert.triage, infra.brief.daily). This is how you find a runaway workflow.

A per-day trend chart sits alongside both. Everything updates in near-real-time.

How to read the savings

A typical Local-First tenant with a GPU connected sees something like:

Tier	Calls / month	Cost
Your own GPU	820,000	$0
Daalu-hosted	60,000	$18
External quality	2,000	$400
External classifier	5,000	$3

The pattern: the overwhelming majority of calls land on your own GPU at zero cost; the long, hard, long-context reasoning that genuinely benefits from a frontier model is the only line that moves the bill. Without a connected GPU, that first row would have been hundreds of dollars of Daalu-hosted and external classifier spend instead. Your card paid for itself.

Picking a policy

If you…	Pick
Haven’t connected a GPU yet	Local-First — keeps classifier traffic off commercial billing
Have a 48 GB card and want quality served locally	Hybrid
Have a strict no-shared-host clause	External-Only
Own your card and want maximum control + zero per-call cost	Your own GPU

Only a tenant admin can change the SKU — on Usage & Pricing → Plan. The switch takes effect immediately for new calls; in-flight calls finish on the old policy.

Quotas and overage

If your plan includes a call bundle, the included allowance applies across tiers. The billing page shows what percentage of the bundle you’ve consumed. Overage is billed at the per-tier rates above. An optional soft cap can be set so that, above a dollar threshold, the API sheds non-essential calls before they run.

Next: Chapter 48 — Spend analysis and forecasting

The pricing model Spend analysis and forecasting