47. SKU breakdown
A SKU pins a routing policy. The policy decides which tier serves each call — and your own GPU serves at zero per-call cost.
The SKU picker on Usage & Pricing has a card per plan. This chapter explains what a SKU actually controls, the tiers your calls can land on, and how to pick the right one for your team.
A SKU is a purchasable plan. Beyond a flat monthly base and a bundle of included calls, the thing it really pins is a routing policy — the rule the LLM router follows when it decides where each inference call runs. Different policies produce very different bills for the same workload.
The two things a call has
Every inference call carries an intent and produces an outcome. They are not the same, and billing is always against the outcome.
Intent — what the caller asked for:
| Intent | Work | Typical shape |
|---|---|---|
| Classifier | Short routing/scoring decisions — “which team owns this alert?”, “does this need a bigger model?“ | 100–500 tokens in, a handful out. Very high volume. |
| Quality | Real reasoning — answering an operator, explaining an alert, writing a briefing, reviewing a diff | 1k–100k tokens in, hundreds to thousands out. Lower volume, higher value. |
Outcome — which tier actually served it:
| Tier | What it is | Per-call cost |
|---|---|---|
| Your own GPU | A card you connected through Daalu Edge | $0 — you own the hardware |
| Daalu-hosted | A Daalu-operated shared GPU | Low metered rate |
| External classifier | A cheap commercial model for classification | Pass-through tokens |
| External quality | A premium commercial model for hard reasoning | Pass-through tokens |
Why it matters. You request an intent; the router picks the tier. A “quality” intent might land on your own GPU (free), the Daalu-hosted GPU (cheap), or a premium commercial model (pass-through) depending on your policy and what’s online. You pay for the silicon that did the work — never for the one you wished had.
The routing policies
Each SKU pins one of four policies. This is the most important choice on the page.
Local-First (default)
Classifier intents run on the Daalu-hosted GPU; quality intents go to a commercial model. Cheapest unit economics for a team that hasn’t connected its own GPU yet — the high-volume classifier traffic stays off commercial billing entirely.
Hybrid
Both intents try a Daalu-operated GPU first and fall back to commercial models on a miss or a model-fit problem. The best quality-and-cost mix once there’s enough GPU headroom in the pool (a 48 GB card, for instance) to serve quality prompts locally.
External-Only
Never routes to a Daalu-operated GPU. For customers whose data-residency clauses forbid prompts transiting a shared host. Every call is pass-through commercial. The most expensive policy, chosen for compliance rather than cost.
Your own GPU
Once you’ve connected a card (Owner or Provider — see Chapter 16), your inference routes to your hardware first. Calls served there are zero per-call cost; the SKU carries a flat licence instead of per-token metering. Anything your card can’t serve falls back gracefully to the Daalu-hosted or commercial tiers.
Tip. A 16 GB
ada-16card on a Local-First or Hybrid policy already takes nearly all classifier traffic off the meter. Step up to a 48 GBada-48when you want quality prompts served locally too. See Chapter 31.
How a call is priced
The billing engine is small and predictable. For each tenant:
monthly bill = SKU monthly base
+ Σ over calls( tokens × that tier's per-million rate )- The first N calls each month are included in the base (your SKU’s bundle); calls beyond N are metered.
- A tier priced at zero on your SKU isn’t offered — the router won’t use it. Your-own-GPU calls are zero by construction.
- The dollar amount for each call is computed and stamped at the time of the call, against the rates in force then. Historical rows survive future price changes; we never back-bill.
There’s no separate ledger and no running balance — every figure on the billing page is a live sum over the recorded calls.
What the Usage & Pricing page shows
For the current month, the page breaks your inference down two ways:
- By tier — how many calls and how many dollars landed on your own GPU vs. Daalu-hosted vs. each external tier. This is your savings story at a glance.
- By source — which agent, briefing, or automation is spending
the budget (e.g.
infra.alert.triage,infra.brief.daily). This is how you find a runaway workflow.
A per-day trend chart sits alongside both. Everything updates in near-real-time.
How to read the savings
A typical Local-First tenant with a GPU connected sees something like:
| Tier | Calls / month | Cost |
|---|---|---|
| Your own GPU | 820,000 | $0 |
| Daalu-hosted | 60,000 | $18 |
| External quality | 2,000 | $400 |
| External classifier | 5,000 | $3 |
The pattern: the overwhelming majority of calls land on your own GPU at zero cost; the long, hard, long-context reasoning that genuinely benefits from a frontier model is the only line that moves the bill. Without a connected GPU, that first row would have been hundreds of dollars of Daalu-hosted and external classifier spend instead. Your card paid for itself.
Picking a policy
| If you… | Pick |
|---|---|
| Haven’t connected a GPU yet | Local-First — keeps classifier traffic off commercial billing |
| Have a 48 GB card and want quality served locally | Hybrid |
| Have a strict no-shared-host clause | External-Only |
| Own your card and want maximum control + zero per-call cost | Your own GPU |
Only a tenant admin can change the SKU — on Usage & Pricing → Plan. The switch takes effect immediately for new calls; in-flight calls finish on the old policy.
Quotas and overage
If your plan includes a call bundle, the included allowance applies across tiers. The billing page shows what percentage of the bundle you’ve consumed. Overage is billed at the per-tier rates above. An optional soft cap can be set so that, above a dollar threshold, the API sheds non-essential calls before they run.