PricingSKU breakdown

47. SKU breakdown

A SKU pins a routing policy. The policy decides which tier serves each call — and your own GPU serves at zero per-call cost.

The SKU picker on Usage & Pricing has a card per plan. This chapter explains what a SKU actually controls, the tiers your calls can land on, and how to pick the right one for your team.

A SKU is a purchasable plan. Beyond a flat monthly base and a bundle of included calls, the thing it really pins is a routing policy — the rule the LLM router follows when it decides where each inference call runs. Different policies produce very different bills for the same workload.


The two things a call has

Every inference call carries an intent and produces an outcome. They are not the same, and billing is always against the outcome.

Intent — what the caller asked for:

IntentWorkTypical shape
ClassifierShort routing/scoring decisions — “which team owns this alert?”, “does this need a bigger model?“100–500 tokens in, a handful out. Very high volume.
QualityReal reasoning — answering an operator, explaining an alert, writing a briefing, reviewing a diff1k–100k tokens in, hundreds to thousands out. Lower volume, higher value.

Outcome — which tier actually served it:

TierWhat it isPer-call cost
Your own GPUA card you connected through Daalu Edge$0 — you own the hardware
Daalu-hostedA Daalu-operated shared GPULow metered rate
External classifierA cheap commercial model for classificationPass-through tokens
External qualityA premium commercial model for hard reasoningPass-through tokens

Why it matters. You request an intent; the router picks the tier. A “quality” intent might land on your own GPU (free), the Daalu-hosted GPU (cheap), or a premium commercial model (pass-through) depending on your policy and what’s online. You pay for the silicon that did the work — never for the one you wished had.


The routing policies

Each SKU pins one of four policies. This is the most important choice on the page.

Local-First (default)

Classifier intents run on the Daalu-hosted GPU; quality intents go to a commercial model. Cheapest unit economics for a team that hasn’t connected its own GPU yet — the high-volume classifier traffic stays off commercial billing entirely.

Hybrid

Both intents try a Daalu-operated GPU first and fall back to commercial models on a miss or a model-fit problem. The best quality-and-cost mix once there’s enough GPU headroom in the pool (a 48 GB card, for instance) to serve quality prompts locally.

External-Only

Never routes to a Daalu-operated GPU. For customers whose data-residency clauses forbid prompts transiting a shared host. Every call is pass-through commercial. The most expensive policy, chosen for compliance rather than cost.

Your own GPU

Once you’ve connected a card (Owner or Provider — see Chapter 16), your inference routes to your hardware first. Calls served there are zero per-call cost; the SKU carries a flat licence instead of per-token metering. Anything your card can’t serve falls back gracefully to the Daalu-hosted or commercial tiers.

Tip. A 16 GB ada-16 card on a Local-First or Hybrid policy already takes nearly all classifier traffic off the meter. Step up to a 48 GB ada-48 when you want quality prompts served locally too. See Chapter 31.


How a call is priced

The billing engine is small and predictable. For each tenant:

monthly bill  =  SKU monthly base
              +  Σ over calls( tokens × that tier's per-million rate )
  • The first N calls each month are included in the base (your SKU’s bundle); calls beyond N are metered.
  • A tier priced at zero on your SKU isn’t offered — the router won’t use it. Your-own-GPU calls are zero by construction.
  • The dollar amount for each call is computed and stamped at the time of the call, against the rates in force then. Historical rows survive future price changes; we never back-bill.

There’s no separate ledger and no running balance — every figure on the billing page is a live sum over the recorded calls.


What the Usage & Pricing page shows

For the current month, the page breaks your inference down two ways:

  • By tier — how many calls and how many dollars landed on your own GPU vs. Daalu-hosted vs. each external tier. This is your savings story at a glance.
  • By source — which agent, briefing, or automation is spending the budget (e.g. infra.alert.triage, infra.brief.daily). This is how you find a runaway workflow.

A per-day trend chart sits alongside both. Everything updates in near-real-time.


How to read the savings

A typical Local-First tenant with a GPU connected sees something like:

TierCalls / monthCost
Your own GPU820,000$0
Daalu-hosted60,000$18
External quality2,000$400
External classifier5,000$3

The pattern: the overwhelming majority of calls land on your own GPU at zero cost; the long, hard, long-context reasoning that genuinely benefits from a frontier model is the only line that moves the bill. Without a connected GPU, that first row would have been hundreds of dollars of Daalu-hosted and external classifier spend instead. Your card paid for itself.


Picking a policy

If you…Pick
Haven’t connected a GPU yetLocal-First — keeps classifier traffic off commercial billing
Have a 48 GB card and want quality served locallyHybrid
Have a strict no-shared-host clauseExternal-Only
Own your card and want maximum control + zero per-call costYour own GPU

Only a tenant admin can change the SKU — on Usage & Pricing → Plan. The switch takes effect immediately for new calls; in-flight calls finish on the old policy.


Quotas and overage

If your plan includes a call bundle, the included allowance applies across tiers. The billing page shows what percentage of the bundle you’ve consumed. Overage is billed at the per-tier rates above. An optional soft cap can be set so that, above a dollar threshold, the API sheds non-essential calls before they run.


Next: Chapter 48 — Spend analysis and forecasting