16. The AI Factory model

A GPU-sharing economy: own a card and run for free, lend its spare capacity and earn, or rent a shared pool and pay only for what you use.

Most AI products have exactly one economic relationship with you: you send prompts, they bill you per token, the tokens run on someone else’s hardware. Daalu’s AI Factory is built on a different idea. A GPU is a capital asset. If you own one, your inference should be free. If it has spare capacity, you should be able to lend that capacity and get paid. And if you don’t own one, you should be able to rent time on a shared pool without buying hardware at all.

This chapter explains that model — the roles, the shared pool, and the front door that makes the accounting work. It’s the why behind Chapter 17 — The LLM router, which covers the per-call decision, and behind Chapter 30 — Overview, which covers the page you’ll actually look at.

The four roles

Every tenant sits in one of four roles at any time. You don’t pick a role from a menu — Daalu resolves it automatically from what your tenant has connected and how it’s configured. As your setup changes, your role changes with it.

Owner

Your tenant has its own connected GPU, reached over the federation tunnel. Your inference runs on your card, and calls served there cost you nothing per call — you already paid for the hardware. You also get the full hardware view: utilisation, temperature, diagnostics, reliability signals.

This is the role most teams aim for. Connecting a GPU is covered in Chapter 31 — Connecting your GPU.

Provider

You’re an owner who has gone one step further: you contribute your GPU’s spare capacity to a shared pool. When another tenant’s inference lands on your card, the inference gateway credits you for that call — a revenue share. Your own inference still runs free; the contribution only fills cycles you weren’t using.

A provider sees everything an owner sees, plus the consumption their pool is serving for others.

Consumer

Your tenant routes to a shared pool rather than running on its own hardware. The pool can be the Daalu-hosted pool (operator- run GPUs) or a provider’s pool. Either way you pay per usage against a quota — no card to buy, no hardware to operate. A consumer’s view is usage-centric: tokens used, requests made, quota remaining, average latency. You never see another tenant’s raw hardware health — only your own consumption.

None

You’re not on a GPU yet. The Assistant still works — the router simply falls through to the Daalu-hosted tier or to external commercial models — but the AI Factory page invites you to connect a card or enrol in the shared tier. This is where most tenants start.

Why it matters Roles are resolved per tenant and enforced server-side. A consumer can never see a provider’s card-level metrics, and a provider’s card never serves a call without the gateway recording who to credit. You don’t configure any of this — the separation is structural.

How the pieces fit together

An owner runs their own inference directly on their card. A provider’s spare capacity flows into a shared pool. A consumer’s requests go through the inference gateway, which dispatches them to the pool and — when the call lands on a contributed card — credits the provider who owns it.

The tiers the router chooses among

The same hardware shows up to the router as an ordered set of tiers. For any given call it tries them in turn and falls through on failure (the full logic is Chapter 17 — The LLM router):

Your own GPU. Reached over the federation tunnel. No per-call cost. Tried first whenever your tenant owns a card.
The Daalu-hosted tier. Operator-run shared GPUs, reached through the inference gateway, billed per usage against your quota. See Chapter 44 — Daalu-hosted AI.
External commercial models. A last-resort fallback for prompts the first two tiers can’t serve.

Note Coding requests never use the external tier. When the in-IDE coding assistant calls a model, the router serves it only on your own GPU or the Daalu-hosted tier — your source never leaves the customer-or-operator perimeter. The privacy guarantee is enforced at the router, not left to each caller.

The inference gateway

The inference gateway is the front door to every shared GPU. It speaks the standard OpenAI-compatible chat-completions API, so the rest of Daalu talks to it the same way it talks to any model endpoint. What makes it more than a passthrough is the policy it wraps around each call:

Monthly quota. Every consumer tenant has a token budget for the billing period. The gateway checks it before dispatching and refuses cleanly once you’re over — handing the router a signal to fall through rather than a hard error.
Per-minute rate limit. A short-window limit so one tenant’s burst can’t starve the pool for everyone else.
Metering. Each successful call is recorded — tokens in, tokens out, latency, model, and the cost computed against your plan. This is what feeds the spend views.
Revenue-share accounting. When a call is served on a card a provider contributed, the gateway records a credit to that provider. The owner of the hardware gets paid for the cycles their card served.

Because all of this happens at one chokepoint, quota, rate limiting, billing, and provider credits stay consistent no matter which feature made the call — the Assistant, an automation, the alert-explanation agent, or the coding workspace.

GPU classes you can bring

Daalu’s serving stack is tested on two NVIDIA Ada-generation cards, and the card you bring determines which models you can serve:

RTX 2000 Ada (16 GB). Fits a strong general 8B-class model and, for the coding workspace, a 14B coder as the sole model on the card. The everyday entry point.
RTX 6000 Ada (48 GB). Fits larger 30–32B models, including the long-context coder used for agentic, tool-driven work.

You don’t hand-tune any of this. When you connect a card, Daalu detects its class and offers the models that fit. Hardware setup is Chapter 31 — Connecting your GPU.

Where the AI Factory shows up

Two surfaces in the product reflect this model.

The AI Factory page (Chapter 30 — Overview) is role-scoped: an owner or provider sees their card’s hardware metrics and can run diagnostics, while a consumer sees usage- centric panels for their use of a shared card — never another tenant’s hardware. The page reads your role and shows you exactly the view that role is entitled to.

Usage & Pricing (Chapter 27 — Billing) is the money side: tokens and spend by tier, the savings from running on your own GPU, and — if you’re a provider — the revenue-share credits the gateway has recorded for capacity you contributed.

The mental model in one line

Own a card and your inference is free. Lend its spare capacity and you earn. Own nothing and you rent a shared pool by the token. The router and the gateway make the routing and the accounting happen automatically — your only decision is which role you want to be in.

Next: Chapter 17 — The LLM router

Cluster federation — managing your own infrastructure The LLM router