34. AIPerf load testing

Drive real load at your endpoint, watch the SLO curve bend, and learn exactly how much your GPU can serve before latency degrades.

Metrics tell you how a card is behaving right now. They don’t tell you how it behaves under pressure — how many concurrent users it can serve before time-to-first-token blows past your SLO, or where the throughput ceiling is for capacity planning. That’s what the Performance benchmarking panel (AIPerf) is for.

AIPerf is NVIDIA’s load generator for inference endpoints. A run is real traffic: it fires structured requests at your serving stack and measures latency and throughput across a range of concurrency levels.

Why you’d run it

Capacity. “How many concurrent coding sessions can this card handle before it falls over?”
Latency / SLO. “At our expected concurrency, what’s TTFT and inter-token latency?” — the numbers your users actually feel.
Pricing. The SLO-vs-concurrency curve is the data behind capacity planning and the pricing model (Chapter 46).
Validating a new card before you flip the router to it.

What a run measures

A sweep produces a per-concurrency SLO curve. The headline metrics:

Metric	Meaning
TTFT	Time to first token — how long until the user sees anything.
ITL	Inter-token latency — the gap between streamed tokens (how fast the answer “types”).
Output tok/s	Output token throughput — total generation rate.
Req/s	Request throughput — completed requests per second.

You read the curve to find the knee: the concurrency where latency starts climbing faster than throughput. That’s your practical ceiling.

The run config

You configure the sweep, then hit Run sweep. The real caps are enforced server-side:

Field	Default	Cap / values	What it does
Model	The endpoint’s served model	—	Which model to benchmark (pinned to your endpoint’s model when you’re scoped to your own GPU).
Concurrency sweep	`1,2,4,8,16,32`	Comma-separated list	Each level is benchmarked in turn — this is what produces the curve.
Requests / level	200	1 – 2000	How many requests to fire at each concurrency level.
Input length (ISL)	512	1 – 8192	Prompt token length per request.
Output length (OSL)	256	1 – 8192	Generated token length per request.
Endpoint type	chat	—	Benchmarks the chat completions path.
Streaming	on	—	Streams tokens, so TTFT and ITL are measured the way users experience them.

Bigger sweeps and longer token lengths mean more load and a longer run. Start modest.

Warning: a sweep is real load. A full-concurrency sweep is genuine traffic on the endpoint under test — it will spike latency for anything else the card is serving while it runs. The panel carries a hazard banner for exactly this reason. Run it off-peak, or against a candidate card before the router is flipped to it. This is a benchmark, not a synthetic probe.

Who can run it

Access is gated on the server, and each scope is pinned to a specific target — you cannot point a sweep at an arbitrary URL.

You are…	You can benchmark	Target choice	Runs you see
Site operator	The shared serving stack	Free choice of target	Every run
GPU owner / provider	Only your own endpoint	Pinned to your endpoint	Your own runs
Everyone else	— (denied)	—	—

The distinction matters: a site operator benchmarks the platform’s shared serving stack and can compare, say, the raw model endpoint against the gateway front door. A GPU owner or provider benchmarks their own card’s endpoint and nothing else — passing an arbitrary target URL is rejected outright. Consumers and non-admins can’t run AIPerf at all.

Note: execution has a second safety switch. Beyond the access gate, AIPerf execution is governed by an operator flag. When it’s off on a deployment, you can still queue a run, but it fails fast with a clear message rather than loading the card — so a sweep can never accidentally hammer a shared production GPU. The panel tells you when execution is disabled.

Reading results and downloading artifacts

Each run appears in the list and moves through pending → running → passed / failed, polling every few seconds while it’s live. Expand a run to see:

Headline metrics inline (TTFT, ITL, throughput) once they’re in.
The per-concurrency SLO curve as a table — one row per concurrency level, so you can see exactly where latency bends.
The run’s parameters — model, target, request count, ISL / OSL.
Downloadable artifacts — the structured outputs, streamed from object storage scoped to that run:

Artifact What it is
profile_export_aiperf.csv The full results table, ready for a spreadsheet.
profile_export_aiperf.json The same data, structured for tooling.
logs The raw run logs.

Artifact	What it is
`profile_export_aiperf.csv`	The full results table, ready for a spreadsheet.
`profile_export_aiperf.json`	The same data, structured for tooling.
logs	The raw run logs.

Downloads are bounded to that run’s own files — you can only fetch artifacts that belong to a run you’re allowed to see.

A worked reading

Suppose you sweep 1,2,4,8,16,32 against an ada-16 card. You’ll typically see TTFT flat and low through the lower levels, throughput climbing, then — somewhere in the sweep — TTFT starts rising sharply while throughput flattens. That inflection is the card’s comfortable concurrency ceiling. Set your capacity plan a step below it, and you have an SLO you can actually promise.

That’s the full AI Factory observability and operations story. The last chapter of Part V is the place all this inference gets used by you directly: the browser coding workspace.

Next: Chapter 35 — The coding workspace

Diagnostics and reliability The coding workspace