Ai FactoryAIPerf load testing

34. AIPerf load testing

Drive real load at your endpoint, watch the SLO curve bend, and learn exactly how much your GPU can serve before latency degrades.

Metrics tell you how a card is behaving right now. They don’t tell you how it behaves under pressure — how many concurrent users it can serve before time-to-first-token blows past your SLO, or where the throughput ceiling is for capacity planning. That’s what the Performance benchmarking panel (AIPerf) is for.

AIPerf is NVIDIA’s load generator for inference endpoints. A run is real traffic: it fires structured requests at your serving stack and measures latency and throughput across a range of concurrency levels.


Why you’d run it

  • Capacity. “How many concurrent coding sessions can this card handle before it falls over?”
  • Latency / SLO. “At our expected concurrency, what’s TTFT and inter-token latency?” — the numbers your users actually feel.
  • Pricing. The SLO-vs-concurrency curve is the data behind capacity planning and the pricing model (Chapter 46).
  • Validating a new card before you flip the router to it.

What a run measures

A sweep produces a per-concurrency SLO curve. The headline metrics:

MetricMeaning
TTFTTime to first token — how long until the user sees anything.
ITLInter-token latency — the gap between streamed tokens (how fast the answer “types”).
Output tok/sOutput token throughput — total generation rate.
Req/sRequest throughput — completed requests per second.

You read the curve to find the knee: the concurrency where latency starts climbing faster than throughput. That’s your practical ceiling.


The run config

You configure the sweep, then hit Run sweep. The real caps are enforced server-side:

FieldDefaultCap / valuesWhat it does
ModelThe endpoint’s served modelWhich model to benchmark (pinned to your endpoint’s model when you’re scoped to your own GPU).
Concurrency sweep1,2,4,8,16,32Comma-separated listEach level is benchmarked in turn — this is what produces the curve.
Requests / level2001 – 2000How many requests to fire at each concurrency level.
Input length (ISL)5121 – 8192Prompt token length per request.
Output length (OSL)2561 – 8192Generated token length per request.
Endpoint typechatBenchmarks the chat completions path.
StreamingonStreams tokens, so TTFT and ITL are measured the way users experience them.

Bigger sweeps and longer token lengths mean more load and a longer run. Start modest.

Warning: a sweep is real load. A full-concurrency sweep is genuine traffic on the endpoint under test — it will spike latency for anything else the card is serving while it runs. The panel carries a hazard banner for exactly this reason. Run it off-peak, or against a candidate card before the router is flipped to it. This is a benchmark, not a synthetic probe.


Who can run it

Access is gated on the server, and each scope is pinned to a specific target — you cannot point a sweep at an arbitrary URL.

You are…You can benchmarkTarget choiceRuns you see
Site operatorThe shared serving stackFree choice of targetEvery run
GPU owner / providerOnly your own endpointPinned to your endpointYour own runs
Everyone else— (denied)

The distinction matters: a site operator benchmarks the platform’s shared serving stack and can compare, say, the raw model endpoint against the gateway front door. A GPU owner or provider benchmarks their own card’s endpoint and nothing else — passing an arbitrary target URL is rejected outright. Consumers and non-admins can’t run AIPerf at all.

Note: execution has a second safety switch. Beyond the access gate, AIPerf execution is governed by an operator flag. When it’s off on a deployment, you can still queue a run, but it fails fast with a clear message rather than loading the card — so a sweep can never accidentally hammer a shared production GPU. The panel tells you when execution is disabled.


Reading results and downloading artifacts

Each run appears in the list and moves through pending → running → passed / failed, polling every few seconds while it’s live. Expand a run to see:

  • Headline metrics inline (TTFT, ITL, throughput) once they’re in.

  • The per-concurrency SLO curve as a table — one row per concurrency level, so you can see exactly where latency bends.

  • The run’s parameters — model, target, request count, ISL / OSL.

  • Downloadable artifacts — the structured outputs, streamed from object storage scoped to that run:

    ArtifactWhat it is
    profile_export_aiperf.csvThe full results table, ready for a spreadsheet.
    profile_export_aiperf.jsonThe same data, structured for tooling.
    logsThe raw run logs.

Downloads are bounded to that run’s own files — you can only fetch artifacts that belong to a run you’re allowed to see.


A worked reading

Suppose you sweep 1,2,4,8,16,32 against an ada-16 card. You’ll typically see TTFT flat and low through the lower levels, throughput climbing, then — somewhere in the sweep — TTFT starts rising sharply while throughput flattens. That inflection is the card’s comfortable concurrency ceiling. Set your capacity plan a step below it, and you have an SLO you can actually promise.


That’s the full AI Factory observability and operations story. The last chapter of Part V is the place all this inference gets used by you directly: the browser coding workspace.

Next: Chapter 35 — The coding workspace