README.md

<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Bench Entrypoints

`multiturn_bench` simulates concurrent multi-turn conversations against an
OpenAI-compatible chat endpoint and reports per-turn TTFT and total latency
statistics. It can optionally enable **speculative prefill** — a technique that
pre-warms the KV cache with the predicted next-turn prefix after each assistant
response, cutting TTFT on subsequent turns.

`offline_replay_bench` runs the Rust-native replay loop directly for profiling
and throughput measurements without going through the Python wrapper. It uses
the mocker's internal polynomial perf model so the results stay focused on
replay overhead instead of external timing backends.

## Quick start

```bash
# Smoke test (1 user, 1 turn, ~50 tokens)
cargo bench --package dynamo-bench --bench multiturn_bench -- --ping
```

## Speculative prefill demo

Speculative prefill works best with multi-turn workloads where the conversation
grows incrementally (e.g. reasoning models in agentic loops). After each
assistant turn the frontend constructs the next-turn prompt prefix and sends a
`max_tokens=1` request to warm the KV cache, so the real follow-up hits a warm
cache and gets a much lower TTFT.

### 1. Launch the backend and frontend

```bash
# Terminal 1 — backend (vLLM example, any supported backend works)
python -m dynamo.vllm \
  --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B

# Terminal 2 — frontend with KV router
python -m dynamo.frontend \
  --router-mode kv \
  --http-port 8000
```

### 2. Run baseline (no speculative prefill)

```bash
cargo bench --package dynamo-bench --bench multiturn_bench -- \
  --url http://localhost:8000 \
  --num-users 10 \
  --num-turns 5 \
  --num-user-tokens 128 \
  --max-completion-tokens 256 \
  --mean-delay-ms 5000 \
  --output baseline.json \
  --verbose
```

### 3. Run with speculative prefill

```bash
cargo bench --package dynamo-bench --bench multiturn_bench -- \
  --url http://localhost:8000 \
  --num-users 10 \
  --num-turns 5 \
  --num-user-tokens 128 \
  --max-completion-tokens 256 \
  --mean-delay-ms 5000 \
  --speculative-prefill \
  --output specprefill.json \
  --verbose
```

Compare the per-turn TTFT columns: turns 2+ should show a significant TTFT
reduction (up to ~3x) because the KV cache is already warm when the real
request arrives.

## CLI reference

| Flag | Default | Description |
|------|---------|-------------|
| `--url` | `http://localhost:8000` | Frontend HTTP endpoint |
| `--model` | auto-detected | Model name (queries `/v1/models` if omitted) |
| `--num-users` | `10` | Concurrent simulated users |
| `--num-turns` | `5` | Conversation turns per user |
| `--num-user-tokens` | `128` | Approximate user-prompt token count per turn |
| `--max-completion-tokens` | `1000` | Output sequence length cap |
| `--ignore-eos` | `true` | Force generation to max tokens |
| `--mean-delay-ms` | `5000` | Mean inter-turn delay (exponential distribution) |
| `--speculative-prefill` | `false` | Enable speculative prefill via `nvext.agent_hints` |
| `--output <path>` | none | Write results to JSON file |
| `--verbose` / `-v` | `false` | Print per-turn logging |
| `--seed` | `42` | Random seed |
| `--ping` | `false` | Smoke-test mode (1 user, 1 turn, ~50 tokens, no delay) |

## How speculative prefill works

1. The client sends `{"nvext": {"agent_hints": {"speculative_prefill": true}}}` in each request.
2. As the assistant response streams back, the frontend accumulates the full response text.
3. Once `finish_reason` is set, a background task constructs the next-turn prompt (conversation history + assistant response, thinking content stripped) and sends a `max_tokens=1` prefill-only request through the pipeline.
4. The KV router routes the speculative request to the same worker, warming its cache.
5. When the real next-turn request arrives, the KV router sees high cache overlap on that worker and routes there, yielding a much lower TTFT.

See also: [Agent Hints documentation](../../docs/components/frontend/nvext.md#agent-hints)

## Offline replay

```bash
cargo bench --package dynamo-bench --bench offline_replay_bench -- \
  /path/to/mooncake_trace.jsonl \
  --num-workers 4 \
  --router-mode kv-router \
  --arrival-speedup-ratio 4 \
  --trace-block-size 512 \
  --block-size 64
```

Use `--speedup-ratio` and `--decode-speedup-ratio` if you want a simple scaling
knob while keeping the same internal polynomial model.