"docs/vscode:/vscode.git/clone" did not exist on "8409e41281f545ba3fe8290a0b53b5a44f1add3f"
README.md 4.18 KB
Newer Older
1
2
3
4
5
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

6
# Bench Entrypoints
7
8
9
10
11
12
13

`multiturn_bench` simulates concurrent multi-turn conversations against an
OpenAI-compatible chat endpoint and reports per-turn TTFT and total latency
statistics. It can optionally enable **speculative prefill** — a technique that
pre-warms the KV cache with the predicted next-turn prefix after each assistant
response, cutting TTFT on subsequent turns.

14
15
16
`offline_replay_bench` runs the Rust-native replay loop directly for profiling
and throughput measurements without going through the Python wrapper.

17
18
19
20
## Quick start

```bash
# Smoke test (1 user, 1 turn, ~50 tokens)
21
cargo bench --package dynamo-bench --bench multiturn_bench -- --ping
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
```

## Speculative prefill demo

Speculative prefill works best with multi-turn workloads where the conversation
grows incrementally (e.g. reasoning models in agentic loops). After each
assistant turn the frontend constructs the next-turn prompt prefix and sends a
`max_tokens=1` request to warm the KV cache, so the real follow-up hits a warm
cache and gets a much lower TTFT.

### 1. Launch the backend and frontend

```bash
# Terminal 1 — backend (vLLM example, any supported backend works)
python -m dynamo.vllm \
  --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B

# Terminal 2 — frontend with KV router
python -m dynamo.frontend \
  --router-mode kv \
  --http-port 8000
```

### 2. Run baseline (no speculative prefill)

```bash
48
cargo bench --package dynamo-bench --bench multiturn_bench -- \
49
50
51
52
53
54
55
56
57
58
59
60
61
  --url http://localhost:8000 \
  --num-users 10 \
  --num-turns 5 \
  --num-user-tokens 128 \
  --max-completion-tokens 256 \
  --mean-delay-ms 5000 \
  --output baseline.json \
  --verbose
```

### 3. Run with speculative prefill

```bash
62
cargo bench --package dynamo-bench --bench multiturn_bench -- \
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
  --url http://localhost:8000 \
  --num-users 10 \
  --num-turns 5 \
  --num-user-tokens 128 \
  --max-completion-tokens 256 \
  --mean-delay-ms 5000 \
  --speculative-prefill \
  --output specprefill.json \
  --verbose
```

Compare the per-turn TTFT columns: turns 2+ should show a significant TTFT
reduction (up to ~3x) because the KV cache is already warm when the real
request arrives.

## CLI reference

| Flag | Default | Description |
|------|---------|-------------|
| `--url` | `http://localhost:8000` | Frontend HTTP endpoint |
| `--model` | auto-detected | Model name (queries `/v1/models` if omitted) |
| `--num-users` | `10` | Concurrent simulated users |
| `--num-turns` | `5` | Conversation turns per user |
| `--num-user-tokens` | `128` | Approximate user-prompt token count per turn |
| `--max-completion-tokens` | `1000` | Output sequence length cap |
| `--ignore-eos` | `true` | Force generation to max tokens |
| `--mean-delay-ms` | `5000` | Mean inter-turn delay (exponential distribution) |
| `--speculative-prefill` | `false` | Enable speculative prefill via `nvext.agent_hints` |
| `--output <path>` | none | Write results to JSON file |
| `--verbose` / `-v` | `false` | Print per-turn logging |
| `--seed` | `42` | Random seed |
| `--ping` | `false` | Smoke-test mode (1 user, 1 turn, ~50 tokens, no delay) |

## How speculative prefill works

1. The client sends `{"nvext": {"agent_hints": {"speculative_prefill": true}}}` in each request.
2. As the assistant response streams back, the frontend accumulates the full response text.
3. Once `finish_reason` is set, a background task constructs the next-turn prompt (conversation history + assistant response, thinking content stripped) and sends a `max_tokens=1` prefill-only request through the pipeline.
4. The KV router routes the speculative request to the same worker, warming its cache.
5. When the real next-turn request arrives, the KV router sees high cache overlap on that worker and routes there, yielding a much lower TTFT.

104
105
106
107
108
109
110
111
112
113
114
115
116
See also: [Agent Hints documentation](../../docs/components/frontend/nvext.md#agent-hints)

## Offline replay

```bash
cargo bench --package dynamo-bench --bench offline_replay_bench -- \
  /path/to/mooncake_trace.jsonl \
  --num-workers 4 \
  --router-mode kv-router \
  --arrival-speedup-ratio 4 \
  --trace-block-size 512 \
  --block-size 64
```