mocker-trace-replay.md 11.2 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
5
title: Mocker Trace Replay
subtitle: Replay Mooncake-style traces through the mocker in offline or online mode
6
7
---

8
9
10
11
This guide covers the mocker's trace replay support for Mooncake-style JSONL traces. The replay
surface is available in two forms:

- `python -m dynamo.mocker --trace-file ...`, which writes a report file and prints a replay summary
12
13
14
- `python -m dynamo.replay ...`, which prints an AIPerf-style summary table, writes the full
  replay report JSON to disk, and exposes `offline|online`, `round_robin|kv_router`,
  `arrival_speedup_ratio`, and synthetic replay inputs directly
15
16
17

Unlike normal `dynamo.mocker` usage, offline replay does not launch workers, register endpoints, or
require NATS, etcd, or a frontend. Online replay does exercise the live mock-worker runtime path.
18
19
20
21
22
23
24
25
26

Use this when you want to:

- benchmark scheduler behavior from a saved trace
- compare timing and cache behavior across mocker configurations
- validate replay logic in CI without bringing up a distributed stack

## Quick Start

27
28
29
30
31
32
33
Run offline replay through the dedicated replay CLI:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --num-workers 4 \
    --replay-mode offline \
    --router-mode round_robin \
34
35
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}' \
    --report-json /tmp/replay-report.json
36
37
38
39
40
41
42
43
44
45
46
47
48
```

Run synthetic replay through the same CLI when you want fixed request shapes without a trace file:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 1000 \
    --arrival-interval-ms 1.0 \
    --num-workers 1 \
    --replay-mode offline \
    --replay-concurrency 100 \
49
50
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}' \
    --report-json /tmp/replay-report.json
51
52
53
```

You can also run replay through the mocker CLI by passing `--trace-file`:
54
55
56
57
58
59
60
61
62
63
64
65
66

```bash
python -m dynamo.mocker \
    --trace-file /path/to/mooncake_trace.jsonl \
    --model-path Qwen/Qwen3-0.6B
```

This writes a JSON report next to the trace file by default:

```text
/path/to/mooncake_trace.replay.json
```

67
68
69
`python -m dynamo.replay` prints an AIPerf-style summary table to stdout and writes the full replay
report JSON to disk. The mocker CLI prints a `Replay Summary` table to stdout and writes the report
JSON to disk.
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

## Input Format

The trace file must be Mooncake-style JSONL. Each line should contain:

- `timestamp` or `created_time`
- `input_length` or `input_tokens`
- `output_length` or `output_tokens`
- `hash_ids`

Example:

```json
{"timestamp": 0, "input_length": 6755, "output_length": 500, "hash_ids": [0, 1, 2, 3]}
```

86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
The mocker synthesizes token blocks from `hash_ids` using the configured `--block-size`, so the
replay block size must match the block size used when the trace was generated. Public Mooncake
traces are commonly block-level hashes at `512` tokens per hash ID, so replaying them with the
default mocker `block_size=64` will fail once `input_length > len(hash_ids) * 64`. For
`engine_type=sglang`, replay still uses canonical `block_size` internally; `sglang.page_size` is
accepted as a compatibility alias and is normalized into `block_size` before replay starts.

## Replay Surfaces

### `python -m dynamo.replay`

The dedicated replay CLI exposes:

- either a positional `trace_file`, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
- `--replay-mode offline|online`
- `--router-mode round_robin|kv_router`
- `--num-workers`
- `--replay-concurrency`
- `--arrival-interval-ms`
- `--arrival-speedup-ratio`
106
107
108
- `--extra-engine-args` (JSON string)
- `--router-config` (JSON string)
- `--report-json`
109
110
111
112
113
114
115
116
117

Example:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode online \
    --router-mode kv_router \
    --num-workers 4 \
    --arrival-speedup-ratio 10 \
118
119
120
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}' \
    --router-config '{"router_queue_policy":"fcfs","router_temperature":0.0}' \
    --report-json /tmp/replay-report.json
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
```

SGLang replay uses the same CLI surface. A minimal extra-engine-args file can use either
`block_size` directly or the compatibility alias `sglang.page_size`:

```json
{
  "engine_type": "sglang",
  "num_gpu_blocks": 512,
  "speedup_ratio": 1000.0,
  "sglang": {
    "page_size": 2
  }
}
```

137
138
139
Both `--extra-engine-args` and `--router-config` accept partial JSON objects. Unspecified fields
fall back to the same defaults used by `MockEngineArgs::default()` and
`KvRouterConfig::default()`.
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

### `python -m dynamo.mocker --trace-file`

The mocker CLI supports offline replay and remains useful when you want the historical
`Replay Summary` output and report-file workflow.

### Synthetic Replay

Synthetic replay bypasses trace loading and generates in-memory requests with fixed input/output
lengths and optional synthetic arrival spacing:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 200 \
    --arrival-interval-ms 0.5 \
    --replay-mode offline \
    --replay-concurrency 50 \
159
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}'
160
161
162
```

This is useful for parameter sweeps where Mooncake-style prefix structure is not required.
163
164
165
166
167

## Modes

### Fixed-Schedule Replay

168
169
Default trace replay preserves the timestamps from the trace and simulates arrivals according to
those timestamps:
170
171

```bash
172
173
174
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
175
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}'
176
177
178
179
180
181
```

This is the right mode when you want deterministic replay of the original arrival pattern.

### Closed-Loop Concurrency Replay

182
183
Use `--replay-concurrency` to ignore trace arrival timing and keep a fixed number of requests in
flight:
184
185

```bash
186
187
188
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
189
190
191
192
193
    --replay-concurrency 16
```

This mode is useful when you want to compare scheduler behavior under a fixed offered concurrency rather than the original trace schedule.

194
195
196
197
198
199
200
201
202
203
204
205
### Online Replay

Online replay launches the mock workers and replays the trace against the live runtime path. This
is useful when you want the replay to include live request dispatch, live output handling, and the
same async KV-event propagation model used by the current router integration.

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode online \
    --router-mode kv_router \
    --num-workers 4 \
    --arrival-speedup-ratio 10 \
206
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}'
207
208
209
210
211
212
213
214
215
216
217
218
```

### Arrival Speedup

Use `--arrival-speedup-ratio` to compress or stretch the trace arrival process without changing the
mocker compute model. Larger values make arrivals happen sooner relative to the original trace.

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
    --arrival-speedup-ratio 5 \
219
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}'
220
221
222
223
224
225
226
227
228
```

### Router Modes

Replay currently supports:

- `round_robin`
- `kv_router`

229
230
`kv_router` uses the shared local scheduler and an in-process KV indexer. Router policy tuning is
provided through `--router-config`, not a dedicated top-level replay flag. In offline replay:
231
232
233
234
235
236
237
238

- `kv_router` is supported only when `num_workers > 1`
- router queueing is enabled and uses simulation time rather than wall-clock time
- KV visibility is delayed slightly relative to request lifecycle events
- queue admission is driven by router lifecycle edges (`add_request`, `mark_prefill_completed`, and `free`)
- transient in-pass prefill occupancy is still approximated at the router level rather than modeled exactly

To compare queue policies manually, keep the same trace and engine args fixed and swap only
239
`router_queue_policy` inside `--router-config`:
240
241
242
243
244
245

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
246
247
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}' \
    --router-config '{"router_queue_policy":"fcfs"}'
248
249
250
251
252

python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
253
254
    --extra-engine-args '{"block_size":512,"speedup_ratio":1000.0}' \
    --router-config '{"router_queue_policy":"lcfs"}'
255
256
257
258
259
```

`lcfs` is intentionally a worse comparison policy under saturation; use it for experiments, not as
an expected production default.

260
261
262
263
264
265
266
267
268
269
270
## Output

Use `--output-file` to override the default report location:

```bash
python -m dynamo.mocker \
    --trace-file /path/to/mooncake_trace.jsonl \
    --model-path Qwen/Qwen3-0.6B \
    --output-file /tmp/replay-report.json
```

271
If `--output-file` is not set, the report path defaults to `TRACE_STEM.replay.json` in the same directory as the input trace.
272
273
274
275
276
277
278
279
280
281
282

The report contains:

- request counts
- input and output token totals
- virtual duration and wall-clock runtime
- request and token throughput
- prefix cache reuse ratio
- TTFT, TTST, TPOT, ITL, and end-to-end latency summaries
- output-token-throughput-per-user summaries

283
284
285
The dedicated replay CLI returns the same report schema as the Python APIs
`dynamo.replay.run_trace_replay(...)` and `dynamo.replay.run_synthetic_trace_replay(...)`.

286
287
288
If `--report-json` is not provided, `python -m dynamo.replay` writes a timestamped
`dynamo_replay_report_*.json` file in the current working directory.

289
290
## Replay Constraints

291
Shared replay constraints:
292
293

- aggregated mode
294
- `--engine-type vllm|sglang`
295
296
- `--data-parallel-size 1`

297
298
299
300
301
302
303
304
305
306
Additional offline constraints:

- offline `kv_router` requires `num_workers > 1`
- public single-worker offline replay still uses the legacy single-worker runtime for `vllm`
  while `sglang` goes through the shared multi-worker replay runtime even when `num_workers=1`

Additional online constraints:

- the current live replay path is also limited to aggregated workers

307
308
309
310
If you violate those constraints, replay fails immediately with a validation error.

## Practical Notes

311
312
313
- `python -m dynamo.replay` requires exactly one of:
  either a trace file, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
- `--replay-concurrency` works with both trace replay and synthetic replay
314
- `--speedup-ratio` still affects simulated timing
315
316
- `--arrival-speedup-ratio` affects trace timestamps, not worker compute speed
- `--arrival-interval-ms` only applies to synthetic replay
317
- `--extra-engine-args` and `--router-config` are JSON strings on the standalone replay CLI
318
319
320
- offline replay does not need planner runtime setup, router registration, or external event transport
- the replay block size should match the trace block size, because token synthesis expands `hash_ids`
  using the configured block size
321
322
323
324
325
326
327
328
329
330
331
332
333
334

## When To Use This vs AIPerf

Use offline replay when:

- you want a fast scheduler-only simulation
- you want deterministic CI coverage of replay behavior
- you do not need HTTP serving, frontend behavior, or network effects

Use [Dynamo Benchmarking](benchmarking.md) when:

- you want end-to-end benchmarking against a live endpoint
- you need frontend, transport, or cluster-level behavior
- you want AIPerf dashboards and endpoint-facing metrics