mocker-trace-replay.md 12.8 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
5
title: Mocker Trace Replay
subtitle: Replay Mooncake-style traces through the mocker in offline or online mode
6
7
---

8
9
This guide covers trace replay support for Mooncake-style JSONL traces via `python -m dynamo.replay`,
which prints an AIPerf-style summary table, writes the full replay report JSON to disk, and exposes
10
11
`offline|online`, `round_robin|kv_router`, `arrival_speedup_ratio`, closed-loop concurrency, and
synthetic workload inputs directly.
12
13
14

Unlike normal `dynamo.mocker` usage, offline replay does not launch workers, register endpoints, or
require NATS, etcd, or a frontend. Online replay does exercise the live mock-worker runtime path.
15
16
17
18
19
20
21
22
23

Use this when you want to:

- benchmark scheduler behavior from a saved trace
- compare timing and cache behavior across mocker configurations
- validate replay logic in CI without bringing up a distributed stack

## Quick Start

24
25
26
27
28
29
30
Run offline replay through the dedicated replay CLI:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --num-workers 4 \
    --replay-mode offline \
    --router-mode round_robin \
31
    --extra-engine-args '{"block_size":512}' \
32
    --report-json /tmp/replay-report.json
33
34
35
36
37
38
39
40
41
42
43
44
45
```

Run synthetic replay through the same CLI when you want fixed request shapes without a trace file:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 1000 \
    --arrival-interval-ms 1.0 \
    --num-workers 1 \
    --replay-mode offline \
    --replay-concurrency 100 \
46
    --extra-engine-args '{"block_size":512}' \
47
    --report-json /tmp/replay-report.json
48
49
```

50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Run synthetic workload replay when you want shared-prefix or multi-turn structure without a trace
file:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 200 \
    --turns-per-session 3 \
    --shared-prefix-ratio 0.5 \
    --num-prefix-groups 8 \
    --inter-turn-delay-ms 250 \
    --replay-mode offline \
    --replay-concurrency 32 \
    --extra-engine-args '{"block_size":512}' \
    --report-json /tmp/replay-report.json
```

68
`python -m dynamo.replay` prints an AIPerf-style summary table to stdout and writes the full replay
69
report JSON to disk.
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

## Input Format

The trace file must be Mooncake-style JSONL. Each line should contain:

- `timestamp` or `created_time`
- `input_length` or `input_tokens`
- `output_length` or `output_tokens`
- `hash_ids`

Example:

```json
{"timestamp": 0, "input_length": 6755, "output_length": 500, "hash_ids": [0, 1, 2, 3]}
```

86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
Replay also supports multi-turn sessions. Use the same `session_id` on all turns in a session. The
first turn uses `timestamp` or `created_time`; later turns may use either:

- `delay` or `delay_ms` directly
- or an absolute later `timestamp`, in which case replay infers the inter-turn delay from the
  previous turn timestamp

Example:

```json
{"session_id":"session-a","timestamp":1000,"input_length":2048,"output_length":128,"hash_ids":[1,2,3,4]}
{"session_id":"session-a","delay":250,"input_length":2560,"output_length":128,"hash_ids":[1,2,3,4,5]}
{"session_id":"session-b","timestamp":1010,"input_length":1024,"output_length":64,"hash_ids":[9,10]}
{"session_id":"session-b","delay_ms":50,"input_length":1536,"output_length":64,"hash_ids":[9,10,11]}
```

The mocker synthesizes token blocks from `hash_ids` using the configured mocker `block_size`, so the
103
104
replay block size must match the block size used when the trace was generated. Public Mooncake
traces are commonly block-level hashes at `512` tokens per hash ID, so replaying them with the
105
106
107
108
default mocker `block_size=64` will fail once `input_length > len(hash_ids) * 64`. Set that
through `--extra-engine-args '{"block_size":512}'`. For `engine_type=sglang`, replay still uses
canonical `block_size` internally; `sglang.page_size` is accepted as a compatibility alias and is
normalized into `block_size` before replay starts.
109
110
111
112
113
114
115
116
117
118
119
120
121
122

## Replay Surfaces

### `python -m dynamo.replay`

The dedicated replay CLI exposes:

- either a positional `trace_file`, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
- `--replay-mode offline|online`
- `--router-mode round_robin|kv_router`
- `--num-workers`
- `--replay-concurrency`
- `--arrival-interval-ms`
- `--arrival-speedup-ratio`
123
124
125
126
- `--turns-per-session`
- `--shared-prefix-ratio`
- `--num-prefix-groups`
- `--inter-turn-delay-ms`
127
128
129
- `--extra-engine-args` (JSON string)
- `--router-config` (JSON string)
- `--report-json`
130

131
132
133
134
135
Defaults:

- `--replay-mode offline`
- `--router-mode round_robin`

136
137
138
139
140
141
142
143
Example:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode online \
    --router-mode kv_router \
    --num-workers 4 \
    --arrival-speedup-ratio 10 \
144
    --extra-engine-args '{"block_size":512}' \
145
146
    --router-config '{"router_queue_policy":"fcfs","router_temperature":0.0}' \
    --report-json /tmp/replay-report.json
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
```

SGLang replay uses the same CLI surface. A minimal extra-engine-args file can use either
`block_size` directly or the compatibility alias `sglang.page_size`:

```json
{
  "engine_type": "sglang",
  "num_gpu_blocks": 512,
  "sglang": {
    "page_size": 2
  }
}
```

162
163
164
165
Both `--extra-engine-args` and `--router-config` accept partial JSON objects. Engine settings such
as `block_size`, `engine_type`, `dp_size`, `speedup_ratio`, and `decode_speedup_ratio` belong in
`--extra-engine-args`, not as top-level replay CLI flags. Unspecified fields fall back to the same
defaults used by `MockEngineArgs::default()` and `KvRouterConfig::default()`.
166
167
168
169
170
171
172
173
174
175
176
177
178
179

### Synthetic Replay

Synthetic replay bypasses trace loading and generates in-memory requests with fixed input/output
lengths and optional synthetic arrival spacing:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 200 \
    --arrival-interval-ms 0.5 \
    --replay-mode offline \
    --replay-concurrency 50 \
180
    --extra-engine-args '{"block_size":512}'
181
182
183
```

This is useful for parameter sweeps where Mooncake-style prefix structure is not required.
184

185
186
187
188
189
190
191
192
193
194
195
196
197
When `--turns-per-session > 1`, `--request-count` is interpreted as the number of sessions rather
than the total number of emitted turns. The total completed request count becomes:

- `request_count * turns_per_session`

Synthetic workload options:

- `--turns-per-session`: number of turns in each synthetic session
- `--shared-prefix-ratio`: fraction of prompt blocks shared inside a prefix group
- `--num-prefix-groups`: number of shared-prefix groups; `0` disables grouping
- `--inter-turn-delay-ms`: constant delay applied after each completed turn before the next turn in
  the same session becomes eligible

198
199
200
201
## Modes

### Fixed-Schedule Replay

202
203
Default trace replay preserves the timestamps from the trace and simulates arrivals according to
those timestamps:
204
205

```bash
206
207
208
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
209
    --extra-engine-args '{"block_size":512}'
210
211
212
213
214
215
```

This is the right mode when you want deterministic replay of the original arrival pattern.

### Closed-Loop Concurrency Replay

216
217
Use `--replay-concurrency` to ignore first-turn trace arrival timing and keep a fixed number of
requests in flight:
218
219

```bash
220
221
222
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
223
224
225
226
227
    --replay-concurrency 16
```

This mode is useful when you want to compare scheduler behavior under a fixed offered concurrency rather than the original trace schedule.

228
229
230
231
232
233
234
For multi-turn sessions, concurrency mode still enforces session order and inter-turn delays:

- first-turn timestamps are ignored
- turn `n+1` is not eligible until turn `n` completes
- `delay` / `delay_ms` / synthetic `--inter-turn-delay-ms` are still applied after completion
- TTFT is measured from actual dispatch under the cap, not from the ignored trace timestamp

235
236
237
238
239
240
241
242
243
244
245
246
### Online Replay

Online replay launches the mock workers and replays the trace against the live runtime path. This
is useful when you want the replay to include live request dispatch, live output handling, and the
same async KV-event propagation model used by the current router integration.

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode online \
    --router-mode kv_router \
    --num-workers 4 \
    --arrival-speedup-ratio 10 \
247
    --extra-engine-args '{"block_size":512}'
248
249
250
251
252
253
254
255
256
257
258
259
```

### Arrival Speedup

Use `--arrival-speedup-ratio` to compress or stretch the trace arrival process without changing the
mocker compute model. Larger values make arrivals happen sooner relative to the original trace.

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
    --arrival-speedup-ratio 5 \
260
    --extra-engine-args '{"block_size":512}'
261
262
263
264
265
266
267
268
269
```

### Router Modes

Replay currently supports:

- `round_robin`
- `kv_router`

270
271
`kv_router` uses the shared local scheduler and an in-process KV indexer. Router policy tuning is
provided through `--router-config`, not a dedicated top-level replay flag. In offline replay:
272
273
274
275
276
277
278
279

- `kv_router` is supported only when `num_workers > 1`
- router queueing is enabled and uses simulation time rather than wall-clock time
- KV visibility is delayed slightly relative to request lifecycle events
- queue admission is driven by router lifecycle edges (`add_request`, `mark_prefill_completed`, and `free`)
- transient in-pass prefill occupancy is still approximated at the router level rather than modeled exactly

To compare queue policies manually, keep the same trace and engine args fixed and swap only
280
`router_queue_policy` inside `--router-config`:
281
282
283
284
285
286

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
287
    --extra-engine-args '{"block_size":512}' \
288
    --router-config '{"router_queue_policy":"fcfs"}'
289
290
291
292
293

python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
294
    --extra-engine-args '{"block_size":512}' \
295
    --router-config '{"router_queue_policy":"lcfs"}'
296
297
298
299
300
```

`lcfs` is intentionally a worse comparison policy under saturation; use it for experiments, not as
an expected production default.

301
302
303
304
305
306
307
308
309
310
311
312
## Output

The report contains:

- request counts
- input and output token totals
- virtual duration and wall-clock runtime
- request and token throughput
- prefix cache reuse ratio
- TTFT, TTST, TPOT, ITL, and end-to-end latency summaries
- output-token-throughput-per-user summaries

313
314
315
The dedicated replay CLI returns the same report schema as the Python APIs
`dynamo.replay.run_trace_replay(...)` and `dynamo.replay.run_synthetic_trace_replay(...)`.

316
317
318
If `--report-json` is not provided, `python -m dynamo.replay` writes a timestamped
`dynamo_replay_report_*.json` file in the current working directory.

319
320
## Replay Constraints

321
Shared replay constraints:
322
323

- aggregated mode
324
325
- `extra_engine_args.engine_type` must be `vllm` or `sglang`
- `extra_engine_args.dp_size` must be `1`
326

327
328
329
Additional offline constraints:

- offline `kv_router` requires `num_workers > 1`
330
331
332
- single-worker offline replay is still a dedicated fast path for `vllm`, but it now supports both
  flat request replay and workload-driven multi-turn replay
- `sglang` still goes through the shared multi-worker replay runtime even when `num_workers=1`
333
334
335
336
337

Additional online constraints:

- the current live replay path is also limited to aggregated workers

338
339
340
341
If you violate those constraints, replay fails immediately with a validation error.

## Practical Notes

342
343
344
- `python -m dynamo.replay` requires exactly one of:
  either a trace file, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
- `--replay-concurrency` works with both trace replay and synthetic replay
345
346
- mocker compute-speed knobs such as `speedup_ratio` still affect simulated timing when passed via
  `--extra-engine-args`
347
348
- `--arrival-speedup-ratio` affects trace timestamps, not worker compute speed
- `--arrival-interval-ms` only applies to synthetic replay
349
350
- `--turns-per-session`, `--shared-prefix-ratio`, `--num-prefix-groups`, and
  `--inter-turn-delay-ms` only apply to synthetic replay
351
- `--extra-engine-args` and `--router-config` are JSON strings on the standalone replay CLI
352
353
354
- offline replay does not need planner runtime setup, router registration, or external event transport
- the replay block size should match the trace block size, because token synthesis expands `hash_ids`
  using the configured block size
355
356
357
358
359
360
361
362
363
364
365
366
367
368

## When To Use This vs AIPerf

Use offline replay when:

- you want a fast scheduler-only simulation
- you want deterministic CI coverage of replay behavior
- you do not need HTTP serving, frontend behavior, or network effects

Use [Dynamo Benchmarking](benchmarking.md) when:

- you want end-to-end benchmarking against a live endpoint
- you need frontend, transport, or cluster-level behavior
- you want AIPerf dashboards and endpoint-facing metrics