mocker-trace-replay.md 13.9 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
5
title: Mocker Trace Replay
subtitle: Replay Mooncake-style traces through the mocker in offline or online mode
6
7
---

8
9
This guide covers trace replay support for Mooncake-style JSONL traces via `python -m dynamo.replay`,
which prints an AIPerf-style summary table, writes the full replay report JSON to disk, and exposes
10
11
`offline|online`, `round_robin|kv_router`, `arrival_speedup_ratio`, closed-loop concurrency, and
synthetic workload inputs directly.
12
13
14

Unlike normal `dynamo.mocker` usage, offline replay does not launch workers, register endpoints, or
require NATS, etcd, or a frontend. Online replay does exercise the live mock-worker runtime path.
15
16
17
18
19
20
21
22
23

Use this when you want to:

- benchmark scheduler behavior from a saved trace
- compare timing and cache behavior across mocker configurations
- validate replay logic in CI without bringing up a distributed stack

## Quick Start

24
25
26
27
28
29
30
Run offline replay through the dedicated replay CLI:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --num-workers 4 \
    --replay-mode offline \
    --router-mode round_robin \
31
    --extra-engine-args '{"block_size":512}' \
32
    --report-json /tmp/replay-report.json
33
34
35
36
37
38
39
40
41
42
43
44
45
```

Run synthetic replay through the same CLI when you want fixed request shapes without a trace file:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 1000 \
    --arrival-interval-ms 1.0 \
    --num-workers 1 \
    --replay-mode offline \
    --replay-concurrency 100 \
46
    --extra-engine-args '{"block_size":512}' \
47
    --report-json /tmp/replay-report.json
48
49
```

50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Run synthetic workload replay when you want shared-prefix or multi-turn structure without a trace
file:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 200 \
    --turns-per-session 3 \
    --shared-prefix-ratio 0.5 \
    --num-prefix-groups 8 \
    --inter-turn-delay-ms 250 \
    --replay-mode offline \
    --replay-concurrency 32 \
    --extra-engine-args '{"block_size":512}' \
    --report-json /tmp/replay-report.json
```

68
`python -m dynamo.replay` prints an AIPerf-style summary table to stdout and writes the full replay
69
report JSON to disk.
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

## Input Format

The trace file must be Mooncake-style JSONL. Each line should contain:

- `timestamp` or `created_time`
- `input_length` or `input_tokens`
- `output_length` or `output_tokens`
- `hash_ids`

Example:

```json
{"timestamp": 0, "input_length": 6755, "output_length": 500, "hash_ids": [0, 1, 2, 3]}
```

86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
Replay also supports multi-turn sessions. Use the same `session_id` on all turns in a session. The
first turn uses `timestamp` or `created_time`; later turns may use either:

- `delay` or `delay_ms` directly
- or an absolute later `timestamp`, in which case replay infers the inter-turn delay from the
  previous turn timestamp

Example:

```json
{"session_id":"session-a","timestamp":1000,"input_length":2048,"output_length":128,"hash_ids":[1,2,3,4]}
{"session_id":"session-a","delay":250,"input_length":2560,"output_length":128,"hash_ids":[1,2,3,4,5]}
{"session_id":"session-b","timestamp":1010,"input_length":1024,"output_length":64,"hash_ids":[9,10]}
{"session_id":"session-b","delay_ms":50,"input_length":1536,"output_length":64,"hash_ids":[9,10,11]}
```

The mocker synthesizes token blocks from `hash_ids` using the configured mocker `block_size`, so the
103
104
replay block size must match the block size used when the trace was generated. Public Mooncake
traces are commonly block-level hashes at `512` tokens per hash ID, so replaying them with the
105
106
107
108
default mocker `block_size=64` will fail once `input_length > len(hash_ids) * 64`. Set that
through `--extra-engine-args '{"block_size":512}'`. For `engine_type=sglang`, replay still uses
canonical `block_size` internally; `sglang.page_size` is accepted as a compatibility alias and is
normalized into `block_size` before replay starts.
109
110
111
112
113
114
115
116
117
118
119

## Replay Surfaces

### `python -m dynamo.replay`

The dedicated replay CLI exposes:

- either a positional `trace_file`, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
- `--replay-mode offline|online`
- `--router-mode round_robin|kv_router`
- `--num-workers`
120
121
- `--num-prefill-workers`
- `--num-decode-workers`
122
123
124
- `--replay-concurrency`
- `--arrival-interval-ms`
- `--arrival-speedup-ratio`
125
126
127
128
- `--turns-per-session`
- `--shared-prefix-ratio`
- `--num-prefix-groups`
- `--inter-turn-delay-ms`
129
- `--extra-engine-args` (JSON string)
130
131
- `--prefill-engine-args` (JSON string)
- `--decode-engine-args` (JSON string)
132
133
- `--router-config` (JSON string)
- `--report-json`
134

135
136
137
138
139
Defaults:

- `--replay-mode offline`
- `--router-mode round_robin`

140
141
142
143
144
145
146
147
Example:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode online \
    --router-mode kv_router \
    --num-workers 4 \
    --arrival-speedup-ratio 10 \
148
    --extra-engine-args '{"block_size":512}' \
149
150
    --router-config '{"router_queue_policy":"fcfs","router_temperature":0.0}' \
    --report-json /tmp/replay-report.json
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
```

SGLang replay uses the same CLI surface. A minimal extra-engine-args file can use either
`block_size` directly or the compatibility alias `sglang.page_size`:

```json
{
  "engine_type": "sglang",
  "num_gpu_blocks": 512,
  "sglang": {
    "page_size": 2
  }
}
```

166
167
168
169
Both `--extra-engine-args` and `--router-config` accept partial JSON objects. Engine settings such
as `block_size`, `engine_type`, `dp_size`, `speedup_ratio`, and `decode_speedup_ratio` belong in
`--extra-engine-args`, not as top-level replay CLI flags. Unspecified fields fall back to the same
defaults used by `MockEngineArgs::default()` and `KvRouterConfig::default()`.
170

171
172
173
174
175
176
177
178
179
180
181
182
183
Offline disagg replay uses staged engine args instead of `--extra-engine-args`:

- `--prefill-engine-args` for the prefill worker config
- `--decode-engine-args` for the decode worker config
- `--num-prefill-workers` and `--num-decode-workers` for pool sizes

For offline disagg replay, the staged JSON must set `worker_type` explicitly:

- `--prefill-engine-args` must use `worker_type: "prefill"`
- `--decode-engine-args` must use `worker_type: "decode"`

The staged configs must also use the same `block_size`.

184
185
186
187
188
189
190
191
192
193
194
195
196
### Synthetic Replay

Synthetic replay bypasses trace loading and generates in-memory requests with fixed input/output
lengths and optional synthetic arrival spacing:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 200 \
    --arrival-interval-ms 0.5 \
    --replay-mode offline \
    --replay-concurrency 50 \
197
    --extra-engine-args '{"block_size":512}'
198
199
200
```

This is useful for parameter sweeps where Mooncake-style prefix structure is not required.
201

202
203
204
205
206
207
208
209
210
211
212
213
214
When `--turns-per-session > 1`, `--request-count` is interpreted as the number of sessions rather
than the total number of emitted turns. The total completed request count becomes:

- `request_count * turns_per_session`

Synthetic workload options:

- `--turns-per-session`: number of turns in each synthetic session
- `--shared-prefix-ratio`: fraction of prompt blocks shared inside a prefix group
- `--num-prefix-groups`: number of shared-prefix groups; `0` disables grouping
- `--inter-turn-delay-ms`: constant delay applied after each completed turn before the next turn in
  the same session becomes eligible

215
216
217
218
## Modes

### Fixed-Schedule Replay

219
220
Default trace replay preserves the timestamps from the trace and simulates arrivals according to
those timestamps:
221
222

```bash
223
224
225
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
226
    --extra-engine-args '{"block_size":512}'
227
228
229
230
231
232
```

This is the right mode when you want deterministic replay of the original arrival pattern.

### Closed-Loop Concurrency Replay

233
234
Use `--replay-concurrency` to ignore first-turn trace arrival timing and keep a fixed number of
requests in flight:
235
236

```bash
237
238
239
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
240
241
242
243
244
    --replay-concurrency 16
```

This mode is useful when you want to compare scheduler behavior under a fixed offered concurrency rather than the original trace schedule.

245
246
247
248
249
250
251
For multi-turn sessions, concurrency mode still enforces session order and inter-turn delays:

- first-turn timestamps are ignored
- turn `n+1` is not eligible until turn `n` completes
- `delay` / `delay_ms` / synthetic `--inter-turn-delay-ms` are still applied after completion
- TTFT is measured from actual dispatch under the cap, not from the ignored trace timestamp

252
253
254
255
256
257
258
259
260
261
262
263
### Online Replay

Online replay launches the mock workers and replays the trace against the live runtime path. This
is useful when you want the replay to include live request dispatch, live output handling, and the
same async KV-event propagation model used by the current router integration.

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode online \
    --router-mode kv_router \
    --num-workers 4 \
    --arrival-speedup-ratio 10 \
264
    --extra-engine-args '{"block_size":512}'
265
266
267
268
269
270
271
272
273
274
275
276
```

### Arrival Speedup

Use `--arrival-speedup-ratio` to compress or stretch the trace arrival process without changing the
mocker compute model. Larger values make arrivals happen sooner relative to the original trace.

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
    --arrival-speedup-ratio 5 \
277
    --extra-engine-args '{"block_size":512}'
278
279
280
281
282
283
284
285
286
```

### Router Modes

Replay currently supports:

- `round_robin`
- `kv_router`

287
288
`kv_router` uses the shared local scheduler and an in-process KV indexer. Router policy tuning is
provided through `--router-config`, not a dedicated top-level replay flag. In offline replay:
289
290
291
292
293
294
295
296

- `kv_router` is supported only when `num_workers > 1`
- router queueing is enabled and uses simulation time rather than wall-clock time
- KV visibility is delayed slightly relative to request lifecycle events
- queue admission is driven by router lifecycle edges (`add_request`, `mark_prefill_completed`, and `free`)
- transient in-pass prefill occupancy is still approximated at the router level rather than modeled exactly

To compare queue policies manually, keep the same trace and engine args fixed and swap only
297
`router_queue_policy` inside `--router-config`:
298
299
300
301
302
303

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
304
    --extra-engine-args '{"block_size":512}' \
305
    --router-config '{"router_queue_policy":"fcfs"}'
306
307
308
309
310

python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
311
    --extra-engine-args '{"block_size":512}' \
312
    --router-config '{"router_queue_policy":"lcfs"}'
313
314
315
316
317
```

`lcfs` is intentionally a worse comparison policy under saturation; use it for experiments, not as
an expected production default.

318
319
320
321
322
323
324
325
326
327
328
329
## Output

The report contains:

- request counts
- input and output token totals
- virtual duration and wall-clock runtime
- request and token throughput
- prefix cache reuse ratio
- TTFT, TTST, TPOT, ITL, and end-to-end latency summaries
- output-token-throughput-per-user summaries

330
331
332
The dedicated replay CLI returns the same report schema as the Python APIs
`dynamo.replay.run_trace_replay(...)` and `dynamo.replay.run_synthetic_trace_replay(...)`.

333
334
335
If `--report-json` is not provided, `python -m dynamo.replay` writes a timestamped
`dynamo_replay_report_*.json` file in the current working directory.

336
337
## Replay Constraints

338
Shared replay constraints:
339

340
- `extra_engine_args.engine_type` must be `vllm` or `sglang`
341
342
343
344
345
- aggregated replay requires the existing aggregated args path
- disagg replay requires both `prefill_engine_args` and `decode_engine_args`
- disagg replay requires `router_mode=kv_router`
- replay `dp_size` must be `1`
- disagg replay requires matching `block_size` in `prefill_engine_args` and `decode_engine_args`
346

347
348
349
Additional offline constraints:

- offline `kv_router` requires `num_workers > 1`
350
351
352
- single-worker offline replay is still a dedicated fast path for `vllm`, but it now supports both
  flat request replay and workload-driven multi-turn replay
- `sglang` still goes through the shared multi-worker replay runtime even when `num_workers=1`
353
- offline disagg replay is a separate two-stage runtime with prefill and decode worker pools
354
355
356
357
358

Additional online constraints:

- the current live replay path is also limited to aggregated workers

359
360
361
362
If you violate those constraints, replay fails immediately with a validation error.

## Practical Notes

363
364
365
- `python -m dynamo.replay` requires exactly one of:
  either a trace file, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
- `--replay-concurrency` works with both trace replay and synthetic replay
366
- mocker compute-speed knobs such as `speedup_ratio` still affect simulated timing when passed via
367
  the engine-args JSON for the chosen replay mode
368
369
- `--arrival-speedup-ratio` affects trace timestamps, not worker compute speed
- `--arrival-interval-ms` only applies to synthetic replay
370
371
- `--turns-per-session`, `--shared-prefix-ratio`, `--num-prefix-groups`, and
  `--inter-turn-delay-ms` only apply to synthetic replay
372
373
- `--extra-engine-args`, `--prefill-engine-args`, `--decode-engine-args`, and `--router-config`
  are JSON strings on the standalone replay CLI
374
375
376
- offline replay does not need planner runtime setup, router registration, or external event transport
- the replay block size should match the trace block size, because token synthesis expands `hash_ids`
  using the configured block size
377
378
379
380
381
382
383
384
385
386
387
388
389
390

## When To Use This vs AIPerf

Use offline replay when:

- you want a fast scheduler-only simulation
- you want deterministic CI coverage of replay behavior
- you do not need HTTP serving, frontend behavior, or network effects

Use [Dynamo Benchmarking](benchmarking.md) when:

- you want end-to-end benchmarking against a live endpoint
- you need frontend, transport, or cluster-level behavior
- you want AIPerf dashboards and endpoint-facing metrics