mocker-trace-replay.md 17.6 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
5
title: Mocker Trace Replay
subtitle: Replay Mooncake-style traces through the mocker in offline or online mode
6
7
---

8
9
This guide covers trace replay support for Mooncake-style JSONL traces via `python -m dynamo.replay`,
which prints an AIPerf-style summary table, writes the full replay report JSON to disk, and exposes
10
11
`offline|online`, `round_robin|kv_router`, `arrival_speedup_ratio`, closed-loop concurrency, and
synthetic workload inputs directly.
12
13
14

Unlike normal `dynamo.mocker` usage, offline replay does not launch workers, register endpoints, or
require NATS, etcd, or a frontend. Online replay does exercise the live mock-worker runtime path.
15
16
17
18
19
20
21

Use this when you want to:

- benchmark scheduler behavior from a saved trace
- compare timing and cache behavior across mocker configurations
- validate replay logic in CI without bringing up a distributed stack

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
## Harness Overview

The replay harness wires a load driver (trace file or synthetic workload generator) into one or more mocker engine simulations and tees request/token timing into a trace collector.

```mermaid
flowchart LR
    LD[Load Driver] --> H[Replay Harness]

    H --> SES[Single Engine Simulation]
    H --> MES[Multi Engine Simulation]

    SES --> H
    MES --> H

    H --> TC[Trace Collector]
```

The load driver is either a Mooncake-style JSONL trace (timestamps, ISL/OSL, `hash_ids`) or a synthetic generator parameterized by `isl`/`osl`/`concurrency`. Single-engine simulation (`SES`) is the fast path for `num_workers == 1` with the vLLM engine; multi-engine simulation (`MES`) covers aggregated multi-worker replay, disaggregated prefill/decode replay, and KV-router replay. The trace collector produces the AIPerf-style summary table, the JSON report, and the per-request timing fields consumed by downstream analysis.

Each simulation composes a different set of components. SES drives the engine core directly (scheduler + forward-pass modeling). MES composes multiple engine cores with KV transfer/offloading, KV routing, and planner simulation layered on top:

```mermaid
flowchart TD
    subgraph SEC[Single Engine Core]
        subgraph SCH[Scheduler Modeling]
            F[Fwd Pass Modeling]
        end
    end

    KV[KV Transfer + Offloading Simulation]
    KR[KV Router Simulation]
    P[Planner Simulation]

    SES[Single Engine Simulation]
    MES[Multi Engine Simulation]

    SES --> SEC

    MES --> SEC
    MES --> KV
    MES --> KR
    MES --> P
```

See [`lib/mocker/src/replay/offline/README.md`](../../lib/mocker/src/replay/offline/README.md) for offline-harness internals (logical clock, event queue, worker model) and [`docs/mocker/mocker.md`](../mocker/mocker.md) for engine-core details (scheduler, KV block manager).

68
69
## Quick Start

70
71
72
73
74
75
76
Run offline replay through the dedicated replay CLI:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --num-workers 4 \
    --replay-mode offline \
    --router-mode round_robin \
77
78
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}' \
79
    --report-json /tmp/replay-report.json
80
81
82
83
84
85
86
87
88
89
90
91
92
```

Run synthetic replay through the same CLI when you want fixed request shapes without a trace file:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 1000 \
    --arrival-interval-ms 1.0 \
    --num-workers 1 \
    --replay-mode offline \
    --replay-concurrency 100 \
93
    --extra-engine-args '{"block_size":512}' \
94
    --report-json /tmp/replay-report.json
95
96
```

97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
Run synthetic workload replay when you want shared-prefix or multi-turn structure without a trace
file:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 200 \
    --turns-per-session 3 \
    --shared-prefix-ratio 0.5 \
    --num-prefix-groups 8 \
    --inter-turn-delay-ms 250 \
    --replay-mode offline \
    --replay-concurrency 32 \
    --extra-engine-args '{"block_size":512}' \
    --report-json /tmp/replay-report.json
```

115
`python -m dynamo.replay` prints an AIPerf-style summary table to stdout and writes the full replay
116
report JSON to disk.
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132

## Input Format

The trace file must be Mooncake-style JSONL. Each line should contain:

- `timestamp` or `created_time`
- `input_length` or `input_tokens`
- `output_length` or `output_tokens`
- `hash_ids`

Example:

```json
{"timestamp": 0, "input_length": 6755, "output_length": 500, "hash_ids": [0, 1, 2, 3]}
```

133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
Replay also supports multi-turn sessions. Use the same `session_id` on all turns in a session. The
first turn uses `timestamp` or `created_time`; later turns may use either:

- `delay` or `delay_ms` directly
- or an absolute later `timestamp`, in which case replay infers the inter-turn delay from the
  previous turn timestamp

Example:

```json
{"session_id":"session-a","timestamp":1000,"input_length":2048,"output_length":128,"hash_ids":[1,2,3,4]}
{"session_id":"session-a","delay":250,"input_length":2560,"output_length":128,"hash_ids":[1,2,3,4,5]}
{"session_id":"session-b","timestamp":1010,"input_length":1024,"output_length":64,"hash_ids":[9,10]}
{"session_id":"session-b","delay_ms":50,"input_length":1536,"output_length":64,"hash_ids":[9,10,11]}
```

149
150
151
152
153
154
155
156
157
158
159
Replay uses two different block-size concepts for trace files:

- `--trace-block-size`: how many tokens each `hash_id` in the dataset represents
- engine `block_size`: the block size used by the replay engine and router when they re-chunk the
  synthesized tokens into sequence hashes

Public Mooncake/toolagent traces use `512` tokens per `hash_id`, so replaying them should normally
use `--trace-block-size 512`. The engine `block_size` can still be smaller, for example the live
vLLM benchmark setup uses `block_size=64`. For `engine_type=sglang`, replay still uses canonical
`block_size` internally; `sglang.page_size` is accepted as a compatibility alias and is normalized
into `block_size` before replay starts.
160
161
162
163
164
165
166
167
168
169
170

## Replay Surfaces

### `python -m dynamo.replay`

The dedicated replay CLI exposes:

- either a positional `trace_file`, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
- `--replay-mode offline|online`
- `--router-mode round_robin|kv_router`
- `--num-workers`
171
172
- `--num-prefill-workers`
- `--num-decode-workers`
173
174
175
- `--replay-concurrency`
- `--arrival-interval-ms`
- `--arrival-speedup-ratio`
176
- `--trace-block-size`
177
178
179
180
- `--turns-per-session`
- `--shared-prefix-ratio`
- `--num-prefix-groups`
- `--inter-turn-delay-ms`
181
- `--extra-engine-args` (JSON string)
182
183
- `--prefill-engine-args` (JSON string)
- `--decode-engine-args` (JSON string)
184
- `--router-config` (JSON string)
185
186
187
188
189
- `--aic-backend`
- `--aic-system`
- `--aic-backend-version`
- `--aic-tp-size`
- `--aic-model-path`
190
- `--report-json`
191

192
193
194
195
196
Defaults:

- `--replay-mode offline`
- `--router-mode round_robin`

197
198
199
200
201
202
203
204
Example:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode online \
    --router-mode kv_router \
    --num-workers 4 \
    --arrival-speedup-ratio 10 \
205
206
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}' \
207
208
    --router-config '{"router_queue_policy":"fcfs","router_temperature":0.0}' \
    --report-json /tmp/replay-report.json
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
```

SGLang replay uses the same CLI surface. A minimal extra-engine-args file can use either
`block_size` directly or the compatibility alias `sglang.page_size`:

```json
{
  "engine_type": "sglang",
  "num_gpu_blocks": 512,
  "sglang": {
    "page_size": 2
  }
}
```

224
225
Both `--extra-engine-args` and `--router-config` accept partial JSON objects. Engine settings such
as `block_size`, `engine_type`, `dp_size`, `speedup_ratio`, and `decode_speedup_ratio` belong in
226
227
228
229
230
231
232
233
234
`--extra-engine-args`, not as top-level replay CLI flags. `--trace-block-size` is separate and is
used only for trace-file replay. Unspecified fields fall back to the same defaults used by
`MockEngineArgs::default()` and `KvRouterConfig::default()`.

Replay has two independent AIC surfaces:

- engine timing AIC via `--extra-engine-args` / staged engine JSON
- router-side prompt-load AIC via top-level `--aic-*` flags together with
  `router_prefill_load_model: "aic"` in `--router-config`
235

236
237
238
239
240
241
242
243
244
245
246
Offline disagg replay uses staged engine args instead of `--extra-engine-args`:

- `--prefill-engine-args` for the prefill worker config
- `--decode-engine-args` for the decode worker config
- `--num-prefill-workers` and `--num-decode-workers` for pool sizes

For offline disagg replay, the staged JSON must set `worker_type` explicitly:

- `--prefill-engine-args` must use `worker_type: "prefill"`
- `--decode-engine-args` must use `worker_type: "decode"`

247
248
The staged configs must also use the same engine `block_size`. `--trace-block-size` remains a
separate trace-file input knob.
249

250
251
252
253
254
255
256
257
258
259
260
261
262
### Synthetic Replay

Synthetic replay bypasses trace loading and generates in-memory requests with fixed input/output
lengths and optional synthetic arrival spacing:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 200 \
    --arrival-interval-ms 0.5 \
    --replay-mode offline \
    --replay-concurrency 50 \
263
    --extra-engine-args '{"block_size":512}'
264
265
266
```

This is useful for parameter sweeps where Mooncake-style prefix structure is not required.
267

268
269
270
271
272
273
274
275
276
277
278
279
280
When `--turns-per-session > 1`, `--request-count` is interpreted as the number of sessions rather
than the total number of emitted turns. The total completed request count becomes:

- `request_count * turns_per_session`

Synthetic workload options:

- `--turns-per-session`: number of turns in each synthetic session
- `--shared-prefix-ratio`: fraction of prompt blocks shared inside a prefix group
- `--num-prefix-groups`: number of shared-prefix groups; `0` disables grouping
- `--inter-turn-delay-ms`: constant delay applied after each completed turn before the next turn in
  the same session becomes eligible

281
282
283
284
## Modes

### Fixed-Schedule Replay

285
286
Default trace replay preserves the timestamps from the trace and simulates arrivals according to
those timestamps:
287
288

```bash
289
290
291
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
292
293
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}'
294
295
296
297
298
299
```

This is the right mode when you want deterministic replay of the original arrival pattern.

### Closed-Loop Concurrency Replay

300
301
Use `--replay-concurrency` to ignore first-turn trace arrival timing and keep a fixed number of
requests in flight:
302
303

```bash
304
305
306
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
307
308
309
310
311
    --replay-concurrency 16
```

This mode is useful when you want to compare scheduler behavior under a fixed offered concurrency rather than the original trace schedule.

312
313
314
315
316
317
318
For multi-turn sessions, concurrency mode still enforces session order and inter-turn delays:

- first-turn timestamps are ignored
- turn `n+1` is not eligible until turn `n` completes
- `delay` / `delay_ms` / synthetic `--inter-turn-delay-ms` are still applied after completion
- TTFT is measured from actual dispatch under the cap, not from the ignored trace timestamp

319
320
321
322
323
324
325
326
327
328
329
330
### Online Replay

Online replay launches the mock workers and replays the trace against the live runtime path. This
is useful when you want the replay to include live request dispatch, live output handling, and the
same async KV-event propagation model used by the current router integration.

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode online \
    --router-mode kv_router \
    --num-workers 4 \
    --arrival-speedup-ratio 10 \
331
332
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}'
333
334
335
336
337
338
339
340
341
342
343
344
```

### Arrival Speedup

Use `--arrival-speedup-ratio` to compress or stretch the trace arrival process without changing the
mocker compute model. Larger values make arrivals happen sooner relative to the original trace.

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --num-workers 4 \
    --arrival-speedup-ratio 5 \
345
346
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}'
347
348
349
350
351
352
353
354
355
```

### Router Modes

Replay currently supports:

- `round_robin`
- `kv_router`

356
357
`kv_router` uses the shared local scheduler and an in-process KV indexer. Router policy tuning is
provided through `--router-config`, not a dedicated top-level replay flag. In offline replay:
358
359
360
361
362
363

- `kv_router` is supported only when `num_workers > 1`
- router queueing is enabled and uses simulation time rather than wall-clock time
- KV visibility is delayed slightly relative to request lifecycle events
- queue admission is driven by router lifecycle edges (`add_request`, `mark_prefill_completed`, and `free`)
- transient in-pass prefill occupancy is still approximated at the router level rather than modeled exactly
364
365
- when `router_prefill_load_model` is `"aic"`, replay predicts one expected prefill duration per
  admitted request and decays only the oldest active prefill request on each worker
366
367

To compare queue policies manually, keep the same trace and engine args fixed and swap only
368
`router_queue_policy` inside `--router-config`:
369
370
371
372
373
374

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
375
376
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}' \
377
    --router-config '{"router_queue_policy":"fcfs"}'
378
379
380
381
382

python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
383
384
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}' \
385
    --router-config '{"router_queue_policy":"lcfs"}'
386
387
388
389
390
```

`lcfs` is intentionally a worse comparison policy under saturation; use it for experiments, not as
an expected production default.

391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
To enable router-side AIC prefill-load modeling during replay:

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}' \
    --router-config '{"router_track_prefill_tokens":true,"router_prefill_load_model":"aic"}' \
    --aic-backend vllm \
    --aic-system h200_sxm \
    --aic-model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
    --aic-tp-size 1
```

For offline disagg replay, the same top-level `--aic-*` flags are supported, but the estimator is
applied only to the prefill-stage router.

410
411
412
413
414
415
416
417
418
419
420
421
## Output

The report contains:

- request counts
- input and output token totals
- virtual duration and wall-clock runtime
- request and token throughput
- prefix cache reuse ratio
- TTFT, TTST, TPOT, ITL, and end-to-end latency summaries
- output-token-throughput-per-user summaries

422
423
424
The dedicated replay CLI returns the same report schema as the Python APIs
`dynamo.replay.run_trace_replay(...)` and `dynamo.replay.run_synthetic_trace_replay(...)`.

425
426
427
If `--report-json` is not provided, `python -m dynamo.replay` writes a timestamped
`dynamo_replay_report_*.json` file in the current working directory.

428
429
## Replay Constraints

430
Shared replay constraints:
431

432
- `extra_engine_args.engine_type` must be `vllm` or `sglang`
433
434
435
436
437
- aggregated replay requires the existing aggregated args path
- disagg replay requires both `prefill_engine_args` and `decode_engine_args`
- disagg replay requires `router_mode=kv_router`
- replay `dp_size` must be `1`
- disagg replay requires matching `block_size` in `prefill_engine_args` and `decode_engine_args`
438

439
440
441
Additional offline constraints:

- offline `kv_router` requires `num_workers > 1`
442
443
444
- single-worker offline replay is still a dedicated fast path for `vllm`, but it now supports both
  flat request replay and workload-driven multi-turn replay
- `sglang` still goes through the shared multi-worker replay runtime even when `num_workers=1`
445
- offline disagg replay is a separate two-stage runtime with prefill and decode worker pools
446
447
448
449
450

Additional online constraints:

- the current live replay path is also limited to aggregated workers

451
452
453
454
If you violate those constraints, replay fails immediately with a validation error.

## Practical Notes

455
456
457
- `python -m dynamo.replay` requires exactly one of:
  either a trace file, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
- `--replay-concurrency` works with both trace replay and synthetic replay
458
- mocker compute-speed knobs such as `speedup_ratio` still affect simulated timing when passed via
459
  the engine-args JSON for the chosen replay mode
460
- `--arrival-speedup-ratio` affects trace timestamps, not worker compute speed
461
- `--trace-block-size` affects only how trace `hash_ids` expand into tokens
462
- `--arrival-interval-ms` only applies to synthetic replay
463
464
- `--turns-per-session`, `--shared-prefix-ratio`, `--num-prefix-groups`, and
  `--inter-turn-delay-ms` only apply to synthetic replay
465
466
- `--extra-engine-args`, `--prefill-engine-args`, `--decode-engine-args`, and `--router-config`
  are JSON strings on the standalone replay CLI
467
468
- top-level `--aic-*` flags are used only for router-side prompt-load modeling; engine timing AIC
  still belongs in the engine-args JSON
469
- offline replay does not need planner runtime setup, router registration, or external event transport
470
471
472
- trace-file replay can use different values for `--trace-block-size` and engine `block_size`
- Mooncake/toolagent traces typically use `--trace-block-size 512`, while engine `block_size`
  often stays `64`
473
474
475
476
477
478
479
480
481
482
483
484
485
486

## When To Use This vs AIPerf

Use offline replay when:

- you want a fast scheduler-only simulation
- you want deterministic CI coverage of replay behavior
- you do not need HTTP serving, frontend behavior, or network effects

Use [Dynamo Benchmarking](benchmarking.md) when:

- you want end-to-end benchmarking against a live endpoint
- you need frontend, transport, or cluster-level behavior
- you want AIPerf dashboards and endpoint-facing metrics