# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:Mocker Offline Trace Replay
subtitle:Replay Mooncake-style traces offline without launching a runtime or router
---
This guide covers the mocker's offline trace replay mode, which replays a Mooncake-style JSONL trace directly through the mock scheduler and writes a metrics report. Unlike normal `dynamo.mocker` usage, this mode does not launch workers, register endpoints, or require NATS, etcd, or a frontend.
Use this when you want to:
- benchmark scheduler behavior from a saved trace
- compare timing and cache behavior across mocker configurations
- validate replay logic in CI without bringing up a distributed stack
## Quick Start
Run offline replay by passing `--trace-file`:
```bash
python -m dynamo.mocker \
--trace-file /path/to/mooncake_trace.jsonl \
--model-path Qwen/Qwen3-0.6B
```
This writes a JSON report next to the trace file by default:
```text
/path/to/mooncake_trace.replay.json
```
The CLI also prints a `Replay Summary` table to stdout with request counts, throughput, and latency statistics.
## Input Format
The trace file must be Mooncake-style JSONL. Each line should contain:
The mocker synthesizes token blocks from `hash_ids` using the configured `--block-size`, so the replay block size should match the block size used when the trace was generated.
## Modes
### Fixed-Schedule Replay
Default replay mode preserves the timestamps from the trace and simulates arrivals in virtual time:
```bash
python -m dynamo.mocker \
--trace-file /path/to/mooncake_trace.jsonl \
--model-path Qwen/Qwen3-0.6B \
--block-size 512
```
This is the right mode when you want deterministic replay of the original arrival pattern.
### Closed-Loop Concurrency Replay
Use `--replay-concurrency` to ignore trace arrival timing and keep a fixed number of requests in flight:
```bash
python -m dynamo.mocker \
--trace-file /path/to/mooncake_trace.jsonl \
--model-path Qwen/Qwen3-0.6B \
--block-size 512 \
--replay-concurrency 16
```
This mode is useful when you want to compare scheduler behavior under a fixed offered concurrency rather than the original trace schedule.
## Output
Use `--output-file` to override the default report location:
```bash
python -m dynamo.mocker \
--trace-file /path/to/mooncake_trace.jsonl \
--model-path Qwen/Qwen3-0.6B \
--output-file /tmp/replay-report.json
```
If `--output-file` is not set, the report path defaults to `<trace stem>.replay.json` in the same directory as the input trace.
The report contains:
- request counts
- input and output token totals
- virtual duration and wall-clock runtime
- request and token throughput
- prefix cache reuse ratio
- TTFT, TTST, TPOT, ITL, and end-to-end latency summaries
- output-token-throughput-per-user summaries
## Replay Constraints
Offline replay currently supports only this configuration:
-`--num-workers 1`
- aggregated mode
-`--engine-type vllm`
-`--data-parallel-size 1`
If you violate those constraints, replay fails immediately with a validation error.
## Practical Notes
-`--replay-concurrency` requires `--trace-file`
-`--speedup-ratio` still affects simulated timing
-`--extra-engine-args` can be used to provide a full mocker config JSON instead of individual CLI flags
- offline replay does not need planner runtime setup, router registration, or event transport
## When To Use This vs AIPerf
Use offline replay when:
- you want a fast scheduler-only simulation
- you want deterministic CI coverage of replay behavior
- you do not need HTTP serving, frontend behavior, or network effects
Use [Dynamo Benchmarking](benchmarking.md) when:
- you want end-to-end benchmarking against a live endpoint
- you need frontend, transport, or cluster-level behavior
- you want AIPerf dashboards and endpoint-facing metrics
| `--model-path` | Required | HuggingFace model ID or local path for tokenizer |
| `--model-path` | Required | HuggingFace model ID or local path for tokenizer |
| `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers |
| `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers |
| `--model-name` | Derived from model-path | Model name for API responses |
| `--model-name` | Derived from model-path | Model name for API responses |
| `--trace-file` | None | Run offline trace replay from a Mooncake-style JSONL trace file |
| `--output-file` | `<trace stem>.replay.json` | Write replay metrics JSON to this path |
| `--replay-concurrency` | None | Run offline replay in closed-loop concurrency mode with this many in-flight requests |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
| `--max-num-seqs` | 256 | Maximum concurrent sequences |
| `--max-num-seqs` | 256 | Maximum concurrent sequences |
...
@@ -119,6 +122,20 @@ python -m dynamo.mocker \
...
@@ -119,6 +122,20 @@ python -m dynamo.mocker \
> **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.
> **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.
## Offline Trace Replay
The mocker also supports an offline replay mode for Mooncake-style traces:
```bash
python -m dynamo.mocker \
--trace-file /path/to/mooncake_trace.jsonl \
--model-path Qwen/Qwen3-0.6B
```
This mode writes a replay report JSON and prints a `Replay Summary` table without launching a runtime or router.
For full usage, constraints, and benchmarking guidance, see [Mocker Offline Trace Replay](../benchmarks/mocker-trace-replay.md).
## Performance Modeling Setup
## Performance Modeling Setup
By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, pass `--planner-profile-data` with either:
By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, pass `--planner-profile-data` with either: