# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
title:Mocker Offline Trace Replay
title:Mocker Trace Replay
subtitle:Replay Mooncake-style traces offline without launching a runtime or router
subtitle:Replay Mooncake-style traces through the mocker in offline or online mode
---
---
This guide covers the mocker's offline trace replay mode, which replays a Mooncake-style JSONL trace directly through the mock scheduler and writes a metrics report. Unlike normal `dynamo.mocker` usage, this mode does not launch workers, register endpoints, or require NATS, etcd, or a frontend.
This guide covers the mocker's trace replay support for Mooncake-style JSONL traces. The replay
surface is available in two forms:
-`python -m dynamo.mocker --trace-file ...`, which writes a report file and prints a replay summary
-`python -m dynamo.replay ...`, which returns the replay report JSON on stdout and exposes
`offline|online`, `round_robin|kv_router`, `arrival_speedup_ratio`, and synthetic replay inputs
directly
Unlike normal `dynamo.mocker` usage, offline replay does not launch workers, register endpoints, or
require NATS, etcd, or a frontend. Online replay does exercise the live mock-worker runtime path.
Use this when you want to:
Use this when you want to:
...
@@ -15,7 +24,31 @@ Use this when you want to:
...
@@ -15,7 +24,31 @@ Use this when you want to:
## Quick Start
## Quick Start
Run offline replay by passing `--trace-file`:
Run offline replay through the dedicated replay CLI:
The mocker synthesizes token blocks from `hash_ids` using the configured `--block-size`, so the replay block size should match the block size used when the trace was generated.
The mocker synthesizes token blocks from `hash_ids` using the configured `--block-size`, so the
replay block size must match the block size used when the trace was generated. Public Mooncake
traces are commonly block-level hashes at `512` tokens per hash ID, so replaying them with the
default mocker `block_size=64` will fail once `input_length > len(hash_ids) * 64`. For
`engine_type=sglang`, replay still uses canonical `block_size` internally; `sglang.page_size` is
accepted as a compatibility alias and is normalized into `block_size` before replay starts.
## Replay Surfaces
### `python -m dynamo.replay`
The dedicated replay CLI exposes:
- either a positional `trace_file`, or all of `--input-tokens`, `--output-tokens`, and `--request-count`
| `--router-queue-policy <str>` | `fcfs` | Scheduling policy for the queue: `fcfs` (tail TTFT) or`wspt` (avg TTFT) |
| `--router-queue-policy <str>` | `fcfs` | Scheduling policy for the queue: `fcfs` (tail TTFT),`wspt` (avg TTFT), or `lcfs` (comparison-only reverse ordering) |
For all available options: `python -m dynamo.frontend --help`
For all available options: `python -m dynamo.frontend --help`
...
@@ -231,10 +231,11 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na
...
@@ -231,10 +231,11 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na
-`--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
-`--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
-`--router-queue-threshold`: Queue threshold fraction for prefill token capacity (default: 2.0). The router holds incoming requests in a priority queue while all workers exceed this fraction of `max_num_batched_tokens`, releasing them when capacity frees up. This defers dispatch (not rejection) so that routing decisions use the most up-to-date load metrics at the moment the request is actually sent to a worker. It also enables **priority scheduling** via `priority` hints in `nvext.agent_hints` — higher values shift a request's effective arrival time earlier in the queue, giving it priority over lower-valued requests. Must be > 0. Set to None to disable queueing (requests are dispatched immediately).
-`--router-queue-threshold`: Queue threshold fraction for prefill token capacity (default: 4.0). The router holds incoming requests in a priority queue while all workers exceed this fraction of `max_num_batched_tokens`, releasing them when capacity frees up. This defers dispatch (not rejection) so that routing decisions use the most up-to-date load metrics at the moment the request is actually sent to a worker. It also enables **priority scheduling** via `priority` hints in `nvext.agent_hints` — higher values shift a request's effective arrival time earlier in the queue, giving it priority over lower-valued requests. Must be > 0. Set to None to disable queueing (requests are dispatched immediately).
-`--router-queue-policy`: Scheduling policy for the router queue (default: `fcfs`). Two policies are available:
-`--router-queue-policy`: Scheduling policy for the router queue (default: `fcfs`). Three policies are available:
-**`fcfs`** (first-come first-served): Orders by adjusted arrival time (`priority_jump - arrival_offset`). Optimizes **tail TTFT** — no request waits longer than necessary.
-**`fcfs`** (first-come first-served): Orders by adjusted arrival time (`priority_jump - arrival_offset`). Optimizes **tail TTFT** — no request waits longer than necessary.
-**`lcfs`** (last-come first-served): Orders by adjusted reverse arrival time (`priority_jump + arrival_offset`). Intentionally favors newer arrivals under saturation and is mainly useful for controlled comparison experiments.
-**`wspt`** (weighted shortest processing time, Smith's rule): Orders by `(1 + priority_jump) / isl_tokens`. Optimizes **average TTFT** — short or high-priority requests are scheduled before long low-priority ones, minimizing total weighted completion time.
-**`wspt`** (weighted shortest processing time, Smith's rule): Orders by `(1 + priority_jump) / isl_tokens`. Optimizes **average TTFT** — short or high-priority requests are scheduled before long low-priority ones, minimizing total weighted completion time.
### KV Event Transport and Persistence
### KV Event Transport and Persistence
...
@@ -281,7 +282,7 @@ Use `--no-router-assume-kv-reuse` in disaggregated setups where the decode worke
...
@@ -281,7 +282,7 @@ Use `--no-router-assume-kv-reuse` in disaggregated setups where the decode worke
Use `--router-track-output-blocks`**(experimental)** when your workload is output-heavy and you want the router to account for output-side KV cache growth in load balancing. This is useful in two scenarios: (1) workloads with long output sequences and little multi-turn reuse, where output blocks dominate the KV cache footprint; (2) agentic schedulers (e.g. NAT or other LLM routers) that can accurately predict the expected output sequence length per request. When enabled, the router adds placeholder blocks as tokens are generated. If you additionally pass `nvext.agent_hints.osl` (expected output sequence length in tokens) per request, the router applies fractional decay to output blocks — each output block's weight starts at 1.0 and decays linearly toward 0.0 as generation approaches the expected OSL. This lets the router predict that a request nearing completion will soon free its blocks, effectively modeling the future load trajectory rather than just the current snapshot. Without `osl`, output blocks are added at full weight with no decay. The flag requires `--router-track-active-blocks` (the default).
Use `--router-track-output-blocks`**(experimental)** when your workload is output-heavy and you want the router to account for output-side KV cache growth in load balancing. This is useful in two scenarios: (1) workloads with long output sequences and little multi-turn reuse, where output blocks dominate the KV cache footprint; (2) agentic schedulers (e.g. NAT or other LLM routers) that can accurately predict the expected output sequence length per request. When enabled, the router adds placeholder blocks as tokens are generated. If you additionally pass `nvext.agent_hints.osl` (expected output sequence length in tokens) per request, the router applies fractional decay to output blocks — each output block's weight starts at 1.0 and decays linearly toward 0.0 as generation approaches the expected OSL. This lets the router predict that a request nearing completion will soon free its blocks, effectively modeling the future load trajectory rather than just the current snapshot. Without `osl`, output blocks are added at full weight with no decay. The flag requires `--router-track-active-blocks` (the default).
The `--router-queue-threshold` (default: 2.0) controls when incoming requests are held in a priority queue. The router holds requests while all workers exceed the given fraction of `max_num_batched_tokens`, releasing them as capacity frees up. This defers the routing decision so it is made with the freshest load metrics, rather than dispatching into an already-saturated system. It also enables priority scheduling via `nvext.agent_hints.priority`. Set to None to disable queueing entirely.
The `--router-queue-threshold` (default: 4.0) controls when incoming requests are held in a priority queue. The router holds requests while all workers exceed the given fraction of `max_num_batched_tokens`, releasing them as capacity frees up. This defers the routing decision so it is made with the freshest load metrics, rather than dispatching into an already-saturated system. It also enables priority scheduling via `nvext.agent_hints.priority`. Set to None to disable queueing entirely.
Use `--router-queue-policy wspt` when your workload has a mix of short and long requests and you want to minimize **average** TTFT. WSPT (Smith's rule) schedules short or high-priority requests first, reducing mean latency across the batch. Use the default `fcfs` when you want to minimize **tail** TTFT — no request waits longer than necessary, since ordering is purely by (adjusted) arrival time.
Use `--router-queue-policy wspt` when your workload has a mix of short and long requests and you want to minimize **average** TTFT. WSPT (Smith's rule) schedules short or high-priority requests first, reducing mean latency across the batch. Use the default `fcfs` when you want to minimize **tail** TTFT — no request waits longer than necessary, since ordering is purely by (adjusted) arrival time.
@@ -11,7 +11,7 @@ The Mocker is a lightweight, high-fidelity simulation of an LLM inference engine
...
@@ -11,7 +11,7 @@ The Mocker is a lightweight, high-fidelity simulation of an LLM inference engine
The mocker simulates:
The mocker simulates:
-**Block-based KV cache management** with LRU eviction
-**Block-based KV cache management** with LRU eviction
-**Continuous batching scheduler**with watermark-based admission control
-**Engine-specific continuous batching schedulers**for vLLM and SGLang
-**Prefix caching** with hash-based block deduplication
-**Prefix caching** with hash-based block deduplication
-**Chunked prefill** for better batching efficiency
-**Chunked prefill** for better batching efficiency
-**Realistic timing models** for prefill and decode phases
-**Realistic timing models** for prefill and decode phases
...
@@ -74,10 +74,10 @@ python -m dynamo.mocker \
...
@@ -74,10 +74,10 @@ python -m dynamo.mocker \
| `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers |
| `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers |
| `--model-name` | Derived from model-path | Model name for API responses |
| `--model-name` | Derived from model-path | Model name for API responses |
| `--trace-file` | None | Run offline trace replay from a Mooncake-style JSONL trace file |
| `--trace-file` | None | Run offline trace replay from a Mooncake-style JSONL trace file |
| `--output-file` | `<trace stem>.replay.json` | Write replay metrics JSON to this path |
| `--output-file` | `TRACE_STEM.replay.json` | Write replay metrics JSON to this path |
| `--replay-concurrency` | None | Run offline replay in closed-loop concurrency mode with this many in-flight requests |
| `--replay-concurrency` | None | Run offline replay in closed-loop concurrency mode with this many in-flight requests |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
| `--block-size` | 64 (`vllm`) / engine-specific | Tokens per KV cache block. For `sglang`, if omitted, the effective page/block size defaults to 1 or to `--sglang-page-size` when provided |
| `--max-num-seqs` | 256 | Maximum concurrent sequences |
| `--max-num-seqs` | 256 | Maximum concurrent sequences |
| `--max-num-batched-tokens` | 8192 | Maximum tokens per batch |
| `--max-num-batched-tokens` | 8192 | Maximum tokens per batch |
| `--sglang-page-size` | 1 | SGLang radix-cache page size in tokens. Also becomes the effective block size when `--engine-type sglang` and `--block-size` is omitted |
| `--sglang-max-prefill-tokens` | 16384 | SGLang max prefill-token budget per batch |
| `--sglang-max-prefill-tokens` | 16384 | SGLang max prefill-token budget per batch |
| `--sglang-clip-max-new-tokens` | 4096 | SGLang admission-budget cap for max new tokens |
| `--sglang-clip-max-new-tokens` | 4096 | SGLang admission-budget cap for max new tokens |
...
@@ -126,9 +125,12 @@ python -m dynamo.mocker \
...
@@ -126,9 +125,12 @@ python -m dynamo.mocker \
> **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.
> **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.
## Offline Trace Replay
## Trace Replay
The mocker also supports an offline replay mode for Mooncake-style traces:
The mocker also supports replaying Mooncake-style traces through both the original mocker CLI and
the dedicated replay harness.
For the original mocker CLI flow:
```bash
```bash
python -m dynamo.mocker \
python -m dynamo.mocker \
...
@@ -136,9 +138,41 @@ python -m dynamo.mocker \
...
@@ -136,9 +138,41 @@ python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B
--model-path Qwen/Qwen3-0.6B
```
```
This mode writes a replay report JSON and prints a `Replay Summary` table without launching a runtime or router.
For the standalone replay CLI, which exposes `offline|online`, `round_robin|kv_router`,
`arrival_speedup_ratio`, `router_queue_policy`, and the synthetic replay path directly:
-**vLLM mocker** uses an upstream-style `waiting + running` scheduler. Each request tracks
2.**Prefill Queue** - Requests scheduled for prefill
computed tokens, the scheduler spends one token budget across the running set first, and decode
3.**Decode Queue** - Requests actively decoding (ordered by age for preemption)
pressure triggers inline preemption of running requests.
-**SGLang mocker** uses a cache-aware waiting/running scheduler around a radix-style prefix cache.
It batches prefill work with decode-state awareness and handles pressure primarily through decode
retraction while preserving cached prefixes.
Each iteration, the scheduler receives incoming requests, moves eligible requests from waiting to prefill based on available memory and compute budgets, simulates the prefill phase for queued requests, runs one decode step for all active sequences, and publishes metrics about current resource utilization.
Both schedulers simulate continuous batching, prefix reuse, chunked prefill, memory pressure, and
decode token emission while publishing metrics about current resource utilization.
When resources become constrained, the scheduler employs preemption: the oldest decoding request is evicted back to the waiting queue, its KV blocks are freed, and it will be rescheduled later. This mirrors how real engines handle memory pressure.
When resources become constrained, the mocker simulates the engine's real recovery path:
- vLLM-style decode preemption and recompute
- SGLang-style decode retraction plus prefix-preserving cache updates