mocker.md 27.3 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Mocker
5
---
Yan Ru Pei's avatar
Yan Ru Pei committed
6
7
8
9
10
11
12
13

The Mocker is a lightweight, high-fidelity simulation of an LLM inference engine, implemented entirely in Rust. It replicates the core scheduling, memory management, and timing behaviors of production engines without requiring a GPU, making it invaluable for testing Dynamo's routing, KV cache events, disaggregated serving, and planner components.

## Overview

The mocker simulates:

- **Block-based KV cache management** with LRU eviction
14
- **Engine-specific continuous batching schedulers** for vLLM and SGLang
Yan Ru Pei's avatar
Yan Ru Pei committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
- **Prefix caching** with hash-based block deduplication
- **Chunked prefill** for better batching efficiency
- **Realistic timing models** for prefill and decode phases
- **Disaggregated serving** (prefill/decode separation)
- **KV event publishing** for router integration
- **Data parallelism** (multiple DP ranks per engine)

> **Note:** While the mocker uses vLLM as its primary reference implementation, these core components—block-based KV cache management, continuous batching schedulers, LRU evictors, and prefix caching—are fundamental to all modern LLM inference engines, including SGLang and TensorRT-LLM. The architectural patterns simulated here are engine-agnostic and apply broadly across the inference ecosystem.

## Quick Start

### Basic Usage

```bash
# Launch a single mocker worker
python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B

# Launch with custom KV cache configuration
python -m dynamo.mocker \
    --model-path Qwen/Qwen3-0.6B \
    --num-gpu-blocks-override 8192 \
    --block-size 64 \
    --max-num-seqs 256

# Launch with timing speedup for faster testing
python -m dynamo.mocker \
    --model-path Qwen/Qwen3-0.6B \
    --speedup-ratio 10.0
```

### Disaggregated Serving

```bash
# Launch prefill worker
python -m dynamo.mocker \
    --model-path Qwen/Qwen3-0.6B \
51
    --disaggregation-mode prefill \
Yan Ru Pei's avatar
Yan Ru Pei committed
52
53
54
55
56
    --bootstrap-ports 50100

# Launch decode worker (in another terminal)
python -m dynamo.mocker \
    --model-path Qwen/Qwen3-0.6B \
57
    --disaggregation-mode decode
Yan Ru Pei's avatar
Yan Ru Pei committed
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
```

### Multiple Workers in One Process

```bash
# Launch 4 mocker workers sharing the same tokio runtime
python -m dynamo.mocker \
    --model-path Qwen/Qwen3-0.6B \
    --num-workers 4
```

## CLI Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--model-path` | Required | HuggingFace model ID or local path for tokenizer |
74
| `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers |
Yan Ru Pei's avatar
Yan Ru Pei committed
75
76
| `--model-name` | Derived from model-path | Model name for API responses |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
77
| `--block-size` | 64 (`vllm`) / engine-specific | Tokens per KV cache block. For `sglang`, if omitted, the effective page/block size defaults to 1 or to `--sglang-page-size` when provided |
Yan Ru Pei's avatar
Yan Ru Pei committed
78
79
80
| `--max-num-seqs` | 256 | Maximum concurrent sequences |
| `--max-num-batched-tokens` | 8192 | Maximum tokens per batch |
| `--enable-prefix-caching` | True | Enable prefix caching |
81
| `--no-enable-prefix-caching` | - | Disable prefix caching |
Yan Ru Pei's avatar
Yan Ru Pei committed
82
| `--enable-chunked-prefill` | True | Enable chunked prefill |
83
84
| `--no-enable-chunked-prefill` | - | Disable chunked prefill |
| `--preemption-mode` | `lifo` | Decode eviction policy under memory pressure: `lifo` (vLLM v1 style) or `fifo` |
Yan Ru Pei's avatar
Yan Ru Pei committed
85
| `--speedup-ratio` | 1.0 | Timing speedup factor |
86
| `--decode-speedup-ratio` | 1.0 | Decode-only speedup multiplier (e.g. for Eagle speculation) |
Yan Ru Pei's avatar
Yan Ru Pei committed
87
88
| `--data-parallel-size` | 1 | Number of DP replicas |
| `--startup-time` | None | Simulated startup delay (seconds) |
89
| `--planner-profile-data` | None | Path to either a mocker-format `.npz` file or a profiler results directory |
Yan Ru Pei's avatar
Yan Ru Pei committed
90
| `--num-workers` | 1 | Workers per process |
91
92
| `--reasoning` | None | JSON config for emitting reasoning token spans, with `start_thinking_token_id`, `end_thinking_token_id`, and `thinking_ratio` |
| `--engine-type` | `vllm` | Engine simulation type: `vllm` or `sglang` |
93
| `--sglang-schedule-policy` | `fifo` / `fcfs` | SGLang scheduling policy: `fifo`/`fcfs` (default) or `lpm` (longest prefix match) |
94
| `--sglang-page-size` | 1 | SGLang radix-cache page size in tokens. Also becomes the effective block size when `--engine-type sglang` and `--block-size` is omitted |
95
96
97
98
| `--sglang-max-prefill-tokens` | 16384 | SGLang max prefill-token budget per batch |
| `--sglang-chunked-prefill-size` | 8192 | SGLang chunked-prefill chunk size |
| `--sglang-clip-max-new-tokens` | 4096 | SGLang admission-budget cap for max new tokens |
| `--sglang-schedule-conservativeness` | 1.0 | SGLang schedule conservativeness factor |
99
| `--aic-perf-model` | False | Use AIC SDK for latency prediction instead of interpolated/polynomial models. Opt-in only: default mocker and replay paths do not use AIC. Requires `aiconfigurator` installed and usable AIC systems/perf data for the requested `system/backend/version` tuple |
100
101
102
| `--aic-system` | `h200_sxm` | AIC system name (e.g., `h200_sxm`). Used with `--aic-perf-model` |
| `--aic-backend-version` | Auto | AIC backend engine version (e.g., `0.12.0` for vLLM). If not set, uses the default version for the backend |
| `--aic-tp-size` | 1 | Tensor parallel size for AIC latency prediction. Only affects AIC performance model lookups, not mocker scheduling |
103
| `--extra-engine-args` | None | Path to a JSON file with mocker configuration; overrides individual CLI arguments |
104
| `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode |
105
| `--disaggregation-mode` | `agg` | Worker mode: `agg` (aggregated), `prefill`, or `decode` |
106
107
108
109
| `--durable-kv-events` | False | Deprecated JetStream KV-event mode; prefer the local indexer / event-plane subscriber path |
| `--zmq-kv-events-ports` | None | Comma-separated ZMQ PUB base ports for KV event publishing, one per worker |
| `--zmq-replay-ports` | None | Comma-separated ZMQ ROUTER base ports for gap recovery, one per worker |
| `--bootstrap-ports` | None | Comma-separated rendezvous base ports, one per worker in disaggregated mode |
110
111
112
| `--kv-transfer-bandwidth` | 64.0 | KV cache transfer bandwidth in GB/s. Set to 0 to disable |
| `--kv-cache-dtype` | auto | KV cache dtype for bytes-per-token computation |
| `--kv-bytes-per-token` | Auto-computed | KV cache bytes per token (override auto-computation) |
113
114
115
116
117
118
119
120
121
122
123
124
| `--discovery-backend` | Env-driven (`etcd`) | Discovery backend: `kubernetes`, `etcd`, `file`, or `mem` |
| `--request-plane` | Env-driven (`tcp`) | Request transport: `nats`, `http`, or `tcp` |
| `--event-plane` | Env-driven (`nats`) | Event transport: `nats` or `zmq` |

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_MOCKER_KV_CACHE_TRACE` | off | Set to `1` or `true` to log structured KV cache allocation and eviction traces |

> **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.

125
## Trace Replay
126

127
The mocker supports replaying Mooncake-style traces through the dedicated replay CLI, which exposes
128
`offline|online`, `round_robin|kv_router`, `arrival_speedup_ratio`, closed-loop concurrency
129
admission, synthetic workload generation, and offline disaggregated prefill/decode replay directly:
130

131
132
133
134
The replay CLI defaults to `--replay-mode offline` and `--router-mode round_robin`. Aggregated
replay uses `--extra-engine-args`. Offline disagg replay instead uses
`--prefill-engine-args` plus `--decode-engine-args`, together with
`--num-prefill-workers` and `--num-decode-workers`.
135
136
137
138
139
140
141

```bash
python -m dynamo.replay /path/to/mooncake_trace.jsonl \
    --num-workers 4 \
    --replay-mode offline \
    --router-mode kv_router \
    --arrival-speedup-ratio 5 \
142
143
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}' \
144
145
    --router-config '{"router_queue_policy":"fcfs"}' \
    --report-json /tmp/replay-report.json
146
147
148
149
150
151
152
153
154
155
156
157
158
```

The same CLI also supports synthetic replay without a trace file:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 1000 \
    --arrival-interval-ms 1.0 \
    --num-workers 1 \
    --replay-mode offline \
    --replay-concurrency 100 \
159
    --extra-engine-args '{"block_size":512}' \
160
    --report-json /tmp/replay-report.json
161
162
```

163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
Synthetic replay also supports workload-style generation for shared-prefix and multi-turn tests:

```bash
python -m dynamo.replay \
    --input-tokens 5000 \
    --output-tokens 500 \
    --request-count 200 \
    --turns-per-session 3 \
    --shared-prefix-ratio 0.5 \
    --num-prefix-groups 8 \
    --inter-turn-delay-ms 250 \
    --replay-mode offline \
    --replay-concurrency 32 \
    --extra-engine-args '{"block_size":512}' \
    --report-json /tmp/replay-report.json
```

For trace files, replay also understands multi-turn sessions when records share `session_id`. The
first turn uses `timestamp`/`created_time`; later turns can use `delay` or `delay_ms`:

```json
{"session_id":"session-a","timestamp":1000,"input_length":2048,"output_length":128,"hash_ids":[1,2,3,4]}
{"session_id":"session-a","delay":250,"input_length":2560,"output_length":128,"hash_ids":[1,2,3,4,5]}
```

188
189
190
191
192
For trace-file replay, `--trace-block-size` controls how many tokens each `hash_id` represents in
the dataset, while engine `block_size` still controls the replay engine and router hashing. Public
Mooncake/toolagent traces use `--trace-block-size 512`; engine `block_size` can still stay at `64`
to match the live runtime configuration.

193
The standalone replay CLI prints an AIPerf-style summary table to stdout and writes the full replay
194
report JSON to disk.
195

196
197
198
199
200
201
Timing semantics:

- trace mode honors first-turn timestamps and inter-turn delays
- concurrency mode ignores first-turn timestamps but still enforces inter-turn delays
- in concurrency mode, TTFT is measured from actual dispatch under the in-flight cap

202
For full usage, constraints, and benchmarking guidance, see [Mocker Trace Replay](../benchmarks/mocker-trace-replay.md).
203

204
205
206
Replay supports aggregated `vllm` and `sglang` engine configs. Internally replay uses canonical
`block_size`; for `sglang`, `sglang.page_size` is still accepted as a compatibility alias as long
as it matches `block_size` when both are provided.
207

208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
Offline replay also supports disaggregated `kv_router` mode. In that mode:

- `--prefill-engine-args` must describe a prefill worker
- `--decode-engine-args` must describe a decode worker
- `--router-mode` must be `kv_router`
- only offline replay is supported

Example:

```bash
python -m dynamo.replay \
    --input-tokens 4096 \
    --output-tokens 256 \
    --request-count 100 \
    --replay-mode offline \
    --router-mode kv_router \
    --replay-concurrency 32 \
    --num-prefill-workers 2 \
    --num-decode-workers 6 \
    --prefill-engine-args '{"worker_type":"prefill","block_size":512}' \
    --decode-engine-args '{"worker_type":"decode","block_size":512}' \
    --router-config '{"router_queue_policy":"wspt"}' \
    --report-json /tmp/replay-report.json
```

233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
## Performance Modeling Setup

By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, pass `--planner-profile-data` with either:

- a mocker-format `.npz` file, or
- a profiler output directory

The mocker automatically accepts profiler-style results directories and converts them internally.

It also accepts older raw-data directories containing:

- `prefill_raw_data.json`
- `decode_raw_data.json`

```bash
python -m dynamo.mocker \
    --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
250
    --planner-profile-data components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D \
251
252
253
    --speedup-ratio 1.0
```

254
255
256
257
258
### AIC Performance Model

To use the AIC SDK for latency prediction:

```bash
259
uv pip install '.[mocker]'
260
261
262
263
264
265
266
267
268
269

python -m dynamo.mocker \
    --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
    --engine-type vllm \
    --aic-perf-model \
    --aic-system h200_sxm
```

The AIC model automatically uses `--model-path` and `--engine-type` to select the appropriate performance data. Available systems include `h200_sxm`, `h100_sxm`, etc. (see AIC SDK documentation for the full list).

270
271
272
Important notes:

- AIC is opt-in. If you do not pass `--aic-perf-model`, `python -m dynamo.mocker` does not use AIC.
273
274
275
276
- `python -m dynamo.replay` has two separate AIC surfaces:
  - engine timing AIC through `--extra-engine-args` / staged engine JSON
  - router-side prefill-load AIC through top-level `--aic-*` flags plus `router_prefill_load_model="aic"` in `--router-config`
- The Python AIC session bridge is now shared with the live KV router path via the internal `dynamo._internal.aic` module. Mocker CLI behavior is unchanged; this just removes duplicate AIC session code.
277
278
279
- `aiconfigurator` must be able to load the requested performance database for the selected `system/backend/version`. If the SDK is installed but the backing systems data is missing or unreadable, mocker now fails fast at startup with a clear error instead of failing later on first request.
- In development environments, this may require pointing Python at a source checkout of `aiconfigurator` with real Git LFS payloads materialized in its `systems/` directory.

280
281
282
283
284
285
This mocker AIC path is separate from the router-side prefill-load estimator. Live router,
frontend, and replay all use `router_prefill_load_model="aic"` plus top-level `--aic-*` flags for
oldest-prefill prompt-load decay. Replay still uses engine-args AIC separately when you want the
mocked worker timing model itself to come from AIC.

For aggregated replay, engine timing AIC still comes from `--extra-engine-args`:
286
287
288
289
290
291

```bash
python -m dynamo.replay /path/to/trace.jsonl \
    --extra-engine-args '{"aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1}'
```

292
293
294
295
296
297
298
299
300
301
302
303
For offline disagg replay, pass the staged engine configs instead:

```bash
python -m dynamo.replay /path/to/trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --prefill-engine-args '{"worker_type":"prefill","aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1,"block_size":512}' \
    --decode-engine-args '{"worker_type":"decode","aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1,"block_size":512}' \
    --num-prefill-workers 2 \
    --num-decode-workers 6
```

304
305
The `aic_backend` field enables the AIC perf model and should match `engine_type` (`"vllm"` or `"sglang"`). The `aic_model_path` field is the equivalent of `--model-path` in `dynamo.mocker`.

306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
Replay router-side AIC prompt-load modeling is configured separately with top-level flags:

```bash
python -m dynamo.replay /path/to/trace.jsonl \
    --replay-mode offline \
    --router-mode kv_router \
    --num-workers 4 \
    --trace-block-size 512 \
    --extra-engine-args '{"block_size":64}' \
    --router-config '{"router_track_prefill_tokens":true,"router_prefill_load_model":"aic"}' \
    --aic-backend vllm \
    --aic-system h200_sxm \
    --aic-model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
    --aic-tp-size 1
```

For offline disagg replay, the same top-level `--aic-*` flags drive the prefill-stage router only;
the decode-stage router keeps prompt tracking disabled.

325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
Example `--reasoning` configuration:

```bash
python -m dynamo.mocker \
    --model-path Qwen/Qwen3-0.6B \
    --reasoning '{"start_thinking_token_id":123,"end_thinking_token_id":456,"thinking_ratio":0.6}'
```

The profile results directory should contain:

- `selected_prefill_interpolation/raw_data.npz`
- `selected_decode_interpolation/raw_data.npz`

To generate profile data for your own model and hardware, run the profiler and then point `--planner-profile-data` at the resulting output directory.

## Event Transport and Router Testing

The default event path uses the local indexer / event-plane subscriber flow. The older durable KV-events mode is still available through `--durable-kv-events`, but it is deprecated and should not be the preferred setup for new tests.

For router and indexer experiments that need native wire-format event forwarding, the mocker also supports a ZMQ path:

- `--event-plane zmq`
- `--zmq-kv-events-ports` for per-worker PUB base ports
- `--zmq-replay-ports` for optional replay/gap-recovery ROUTER base ports

When set, each worker binds on its base port plus `dp_rank`, so the number of comma-separated base ports must match `--num-workers`.

## Disaggregation Port Layout

`--bootstrap-ports` takes a comma-separated list of base ports, one per worker. In multi-worker mode, the number of listed ports must exactly match `--num-workers`.

Prefill workers listen on these ports and publish the bootstrap endpoint through discovery. Decode workers use the matching ports to rendezvous before decode begins.

## Kubernetes Deployment

The mocker can be deployed through example `DynamoGraphDeployment` manifests for both aggregated and disaggregated setups:

```bash
kubectl apply -f examples/backends/mocker/deploy/agg.yaml
kubectl apply -f examples/backends/mocker/deploy/disagg.yaml
```
Yan Ru Pei's avatar
Yan Ru Pei committed
366
367
368

## Architecture

369
The mocker is organized into several cooperating components that mirror the internal architecture of production LLM inference engines. The scheduler (vLLM-style and SGLang-style variants) and KV block manager live inside the engine core. Multi-engine behavior — KV transfer/offloading simulation, KV router simulation, planner simulation — is added by the replay harness on top of multiple engine cores; see [Mocker Trace Replay](../benchmarks/mocker-trace-replay.md) for the component-level diagram and for offline replay internals under [`lib/mocker/src/replay/offline/`](../../lib/mocker/src/replay/offline/README.md).
Yan Ru Pei's avatar
Yan Ru Pei committed
370
371
372

### Scheduler

373
The mocker now has two scheduler shapes rather than one generic queue model:
Yan Ru Pei's avatar
Yan Ru Pei committed
374

375
376
377
378
379
380
- **vLLM mocker** uses an upstream-style `waiting + running` scheduler. Each request tracks
  computed tokens, the scheduler spends one token budget across the running set first, and decode
  pressure triggers inline preemption of running requests.
- **SGLang mocker** uses a cache-aware waiting/running scheduler around a radix-style prefix cache.
  It batches prefill work with decode-state awareness and handles pressure primarily through decode
  retraction while preserving cached prefixes.
Yan Ru Pei's avatar
Yan Ru Pei committed
381

382
383
Both schedulers simulate continuous batching, prefix reuse, chunked prefill, memory pressure, and
decode token emission while publishing metrics about current resource utilization.
Yan Ru Pei's avatar
Yan Ru Pei committed
384

385
386
387
When resources become constrained, the mocker simulates the engine's real recovery path:
- vLLM-style decode preemption and recompute
- SGLang-style decode retraction plus prefix-preserving cache updates
Yan Ru Pei's avatar
Yan Ru Pei committed
388
389
390

### KV Block Manager

391
The mocker's KV block manager is now built on [`kvbm-logical::BlockManager<G1>`](../../lib/kvbm-logical/), the same logical block manager the real Dynamo runtime uses. The mocker wraps it in [`lib/mocker/src/kv_manager/kvbm_backend.rs`](../../lib/mocker/src/kv_manager/kvbm_backend.rs) and translates its own `MoveBlock` protocol onto kvbm-logical's RAII lifecycle (`allocate → stage → register → drop`).
Yan Ru Pei's avatar
Yan Ru Pei committed
392

393
Blocks still conceptually live in one of two pools:
Yan Ru Pei's avatar
Yan Ru Pei committed
394

395
396
- **Active** — blocks currently held by at least one sequence. Partial (still-filling) blocks are held as `MutableBlock<G1>`; full blocks are held as `ImmutableBlock<G1>` clones (the clone vec length is the mocker's refcount, one per `Use`).
- **Inactive** — blocks no longer referenced by any sequence but kept for prefix-cache reuse. Handled entirely by kvbm-logical's inactive pool; the mocker never tracks them manually.
Yan Ru Pei's avatar
Yan Ru Pei committed
397

398
The lifecycle is RAII: dropping the last `ImmutableBlock` clone transitions the block from active to inactive (kvbm-logical's `reset` pool), with no explicit `deref`/`evict` bookkeeping on the mocker side. When a sequence completes or is preempted, the mocker simply drops its handles; kvbm-logical recovers the capacity.
Yan Ru Pei's avatar
Yan Ru Pei committed
399

400
401
```mermaid
stateDiagram-v2
402
403
404
405
406
    [*] --> Active : allocate + stage + register
    Active --> Inactive : last handle dropped (RAII)
    Inactive --> Active : match_blocks(PLH) reuse
    Inactive --> Freed : evicted by backend
    Active --> Freed : explicit Removed (Destroy)
407
408
409
    Freed --> [*]

    state Active {
410
411
412
        [*] --> Partial : MutableBlock<G1>
        Partial --> Full : promote (PLH / SequenceHash)
        [*] --> Full : ImmutableBlock<G1> clones
413
    }
Yan Ru Pei's avatar
Yan Ru Pei committed
414
415
```

416
417
418
Three `Use` outcomes are tracked for KV-event emission: `ActiveHit` (bump refcount on an already-pinned block), `InactiveHit` (reactivate via `match_blocks(plh)`), and `NewStore` (fresh allocation). Only `NewStore` emits a `Stored` KV event — the router radix tree already knows about the other two and only forgets on explicit `Removed`.

### Eviction Backends
Yan Ru Pei's avatar
Yan Ru Pei committed
419

420
The kvbm-logical inactive pool selects eviction victims via one of three backends, exposed as `MockerEvictionBackend` in [`lib/mocker/src/common/protocols.rs`](../../lib/mocker/src/common/protocols.rs):
421

422
423
424
- **`Lineage`** (default) — parent-chain aware: evicts leaf blocks first, preserving shared prefix chains. Subsumes the preemption-priority behavior the old hand-rolled `LRUEvictor::push_front` used to provide.
- **`Lru`** — plain recency-based LRU.
- **`MultiLru`** — 4-tier frequency-aware LRU built on a TinyLFU tracker.
425

426
All three give the same "suffix blocks evicted before shared prefixes" outcome that the old evictor was designed to produce; `Lineage` does it structurally (via the block parent chain) rather than via monotonic counters.
Yan Ru Pei's avatar
Yan Ru Pei committed
427
428
429

### Sequence Tracking

430
Each active request is tracked as a sequence, managing its token blocks and generation state. As tokens are generated, the sequence tracks which blocks are partial (`MutableBlock<G1>`, still being filled) versus full (`ImmutableBlock<G1>`, complete and hashable for prefix caching). When a partial block fills up, it gets "promoted" to a full block with a content-based `SequenceHash` (or collapses onto an existing registered handle if the PLH is already present), enabling future cache hits from requests with matching prefixes.
Yan Ru Pei's avatar
Yan Ru Pei committed
431
432
433

### Performance Model

434
The mocker supports three timing prediction modes:
Yan Ru Pei's avatar
Yan Ru Pei committed
435
436
437
438
439

**Polynomial Model (Default):** Uses hardcoded polynomial formulas that approximate typical GPU behavior. Prefill time scales quadratically with token count, while decode time depends on the total active KV cache size.

**Interpolated Model:** Loads actual profiling data from an NPZ file containing measured prefill and decode latencies. The mocker interpolates between data points to predict timing for any input size. This enables high-fidelity simulation matching a specific hardware configuration.

440
**AIC Model (`--aic-perf-model`):** Uses the NVIDIA AI Configurator (AIC) SDK for latency prediction. AIC provides calibrated performance models for specific GPU/model/engine combinations, predicting prefill and decode latency as a function of batch size, sequence length, and prefix cache hits. The model path is automatically derived from `--model-path`, and the engine type from `--engine-type`. This mode is opt-in and requires both the `aiconfigurator` SDK and loadable systems/perf data for the requested tuple.
441

Yan Ru Pei's avatar
Yan Ru Pei committed
442
443
444
445
### Bootstrap Rendezvous (Disaggregated Serving)

For disaggregated prefill/decode deployments, prefill and decode workers coordinate via a simple TCP-based rendezvous protocol. The decode worker connects to the prefill worker's bootstrap port and waits until the prefill phase completes and KV cache is ready. Either side can arrive first—the rendezvous completes when both are ready.

446
447
448
449
450
451
452
453
454
455
### KV Transfer Latency Simulation

The mocker simulates KV cache transfer time between prefill and decode workers. Before the prefill worker emits its first (and only) token, it sleeps for a duration based on:

- **kv_bytes_per_token** (auto-computed from model config): `num_layers * 2 * num_kv_heads * head_dim * dtype_bytes`. The `dtype_bytes` is determined by `--kv-cache-dtype`: when set to `auto` (default), it uses the model's `dtype` from config; when explicitly set (e.g., `fp8`), it uses the specified dtype instead. It can also be overridden directly with `--kv-bytes-per-token`.
- **kv_transfer_bandwidth** (default: 64.0 GB/s, inter-node InfiniBand)
- **Transfer time**: `num_input_tokens * kv_bytes_per_token / bandwidth`

This delay is injected after the scheduler's prefill compute simulation completes, modeling the sequential flow: prefill computation → KV transfer → decode begins. Set `--kv-transfer-bandwidth 0` to disable.

Yan Ru Pei's avatar
Yan Ru Pei committed
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
## Integration with Dynamo

### KV Event Publishing

When prefix caching is enabled, the mocker publishes KV cache events to the distributed runtime. These events notify the system when blocks are stored (new content cached) or removed (evicted). This enables the KV-aware router to make intelligent routing decisions based on which workers have which prefixes cached.

### Metrics Publishing

Each scheduler publishes metrics about its current state, including the number of active decode blocks per DP rank. The router uses these metrics for load-aware routing decisions.

## Testing Scenarios

The mocker is particularly useful for:

1. **Router Testing** - Validate KV-aware routing without GPUs
2. **Planner Testing** - Test SLA-based planners with realistic timing
3. **Fault Tolerance** - Test request migration, graceful shutdown
4. **Disaggregation** - Test P/D separation and KV transfer coordination
5. **Performance Modeling** - Prototype scheduling policies
6. **CI/CD** - Fast integration tests without hardware dependencies

## Comparison with Real Engines

| Feature | Real Engine | Mocker |
|---------|-------------|--------|
| GPU Required | Yes | No |
| Block Manager | Paged KV cache | Simulated blocks |
| Scheduler | Continuous batching | Continuous batching |
| Prefix Caching | Hash-based | Hash-based |
| Chunked Prefill | Supported | Supported |
| Preemption | Recompute/swap | Recompute (simulated) |
| Timing | Real execution | Model-based |
| KV Events | Native | Compatible |
| Data Parallelism | Multi-GPU | Simulated |

491
492
493
494
495
496
497
498
499
## Next Steps

| Document | Description |
|----------|-------------|
| [Benchmarking Dynamo Deployments](../benchmarks/benchmarking.md) | Run AIPerf against a mocker-backed deployment to measure latency, TTFT, throughput, and scaling behavior |
| [Aggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/agg.yaml) | Deploy a mocker-backed aggregated DynamoGraphDeployment on Kubernetes |
| [Disaggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/disagg.yaml) | Deploy separate prefill and decode mocker workers for disaggregated-serving benchmarks |
| [Global Planner Mocker Example](../../examples/global_planner/global-planner-mocker-test.yaml) | Advanced multi-pool mocker setup for planner and global-router experiments |

Yan Ru Pei's avatar
Yan Ru Pei committed
500
501
## Feature Gaps (WIP)

502
503
> For the broader mocker enhancement roadmap, see [#6383](https://github.com/ai-dynamo/dynamo/issues/6383).

Yan Ru Pei's avatar
Yan Ru Pei committed
504
505
506
507
508
The following features are not yet supported by the mocker:

- **Multi-tier memory** - No support for offloading KV cache to CPU/disk or onboarding back to GPU; potential future integration with KVBM
- **Multimodal support** - Currently only simulates text token processing; no vision encoder or cross-attention simulation
- **Native Rust reference counting** - Work in progress to use native Rc/Arc for block reference counting, enabling natural RAII patterns for simpler tracking