docs: add mocker under testing (#7488)

Signed-off-by: PeaBrane <yanrpei@gmail.com>

docs: add mocker under testing (#7488)
Signed-off-by: PeaBrane <yanrpei@gmail.com>
3357c53f · Yan Ru Pei · GitHub · 0b20745e · 3357c53f · 3357c53f
Unverified Commit 3357c53f authored Mar 18, 2026 by Yan Ru Pei Committed by GitHub Mar 18, 2026
Showing with 111 additions and 87 deletions

components/src/dynamo/mocker/README.md components/src/dynamo/mocker/README.md +6 -83

docs/index.yml docs/index.yml +4 -0

docs/mocker/mocker.md docs/mocker/mocker.md +101 -4

No files found.
--- a/components/src/dynamo/mocker/README.md
+++ b/components/src/dynamo/mocker/README.md
 # Mocker engine
-The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for:
+The canonical user-facing documentation for the mocker lives at
+[`docs/mocker/mocker.md`](../../../../docs/mocker/mocker.md).
- Testing distributed system components without GPU resources
+Useful adjacent references:
- Benchmarking infrastructure and networking overhead
- Developing and debugging Dynamo components
- Load testing and performance analysis
-## Basic usage
+- Aggregated deployment example: [`examples/backends/mocker/deploy/agg.yaml`](../../../../examples/backends/mocker/deploy/agg.yaml)
+- Disaggregated deployment example: [`examples/backends/mocker/deploy/disagg.yaml`](../../../../examples/backends/mocker/deploy/disagg.yaml)
-The mocker engine now supports a vLLM-style CLI interface with individual arguments for all configuration options.
+- Global planner mocker example: [`examples/global_planner/global-planner-mocker-test.yaml`](../../../../examples/global_planner/global-planner-mocker-test.yaml)
-### Required arguments:
- `--model-path`: Path to model directory or HuggingFace model ID (required for tokenizer)
-### MockEngineArgs parameters (vLLM-style):
- `--num-gpu-blocks-override`: Number of GPU blocks for KV cache (default: 16384)
- `--block-size`: Token block size for KV cache blocks (default: 64)
- `--max-num-seqs`: Maximum number of sequences per iteration (default: 256)
- `--max-num-batched-tokens`: Maximum number of batched tokens per iteration (default: 8192)
- `--enable-prefix-caching` / `--no-enable-prefix-caching`: Enable/disable automatic prefix caching (default: True)
- `--enable-chunked-prefill` / `--no-enable-chunked-prefill`: Enable/disable chunked prefill (default: True)
- `--preemption-mode`: Preemption mode for decode eviction under memory pressure: `lifo` (default, matches vLLM v1) or `fifo`
- `--speedup-ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster. Use `0` for infinite speedup (no simulation delays)
- `--decode-speedup-ratio`: Additional speedup multiplier applied only to decode steps (default: 1.0). Models speculative decoding (e.g. Eagle) where decode throughput improves without affecting prefill latency. Effective decode speedup is `speedup_ratio * decode_speedup_ratio`
- `--data-parallel-size`: Number of data parallel workers to simulate (default: 1)
- `--num-workers`: Number of mocker workers to launch in the same process (default: 1). All workers share the same tokio runtime and thread pool
- `--stagger-delay`: Delay in seconds between launching each worker to avoid overwhelming etcd/NATS/frontend. Set to 0 to disable staggering. Use -1 for auto mode (stagger dependent on number of workers). Default: -1 (auto)
- `--disaggregation-mode prefill` / `--disaggregation-mode decode`: Whether the worker is a prefill or decode worker for disaggregated deployment. If not specified, mocker will be in aggregated mode.
- `--kv-transfer-bandwidth`: KV cache transfer bandwidth in GB/s for disaggregated serving latency simulation (default: 64.0, inter-node InfiniBand). Set to 0 to disable. For intra-node NVLink, typical value is ~450.
- `--kv-cache-dtype`: Data type for KV cache, used to compute kv_bytes_per_token. "auto" uses the model's torch dtype (default).
- `--kv-bytes-per-token`: KV cache bytes per token. If not specified, auto-computed from model config.
-**Environment variables:**
- `DYN_MOCKER_KV_CACHE_TRACE`: Set to `1` or `true` to log structured KV cache allocation/eviction trace (timestamp_ms, block_ids, etc.). Default: off.
-### Example with individual arguments (vLLM-style):
-```bash
-# Start mocker with custom configuration
-python -m dynamo.mocker \
-  --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
-  --num-gpu-blocks-override 8192 \
-  --block-size 16 \
-  --speedup-ratio 10.0 \
-  --max-num-seqs 512 \
-  --num-workers 4 \
-  --enable-prefix-caching
-# Start frontend server
-python -m dynamo.frontend --http-port 8000
-```
-> [!Note]
-> Each mocker instance runs as a single process, and each DP worker (specified by `--data-parallel-size`) is spawned as a lightweight async task within that process. For benchmarking (e.g., router testing), you can use `--num-workers` to launch multiple mocker engines in the same process, which is more efficient than launching separate processes since they all share the same tokio runtime and thread pool.
-## Performance modeling with planner profile data
-By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, you can load performance data from actual profiling results using `--planner-profile-data`:
-```bash
-python -m dynamo.mocker \
-  --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
-  --planner-profile-data tests/planner/profiling_results/H200_TP1P_TP1D \
-  --speedup-ratio 1.0
-```
-The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume.
-To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/components/profiler/profiler-guide.md) for details):
-```bash
-python components/src/dynamo/profiler/profile_sla.py \
-  --profile-config your_profile_config.yaml
-```
-Then use the resulting profile results directory directly with `--planner-profile-data`.
-## Deploying Mocker in K8s
-We provide the example DGD yaml configurations for aggregated and disaggregated deployment in `examples/backends/mocker/deploy/`. You can deploy the mocker engine in K8s by running:
-```bash
-kubectl apply -f examples/backends/mocker/deploy/agg.yaml # or, for disaggregated
-kubectl apply -f examples/backends/mocker/deploy/disagg.yaml
-```
--- a/docs/index.yml
+++ b/docs/index.yml
@@ -93,6 +93,10 @@ navigation:
        path: components/kvbm/kvbm-guide.md
      - page: Dynamo Benchmarking
        path: benchmarks/benchmarking.md
+      - section: Testing
+        contents:
+          - page: Mocker
+            path: mocker/mocker.md
      - section: Multimodal
        path: features/multimodal/README.md
        contents:

--- a/docs/mocker/mocker.md
+++ b/docs/mocker/mocker.md
@@ -71,28 +71,116 @@ python -m dynamo.mocker \
 | Argument | Default | Description |
 |----------|---------|-------------|
 | `--model-path` | Required | HuggingFace model ID or local path for tokenizer |
-| `--endpoint` | `dyn://dynamo.backend.generate` | Dynamo endpoint string |
+| `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers |
 | `--model-name` | Derived from model-path | Model name for API responses |
 | `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
 | `--block-size` | 64 | Tokens per KV cache block |
 | `--max-num-seqs` | 256 | Maximum concurrent sequences |
 | `--max-num-batched-tokens` | 8192 | Maximum tokens per batch |
 | `--enable-prefix-caching` | True | Enable prefix caching |
+| `--no-enable-prefix-caching` | - | Disable prefix caching |
 | `--enable-chunked-prefill` | True | Enable chunked prefill |
+| `--no-enable-chunked-prefill` | - | Disable chunked prefill |
+| `--preemption-mode` | `lifo` | Decode eviction policy under memory pressure: `lifo` (vLLM v1 style) or `fifo` |
 | `--watermark` | 0.01 | KV cache watermark (fraction reserved) |
 | `--speedup-ratio` | 1.0 | Timing speedup factor |
 | `--decode-speedup-ratio` | 1.0 | Decode-only speedup multiplier (e.g. for Eagle speculation) |
 | `--data-parallel-size` | 1 | Number of DP replicas |
 | `--startup-time` | None | Simulated startup delay (seconds) |
-| `--planner-profile-data` | None | Path to NPZ file with timing data |
+| `--planner-profile-data` | None | Path to either a mocker-format `.npz` file or a profiler results directory |
 | `--num-workers` | 1 | Workers per process |
+| `--reasoning` | None | JSON config for emitting reasoning token spans, with `start_thinking_token_id`, `end_thinking_token_id`, and `thinking_ratio` |
+| `--engine-type` | `vllm` | Engine simulation type: `vllm` or `sglang` |
+| `--sglang-schedule-policy` | `fifo` / `fcfs` | SGLang scheduling policy override |
+| `--sglang-page-size` | 1 | SGLang radix-cache page size in tokens |
+| `--sglang-max-prefill-tokens` | 16384 | SGLang max prefill-token budget per batch |
+| `--sglang-chunked-prefill-size` | 8192 | SGLang chunked-prefill chunk size |
+| `--sglang-clip-max-new-tokens` | 4096 | SGLang admission-budget cap for max new tokens |
+| `--sglang-schedule-conservativeness` | 1.0 | SGLang schedule conservativeness factor |
+| `--extra-engine-args` | None | Path to a JSON file with mocker configuration; overrides individual CLI arguments |
 | `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode |
 | `--disaggregation-mode` | `agg` | Worker mode: `agg` (aggregated), `prefill`, or `decode` |
-| `--durable-kv-events` | False | Enable durable KV events via JetStream (disables local indexer) |
+| `--durable-kv-events` | False | Deprecated JetStream KV-event mode; prefer the local indexer / event-plane subscriber path |
-| `--bootstrap-ports` | None | Ports for P/D rendezvous |
+| `--zmq-kv-events-ports` | None | Comma-separated ZMQ PUB base ports for KV event publishing, one per worker |
+| `--zmq-replay-ports` | None | Comma-separated ZMQ ROUTER base ports for gap recovery, one per worker |
+| `--bootstrap-ports` | None | Comma-separated rendezvous base ports, one per worker in disaggregated mode |
 | `--kv-transfer-bandwidth` | 64.0 | KV cache transfer bandwidth in GB/s. Set to 0 to disable |
 | `--kv-cache-dtype` | auto | KV cache dtype for bytes-per-token computation |
 | `--kv-bytes-per-token` | Auto-computed | KV cache bytes per token (override auto-computation) |
+| `--discovery-backend` | Env-driven (`etcd`) | Discovery backend: `kubernetes`, `etcd`, `file`, or `mem` |
+| `--request-plane` | Env-driven (`tcp`) | Request transport: `nats`, `http`, or `tcp` |
+| `--event-plane` | Env-driven (`nats`) | Event transport: `nats` or `zmq` |
+## Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DYN_MOCKER_KV_CACHE_TRACE` | off | Set to `1` or `true` to log structured KV cache allocation and eviction traces |
+> **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.
+## Performance Modeling Setup
+By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, pass `--planner-profile-data` with either:
+- a mocker-format `.npz` file, or
+- a profiler output directory
+The mocker automatically accepts profiler-style results directories and converts them internally.
+It also accepts older raw-data directories containing:
+- `prefill_raw_data.json`
+- `decode_raw_data.json`
+```bash
+python -m dynamo.mocker \
+    --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
+    --planner-profile-data tests/planner/profiling_results/H200_TP1P_TP1D \
+    --speedup-ratio 1.0
+```
+Example `--reasoning` configuration:
+```bash
+python -m dynamo.mocker \
+    --model-path Qwen/Qwen3-0.6B \
+    --reasoning '{"start_thinking_token_id":123,"end_thinking_token_id":456,"thinking_ratio":0.6}'
+```
+The profile results directory should contain:
+- `selected_prefill_interpolation/raw_data.npz`
+- `selected_decode_interpolation/raw_data.npz`
+To generate profile data for your own model and hardware, run the profiler and then point `--planner-profile-data` at the resulting output directory.
+## Event Transport and Router Testing
+The default event path uses the local indexer / event-plane subscriber flow. The older durable KV-events mode is still available through `--durable-kv-events`, but it is deprecated and should not be the preferred setup for new tests.
+For router and indexer experiments that need native wire-format event forwarding, the mocker also supports a ZMQ path:
+- `--event-plane zmq`
+- `--zmq-kv-events-ports` for per-worker PUB base ports
+- `--zmq-replay-ports` for optional replay/gap-recovery ROUTER base ports
+When set, each worker binds on its base port plus `dp_rank`, so the number of comma-separated base ports must match `--num-workers`.
+## Disaggregation Port Layout
+`--bootstrap-ports` takes a comma-separated list of base ports, one per worker. In multi-worker mode, the number of listed ports must exactly match `--num-workers`.
+Prefill workers listen on these ports and publish the bootstrap endpoint through discovery. Decode workers use the matching ports to rendezvous before decode begins.
+## Kubernetes Deployment
+The mocker can be deployed through example `DynamoGraphDeployment` manifests for both aggregated and disaggregated setups:
+```bash
+kubectl apply -f examples/backends/mocker/deploy/agg.yaml
+kubectl apply -f examples/backends/mocker/deploy/disagg.yaml
+```
 ## Architecture
@@ -207,6 +295,15 @@ The mocker is particularly useful for:
 | KV Events | Native | Compatible |
 | Data Parallelism | Multi-GPU | Simulated |
+## Next Steps
+| Document | Description |
+|----------|-------------|
+| [Benchmarking Dynamo Deployments](../benchmarks/benchmarking.md) | Run AIPerf against a mocker-backed deployment to measure latency, TTFT, throughput, and scaling behavior |
+| [Aggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/agg.yaml) | Deploy a mocker-backed aggregated DynamoGraphDeployment on Kubernetes |
+| [Disaggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/disagg.yaml) | Deploy separate prefill and decode mocker workers for disaggregated-serving benchmarks |
+| [Global Planner Mocker Example](../../examples/global_planner/global-planner-mocker-test.yaml) | Advanced multi-pool mocker setup for planner and global-router experiments |
 ## Feature Gaps (WIP)
 The following features are not yet supported by the mocker: