"components/vscode:/vscode.git/clone" did not exist on "cebe9219008ceb62bc8c3027aa906533fd939826"
Unverified Commit 3357c53f authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: add mocker under testing (#7488)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent 0b20745e
# Mocker engine # Mocker engine
The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for: The canonical user-facing documentation for the mocker lives at
[`docs/mocker/mocker.md`](../../../../docs/mocker/mocker.md).
- Testing distributed system components without GPU resources Useful adjacent references:
- Benchmarking infrastructure and networking overhead
- Developing and debugging Dynamo components
- Load testing and performance analysis
## Basic usage - Aggregated deployment example: [`examples/backends/mocker/deploy/agg.yaml`](../../../../examples/backends/mocker/deploy/agg.yaml)
- Disaggregated deployment example: [`examples/backends/mocker/deploy/disagg.yaml`](../../../../examples/backends/mocker/deploy/disagg.yaml)
The mocker engine now supports a vLLM-style CLI interface with individual arguments for all configuration options. - Global planner mocker example: [`examples/global_planner/global-planner-mocker-test.yaml`](../../../../examples/global_planner/global-planner-mocker-test.yaml)
### Required arguments:
- `--model-path`: Path to model directory or HuggingFace model ID (required for tokenizer)
### MockEngineArgs parameters (vLLM-style):
- `--num-gpu-blocks-override`: Number of GPU blocks for KV cache (default: 16384)
- `--block-size`: Token block size for KV cache blocks (default: 64)
- `--max-num-seqs`: Maximum number of sequences per iteration (default: 256)
- `--max-num-batched-tokens`: Maximum number of batched tokens per iteration (default: 8192)
- `--enable-prefix-caching` / `--no-enable-prefix-caching`: Enable/disable automatic prefix caching (default: True)
- `--enable-chunked-prefill` / `--no-enable-chunked-prefill`: Enable/disable chunked prefill (default: True)
- `--preemption-mode`: Preemption mode for decode eviction under memory pressure: `lifo` (default, matches vLLM v1) or `fifo`
- `--speedup-ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster. Use `0` for infinite speedup (no simulation delays)
- `--decode-speedup-ratio`: Additional speedup multiplier applied only to decode steps (default: 1.0). Models speculative decoding (e.g. Eagle) where decode throughput improves without affecting prefill latency. Effective decode speedup is `speedup_ratio * decode_speedup_ratio`
- `--data-parallel-size`: Number of data parallel workers to simulate (default: 1)
- `--num-workers`: Number of mocker workers to launch in the same process (default: 1). All workers share the same tokio runtime and thread pool
- `--stagger-delay`: Delay in seconds between launching each worker to avoid overwhelming etcd/NATS/frontend. Set to 0 to disable staggering. Use -1 for auto mode (stagger dependent on number of workers). Default: -1 (auto)
- `--disaggregation-mode prefill` / `--disaggregation-mode decode`: Whether the worker is a prefill or decode worker for disaggregated deployment. If not specified, mocker will be in aggregated mode.
- `--kv-transfer-bandwidth`: KV cache transfer bandwidth in GB/s for disaggregated serving latency simulation (default: 64.0, inter-node InfiniBand). Set to 0 to disable. For intra-node NVLink, typical value is ~450.
- `--kv-cache-dtype`: Data type for KV cache, used to compute kv_bytes_per_token. "auto" uses the model's torch dtype (default).
- `--kv-bytes-per-token`: KV cache bytes per token. If not specified, auto-computed from model config.
**Environment variables:**
- `DYN_MOCKER_KV_CACHE_TRACE`: Set to `1` or `true` to log structured KV cache allocation/eviction trace (timestamp_ms, block_ids, etc.). Default: off.
### Example with individual arguments (vLLM-style):
```bash
# Start mocker with custom configuration
python -m dynamo.mocker \
--model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--num-gpu-blocks-override 8192 \
--block-size 16 \
--speedup-ratio 10.0 \
--max-num-seqs 512 \
--num-workers 4 \
--enable-prefix-caching
# Start frontend server
python -m dynamo.frontend --http-port 8000
```
> [!Note]
> Each mocker instance runs as a single process, and each DP worker (specified by `--data-parallel-size`) is spawned as a lightweight async task within that process. For benchmarking (e.g., router testing), you can use `--num-workers` to launch multiple mocker engines in the same process, which is more efficient than launching separate processes since they all share the same tokio runtime and thread pool.
## Performance modeling with planner profile data
By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, you can load performance data from actual profiling results using `--planner-profile-data`:
```bash
python -m dynamo.mocker \
--model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
--planner-profile-data tests/planner/profiling_results/H200_TP1P_TP1D \
--speedup-ratio 1.0
```
The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume.
To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/components/profiler/profiler-guide.md) for details):
```bash
python components/src/dynamo/profiler/profile_sla.py \
--profile-config your_profile_config.yaml
```
Then use the resulting profile results directory directly with `--planner-profile-data`.
## Deploying Mocker in K8s
We provide the example DGD yaml configurations for aggregated and disaggregated deployment in `examples/backends/mocker/deploy/`. You can deploy the mocker engine in K8s by running:
```bash
kubectl apply -f examples/backends/mocker/deploy/agg.yaml # or, for disaggregated
kubectl apply -f examples/backends/mocker/deploy/disagg.yaml
```
...@@ -93,6 +93,10 @@ navigation: ...@@ -93,6 +93,10 @@ navigation:
path: components/kvbm/kvbm-guide.md path: components/kvbm/kvbm-guide.md
- page: Dynamo Benchmarking - page: Dynamo Benchmarking
path: benchmarks/benchmarking.md path: benchmarks/benchmarking.md
- section: Testing
contents:
- page: Mocker
path: mocker/mocker.md
- section: Multimodal - section: Multimodal
path: features/multimodal/README.md path: features/multimodal/README.md
contents: contents:
......
...@@ -71,28 +71,116 @@ python -m dynamo.mocker \ ...@@ -71,28 +71,116 @@ python -m dynamo.mocker \
| Argument | Default | Description | | Argument | Default | Description |
|----------|---------|-------------| |----------|---------|-------------|
| `--model-path` | Required | HuggingFace model ID or local path for tokenizer | | `--model-path` | Required | HuggingFace model ID or local path for tokenizer |
| `--endpoint` | `dyn://dynamo.backend.generate` | Dynamo endpoint string | | `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers |
| `--model-name` | Derived from model-path | Model name for API responses | | `--model-name` | Derived from model-path | Model name for API responses |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks | | `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
| `--block-size` | 64 | Tokens per KV cache block | | `--block-size` | 64 | Tokens per KV cache block |
| `--max-num-seqs` | 256 | Maximum concurrent sequences | | `--max-num-seqs` | 256 | Maximum concurrent sequences |
| `--max-num-batched-tokens` | 8192 | Maximum tokens per batch | | `--max-num-batched-tokens` | 8192 | Maximum tokens per batch |
| `--enable-prefix-caching` | True | Enable prefix caching | | `--enable-prefix-caching` | True | Enable prefix caching |
| `--no-enable-prefix-caching` | - | Disable prefix caching |
| `--enable-chunked-prefill` | True | Enable chunked prefill | | `--enable-chunked-prefill` | True | Enable chunked prefill |
| `--no-enable-chunked-prefill` | - | Disable chunked prefill |
| `--preemption-mode` | `lifo` | Decode eviction policy under memory pressure: `lifo` (vLLM v1 style) or `fifo` |
| `--watermark` | 0.01 | KV cache watermark (fraction reserved) | | `--watermark` | 0.01 | KV cache watermark (fraction reserved) |
| `--speedup-ratio` | 1.0 | Timing speedup factor | | `--speedup-ratio` | 1.0 | Timing speedup factor |
| `--decode-speedup-ratio` | 1.0 | Decode-only speedup multiplier (e.g. for Eagle speculation) | | `--decode-speedup-ratio` | 1.0 | Decode-only speedup multiplier (e.g. for Eagle speculation) |
| `--data-parallel-size` | 1 | Number of DP replicas | | `--data-parallel-size` | 1 | Number of DP replicas |
| `--startup-time` | None | Simulated startup delay (seconds) | | `--startup-time` | None | Simulated startup delay (seconds) |
| `--planner-profile-data` | None | Path to NPZ file with timing data | | `--planner-profile-data` | None | Path to either a mocker-format `.npz` file or a profiler results directory |
| `--num-workers` | 1 | Workers per process | | `--num-workers` | 1 | Workers per process |
| `--reasoning` | None | JSON config for emitting reasoning token spans, with `start_thinking_token_id`, `end_thinking_token_id`, and `thinking_ratio` |
| `--engine-type` | `vllm` | Engine simulation type: `vllm` or `sglang` |
| `--sglang-schedule-policy` | `fifo` / `fcfs` | SGLang scheduling policy override |
| `--sglang-page-size` | 1 | SGLang radix-cache page size in tokens |
| `--sglang-max-prefill-tokens` | 16384 | SGLang max prefill-token budget per batch |
| `--sglang-chunked-prefill-size` | 8192 | SGLang chunked-prefill chunk size |
| `--sglang-clip-max-new-tokens` | 4096 | SGLang admission-budget cap for max new tokens |
| `--sglang-schedule-conservativeness` | 1.0 | SGLang schedule conservativeness factor |
| `--extra-engine-args` | None | Path to a JSON file with mocker configuration; overrides individual CLI arguments |
| `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode | | `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode |
| `--disaggregation-mode` | `agg` | Worker mode: `agg` (aggregated), `prefill`, or `decode` | | `--disaggregation-mode` | `agg` | Worker mode: `agg` (aggregated), `prefill`, or `decode` |
| `--durable-kv-events` | False | Enable durable KV events via JetStream (disables local indexer) | | `--durable-kv-events` | False | Deprecated JetStream KV-event mode; prefer the local indexer / event-plane subscriber path |
| `--bootstrap-ports` | None | Ports for P/D rendezvous | | `--zmq-kv-events-ports` | None | Comma-separated ZMQ PUB base ports for KV event publishing, one per worker |
| `--zmq-replay-ports` | None | Comma-separated ZMQ ROUTER base ports for gap recovery, one per worker |
| `--bootstrap-ports` | None | Comma-separated rendezvous base ports, one per worker in disaggregated mode |
| `--kv-transfer-bandwidth` | 64.0 | KV cache transfer bandwidth in GB/s. Set to 0 to disable | | `--kv-transfer-bandwidth` | 64.0 | KV cache transfer bandwidth in GB/s. Set to 0 to disable |
| `--kv-cache-dtype` | auto | KV cache dtype for bytes-per-token computation | | `--kv-cache-dtype` | auto | KV cache dtype for bytes-per-token computation |
| `--kv-bytes-per-token` | Auto-computed | KV cache bytes per token (override auto-computation) | | `--kv-bytes-per-token` | Auto-computed | KV cache bytes per token (override auto-computation) |
| `--discovery-backend` | Env-driven (`etcd`) | Discovery backend: `kubernetes`, `etcd`, `file`, or `mem` |
| `--request-plane` | Env-driven (`tcp`) | Request transport: `nats`, `http`, or `tcp` |
| `--event-plane` | Env-driven (`nats`) | Event transport: `nats` or `zmq` |
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_MOCKER_KV_CACHE_TRACE` | off | Set to `1` or `true` to log structured KV cache allocation and eviction traces |
> **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.
## Performance Modeling Setup
By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, pass `--planner-profile-data` with either:
- a mocker-format `.npz` file, or
- a profiler output directory
The mocker automatically accepts profiler-style results directories and converts them internally.
It also accepts older raw-data directories containing:
- `prefill_raw_data.json`
- `decode_raw_data.json`
```bash
python -m dynamo.mocker \
--model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
--planner-profile-data tests/planner/profiling_results/H200_TP1P_TP1D \
--speedup-ratio 1.0
```
Example `--reasoning` configuration:
```bash
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--reasoning '{"start_thinking_token_id":123,"end_thinking_token_id":456,"thinking_ratio":0.6}'
```
The profile results directory should contain:
- `selected_prefill_interpolation/raw_data.npz`
- `selected_decode_interpolation/raw_data.npz`
To generate profile data for your own model and hardware, run the profiler and then point `--planner-profile-data` at the resulting output directory.
## Event Transport and Router Testing
The default event path uses the local indexer / event-plane subscriber flow. The older durable KV-events mode is still available through `--durable-kv-events`, but it is deprecated and should not be the preferred setup for new tests.
For router and indexer experiments that need native wire-format event forwarding, the mocker also supports a ZMQ path:
- `--event-plane zmq`
- `--zmq-kv-events-ports` for per-worker PUB base ports
- `--zmq-replay-ports` for optional replay/gap-recovery ROUTER base ports
When set, each worker binds on its base port plus `dp_rank`, so the number of comma-separated base ports must match `--num-workers`.
## Disaggregation Port Layout
`--bootstrap-ports` takes a comma-separated list of base ports, one per worker. In multi-worker mode, the number of listed ports must exactly match `--num-workers`.
Prefill workers listen on these ports and publish the bootstrap endpoint through discovery. Decode workers use the matching ports to rendezvous before decode begins.
## Kubernetes Deployment
The mocker can be deployed through example `DynamoGraphDeployment` manifests for both aggregated and disaggregated setups:
```bash
kubectl apply -f examples/backends/mocker/deploy/agg.yaml
kubectl apply -f examples/backends/mocker/deploy/disagg.yaml
```
## Architecture ## Architecture
...@@ -207,6 +295,15 @@ The mocker is particularly useful for: ...@@ -207,6 +295,15 @@ The mocker is particularly useful for:
| KV Events | Native | Compatible | | KV Events | Native | Compatible |
| Data Parallelism | Multi-GPU | Simulated | | Data Parallelism | Multi-GPU | Simulated |
## Next Steps
| Document | Description |
|----------|-------------|
| [Benchmarking Dynamo Deployments](../benchmarks/benchmarking.md) | Run AIPerf against a mocker-backed deployment to measure latency, TTFT, throughput, and scaling behavior |
| [Aggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/agg.yaml) | Deploy a mocker-backed aggregated DynamoGraphDeployment on Kubernetes |
| [Disaggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/disagg.yaml) | Deploy separate prefill and decode mocker workers for disaggregated-serving benchmarks |
| [Global Planner Mocker Example](../../examples/global_planner/global-planner-mocker-test.yaml) | Advanced multi-pool mocker setup for planner and global-router experiments |
## Feature Gaps (WIP) ## Feature Gaps (WIP)
The following features are not yet supported by the mocker: The following features are not yet supported by the mocker:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment