The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for:
The canonical user-facing documentation for the mocker lives at
-`--preemption-mode`: Preemption mode for decode eviction under memory pressure: `lifo` (default, matches vLLM v1) or `fifo`
-`--speedup-ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster. Use `0` for infinite speedup (no simulation delays)
-`--decode-speedup-ratio`: Additional speedup multiplier applied only to decode steps (default: 1.0). Models speculative decoding (e.g. Eagle) where decode throughput improves without affecting prefill latency. Effective decode speedup is `speedup_ratio * decode_speedup_ratio`
-`--data-parallel-size`: Number of data parallel workers to simulate (default: 1)
-`--num-workers`: Number of mocker workers to launch in the same process (default: 1). All workers share the same tokio runtime and thread pool
-`--stagger-delay`: Delay in seconds between launching each worker to avoid overwhelming etcd/NATS/frontend. Set to 0 to disable staggering. Use -1 for auto mode (stagger dependent on number of workers). Default: -1 (auto)
-`--disaggregation-mode prefill` / `--disaggregation-mode decode`: Whether the worker is a prefill or decode worker for disaggregated deployment. If not specified, mocker will be in aggregated mode.
-`--kv-transfer-bandwidth`: KV cache transfer bandwidth in GB/s for disaggregated serving latency simulation (default: 64.0, inter-node InfiniBand). Set to 0 to disable. For intra-node NVLink, typical value is ~450.
-`--kv-cache-dtype`: Data type for KV cache, used to compute kv_bytes_per_token. "auto" uses the model's torch dtype (default).
-`--kv-bytes-per-token`: KV cache bytes per token. If not specified, auto-computed from model config.
**Environment variables:**
-`DYN_MOCKER_KV_CACHE_TRACE`: Set to `1` or `true` to log structured KV cache allocation/eviction trace (timestamp_ms, block_ids, etc.). Default: off.
### Example with individual arguments (vLLM-style):
```bash
# Start mocker with custom configuration
python -m dynamo.mocker \
--model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--num-gpu-blocks-override 8192 \
--block-size 16 \
--speedup-ratio 10.0 \
--max-num-seqs 512 \
--num-workers 4 \
--enable-prefix-caching
# Start frontend server
python -m dynamo.frontend --http-port 8000
```
> [!Note]
> Each mocker instance runs as a single process, and each DP worker (specified by `--data-parallel-size`) is spawned as a lightweight async task within that process. For benchmarking (e.g., router testing), you can use `--num-workers` to launch multiple mocker engines in the same process, which is more efficient than launching separate processes since they all share the same tokio runtime and thread pool.
## Performance modeling with planner profile data
By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, you can load performance data from actual profiling results using `--planner-profile-data`:
The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume.
To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/components/profiler/profiler-guide.md) for details):
Then use the resulting profile results directory directly with `--planner-profile-data`.
## Deploying Mocker in K8s
We provide the example DGD yaml configurations for aggregated and disaggregated deployment in `examples/backends/mocker/deploy/`. You can deploy the mocker engine in K8s by running:
```bash
kubectl apply -f examples/backends/mocker/deploy/agg.yaml # or, for disaggregated
| `--model-path` | Required | HuggingFace model ID or local path for tokenizer |
| `--model-path` | Required | HuggingFace model ID or local path for tokenizer |
| `--endpoint` | `dyn://dynamo.backend.generate` | Dynamo endpoint string |
| `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers |
| `--model-name` | Derived from model-path | Model name for API responses |
| `--model-name` | Derived from model-path | Model name for API responses |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
| `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks |
| `DYN_MOCKER_KV_CACHE_TRACE` | off | Set to `1` or `true` to log structured KV cache allocation and eviction traces |
> **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.
## Performance Modeling Setup
By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, pass `--planner-profile-data` with either:
- a mocker-format `.npz` file, or
- a profiler output directory
The mocker automatically accepts profiler-style results directories and converts them internally.
It also accepts older raw-data directories containing:
To generate profile data for your own model and hardware, run the profiler and then point `--planner-profile-data` at the resulting output directory.
## Event Transport and Router Testing
The default event path uses the local indexer / event-plane subscriber flow. The older durable KV-events mode is still available through `--durable-kv-events`, but it is deprecated and should not be the preferred setup for new tests.
For router and indexer experiments that need native wire-format event forwarding, the mocker also supports a ZMQ path:
-`--event-plane zmq`
-`--zmq-kv-events-ports` for per-worker PUB base ports
-`--zmq-replay-ports` for optional replay/gap-recovery ROUTER base ports
When set, each worker binds on its base port plus `dp_rank`, so the number of comma-separated base ports must match `--num-workers`.
## Disaggregation Port Layout
`--bootstrap-ports` takes a comma-separated list of base ports, one per worker. In multi-worker mode, the number of listed ports must exactly match `--num-workers`.
Prefill workers listen on these ports and publish the bootstrap endpoint through discovery. Decode workers use the matching ports to rendezvous before decode begins.
## Kubernetes Deployment
The mocker can be deployed through example `DynamoGraphDeployment` manifests for both aggregated and disaggregated setups:
@@ -207,6 +295,15 @@ The mocker is particularly useful for:
...
@@ -207,6 +295,15 @@ The mocker is particularly useful for:
| KV Events | Native | Compatible |
| KV Events | Native | Compatible |
| Data Parallelism | Multi-GPU | Simulated |
| Data Parallelism | Multi-GPU | Simulated |
## Next Steps
| Document | Description |
|----------|-------------|
| [Benchmarking Dynamo Deployments](../benchmarks/benchmarking.md) | Run AIPerf against a mocker-backed deployment to measure latency, TTFT, throughput, and scaling behavior |
| [Aggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/agg.yaml) | Deploy a mocker-backed aggregated DynamoGraphDeployment on Kubernetes |
| [Disaggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/disagg.yaml) | Deploy separate prefill and decode mocker workers for disaggregated-serving benchmarks |
| [Global Planner Mocker Example](../../examples/global_planner/global-planner-mocker-test.yaml) | Advanced multi-pool mocker setup for planner and global-router experiments |
## Feature Gaps (WIP)
## Feature Gaps (WIP)
The following features are not yet supported by the mocker:
The following features are not yet supported by the mocker: