@@ -27,6 +27,9 @@ The mocker engine now supports a vLLM-style CLI interface with individual argume
-`--num-workers`: Number of mocker workers to launch in the same process (default: 1). All workers share the same tokio runtime and thread pool
-`--stagger-delay`: Delay in seconds between launching each worker to avoid overwhelming etcd/NATS/frontend. Set to 0 to disable staggering. Use -1 for auto mode (stagger dependent on number of workers). Default: -1 (auto)
-`--disaggregation-mode prefill` / `--disaggregation-mode decode`: Whether the worker is a prefill or decode worker for disaggregated deployment. If not specified, mocker will be in aggregated mode.
-`--kv-transfer-bandwidth`: KV cache transfer bandwidth in GB/s for disaggregated serving latency simulation (default: 64.0, inter-node InfiniBand). Set to 0 to disable. For intra-node NVLink, typical value is ~450.
-`--kv-cache-dtype`: Data type for KV cache, used to compute kv_bytes_per_token. "auto" uses the model's torch dtype (default).
-`--kv-bytes-per-token`: KV cache bytes per token. If not specified, auto-computed from model config.
@@ -160,6 +163,16 @@ The mocker supports two timing prediction modes:
For disaggregated prefill/decode deployments, prefill and decode workers coordinate via a simple TCP-based rendezvous protocol. The decode worker connects to the prefill worker's bootstrap port and waits until the prefill phase completes and KV cache is ready. Either side can arrive first—the rendezvous completes when both are ready.
### KV Transfer Latency Simulation
The mocker simulates KV cache transfer time between prefill and decode workers. Before the prefill worker emits its first (and only) token, it sleeps for a duration based on:
-**kv_bytes_per_token** (auto-computed from model config): `num_layers * 2 * num_kv_heads * head_dim * dtype_bytes`. The `dtype_bytes` is determined by `--kv-cache-dtype`: when set to `auto` (default), it uses the model's `dtype` from config; when explicitly set (e.g., `fp8`), it uses the specified dtype instead. It can also be overridden directly with `--kv-bytes-per-token`.
This delay is injected after the scheduler's prefill compute simulation completes, modeling the sequential flow: prefill computation → KV transfer → decode begins. Set `--kv-transfer-bandwidth 0` to disable.
## Integration with Dynamo
### KV Event Publishing
...
...
@@ -199,7 +212,6 @@ The mocker is particularly useful for:
The following features are not yet supported by the mocker:
-**KV transfer latency simulation** - Disaggregated serving simulates the rendezvous handshake but does not model the actual KV cache transfer time between prefill and decode workers
-**Multi-tier memory** - No support for offloading KV cache to CPU/disk or onboarding back to GPU; potential future integration with KVBM
-**Multimodal support** - Currently only simulates text token processing; no vision encoder or cross-attention simulation
-**Native Rust reference counting** - Work in progress to use native Rc/Arc for block reference counting, enabling natural RAII patterns for simpler tracking