@@ -25,6 +25,7 @@ The mocker engine now supports a vLLM-style CLI interface with individual argume
...
@@ -25,6 +25,7 @@ The mocker engine now supports a vLLM-style CLI interface with individual argume
-`--speedup-ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster
-`--speedup-ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster
-`--data-parallel-size`: Number of data parallel workers to simulate (default: 1)
-`--data-parallel-size`: Number of data parallel workers to simulate (default: 1)
-`--num-workers`: Number of mocker workers to launch in the same process (default: 1). All workers share the same tokio runtime and thread pool
-`--num-workers`: Number of mocker workers to launch in the same process (default: 1). All workers share the same tokio runtime and thread pool
-`--stagger-delay`: Delay in seconds between launching each worker to avoid overwhelming etcd/NATS/frontend. Set to 0 to disable staggering. Use -1 for auto mode (stagger dependent on number of workers). Default: -1 (auto)
-`--is-prefill-worker` / `--is-decode-worker`: Whether the worker is a prefill or decode worker for disaggregated deployment. If not specified, mocker will be in aggregated mode.
-`--is-prefill-worker` / `--is-decode-worker`: Whether the worker is a prefill or decode worker for disaggregated deployment. If not specified, mocker will be in aggregated mode.
### Example with individual arguments (vLLM-style):
### Example with individual arguments (vLLM-style):
...
@@ -75,4 +76,4 @@ We provide the example DGD yaml configurations for aggregated and disaggregated
...
@@ -75,4 +76,4 @@ We provide the example DGD yaml configurations for aggregated and disaggregated
```bash
```bash
kubectl apply -f examples/backends/mocker/deploy/agg.yaml # or, for disaggregated
kubectl apply -f examples/backends/mocker/deploy/agg.yaml # or, for disaggregated
The Mocker is a lightweight, high-fidelity simulation of an LLM inference engine, implemented entirely in Rust. It replicates the core scheduling, memory management, and timing behaviors of production engines without requiring a GPU, making it invaluable for testing Dynamo's routing, KV cache events, disaggregated serving, and planner components.
## Overview
The mocker simulates:
-**Block-based KV cache management** with LRU eviction
-**Continuous batching scheduler** with watermark-based admission control
-**Prefix caching** with hash-based block deduplication
-**Chunked prefill** for better batching efficiency
-**Realistic timing models** for prefill and decode phases
-**Data parallelism** (multiple DP ranks per engine)
> **Note:** While the mocker uses vLLM as its primary reference implementation, these core components—block-based KV cache management, continuous batching schedulers, LRU evictors, and prefix caching—are fundamental to all modern LLM inference engines, including SGLang and TensorRT-LLM. The architectural patterns simulated here are engine-agnostic and apply broadly across the inference ecosystem.
2.**Prefill Queue** - Requests scheduled for prefill
3.**Decode Queue** - Requests actively decoding (ordered by age for preemption)
Each iteration, the scheduler receives incoming requests, moves eligible requests from waiting to prefill based on available memory and compute budgets, simulates the prefill phase for queued requests, runs one decode step for all active sequences, and publishes metrics about current resource utilization.
When resources become constrained, the scheduler employs preemption: the oldest decoding request is evicted back to the waiting queue, its KV blocks are freed, and it will be rescheduled later. This mirrors how real engines handle memory pressure.
### KV Block Manager
The block manager tracks KV cache blocks using reference counting and an LRU eviction policy. Blocks exist in one of two pools:
-**Active Pool** - Blocks currently in use by one or more sequences, tracked with reference counts
-**Inactive Pool** - Blocks no longer actively referenced but kept for potential reuse (prefix caching)
When a sequence needs blocks, the manager first checks if they already exist (cache hit). If not, it allocates new blocks, potentially evicting the least-recently-used inactive blocks to make room. When a sequence completes or is preempted, its blocks are either moved to the inactive pool (for potential reuse) or freed entirely.
The following diagram illustrates the block lifecycle, based on vLLM's block manager design:
│ New Block │──────►│ Active │──────►│ Inactive │──────►│ Freed │
└───────────┘ alloc │ Pool │ deref │ Pool │ evict └───────────┘
│(ref_count)│ │ (LRU order) │
└─────┬─────┘ └─────────────────┘
│
│ destroy (preemption)
▼
┌───────────┐
│ Freed │
└───────────┘
```
### Evictor
The LRU evictor maintains blocks ordered by their last access time, enabling O(1) eviction of the oldest unused block. It supports both normal insertion (for completed sequences) and front-insertion (for preempted sequences that should be evicted first if memory pressure continues).
### Sequence Tracking
Each active request is tracked as a sequence, managing its token blocks and generation state. As tokens are generated, the sequence tracks which blocks are partial (still being filled) versus full (complete and hashable for prefix caching). When a partial block fills up, it gets "promoted" to a full block with a content-based hash, enabling future cache hits from requests with matching prefixes.
### Performance Model
The mocker supports two timing prediction modes:
**Polynomial Model (Default):** Uses hardcoded polynomial formulas that approximate typical GPU behavior. Prefill time scales quadratically with token count, while decode time depends on the total active KV cache size.
**Interpolated Model:** Loads actual profiling data from an NPZ file containing measured prefill and decode latencies. The mocker interpolates between data points to predict timing for any input size. This enables high-fidelity simulation matching a specific hardware configuration.
### Bootstrap Rendezvous (Disaggregated Serving)
For disaggregated prefill/decode deployments, prefill and decode workers coordinate via a simple TCP-based rendezvous protocol. The decode worker connects to the prefill worker's bootstrap port and waits until the prefill phase completes and KV cache is ready. Either side can arrive first—the rendezvous completes when both are ready.
## Integration with Dynamo
### KV Event Publishing
When prefix caching is enabled, the mocker publishes KV cache events to the distributed runtime. These events notify the system when blocks are stored (new content cached) or removed (evicted). This enables the KV-aware router to make intelligent routing decisions based on which workers have which prefixes cached.
### Metrics Publishing
Each scheduler publishes metrics about its current state, including the number of active decode blocks per DP rank. The router uses these metrics for load-aware routing decisions.
## Testing Scenarios
The mocker is particularly useful for:
1.**Router Testing** - Validate KV-aware routing without GPUs
2.**Planner Testing** - Test SLA-based planners with realistic timing
3.**Fault Tolerance** - Test request migration, graceful shutdown
4.**Disaggregation** - Test P/D separation and KV transfer coordination
The following features are not yet supported by the mocker:
-**KV transfer latency simulation** - Disaggregated serving simulates the rendezvous handshake but does not model the actual KV cache transfer time between prefill and decode workers
-**Multi-tier memory** - No support for offloading KV cache to CPU/disk or onboarding back to GPU; potential future integration with KVBM
-**Multimodal support** - Currently only simulates text token processing; no vision encoder or cross-attention simulation
-**Native Rust reference counting** - Work in progress to use native Rc/Arc for block reference counting, enabling natural RAII patterns for simpler tracking