@@ -66,6 +66,24 @@ First, start the vLLM worker engines in a terminal.
--tensor-parallel-size 2
```
#### Prefill Workers
You can also launch separate decode and prefill workers for disaggregated serving. This allows you to dedicate specific GPUs to prefill (prompt processing) and decode (token generation) tasks:
We also supports running lightweight mock engines that simulate vLLM behavior without performing actual model inference. Mocker engines are useful for testing router logic and performance without GPU requirements. Use the `--mockers` flag to run mocker engines instead of real vLLM workers.
For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/architecture/kv_cache_routing.md).
#### Launching a Prefill Router (Optional)
If you're using disaggregated serving with separate prefill and decode workers, you should also launch a prefill router. The prefill router handles routing prefill requests to dedicated prefill workers. When using a prefill router, it's recommended to start the frontend (decode router) with `--kv-overlap-score-weight 0` for pure load balancing (as prefix-aware routing is now handled by the prefill router):
```bash
# Start the decode router with pure load balancing
python -m dynamo.frontend \
--router-mode kv \
--kv-cache-block-size 64 \
--router-reset-states\
--http-port 8000 \
--kv-overlap-score-weight 0
# In another terminal, start the prefill router (currently only supports vLLM)
python -m dynamo.vllm_prefill_router \
--namespace dynamo \
--block-size 64
```
The prefill router will automatically coordinate with the decode router to handle request routing between prefill and decode workers.
**Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead: