@@ -190,17 +190,23 @@ For vLLM multinode deployments, the operator automatically selects and configure
...
@@ -190,17 +190,23 @@ For vLLM multinode deployments, the operator automatically selects and configure
The operator automatically determines the deployment mode based on your parallelism configuration:
The operator automatically determines the deployment mode based on your parallelism configuration:
**1. Ray-Based Mode (Tensor/Pipeline Parallelism across nodes)**
**1. Tensor/Pipeline Parallelism Mode (Single model across nodes)**
-**When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size`
-**When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size`
-**Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism
-**Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism
The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes.
**Leader Node:**
**Leader Node:**
-**Ray Head**: The operator prepends `ray start --head --port=6379` to your existing command
-**Probes**: All probes (liveness, readiness, startup) are automatically removed since workers don't expose health endpoints
-**Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers
-**Probes**: All probes (liveness, readiness, startup) are automatically removed
> **Note**: vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend.
**2. Data Parallel Mode (Multiple model instances across nodes)**
**2. Data Parallel Mode (Multiple model instances across nodes)**
-**When used**: When `world_size × data_parallel_size > GPUs_per_node`
-**When used**: When `world_size × data_parallel_size > GPUs_per_node`
...
@@ -216,6 +222,18 @@ The operator automatically determines the deployment mode based on your parallel
...
@@ -216,6 +222,18 @@ The operator automatically determines the deployment mode based on your parallel
**Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
**Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
#### Why Ray for Multi-Node TP/PP?
vLLM supports two distributed executor backends: `ray` and `mp`. For multi-node deployments:
-**Ray executor**: vLLM creates a placement group and spawns Ray actors across the cluster. Workers don't run vLLM directly - the leader's vLLM process manages everything.
-**mp executor**: Each node must run its own vLLM process with `--nnodes`, `--node-rank`, `--master-addr`, `--master-port`. This approach is more complex to orchestrate.
The Dynamo operator uses Ray because:
1. It aligns with vLLM's official multi-node documentation (see `multi-node-serving.sh`)
2. Simpler orchestration - only the leader runs vLLM, workers just need Ray agents
3. vLLM automatically handles placement group creation and worker management
#### Compilation Cache Support
#### Compilation Cache Support
When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets:
When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets:
-**`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point
-**`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point
**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token.
**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token.
## V
## V
**vLLM** - High-throughput LLM serving engine with Ray distributed support and PagedAttention.
**vLLM** - High-throughput LLM serving engine with distributed tensor/pipeline parallelism and PagedAttention.
## W
## W
**Wide Expert Parallelism (WideEP)** - Mixture-of-Experts deployment strategy that spreads experts across many GPUs (e.g., 64-way EP) so each GPU hosts only a few experts.
**Wide Expert Parallelism (WideEP)** - Mixture-of-Experts deployment strategy that spreads experts across many GPUs (e.g., 64-way EP) so each GPU hosts only a few experts.