"examples/deployments/EKS/README.md" did not exist on "e0bb5bd3ac540241b19943385d9efdabdfee262d"
Unverified Commit 7893f268 authored by Alec's avatar Alec Committed by GitHub
Browse files

feat: add --disaggregation-mode enum to vLLM backend (#6483)


Signed-off-by: default avataralec-flowers <aflowers@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
parent 6d3e0137
...@@ -55,4 +55,5 @@ spec: ...@@ -55,4 +55,5 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-0.6B - Qwen/Qwen3-0.6B
- --is-prefill-worker - --disaggregation-mode
- prefill
...@@ -194,7 +194,7 @@ The DCGM ServiceMonitor must be manually created (see `dcgm-servicemonitor.yaml` ...@@ -194,7 +194,7 @@ The DCGM ServiceMonitor must be manually created (see `dcgm-servicemonitor.yaml`
- Check deployment mode and request routing configuration - Check deployment mode and request routing configuration
### KV Cache metrics only showing decode workers: ### KV Cache metrics only showing decode workers:
**Important Limitation**: In disaggregated mode, prefill workers (`--is-prefill-worker`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these. **Important Limitation**: In disaggregated mode, prefill workers (`--disaggregation-mode prefill`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these.
**Why this happens:** **Why this happens:**
- Prefill workers transfer KV cache to decode workers via NIXL - Prefill workers transfer KV cache to decode workers via NIXL
......
...@@ -159,7 +159,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo ...@@ -159,7 +159,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo
vLLM workers are configured through command-line arguments. Key parameters include: vLLM workers are configured through command-line arguments. Key parameters include:
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`) - `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving - `--disaggregation-mode <mode>`: Worker role for disaggregated serving. Accepted values: `prefill`, `decode`, `agg` (default)
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo - `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
- `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig. - `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
- `--enable-prompt-embeds`: **Enable prompt embeddings feature** (opt-in, default: disabled) - `--enable-prompt-embeds`: **Enable prompt embeddings feature** (opt-in, default: disabled)
......
...@@ -46,7 +46,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m dynamo.vllm \ ...@@ -46,7 +46,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m dynamo.vllm \
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \ CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \
--model openai/gpt-oss-120b \ --model openai/gpt-oss-120b \
--tensor-parallel-size 4 \ --tensor-parallel-size 4 \
--is-prefill-worker \ --disaggregation-mode prefill \
--dyn-reasoning-parser gpt_oss \ --dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony --dyn-tool-call-parser harmony
``` ```
......
...@@ -84,7 +84,7 @@ python -m dynamo.vllm \ ...@@ -84,7 +84,7 @@ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \ --model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--enforce-eager \ --enforce-eager \
--is-decode-worker --disaggregation-mode decode
``` ```
**Node 2**: Run prefill worker **Node 2**: Run prefill worker
...@@ -94,7 +94,7 @@ python -m dynamo.vllm \ ...@@ -94,7 +94,7 @@ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \ --model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--enforce-eager \ --enforce-eager \
--is-prefill-worker --disaggregation-mode prefill
``` ```
### Multi-node Tensor/Pipeline Parallelism ### Multi-node Tensor/Pipeline Parallelism
......
...@@ -506,7 +506,8 @@ spec: ...@@ -506,7 +506,8 @@ spec:
- "fp8" - "fp8"
- "--max-num-seqs" - "--max-num-seqs"
- "1" # Prefill workers use batch size 1 - "1" # Prefill workers use batch size 1
- --is-prefill-worker - --disaggregation-mode
- prefill
VLLMDecodeWorker: VLLMDecodeWorker:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
...@@ -553,7 +554,8 @@ spec: ...@@ -553,7 +554,8 @@ spec:
- "fp8" - "fp8"
- "--max-num-seqs" - "--max-num-seqs"
- "1024" # Decode workers handle high concurrency - "1024" # Decode workers handle high concurrency
- --is-decode-worker - --disaggregation-mode
- decode
``` ```
**Critical RDMA settings:** **Critical RDMA settings:**
......
...@@ -47,7 +47,7 @@ vLLM supports all multimodal deployment patterns. See [Architecture Patterns](RE ...@@ -47,7 +47,7 @@ vLLM supports all multimodal deployment patterns. See [Architecture Patterns](RE
| Processor | `--multimodal-processor` | HTTP entry, tokenization | | Processor | `--multimodal-processor` | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | Media encoding | | Encode Worker | `--multimodal-encode-worker` | Media encoding |
| PD Worker | `--multimodal-worker` | Prefill + Decode | | PD Worker | `--multimodal-worker` | Prefill + Decode |
| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Prefill only | | Prefill Worker | `--multimodal-worker --disaggregation-mode prefill` | Prefill only |
| Decode Worker | `--multimodal-decode-worker` | Decode only | | Decode Worker | `--multimodal-decode-worker` | Decode only |
## Use the Latest Release ## Use the Latest Release
......
...@@ -107,7 +107,7 @@ FLEXKV_CPU_CACHE_GB=32 \ ...@@ -107,7 +107,7 @@ FLEXKV_CPU_CACHE_GB=32 \
CUDA_VISIBLE_DEVICES=1 \ CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \ python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--is-prefill-worker \ --disaggregation-mode prefill \
--connector nixl flexkv --connector nixl flexkv
``` ```
......
...@@ -88,7 +88,7 @@ This will: ...@@ -88,7 +88,7 @@ This will:
- **Purpose**: Handles prompt processing (prefill phase) - **Purpose**: Handles prompt processing (prefill phase)
- **GPU Assignment**: CUDA_VISIBLE_DEVICES=1 - **GPU Assignment**: CUDA_VISIBLE_DEVICES=1
- **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for KV offloading and use NIXL for KV transfer between prefill and decode workers. - **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for KV offloading and use NIXL for KV transfer between prefill and decode workers.
- **Flag**: `--is-prefill-worker` - **Flag**: `--disaggregation-mode prefill`
## Architecture ## Architecture
......
...@@ -219,7 +219,7 @@ Key customization points include: ...@@ -219,7 +219,7 @@ Key customization points include:
- **Resource Allocation**: Configure GPU requirements under `resources.limits` - **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances - **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs - **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers - **Worker Specialization**: Add `--disaggregation-mode prefill` flag for disaggregated prefill workers
## Additional Resources ## Additional Resources
......
...@@ -186,7 +186,8 @@ Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MO ...@@ -186,7 +186,8 @@ Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MO
```yaml ```yaml
args: args:
- --is-prefill-worker # For disaggregated prefill workers - --disaggregation-mode
- prefill # For disaggregated prefill workers
``` ```
### Image Pull Secret Configuration ### Image Pull Secret Configuration
......
...@@ -50,13 +50,13 @@ python -m dynamo.mocker \ ...@@ -50,13 +50,13 @@ python -m dynamo.mocker \
# Launch prefill worker # Launch prefill worker
python -m dynamo.mocker \ python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \ --model-path Qwen/Qwen3-0.6B \
--is-prefill-worker \ --disaggregation-mode prefill \
--bootstrap-ports 50100 --bootstrap-ports 50100
# Launch decode worker (in another terminal) # Launch decode worker (in another terminal)
python -m dynamo.mocker \ python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \ --model-path Qwen/Qwen3-0.6B \
--is-decode-worker --disaggregation-mode decode
``` ```
### Multiple Workers in One Process ### Multiple Workers in One Process
...@@ -88,8 +88,7 @@ python -m dynamo.mocker \ ...@@ -88,8 +88,7 @@ python -m dynamo.mocker \
| `--planner-profile-data` | None | Path to NPZ file with timing data | | `--planner-profile-data` | None | Path to NPZ file with timing data |
| `--num-workers` | 1 | Workers per process | | `--num-workers` | 1 | Workers per process |
| `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode | | `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode |
| `--is-prefill-worker` | False | Prefill-only mode | | `--disaggregation-mode` | `agg` | Worker mode: `agg` (aggregated), `prefill`, or `decode` |
| `--is-decode-worker` | False | Decode-only mode |
| `--durable-kv-events` | False | Enable durable KV events via JetStream (disables local indexer) | | `--durable-kv-events` | False | Enable durable KV events via JetStream (disables local indexer) |
| `--bootstrap-ports` | None | Ports for P/D rendezvous | | `--bootstrap-ports` | None | Ports for P/D rendezvous |
......
...@@ -105,7 +105,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \ ...@@ -105,7 +105,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \ --model Qwen/Qwen3-0.6B \
--enforce-eager \ --enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \ --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--is-prefill-worker \ --disaggregation-mode prefill \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' & --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' &
``` ```
......
...@@ -35,7 +35,8 @@ spec: ...@@ -35,7 +35,8 @@ spec:
- "1.0" - "1.0"
- --planner-profile-data - --planner-profile-data
- /workspace/tests/planner/profiling_results/H200_TP1P_TP1D - /workspace/tests/planner/profiling_results/H200_TP1P_TP1D
- --is-prefill-worker - --disaggregation-mode
- prefill
decode: decode:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
componentType: worker componentType: worker
...@@ -58,4 +59,5 @@ spec: ...@@ -58,4 +59,5 @@ spec:
- "1.0" - "1.0"
- --planner-profile-data - --planner-profile-data
- /workspace/tests/planner/profiling_results/H200_TP1P_TP1D - /workspace/tests/planner/profiling_results/H200_TP1P_TP1D
- --is-decode-worker - --disaggregation-mode
\ No newline at end of file - decode
\ No newline at end of file
...@@ -24,7 +24,7 @@ High-performance deployment with separated prefill and decode workers. ...@@ -24,7 +24,7 @@ High-performance deployment with separated prefill and decode workers.
**Architecture:** **Architecture:**
- `Frontend`: HTTP API server coordinating between workers - `Frontend`: HTTP API server coordinating between workers
- `VLLMDecodeWorker`: Specialized decode-only worker - `VLLMDecodeWorker`: Specialized decode-only worker
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--is-prefill-worker`) - `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
- Communication via NIXL transfer backend - Communication via NIXL transfer backend
### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`) ### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`)
...@@ -33,7 +33,7 @@ Advanced disaggregated deployment with KV cache routing capabilities. ...@@ -33,7 +33,7 @@ Advanced disaggregated deployment with KV cache routing capabilities.
**Architecture:** **Architecture:**
- `Frontend`: HTTP API server with KV-aware routing - `Frontend`: HTTP API server with KV-aware routing
- `VLLMDecodeWorker`: Specialized decode-only worker - `VLLMDecodeWorker`: Specialized decode-only worker
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--is-prefill-worker`) - `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
## CRD Structure ## CRD Structure
...@@ -85,7 +85,7 @@ extraPodSpec: ...@@ -85,7 +85,7 @@ extraPodSpec:
**Common vLLM Flags:** **Common vLLM Flags:**
- `--enable-prompt-embeds`: Enable prompt embeddings feature - `--enable-prompt-embeds`: Enable prompt embeddings feature
- `--enable-multimodal`: Enable multimodal (vision) support - `--enable-multimodal`: Enable multimodal (vision) support
- `--is-prefill-worker`: Prefill-only mode for disaggregated serving - `--disaggregation-mode prefill`: Prefill-only mode for disaggregated serving
- `--connector [nixl|lmcache|kvbm|none]`: KV transfer backend selection - `--connector [nixl|lmcache|kvbm|none]`: KV transfer backend selection
## Prerequisites ## Prerequisites
...@@ -198,7 +198,7 @@ spec: ...@@ -198,7 +198,7 @@ spec:
vLLM workers are configured through command-line arguments. Key parameters include: vLLM workers are configured through command-line arguments. Key parameters include:
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`) - `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving - `--disaggregation-mode prefill`: Enable prefill-only mode for disaggregated serving
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo - `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the full list of configuration options. See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the full list of configuration options.
......
...@@ -71,10 +71,10 @@ spec: ...@@ -71,10 +71,10 @@ spec:
- Qwen/Qwen3-0.6B - Qwen/Qwen3-0.6B
- --tensor-parallel-size - --tensor-parallel-size
- "2" - "2"
- --is-prefill-worker - --disaggregation-mode
- prefill
- --connector - --connector
- none - none
- --kv-transfer-config - --kv-transfer-config
- '{"kv_connector": "NixlConnector", "kv_role": "kv_both", "engine_id": - '{"kv_connector": "NixlConnector", "kv_role": "kv_both", "engine_id":
"vllm-disagg-prefill-engine-0abc123"}' "vllm-disagg-prefill-engine-0abc123"}'
...@@ -32,7 +32,8 @@ spec: ...@@ -32,7 +32,8 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-0.6B - Qwen/Qwen3-0.6B
- --is-decode-worker - --disaggregation-mode
- decode
VllmPrefillWorker: VllmPrefillWorker:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
componentType: worker componentType: worker
...@@ -52,4 +53,5 @@ spec: ...@@ -52,4 +53,5 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-0.6B - Qwen/Qwen3-0.6B
- --is-prefill-worker - --disaggregation-mode
- prefill
...@@ -31,6 +31,8 @@ spec: ...@@ -31,6 +31,8 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-8B - Qwen/Qwen3-8B
- --disaggregation-mode
- decode
- --max-model-len - --max-model-len
- "32000" - "32000"
- --enforce-eager - --enforce-eager
...@@ -59,7 +61,8 @@ spec: ...@@ -59,7 +61,8 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-8B - Qwen/Qwen3-8B
- --is-prefill-worker - --disaggregation-mode
- prefill
- --max-model-len - --max-model-len
- "32000" - "32000"
- --enforce-eager - --enforce-eager
......
...@@ -31,6 +31,8 @@ spec: ...@@ -31,6 +31,8 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-8B - Qwen/Qwen3-8B
- --disaggregation-mode
- decode
- --max-model-len - --max-model-len
- "32000" - "32000"
- --enforce-eager - --enforce-eager
...@@ -59,7 +61,8 @@ spec: ...@@ -59,7 +61,8 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-8B - Qwen/Qwen3-8B
- --is-prefill-worker - --disaggregation-mode
- prefill
- --max-model-len - --max-model-len
- "32000" - "32000"
- --enforce-eager - --enforce-eager
......
...@@ -33,6 +33,8 @@ spec: ...@@ -33,6 +33,8 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-8B - Qwen/Qwen3-8B
- --disaggregation-mode
- decode
- --gpu-memory-utilization - --gpu-memory-utilization
- "0.23" - "0.23"
- --max-model-len - --max-model-len
...@@ -65,7 +67,8 @@ spec: ...@@ -65,7 +67,8 @@ spec:
args: args:
- --model - --model
- Qwen/Qwen3-8B - Qwen/Qwen3-8B
- --is-prefill-worker - --disaggregation-mode
- prefill
- --gpu-memory-utilization - --gpu-memory-utilization
- "0.23" - "0.23"
- --max-model-len - --max-model-len
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment