"tests/entrypoints/openai/test_lora_lineage.py" did not exist on "138485a82de50f90536ea0a650dd2f6bba1927e9"
Unverified Commit 7893f268 authored by Alec's avatar Alec Committed by GitHub
Browse files

feat: add --disaggregation-mode enum to vLLM backend (#6483)


Signed-off-by: default avataralec-flowers <aflowers@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
parent 6d3e0137
......@@ -55,4 +55,5 @@ spec:
args:
- --model
- Qwen/Qwen3-0.6B
- --is-prefill-worker
- --disaggregation-mode
- prefill
......@@ -194,7 +194,7 @@ The DCGM ServiceMonitor must be manually created (see `dcgm-servicemonitor.yaml`
- Check deployment mode and request routing configuration
### KV Cache metrics only showing decode workers:
**Important Limitation**: In disaggregated mode, prefill workers (`--is-prefill-worker`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these.
**Important Limitation**: In disaggregated mode, prefill workers (`--disaggregation-mode prefill`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these.
**Why this happens:**
- Prefill workers transfer KV cache to decode workers via NIXL
......
......@@ -159,7 +159,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo
vLLM workers are configured through command-line arguments. Key parameters include:
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
- `--disaggregation-mode <mode>`: Worker role for disaggregated serving. Accepted values: `prefill`, `decode`, `agg` (default)
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
- `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
- `--enable-prompt-embeds`: **Enable prompt embeddings feature** (opt-in, default: disabled)
......
......@@ -46,7 +46,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m dynamo.vllm \
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \
--model openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--is-prefill-worker \
--disaggregation-mode prefill \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
......
......@@ -84,7 +84,7 @@ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--enforce-eager \
--is-decode-worker
--disaggregation-mode decode
```
**Node 2**: Run prefill worker
......@@ -94,7 +94,7 @@ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--enforce-eager \
--is-prefill-worker
--disaggregation-mode prefill
```
### Multi-node Tensor/Pipeline Parallelism
......
......@@ -506,7 +506,8 @@ spec:
- "fp8"
- "--max-num-seqs"
- "1" # Prefill workers use batch size 1
- --is-prefill-worker
- --disaggregation-mode
- prefill
VLLMDecodeWorker:
envFromSecret: hf-token-secret
......@@ -553,7 +554,8 @@ spec:
- "fp8"
- "--max-num-seqs"
- "1024" # Decode workers handle high concurrency
- --is-decode-worker
- --disaggregation-mode
- decode
```
**Critical RDMA settings:**
......
......@@ -47,7 +47,7 @@ vLLM supports all multimodal deployment patterns. See [Architecture Patterns](RE
| Processor | `--multimodal-processor` | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | Media encoding |
| PD Worker | `--multimodal-worker` | Prefill + Decode |
| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Prefill only |
| Prefill Worker | `--multimodal-worker --disaggregation-mode prefill` | Prefill only |
| Decode Worker | `--multimodal-decode-worker` | Decode only |
## Use the Latest Release
......
......@@ -107,7 +107,7 @@ FLEXKV_CPU_CACHE_GB=32 \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker \
--disaggregation-mode prefill \
--connector nixl flexkv
```
......
......@@ -88,7 +88,7 @@ This will:
- **Purpose**: Handles prompt processing (prefill phase)
- **GPU Assignment**: CUDA_VISIBLE_DEVICES=1
- **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for KV offloading and use NIXL for KV transfer between prefill and decode workers.
- **Flag**: `--is-prefill-worker`
- **Flag**: `--disaggregation-mode prefill`
## Architecture
......
......@@ -219,7 +219,7 @@ Key customization points include:
- **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
- **Worker Specialization**: Add `--disaggregation-mode prefill` flag for disaggregated prefill workers
## Additional Resources
......
......@@ -186,7 +186,8 @@ Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MO
```yaml
args:
- --is-prefill-worker # For disaggregated prefill workers
- --disaggregation-mode
- prefill # For disaggregated prefill workers
```
### Image Pull Secret Configuration
......
......@@ -50,13 +50,13 @@ python -m dynamo.mocker \
# Launch prefill worker
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--is-prefill-worker \
--disaggregation-mode prefill \
--bootstrap-ports 50100
# Launch decode worker (in another terminal)
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--is-decode-worker
--disaggregation-mode decode
```
### Multiple Workers in One Process
......@@ -88,8 +88,7 @@ python -m dynamo.mocker \
| `--planner-profile-data` | None | Path to NPZ file with timing data |
| `--num-workers` | 1 | Workers per process |
| `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode |
| `--is-prefill-worker` | False | Prefill-only mode |
| `--is-decode-worker` | False | Decode-only mode |
| `--disaggregation-mode` | `agg` | Worker mode: `agg` (aggregated), `prefill`, or `decode` |
| `--durable-kv-events` | False | Enable durable KV events via JetStream (disables local indexer) |
| `--bootstrap-ports` | None | Ports for P/D rendezvous |
......
......@@ -105,7 +105,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
--is-prefill-worker \
--disaggregation-mode prefill \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' &
```
......
......@@ -35,7 +35,8 @@ spec:
- "1.0"
- --planner-profile-data
- /workspace/tests/planner/profiling_results/H200_TP1P_TP1D
- --is-prefill-worker
- --disaggregation-mode
- prefill
decode:
envFromSecret: hf-token-secret
componentType: worker
......@@ -58,4 +59,5 @@ spec:
- "1.0"
- --planner-profile-data
- /workspace/tests/planner/profiling_results/H200_TP1P_TP1D
- --is-decode-worker
\ No newline at end of file
- --disaggregation-mode
- decode
\ No newline at end of file
......@@ -24,7 +24,7 @@ High-performance deployment with separated prefill and decode workers.
**Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `VLLMDecodeWorker`: Specialized decode-only worker
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--is-prefill-worker`)
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
- Communication via NIXL transfer backend
### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`)
......@@ -33,7 +33,7 @@ Advanced disaggregated deployment with KV cache routing capabilities.
**Architecture:**
- `Frontend`: HTTP API server with KV-aware routing
- `VLLMDecodeWorker`: Specialized decode-only worker
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--is-prefill-worker`)
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
## CRD Structure
......@@ -85,7 +85,7 @@ extraPodSpec:
**Common vLLM Flags:**
- `--enable-prompt-embeds`: Enable prompt embeddings feature
- `--enable-multimodal`: Enable multimodal (vision) support
- `--is-prefill-worker`: Prefill-only mode for disaggregated serving
- `--disaggregation-mode prefill`: Prefill-only mode for disaggregated serving
- `--connector [nixl|lmcache|kvbm|none]`: KV transfer backend selection
## Prerequisites
......@@ -198,7 +198,7 @@ spec:
vLLM workers are configured through command-line arguments. Key parameters include:
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
- `--disaggregation-mode prefill`: Enable prefill-only mode for disaggregated serving
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the full list of configuration options.
......
......@@ -71,10 +71,10 @@ spec:
- Qwen/Qwen3-0.6B
- --tensor-parallel-size
- "2"
- --is-prefill-worker
- --disaggregation-mode
- prefill
- --connector
- none
- --kv-transfer-config
- '{"kv_connector": "NixlConnector", "kv_role": "kv_both", "engine_id":
"vllm-disagg-prefill-engine-0abc123"}'
......@@ -32,7 +32,8 @@ spec:
args:
- --model
- Qwen/Qwen3-0.6B
- --is-decode-worker
- --disaggregation-mode
- decode
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
......@@ -52,4 +53,5 @@ spec:
args:
- --model
- Qwen/Qwen3-0.6B
- --is-prefill-worker
- --disaggregation-mode
- prefill
......@@ -31,6 +31,8 @@ spec:
args:
- --model
- Qwen/Qwen3-8B
- --disaggregation-mode
- decode
- --max-model-len
- "32000"
- --enforce-eager
......@@ -59,7 +61,8 @@ spec:
args:
- --model
- Qwen/Qwen3-8B
- --is-prefill-worker
- --disaggregation-mode
- prefill
- --max-model-len
- "32000"
- --enforce-eager
......
......@@ -31,6 +31,8 @@ spec:
args:
- --model
- Qwen/Qwen3-8B
- --disaggregation-mode
- decode
- --max-model-len
- "32000"
- --enforce-eager
......@@ -59,7 +61,8 @@ spec:
args:
- --model
- Qwen/Qwen3-8B
- --is-prefill-worker
- --disaggregation-mode
- prefill
- --max-model-len
- "32000"
- --enforce-eager
......
......@@ -33,6 +33,8 @@ spec:
args:
- --model
- Qwen/Qwen3-8B
- --disaggregation-mode
- decode
- --gpu-memory-utilization
- "0.23"
- --max-model-len
......@@ -65,7 +67,8 @@ spec:
args:
- --model
- Qwen/Qwen3-8B
- --is-prefill-worker
- --disaggregation-mode
- prefill
- --gpu-memory-utilization
- "0.23"
- --max-model-len
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment