feat: add --disaggregation-mode enum to vLLM backend (#6483)

Signed-off-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

feat: add --disaggregation-mode enum to vLLM backend (#6483)
Signed-off-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
7893f268 · Alec · GitHub · 6d3e0137 · 7893f268 · 7893f268
Unverified Commit 7893f268 authored Feb 23, 2026 by Alec Committed by GitHub Feb 23, 2026
20 changed files
--- a/deploy/discovery/dgd.yaml
+++ b/deploy/discovery/dgd.yaml
@@ -55,4 +55,5 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-0.6B
-            - --is-prefill-worker
+            - --disaggregation-mode
+            - prefill
--- a/deploy/observability/grafana_dashboards/DASHBOARD_METRICS.md
+++ b/deploy/observability/grafana_dashboards/DASHBOARD_METRICS.md
@@ -194,7 +194,7 @@ The DCGM ServiceMonitor must be manually created (see `dcgm-servicemonitor.yaml`
 - Check deployment mode and request routing configuration
 ### KV Cache metrics only showing decode workers:
-**Important Limitation**: In disaggregated mode, prefill workers (`--is-prefill-worker`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these.
+**Important Limitation**: In disaggregated mode, prefill workers (`--disaggregation-mode prefill`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these.
 **Why this happens:**
 - Prefill workers transfer KV cache to decode workers via NIXL

--- a/docs/pages/backends/vllm/README.md
+++ b/docs/pages/backends/vllm/README.md
@@ -159,7 +159,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo
 vLLM workers are configured through command-line arguments. Key parameters include:
 - `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
+- `--disaggregation-mode <mode>`: Worker role for disaggregated serving. Accepted values: `prefill`, `decode`, `agg` (default)
 - `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
 - `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
 - `--enable-prompt-embeds`: **Enable prompt embeddings feature** (opt-in, default: disabled)

--- a/docs/pages/backends/vllm/gpt-oss.md
+++ b/docs/pages/backends/vllm/gpt-oss.md
@@ -46,7 +46,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3  python -m dynamo.vllm \
 CUDA_VISIBLE_DEVICES=4,5,6,7  python -m dynamo.vllm \
  --model openai/gpt-oss-120b \
  --tensor-parallel-size 4 \
-  --is-prefill-worker \
+  --disaggregation-mode prefill \
  --dyn-reasoning-parser gpt_oss \
  --dyn-tool-call-parser harmony
 ```

--- a/docs/pages/backends/vllm/multi-node.md
+++ b/docs/pages/backends/vllm/multi-node.md
@@ -84,7 +84,7 @@ python -m dynamo.vllm \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enforce-eager \
-  --is-decode-worker
+  --disaggregation-mode decode
 ```
 **Node 2**: Run prefill worker
@@ -94,7 +94,7 @@ python -m dynamo.vllm \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enforce-eager \
-  --is-prefill-worker
+  --disaggregation-mode prefill
 ```
 ### Multi-node Tensor/Pipeline Parallelism

--- a/docs/pages/features/disaggregated-serving/README.md
+++ b/docs/pages/features/disaggregated-serving/README.md
@@ -506,7 +506,8 @@ spec:
            - "fp8"
            - "--max-num-seqs"
            - "1"               # Prefill workers use batch size 1
-            - --is-prefill-worker
+            - --disaggregation-mode
+            - prefill
    VLLMDecodeWorker:
      envFromSecret: hf-token-secret
@@ -553,7 +554,8 @@ spec:
            - "fp8"
            - "--max-num-seqs"
            - "1024"            # Decode workers handle high concurrency
-            - --is-decode-worker
+            - --disaggregation-mode
+            - decode
 ```
 **Critical RDMA settings:**

--- a/docs/pages/features/multimodal/multimodal-vllm.md
+++ b/docs/pages/features/multimodal/multimodal-vllm.md
@@ -47,7 +47,7 @@ vLLM supports all multimodal deployment patterns. See [Architecture Patterns](RE
 | Processor | `--multimodal-processor` | HTTP entry, tokenization |
 | Encode Worker | `--multimodal-encode-worker` | Media encoding |
 | PD Worker | `--multimodal-worker` | Prefill + Decode |
-| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Prefill only |
+| Prefill Worker | `--multimodal-worker --disaggregation-mode prefill` | Prefill only |
 | Decode Worker | `--multimodal-decode-worker` | Decode only |
 ## Use the Latest Release

--- a/docs/pages/integrations/flexkv-integration.md
+++ b/docs/pages/integrations/flexkv-integration.md
@@ -107,7 +107,7 @@ FLEXKV_CPU_CACHE_GB=32 \
 CUDA_VISIBLE_DEVICES=1 \
  python -m dynamo.vllm \
  --model Qwen/Qwen3-0.6B \
-  --is-prefill-worker \
+  --disaggregation-mode prefill \
  --connector nixl flexkv
 ```

--- a/docs/pages/integrations/lmcache-integration.md
+++ b/docs/pages/integrations/lmcache-integration.md
@@ -88,7 +88,7 @@ This will:
 - **Purpose**: Handles prompt processing (prefill phase)
 - **GPU Assignment**: CUDA_VISIBLE_DEVICES=1
 - **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for KV offloading and use NIXL for KV transfer between prefill and decode workers.
- **Flag**: `--is-prefill-worker`
+- **Flag**: `--disaggregation-mode prefill`
 ## Architecture

--- a/docs/pages/kubernetes/README.md
+++ b/docs/pages/kubernetes/README.md
@@ -219,7 +219,7 @@ Key customization points include:
 - **Resource Allocation**: Configure GPU requirements under `resources.limits`
 - **Scaling**: Set `replicas` for number of worker instances
 - **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
+- **Worker Specialization**: Add `--disaggregation-mode prefill` flag for disaggregated prefill workers
 ## Additional Resources

--- a/docs/pages/kubernetes/deployment/create-deployment.md
+++ b/docs/pages/kubernetes/deployment/create-deployment.md
@@ -186,7 +186,8 @@ Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MO
 ```yaml
   args:
-     - --is-prefill-worker  # For disaggregated prefill workers
+     - --disaggregation-mode
+     - prefill  # For disaggregated prefill workers
 ```
 ### Image Pull Secret Configuration

--- a/docs/pages/mocker/mocker.md
+++ b/docs/pages/mocker/mocker.md
@@ -50,13 +50,13 @@ python -m dynamo.mocker \
 # Launch prefill worker
 python -m dynamo.mocker \
    --model-path Qwen/Qwen3-0.6B \
-    --is-prefill-worker \
+    --disaggregation-mode prefill \
    --bootstrap-ports 50100
 # Launch decode worker (in another terminal)
 python -m dynamo.mocker \
    --model-path Qwen/Qwen3-0.6B \
-    --is-decode-worker
+    --disaggregation-mode decode
 ```
 ### Multiple Workers in One Process
@@ -88,8 +88,7 @@ python -m dynamo.mocker \
 | `--planner-profile-data` | None | Path to NPZ file with timing data |
 | `--num-workers` | 1 | Workers per process |
 | `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode |
-| `--is-prefill-worker` | False | Prefill-only mode |
+| `--disaggregation-mode` | `agg` | Worker mode: `agg` (aggregated), `prefill`, or `decode` |
-| `--is-decode-worker` | False | Decode-only mode |
 | `--durable-kv-events` | False | Enable durable KV events via JetStream (disables local indexer) |
 | `--bootstrap-ports` | None | Ports for P/D rendezvous |

--- a/docs/pages/observability/tracing.md
+++ b/docs/pages/observability/tracing.md
@@ -105,7 +105,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --enforce-eager \
    --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" \
-    --is-prefill-worker \
+    --disaggregation-mode prefill \
    --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' &
 ```

--- a/examples/backends/mocker/deploy/disagg.yaml
+++ b/examples/backends/mocker/deploy/disagg.yaml
@@ -35,7 +35,8 @@ spec:
            - "1.0"
            - --planner-profile-data
            - /workspace/tests/planner/profiling_results/H200_TP1P_TP1D
-            - --is-prefill-worker
+            - --disaggregation-mode
+            - prefill
    decode:
      envFromSecret: hf-token-secret
      componentType: worker
@@ -58,4 +59,5 @@ spec:
            - "1.0"
            - --planner-profile-data
            - /workspace/tests/planner/profiling_results/H200_TP1P_TP1D
-            - --is-decode-worker
+            - --disaggregation-mode
\ No newline at end of file
+            - decode
\ No newline at end of file
--- a/examples/backends/vllm/deploy/README.md
+++ b/examples/backends/vllm/deploy/README.md
@@ -24,7 +24,7 @@ High-performance deployment with separated prefill and decode workers.
 **Architecture:**
 - `Frontend`: HTTP API server coordinating between workers
 - `VLLMDecodeWorker`: Specialized decode-only worker
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--is-prefill-worker`)
+- `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
 - Communication via NIXL transfer backend
 ### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`)
@@ -33,7 +33,7 @@ Advanced disaggregated deployment with KV cache routing capabilities.
 **Architecture:**
 - `Frontend`: HTTP API server with KV-aware routing
 - `VLLMDecodeWorker`: Specialized decode-only worker
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--is-prefill-worker`)
+- `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
 ## CRD Structure
@@ -85,7 +85,7 @@ extraPodSpec:
 **Common vLLM Flags:**
 - `--enable-prompt-embeds`: Enable prompt embeddings feature
 - `--enable-multimodal`: Enable multimodal (vision) support
- `--is-prefill-worker`: Prefill-only mode for disaggregated serving
+- `--disaggregation-mode prefill`: Prefill-only mode for disaggregated serving
 - `--connector [nixl|lmcache|kvbm|none]`: KV transfer backend selection
 ## Prerequisites
@@ -198,7 +198,7 @@ spec:
 vLLM workers are configured through command-line arguments. Key parameters include:
 - `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
+- `--disaggregation-mode prefill`: Enable prefill-only mode for disaggregated serving
 - `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
 See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the full list of configuration options.

--- a/examples/backends/vllm/deploy/disagg-multinode.yaml
+++ b/examples/backends/vllm/deploy/disagg-multinode.yaml
@@ -71,10 +71,10 @@ spec:
          - Qwen/Qwen3-0.6B
          - --tensor-parallel-size
          - "2"
-          - --is-prefill-worker
+          - --disaggregation-mode
+          - prefill
          - --connector
          - none
          - --kv-transfer-config
          - '{"kv_connector": "NixlConnector", "kv_role": "kv_both", "engine_id":
            "vllm-disagg-prefill-engine-0abc123"}'
--- a/examples/backends/vllm/deploy/disagg.yaml
+++ b/examples/backends/vllm/deploy/disagg.yaml
@@ -32,7 +32,8 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-0.6B
-            - --is-decode-worker
+            - --disaggregation-mode
+            - decode
    VllmPrefillWorker:
      envFromSecret: hf-token-secret
      componentType: worker
@@ -52,4 +53,5 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-0.6B
-            - --is-prefill-worker
+            - --disaggregation-mode
+            - prefill
--- a/examples/backends/vllm/deploy/disagg_kvbm.yaml
+++ b/examples/backends/vllm/deploy/disagg_kvbm.yaml
@@ -31,6 +31,8 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-8B
+            - --disaggregation-mode
+            - decode
            - --max-model-len
            - "32000"
            - --enforce-eager
@@ -59,7 +61,8 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-8B
-            - --is-prefill-worker
+            - --disaggregation-mode
+            - prefill
            - --max-model-len
            - "32000"
            - --enforce-eager

--- a/examples/backends/vllm/deploy/disagg_kvbm_2p2d.yaml
+++ b/examples/backends/vllm/deploy/disagg_kvbm_2p2d.yaml
@@ -31,6 +31,8 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-8B
+            - --disaggregation-mode
+            - decode
            - --max-model-len
            - "32000"
            - --enforce-eager
@@ -59,7 +61,8 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-8B
-            - --is-prefill-worker
+            - --disaggregation-mode
+            - prefill
            - --max-model-len
            - "32000"
            - --enforce-eager

--- a/examples/backends/vllm/deploy/disagg_kvbm_tp2.yaml
+++ b/examples/backends/vllm/deploy/disagg_kvbm_tp2.yaml
@@ -33,6 +33,8 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-8B
+            - --disaggregation-mode
+            - decode
            - --gpu-memory-utilization
            - "0.23"
            - --max-model-len
@@ -65,7 +67,8 @@ spec:
          args:
            - --model
            - Qwen/Qwen3-8B
-            - --is-prefill-worker
+            - --disaggregation-mode
+            - prefill
            - --gpu-memory-utilization
            - "0.23"
            - --max-model-len