- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The aggregated config uses CUDA graphs for optimized inference
- The aggregated config uses CUDA graphs for optimized inference
- KV cache uses FP8 dtype for memory efficiency
- KV cache uses FP8 dtype for memory efficiency
- The `vllm/disagg` config splits 8 GPUs as 2× prefill (TP=2) + 1× decode (TP=4) using NixlConnector KV transfer; all workers must be co-located on one node
-`--max-model-len 8192` is set in `vllm/disagg/deploy.yaml` for A100 40 GB compatibility; remove or increase this flag on H100/H200