@@ -14,7 +14,7 @@ Dynamo commits after `1b3eed4b6a0e735d4ecec6681f4c0b89f2112167` (Sep 18, 2025) a
## Hardware
The two deployment recipes are for 8xH200 and 16xH200. It should also work for other GPU SKUs. Change the TDP and DEP size accordingly to match the GPU capacity.
The two deployment recipes are for 16x H200 (disagg-8gpu) and 32x H200 (disagg-16gpu). The folder names refer to GPUs per worker type (8 or 16), with separate prefill and decode workers each using that many GPUs. It should also work for other GPU SKUs. Change the TP and EP size accordingly to match the GPU capacity.
If you see NCCL errors when sending requests to the engines, it is usually caused by OOM error. Try to reduce `--mem-fraction-static` in both prefill and decode engines.
### DeepSeek-R1 with vLLM — Disaggregated on 8x Hopper
### DeepSeek-R1 with vLLM — Disaggregated on 32x Hopper
This recipe deploys DeepSeek-R1 using vLLM in a disaggregated prefill/decode setup on a single Hopper node with 8 GPUs.
This recipe deploys DeepSeek-R1 using vLLM in a disaggregated prefill/decode setup across four Hopper nodes (32 GPUs total: 16 for prefill, 16 for decode).
- Model cache PVC + download job: `recipes/deepseek-r1/model-cache/`
The manifest runs separate prefill and decode workers, each mounting the shared model cache, with settings tuned for Hopper.
The manifest runs separate prefill and decode workers across multiple nodes, each mounting the shared model cache, with settings tuned for Hopper GPUs.
Test the deployment locally by port-forwarding and sending a request: