Unverified Commit 330e649c authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

chore: multinode dsr1 doc fix (#1814)

parent 427d5471
...@@ -4,10 +4,9 @@ ...@@ -4,10 +4,9 @@
SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires 4 nodes of 8xH100 GPUs. SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires 4 nodes of 8xH100 GPUs.
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes as you will need the NATS/ETCD endpoints to be accessible by all other nodes. **Step 1**: Use the provided helper script to generate commands to start NATS/ETCD on your head prefill node. This script will also give you environment variables to export on each other node. You will need the IP addresses of your head prefill and head decode node to run this script.
```bash ```bash
# node 1 ./utils/gen_env_vars.sh
docker compose -f lib/runtime/docker-compose.yml up -d
``` ```
**Step 2**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16: **Step 2**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
...@@ -22,7 +21,7 @@ python3 components/worker.py \ ...@@ -22,7 +21,7 @@ python3 components/worker.py \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \ --tp 16 \
--dp-size 16 \ --dp-size 16 \
--dist-init-addr HEAD_PREFILL_NODE_IP:29500 \ --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 2 \ --nnodes 2 \
--node-rank 0 \ --node-rank 0 \
--enable-dp-attention \ --enable-dp-attention \
...@@ -30,22 +29,18 @@ python3 components/worker.py \ ...@@ -30,22 +29,18 @@ python3 components/worker.py \
--skip-tokenizer-init \ --skip-tokenizer-init \
--disaggregation-mode prefill \ --disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \ --disaggregation-transfer-backend nixl \
--mem-fraction-static 0.82 \ --disaggregation-bootstrap-port 30001 \
--mem-fraction-static 0.82
``` ```
Node 2: Run the remaining 8 shards of the prefill worker Node 2: Run the remaining 8 shards of the prefill worker
```bash ```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
# worker
python3 components/worker.py \ python3 components/worker.py \
--model-path /model/ \ --model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \ --tp 16 \
--dp-size 16 \ --dp-size 16 \
--dist-init-addr HEAD_PREFILL_NODE_IP:29500 \ --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 2 \ --nnodes 2 \
--node-rank 1 \ --node-rank 1 \
--enable-dp-attention \ --enable-dp-attention \
...@@ -53,22 +48,18 @@ python3 components/worker.py \ ...@@ -53,22 +48,18 @@ python3 components/worker.py \
--skip-tokenizer-init \ --skip-tokenizer-init \
--disaggregation-mode prefill \ --disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \ --disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--mem-fraction-static 0.82 --mem-fraction-static 0.82
``` ```
Node 3: Run the first 8 shards of the decode worker Node 3: Run the first 8 shards of the decode worker
```bash ```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
# worker
python3 components/decode_worker.py \ python3 components/decode_worker.py \
--model-path /model/ \ --model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \ --tp 16 \
--dp-size 16 \ --dp-size 16 \
--dist-init-addr HEAD_DECODE_NODE_IP:29500 \ --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 2 \ --nnodes 2 \
--node-rank 0 \ --node-rank 0 \
--enable-dp-attention \ --enable-dp-attention \
...@@ -76,22 +67,18 @@ python3 components/decode_worker.py \ ...@@ -76,22 +67,18 @@ python3 components/decode_worker.py \
--skip-tokenizer-init \ --skip-tokenizer-init \
--disaggregation-mode decode \ --disaggregation-mode decode \
--disaggregation-transfer-backend nixl \ --disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--mem-fraction-static 0.82 --mem-fraction-static 0.82
``` ```
Node 4: Run the remaining 8 shards of the decode worker Node 4: Run the remaining 8 shards of the decode worker
```bash ```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
# worker
python3 components/decode_worker.py \ python3 components/decode_worker.py \
--model-path /model/ \ --model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \ --tp 16 \
--dp-size 16 \ --dp-size 16 \
--dist-init-addr HEAD_DECODE_NODE_IP:29500 \ --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 2 \ --nnodes 2 \
--node-rank 1 \ --node-rank 1 \
--enable-dp-attention \ --enable-dp-attention \
...@@ -99,6 +86,7 @@ python3 components/decode_worker.py \ ...@@ -99,6 +86,7 @@ python3 components/decode_worker.py \
--skip-tokenizer-init \ --skip-tokenizer-init \
--disaggregation-mode decode \ --disaggregation-mode decode \
--disaggregation-transfer-backend nixl \ --disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--mem-fraction-static 0.82 --mem-fraction-static 0.82
``` ```
...@@ -106,7 +94,7 @@ python3 components/decode_worker.py \ ...@@ -106,7 +94,7 @@ python3 components/decode_worker.py \
SGLang typically requires a warmup period to ensure the DeepGEMM kernels are loaded. We recommend running a few warmup requests and ensuring that the DeepGEMM kernels load in. SGLang typically requires a warmup period to ensure the DeepGEMM kernels are loaded. We recommend running a few warmup requests and ensuring that the DeepGEMM kernels load in.
```bash ```bash
curl <node-1-ip>:8000/v1/chat/completions \ curl ${HEAD_PREFILL_NODE_IP}:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment