Unverified Commit c29f78c1 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: add host and bootstrap port to disagg multinode example (#5309)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent 5f8d90a3
...@@ -97,6 +97,7 @@ Ensure the following ports are accessible between nodes: ...@@ -97,6 +97,7 @@ Ensure the following ports are accessible between nodes:
- **2379**: etcd client port - **2379**: etcd client port
- **4222**: NATS client port - **4222**: NATS client port
- **8000**: Frontend HTTP port (only needed on frontend node) - **8000**: Frontend HTTP port (only needed on frontend node)
- **${DISAGG_BOOTSTRAP_PORT}**: SGLang disaggregation bootstrap port (set in Step 1; must be reachable across nodes)
- **High-speed interconnect**: For optimal NIXL performance (InfiniBand, RoCE, or high-bandwidth Ethernet) - **High-speed interconnect**: For optimal NIXL performance (InfiniBand, RoCE, or high-bandwidth Ethernet)
### 4. Hardware Setup ### 4. Hardware Setup
...@@ -128,6 +129,9 @@ export INFRA_NODE_IP=<INFRA_NODE_IP> ...@@ -128,6 +129,9 @@ export INFRA_NODE_IP=<INFRA_NODE_IP>
export ETCD_ENDPOINTS=http://${INFRA_NODE_IP}:2379 export ETCD_ENDPOINTS=http://${INFRA_NODE_IP}:2379
export NATS_SERVER=nats://${INFRA_NODE_IP}:4222 export NATS_SERVER=nats://${INFRA_NODE_IP}:4222
export DYN_LOG=debug # Enable debug logging to see routing decisions export DYN_LOG=debug # Enable debug logging to see routing decisions
# Use a fixed, reachable port for the disaggregation bootstrap server
# Pick any free port and ensure it's open between nodes
export DISAGG_BOOTSTRAP_PORT=32963
``` ```
### Step 2: Launch Replica 1 (Node 1) ### Step 2: Launch Replica 1 (Node 1)
...@@ -141,8 +145,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \ ...@@ -141,8 +145,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--served-model-name Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
--tp 1 \ --tp 1 \
--host 0.0.0.0 \
--trust-remote-code \ --trust-remote-code \
--skip-tokenizer-init \ --skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode prefill \ --disaggregation-mode prefill \
--disaggregation-transfer-backend nixl & --disaggregation-transfer-backend nixl &
...@@ -151,8 +157,10 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ ...@@ -151,8 +157,10 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--served-model-name Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
--tp 1 \ --tp 1 \
--host 0.0.0.0 \
--trust-remote-code \ --trust-remote-code \
--skip-tokenizer-init \ --skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode decode \ --disaggregation-mode decode \
--disaggregation-transfer-backend nixl --disaggregation-transfer-backend nixl
``` ```
...@@ -161,6 +169,8 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ ...@@ -161,6 +169,8 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
> >
> - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs) > - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs)
> - `--page-size 16`: Sets the KV cache block size - must be identical across all workers > - `--page-size 16`: Sets the KV cache block size - must be identical across all workers
> - `--host 0.0.0.0`: Exposes the SGLang bootstrap server on all interfaces so other nodes can reach it
> - `--disaggregation-bootstrap-port`: Uses the fixed port you set in `DISAGG_BOOTSTRAP_PORT`; ensure this port is open between nodes
> - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation) > - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation)
> - `--disaggregation-transfer-backend nixl`: Enables high-speed GPU-to-GPU transfers > - `--disaggregation-transfer-backend nixl`: Enables high-speed GPU-to-GPU transfers
> - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend > handles tokenization > - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend > handles tokenization
...@@ -176,8 +186,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \ ...@@ -176,8 +186,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--served-model-name Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
--tp 1 \ --tp 1 \
--host 0.0.0.0 \
--trust-remote-code \ --trust-remote-code \
--skip-tokenizer-init \ --skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode prefill \ --disaggregation-mode prefill \
--disaggregation-transfer-backend nixl & --disaggregation-transfer-backend nixl &
...@@ -187,8 +199,10 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ ...@@ -187,8 +199,10 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--served-model-name Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
--tp 1 \ --tp 1 \
--host 0.0.0.0 \
--trust-remote-code \ --trust-remote-code \
--skip-tokenizer-init \ --skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode decode \ --disaggregation-mode decode \
--disaggregation-transfer-backend nixl --disaggregation-transfer-backend nixl
``` ```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment