Unverified Commit c29f78c1 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: add host and bootstrap port to disagg multinode example (#5309)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent 5f8d90a3
......@@ -97,6 +97,7 @@ Ensure the following ports are accessible between nodes:
- **2379**: etcd client port
- **4222**: NATS client port
- **8000**: Frontend HTTP port (only needed on frontend node)
- **${DISAGG_BOOTSTRAP_PORT}**: SGLang disaggregation bootstrap port (set in Step 1; must be reachable across nodes)
- **High-speed interconnect**: For optimal NIXL performance (InfiniBand, RoCE, or high-bandwidth Ethernet)
### 4. Hardware Setup
......@@ -128,6 +129,9 @@ export INFRA_NODE_IP=<INFRA_NODE_IP>
export ETCD_ENDPOINTS=http://${INFRA_NODE_IP}:2379
export NATS_SERVER=nats://${INFRA_NODE_IP}:4222
export DYN_LOG=debug # Enable debug logging to see routing decisions
# Use a fixed, reachable port for the disaggregation bootstrap server
# Pick any free port and ensure it's open between nodes
export DISAGG_BOOTSTRAP_PORT=32963
```
### Step 2: Launch Replica 1 (Node 1)
......@@ -141,8 +145,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--host 0.0.0.0 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &
......@@ -151,8 +157,10 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--host 0.0.0.0 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl
```
......@@ -161,6 +169,8 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
>
> - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs)
> - `--page-size 16`: Sets the KV cache block size - must be identical across all workers
> - `--host 0.0.0.0`: Exposes the SGLang bootstrap server on all interfaces so other nodes can reach it
> - `--disaggregation-bootstrap-port`: Uses the fixed port you set in `DISAGG_BOOTSTRAP_PORT`; ensure this port is open between nodes
> - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation)
> - `--disaggregation-transfer-backend nixl`: Enables high-speed GPU-to-GPU transfers
> - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend > handles tokenization
......@@ -176,8 +186,10 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--host 0.0.0.0 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &
......@@ -187,8 +199,10 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--host 0.0.0.0 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment