Note: The above architecture illustrates all the components. The final components
that get spawned depend upon the chosen graph.
### Example architectures
### Example architectures
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each commmand and run them in separate terminals.
#### Aggregated
#### Aggregated
```bash
```bash
cd /workspace/examples/sglang
cd /workspace/examples/sglang
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
./launch/agg.sh
```
```
#### Aggregated with router
#### Aggregated serving with KV Routing
> [!NOTE]
> [!NOTE]
> The current implementation of `examples/sglang/components/worker.py` publishes _placeholder_ engine metrics to keep the Dynamo KV-router happy. Real-time metrics will be surfaced directly from the SGLang engine once the following pull requests are merged:
> The current implementation of `examples/sglang/components/worker.py` publishes _placeholder_ engine metrics to keep the Dynamo KV-router happy. Real-time metrics will be surfaced directly from the SGLang engine once the following pull requests are merged:
In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
4. On the head prefill node, start `nats-server` and`etcd` using the following commands
4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`,`etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
6. Configure each configuration file to use the correct `dist-init-addr`, and `node-rank`
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init\
Each container contains the configuration file in `configs/dsr1-wideep.yaml`. For our example, we will make the following changes:
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
On the prefill head node, `vim` into the configs and change the following section of the `SGLangWorker`:
--disaggregation-bootstrap-port 30001 \
--dist-init-addr${HEAD_PREFILL_NODE_IP}:29500 \
```yaml
--nnodes 4 \
SGLangWorker:
--node-rank 0 \
...
--tp-size 32 \
dist-init-addr:HEAD_PREFILL_NODE_IP
--dp-size 32 \
nnodes:2
--enable-dp-attention\
node-rank:0
--decode-log-interval 1 \
...
--enable-deepep-moe\
--page-size 1 \
--trust-remote-code\
--moe-dense-tp-size 1 \
--enable-dp-lm-head\
--disable-radix-cache\
--watchdog-timeout 1000000 \
--enable-two-batch-overlap\
--deepep-mode normal \
--mem-fraction-static 0.85 \
--deepep-config /configs/deepep.json \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm dynamic \
--eplb-algorithm deepseek
```
```
On the other prefill node (since this example has 2 prefill nodes), change the following section of the `SGLangWorker`:
On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
```yaml
7. Run the decode worker on the head decode node
SGLangWorker:
...
dist-init-addr:HEAD_PREFILL_NODE_IP
nnodes:2
node-rank:1
...
```
On the decode head node, `vim` into the configs and change the following section of the `SGLangDecodeWorker`:
```yaml
```bash
SGLangDecodeWorker:
python3 components/decode_worker_inc.py \
...
--model-path /model/ \
dist-init-addr:HEAD_DECODE_NODE_IP
--served-model-name deepseek-ai/DeepSeek-R1 \
nnodes:4
--skip-tokenizer-init\
node-rank:0
--disaggregation-mode decode \
...
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--dist-init-addr${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 9 \
--node-rank 0 \
--tp-size 72 \
--dp-size 72 \
--enable-dp-attention\
--decode-log-interval 1 \
--enable-deepep-moe\
--page-size 1 \
--trust-remote-code\
--moe-dense-tp-size 1 \
--enable-dp-lm-head\
--disable-radix-cache\
--watchdog-timeout 1000000 \
--enable-two-batch-overlap\
--deepep-mode low_latency \
--mem-fraction-static 0.835 \
--ep-num-redundant-experts 32 \
--cuda-graph-bs 256
```
```
On the other decode nodes (this example has 4 decode nodes), change the following section of the `SGLangDecodeWorker`:
On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
```yaml
8. Run the warmup script to warm up the model
SGLangDecodeWorker:
...
dist-init-addr:HEAD_DECODE_NODE_IP
nnodes:4
# depending on which node this will be 1, 2, and 3
node-rank:1
```
7. Start up the workers using the following commands
On prefill head node
DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
```bash
```bash
dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml
./warmup.sh HEAD_PREFILL_NODE_IP
```
```
On prefill child node
## Benchmarking
```bash
dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangWorker
```
On all decode nodes
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
prefill:
```bash
```bash
dynamo serve graphs.disagg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangDecodeWorker
DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to uncomment the labeled flags in the `configs/dsr1.yaml` file inside of the container.
We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
1.**GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
1.**GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires
SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires 4 nodes of 8xH100 GPUs.
4 nodes of 8xH100 GPUs.
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes as you will need the NATS/ETCD endpoints to be accessible by all other nodes.
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes as you will need the NATS/ETCD endpoints to be accessible by all other nodes.
```bash
```bash
...
@@ -14,130 +13,93 @@ docker compose -f lib/runtime/docker-compose.yml up -d
...
@@ -14,130 +13,93 @@ docker compose -f lib/runtime/docker-compose.yml up -d
**Step 2**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
**Step 2**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
```yaml
# configs/prefill-1.yaml
Frontend:
served_model_name:deepseek-ai/DeepSeek-R1
endpoint:dynamo.SGLangWorker.generate
port:8000
SGLangWorker:
model-path:deepseek-ai/DeepSeek-R1
served-model-name:deepseek-ai/DeepSeek-R1
tp:16
trust-remote-code:true
skip-tokenizer-init:true
dist-init-addr:<node-1-ip>:29500
disaggregation-bootstrap-port:30001
disaggregation-mode:prefill
disaggregation-transfer-backend:nixl
nnodes:2
node-rank:0
mem-fraction-static:0.82
ServiceArgs:
workers:1
resources:
gpu:8
```
Run this with:
```bash
```bash
cd examples/sglang
# run ingress
dynamo serve graphs.agg:Frontend -f configs/prefill-1.yaml
dynamo run in=http out=dyn &
```
# run prefill worker
python3 components/worker_inc.py \
Node 2: Run the remaining 8 shards of the prefill worker and the decode worker
--model-path /model/ \
```yaml
--served-model-name deepseek-ai/DeepSeek-R1 \
# configs/prefill-2.yaml
--tp 16 \
SGLangWorker:
--dp-size 16 \
model-path:deepseek-ai/DeepSeek-R1
--dist-init-addr HEAD_PREFILL_NODE_IP:29500 \
served-model-name:deepseek-ai/DeepSeek-R1
--nnodes 2 \
tp:16
--node-rank 0 \
trust-remote-code:true
--enable-dp-attention\
skip-tokenizer-init:true
--trust-remote-code\
mem-fraction-static:0.82
--skip-tokenizer-init\
dist-init-addr:<node-1-ip>:29500
--disaggregation-mode prefill \
disaggregation-bootstrap-port:30001
--disaggregation-transfer-backend nixl \
disaggregation-mode:prefill
--mem-fraction-static 0.82 \
disaggregation-transfer-backend:nixl
nnodes:2
node-rank:1
ServiceArgs:
workers:1
resources:
gpu:8
```
```
On all other nodes, we need to export the NATS and ETCD endpoints. Run this with with:
Node 2: Run the remaining 8 shards of the prefill worker
```bash
```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
cd examples/sglang
# worker
dynamo serve graphs.disagg:Frontend -f configs/prefill-2.yaml --service-name SGLangWorker
python3 components/worker_inc.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr HEAD_PREFILL_NODE_IP:29500 \
--nnodes 2 \
--node-rank 1 \
--enable-dp-attention\
--trust-remote-code\
--skip-tokenizer-init\
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--mem-fraction-static 0.82
```
```
Node 3: Run the first 8 shards of the decode worker
Node 3: Run the first 8 shards of the decode worker
```yaml
# configs/decode-1.yaml
SGLangDecodeWorker:
model-path:deepseek-ai/DeepSeek-R1
served-model-name:deepseek-ai/DeepSeek-R1
tp:16
trust-remote-code:true
skip-tokenizer-init:true
mem-fraction-static:0.80
dist-init-addr:2:29500
disaggregation-mode:decode
disaggregation-transfer-backend:nixl
disaggregation-bootstrap-port:30001
nnodes:2
node-rank:0
ServiceArgs:
workers:1
resources:
gpu:8
```
Run this with:
```bash
```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
cd examples/sglang
# worker
dynamo serve graphs.disagg:Frontend -f configs/decode-1.yaml --service-name SGLangDecodeWorker
python3 components/decode_worker_inc.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr HEAD_DECODE_NODE_IP:29500 \
--nnodes 2 \
--node-rank 0 \
--enable-dp-attention\
--trust-remote-code\
--skip-tokenizer-init\
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--mem-fraction-static 0.82
```
```
Node 4: Run the remaining 8 shards of the decode worker
Node 4: Run the remaining 8 shards of the decode worker
```yaml
# configs/decode-2.yaml
SGLangDecodeWorker:
model-path:deepseek-ai/DeepSeek-R1
served-model-name:deepseek-ai/DeepSeek-R1
tp:16
trust-remote-code:true
skip-tokenizer-init:true
mem-fraction-static:0.80
dist-init-addr:2:29500
disaggregation-mode:decode
disaggregation-transfer-backend:nixl
disaggregation-bootstrap-port:30001
disable-cuda-graph:true
nnodes:2
node-rank:1
ServiceArgs:
workers:1
resources:
gpu:8
```
Run this with:
```bash
```bash
# nats and etcd endpoints
export NATS_SERVER="nats://<node-1-ip>"
export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
cd examples/sglang
# worker
dynamo serve graphs.disagg:Frontend -f configs/decode-2.yaml --service-name SGLangDecodeWorker