Unverified Commit 5a158552 authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

fix: remove old docs and unify model paths (#5179)

parent 7b5cdc42
...@@ -261,23 +261,6 @@ curl localhost:8000/v1/chat/completions \ ...@@ -261,23 +261,6 @@ curl localhost:8000/v1/chat/completions \
}' }'
``` ```
## Advanced Examples
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Run a multi-node sized model
- **[Run a multi-node model](multinode-examples.md)**
### Large scale P/D disaggregation with WideEP
- **[Run DeepSeek-R1-FP8 on H100s](dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1-FP8 on GB200s](dsr1-wideep-gb200.md)**
### Hierarchical Cache (HiCache)
- **[Enable SGLang Hierarchical Cache (HiCache)](sgl-hicache-example.md)**
### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL
- **[Run a multimodal model with EPD Disaggregation](../../multimodal/sglang.md)**
## Deployment ## Deployment
We currently provide deployment examples for Kubernetes and SLURM. We currently provide deployment examples for Kubernetes and SLURM.
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running DeepSeek-R1 Disaggregated with WideEP on GB200s
Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. We provide a sample configuration that demonstrates WideEP and P/D disaggregation. To run the exact configuration shown in the blog post, you can view the commands created by the SGLang team [here](https://github.com/sgl-project/sglang/issues/7227). In this example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 2 GB200 nodes (total 8 GPUs).
## Instructions
1. Build the Dynamo container for ARM64 (GB200) using the `build.sh` script.
> [!Note]
> Please ensure that you are building this on an ARM64 machine. The build script will automatically configure the correct platform and build arguments for SGLang on ARM64/GB200.
```bash
cd $DYNAMO_ROOT
./container/build.sh \
--framework SGLANG \
--platform linux/arm64 \
--tag dynamo-wideep-gb200:latest
```
2. You can run this container on each 4xGB200 node using the following command.
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--volume /PATH_TO_DSR1_MODEL/:/model/ \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-wideep-gb200:latest
```
In each container, you should be in the /sgl-workspace/dynamo/examples/backends/sglang directory.
3. Run the ingress and prefill worker
```bash
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# run prefill worker
DYN_SKIP_SGLANG_LOG_FORMATTING=1 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
MC_FORCE_MNNVL=1 \
NCCL_MNNVL_ENABLE=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
--trust-remote-code \
--disaggregation-mode prefill \
--dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
--disaggregation-bootstrap-port 30001 \
--nnodes 2 \
--node-rank 0 \
--tp-size 8 \
--dp-size 8 \
--enable-dp-attention \
--host 0.0.0.0 \
--decode-log-interval 1 \
--max-running-requests 6144 \
--context-length 2716 \
--disable-radix-cache \
--moe-a2a-backend deepep \
--load-balance-method round_robin \
--deepep-mode normal \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-shared-experts-fusion \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm static \
--eplb-algorithm deepseek \
--attention-backend cutlass_mla \
--watchdog-timeout 1000000 \
--disable-cuda-graph \
--chunked-prefill-size 16384 \
--max-total-tokens 32768 \
--mem-fraction-static 0.82 \
--log-level debug \
--disaggregation-transfer-backend nixl
```
On the other prefill nodes (this example has 2 total prefill nodes), run the same command but change `--node-rank` to 1
> [!IMPORTANT]
> If you encounter random CPU recv timeout issues during the warm-up phase in multi-GPU or multi-node setups, they are likely caused by DeepGEMM kernel compilation overhead.
> To avoid these non-deterministic timeouts, it's strongly recommended to precompile the DeepGEMM kernels before launching the SGLang engine. This ensures all kernels are cached and ready, preventing long initialization delays or distributed timeout errors. To precompile and use cached kernels, please execute the following commands:
```bash
# 1. Precompile DeepGEMM kernels
export SGLANG_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
python3 -m sglang.compile_deep_gemm <ServerArgs>
# 2. Launch the engine with the same cache directory
export SGLANG_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
python3 -m dynamo.frontend <ServerArgs>
```
> [!NOTE]
> There's a known issue where the compile request may fail due to missing bootstrap information, but the kernels are still successfully cached.
> Using a gradual warm-up phase and enabling caching for FlashInfer (similar to DeepGEMM) can further improve stability and reduce startup time.
> See https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174 for more details.
4. Run the decode worker on the head decode node
```bash
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
NCCL_MNNVL_ENABLE=1 \
MC_FORCE_MNNVL=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
--trust-remote-code \
--disaggregation-mode decode \
--dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
--disaggregation-bootstrap-port 30001 \
--nnodes 2 \
--node-rank 0 \
--tp-size 8 \
--dp-size 8 \
--enable-dp-attention \
--host 0.0.0.0 \
--decode-log-interval 1 \
--max-running-requests 36864 \
--context-length 2716 \
--disable-radix-cache \
--moe-a2a-backend deepep \
--prefill-round-robin-balance \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--cuda-graph-max-bs 256 \
--disable-shared-experts-fusion \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm static \
--eplb-algorithm deepseek \
--attention-backend cutlass_mla \
--watchdog-timeout 1000000 \
--chunked-prefill-size 36864 \
--mem-fraction-static 0.82 \
--log-level debug \
--disaggregation-transfer-backend nixl
```
On the other decode nodes (this example has 2 total decode nodes), run the same command but change `--node-rank` to 1.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running DeepSeek-R1 Disaggregated with WideEP on H100s
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-05-05-large-scale-ep/) for more details. We provide a sample configuration that demonstrates WideEP and P/D disaggregation. To run the exact configuration shown in the blog post, you can view the commands created by the SGLang team [here](https://github.com/sgl-project/sglang/issues/6017). In this example, we will run 1 prefill worker on 4 H100 nodes (32 GPUs each) and 1 decode worker on 4 H100 nodes (total 64 GPUs).
## Instructions
1. Build the Dynamo container for AMD64/x86_64 (H100) using the `build.sh` script.
> [!Note]
> Please ensure that you are building this on an AMD64 (x86_64) machine. The build script will automatically configure the correct platform for SGLang.
```bash
cd $DYNAMO_ROOT
./container/build.sh \
--framework SGLANG \
--tag dynamo-wideep:latest \
```
2. You can run this container on each 8xH100 node using the following command.
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--volume /PATH_TO_DSR1_MODEL/:/model/ \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-wideep:latest
```
In each container, you should be in the `/sgl-workspace/dynamo/examples/backends/sglang` directory.
3. Run the ingress and prefill worker
```bash
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# run prefill worker
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--disaggregation-bootstrap-port 30001 \
--dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 4 \
--node-rank 0 \
--tp-size 32 \
--dp-size 32 \
--enable-dp-attention \
--decode-log-interval 1000 \
--moe-a2a-backend deepep \
--load-balance-method round_robin \
--page-size 1 \
--trust-remote-code \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-radix-cache \
--watchdog-timeout 1000000 \
--enable-two-batch-overlap \
--deepep-mode normal \
--mem-fraction-static 0.85 \
--deepep-config /configs/deepep.json \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm dynamic \
--eplb-algorithm deepseek
```
On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
> [!IMPORTANT]
> If you encounter random CPU recv timeout issues during the warm-up phase in multi-GPU or multi-node setups, they are likely caused by DeepGEMM kernel compilation overhead.
> To avoid these non-deterministic timeouts, it's strongly recommended to precompile the DeepGEMM kernels before launching the SGLang engine. This ensures all kernels are cached and ready, preventing long initialization delays or distributed timeout errors. To precompile and use cached kernels, please execute the following commands:
```bash
# 1. Precompile DeepGEMM kernels
export SGLANG_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
python3 -m sglang.compile_deep_gemm <ServerArgs>
# 2. Launch the engine with the same cache directory
export SGLANG_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
python3 -m dynamo.frontend <ServerArgs>
```
> [!NOTE]
> There's a known issue where the compile request may fail due to missing bootstrap information, but the kernels are still successfully cached.
> Using a gradual warm-up phase and enabling caching for FlashInfer (similar to DeepGEMM) can further improve stability and reduce startup time.
> See https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174 for more details.
4. Run the decode worker on the head decode node
```bash
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--host 0.0.0.0 \
--dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 4 \
--node-rank 0 \
--tp-size 32 \
--dp-size 32 \
--enable-dp-attention \
--decode-log-interval 1000 \
--moe-a2a-backend deepep \
--prefill-round-robin-balance \
--page-size 1 \
--trust-remote-code \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-radix-cache \
--watchdog-timeout 1000000 \
--enable-two-batch-overlap \
--deepep-mode low_latency \
--mem-fraction-static 0.835 \
--ep-num-redundant-experts 32 \
--cuda-graph-bs 128
```
On the other decode nodes (this example has 4 total decode nodes), run the same command but change `--node-rank` to 1, 2, and 3
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Multinode Examples
## Multi-node sized models
SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires 4 nodes of 8xH100 GPUs.
**Prerequisite**: Building the Dynamo container.
```bash
cd $DYNAMO_ROOT
./container/build.sh \
--framework SGLANG \
--tag dynamo-wideep:latest \
```
You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
**Step 1**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
```bash
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# run prefill worker
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 2 \
--node-rank 0 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--load-balance-method round_robin \
--host 0.0.0.0 \
--mem-fraction-static 0.82
```
Node 2: Run the remaining 8 shards of the prefill worker
```bash
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 2 \
--node-rank 1 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--host 0.0.0.0 \
--load-balance-method round_robin \
--mem-fraction-static 0.82
```
Node 3: Run the first 8 shards of the decode worker
```bash
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 2 \
--node-rank 0 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--host 0.0.0.0 \
--prefill-round-robin-balance \
--mem-fraction-static 0.82 \
--cuda-graph-max-bs 8
```
Node 4: Run the remaining 8 shards of the decode worker
```bash
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 2 \
--node-rank 1 \
--enable-dp-attention \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--host 0.0.0.0 \
--prefill-round-robin-balance \
--mem-fraction-static 0.82 \
--cuda-graph-max-bs 8
```
**Step 2**: Run inference
SGLang typically requires a warmup period to ensure the DeepGEMM kernels are loaded. We recommend running a few warmup requests and ensuring that the DeepGEMM kernels load in.
```bash
curl ${HEAD_PREFILL_NODE_IP}:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of the tennis world, where champions rise and fall with each Grand Slam, lies the legend of the Golden Racket of Wimbledon. Once wielded by the greatest players of antiquity, this mythical racket is said to bestow unparalleled precision, grace, and longevity upon its rightful owner. For centuries, it remained hidden, its location lost to all but the most dedicated scholars of the sport. You are Roger Federer, the Swiss maestro whose elegant play and sportsmanship have already cemented your place among the legends, but whose quest for perfection remains unquenched even as time marches on. Recent dreams have brought you visions of this ancient artifact, along with fragments of a map that seems to lead to its resting place. Your journey will take you through the hallowed grounds of tennis history, from the clay courts of Roland Garros to the hidden training grounds of forgotten champions, and finally to a secret chamber beneath Centre Court itself. Character Background: Develop a detailed background for Roger Federer in this quest. Describe his motivations for seeking the Golden Racket, his tennis skills and personal weaknesses, and any connections to the legends of the sport that came before him. Is he driven by a desire to extend his career, to secure his legacy as the greatest of all time, or perhaps by something more personal? What price might he be willing to pay to claim this artifact, and what challenges from rivals past and present might stand in his way?"
}
],
"stream":false,
"max_tokens": 30
}'
```
...@@ -52,9 +52,6 @@ ...@@ -52,9 +52,6 @@
backends/trtllm/gpt-oss.md backends/trtllm/gpt-oss.md
backends/trtllm/prometheus.md backends/trtllm/prometheus.md
backends/sglang/multinode-examples.md
backends/sglang/dsr1-wideep-gb200.md
backends/sglang/dsr1-wideep-h100.md
backends/sglang/expert-distribution-eplb.md backends/sglang/expert-distribution-eplb.md
backends/sglang/gpt-oss.md backends/sglang/gpt-oss.md
backends/sglang/profiling.md backends/sglang/profiling.md
......
...@@ -58,11 +58,10 @@ DYNAMO_PID=$! ...@@ -58,11 +58,10 @@ DYNAMO_PID=$!
# harnesses can set one simple pair for disaggregated deployments. # harnesses can set one simple pair for disaggregated deployments.
OTEL_SERVICE_NAME=dynamo-worker-prefill DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \ OTEL_SERVICE_NAME=dynamo-worker-prefill DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \
python3 -m dynamo.sglang \ python3 -m dynamo.sglang \
--model-path silence09/DeepSeek-R1-Small-2layers \ --model-path Qwen/Qwen3-0.6B \
--served-model-name silence09/DeepSeek-R1-Small-2layers \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
--tp 2 --dp-size 2 --enable-dp-attention \ --tp 1 \
--load-balance-method round_robin \
--trust-remote-code \ --trust-remote-code \
--disaggregation-mode prefill \ --disaggregation-mode prefill \
--disaggregation-bootstrap-port 12345 \ --disaggregation-bootstrap-port 12345 \
...@@ -75,12 +74,11 @@ PREFILL_PID=$! ...@@ -75,12 +74,11 @@ PREFILL_PID=$!
# run decode worker # run decode worker
OTEL_SERVICE_NAME=dynamo-worker-decode DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \ OTEL_SERVICE_NAME=dynamo-worker-decode DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
CUDA_VISIBLE_DEVICES=2,3 python3 -m dynamo.sglang \ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path silence09/DeepSeek-R1-Small-2layers \ --model-path Qwen/Qwen3-0.6B \
--served-model-name silence09/DeepSeek-R1-Small-2layers \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
--prefill-round-robin-balance \ --tp 1 \
--tp 2 --dp-size 2 --enable-dp-attention \
--trust-remote-code \ --trust-remote-code \
--disaggregation-mode decode \ --disaggregation-mode decode \
--disaggregation-bootstrap-port 12345 \ --disaggregation-bootstrap-port 12345 \
......
...@@ -59,8 +59,8 @@ DYNAMO_PID=$! ...@@ -59,8 +59,8 @@ DYNAMO_PID=$!
# run prefill worker # run prefill worker
OTEL_SERVICE_NAME=dynamo-worker-prefill-1 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \ OTEL_SERVICE_NAME=dynamo-worker-prefill-1 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \
python3 -m dynamo.sglang \ python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --model-path Qwen/Qwen3-0.6B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 64 \ --page-size 64 \
--tp 1 \ --tp 1 \
--trust-remote-code \ --trust-remote-code \
...@@ -75,8 +75,8 @@ PREFILL_PID1=$! ...@@ -75,8 +75,8 @@ PREFILL_PID1=$!
# run prefill worker # run prefill worker
OTEL_SERVICE_NAME=dynamo-worker-prefill-2 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \ OTEL_SERVICE_NAME=dynamo-worker-prefill-2 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --model-path Qwen/Qwen3-0.6B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 64 \ --page-size 64 \
--tp 1 \ --tp 1 \
--trust-remote-code \ --trust-remote-code \
...@@ -91,8 +91,8 @@ PREFILL_PID2=$! ...@@ -91,8 +91,8 @@ PREFILL_PID2=$!
# run decode worker # run decode worker
OTEL_SERVICE_NAME=dynamo-worker-decode-1 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT3:-8083} \ OTEL_SERVICE_NAME=dynamo-worker-decode-1 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT3:-8083} \
CUDA_VISIBLE_DEVICES=3 python3 -m dynamo.sglang \ CUDA_VISIBLE_DEVICES=3 python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --model-path Qwen/Qwen3-0.6B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 64 \ --page-size 64 \
--tp 1 \ --tp 1 \
--trust-remote-code \ --trust-remote-code \
...@@ -107,8 +107,8 @@ DECODE_PID1=$! ...@@ -107,8 +107,8 @@ DECODE_PID1=$!
# run decode worker # run decode worker
OTEL_SERVICE_NAME=dynamo-worker-decode-2 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT4:-8084} \ OTEL_SERVICE_NAME=dynamo-worker-decode-2 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT4:-8084} \
CUDA_VISIBLE_DEVICES=2 python3 -m dynamo.sglang \ CUDA_VISIBLE_DEVICES=2 python3 -m dynamo.sglang \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --model-path Qwen/Qwen3-0.6B \
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 64 \ --page-size 64 \
--tp 1 \ --tp 1 \
--trust-remote-code \ --trust-remote-code \
......
...@@ -17,7 +17,7 @@ For this example, we will make some assumptions about your SLURM cluster: ...@@ -17,7 +17,7 @@ For this example, we will make some assumptions about your SLURM cluster:
If your cluster supports similar container based plugins, you may be able to If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead. modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as 3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../../../../docs/backends/sglang/dsr1-wideep-gb200.md#instructions). described [here](../../../../docs/backends/sglang/README.md#using-docker-containers).
This is the image that can be passed to the `--container-image` argument in later steps. This is the image that can be passed to the `--container-image` argument in later steps.
## Scripts Overview ## Scripts Overview
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment