chore(sglang): readme and instruction fixes (#1761)

8bfc61ac · ishandhanani · GitHub · 6901c7c0 · 8bfc61ac · 8bfc61ac
Unverified Commit 8bfc61ac authored Jul 03, 2025 by ishandhanani Committed by GitHub Jul 03, 2025
7 changed files
--- a/examples/sglang/README.md
+++ b/examples/sglang/README.md
@@ -95,7 +95,7 @@ that get spawned depend upon the chosen graph.
 #### Aggregated

 ```bash
-cd /workspace/examples/sglang
+cd $DYNAMO_ROOT/examples/sglang
 ./launch/agg.sh
 ```

@@ -108,8 +108,7 @@ cd /workspace/examples/sglang
 > After these are in, the TODOs in `worker.py` will be resolved and the placeholder logic removed.

 ```bash
-cd /workspace/examples/sglang
-export PYTHONPATH=$PYTHONPATH:/workspace/examples/sglang/utils
+cd $DYNAMO_ROOT/examples/sglang
 ./launch/agg_router.sh
 ```

@@ -133,7 +132,7 @@ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead
 > Disaggregated serving in SGLang currently requires each worker to have the same tensor parallel size [unless you are using an MLA based model](https://github.com/sgl-project/sglang/pull/5922)

 ```bash
-cd /workspace/examples/sglang
+cd $DYNAMO_ROOT/examples/sglang
 ./launch/disagg.sh
 ```

@@ -143,7 +142,7 @@ SGLang also supports DP attention for MoE models. We provide an example config f

 ```bash
 # note this will require 4 GPUs
-cd /workspace/examples/sglang
+cd $DYNAMO_ROOT/examples/sglang
 ./launch/disagg_dp_attn.sh
 ```


--- a/examples/sglang/dsr1-wideep.md
+++ b/examples/sglang/dsr1-wideep.md
@@ -73,7 +73,7 @@ In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang`
 # run ingress
 dynamo run in=http out=dyn &
 # run prefill worker
-python3 components/worker_inc.py \
+python3 components/worker.py \
  --model-path /model/ \
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --skip-tokenizer-init \
@@ -108,7 +108,7 @@ On the other prefill node (since this example has 4 total prefill nodes), run th
 7. Run the decode worker on the head decode node

 ```bash
-python3 components/decode_worker_inc.py \
+python3 components/decode_worker.py \
  --model-path /model/ \
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --skip-tokenizer-init \
@@ -138,14 +138,6 @@ python3 components/decode_worker_inc.py \

 On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8

-8. Run the warmup script to warm up the model
-
-DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
-
-```bash
-./warmup.sh HEAD_PREFILL_NODE_IP
-```
-
 ## Benchmarking

 In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:

--- a/examples/sglang/launch/agg.sh
+++ b/examples/sglang/launch/agg.sh
@@ -15,12 +15,9 @@ trap cleanup EXIT INT TERM
 python3 utils/clear_namespace.py --namespace dynamo

 # run ingress
-dynamo run in=http out=dyn &
+dynamo run in=http out=dyn --http-port=8000 &
 DYNAMO_PID=$!

-# run ingress
-dynamo run in=http out=dyn &
-
 # run worker
 python3 components/worker.py \
  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \

--- a/examples/sglang/launch/agg_router.sh
+++ b/examples/sglang/launch/agg_router.sh
@@ -15,7 +15,7 @@ trap cleanup EXIT INT TERM
 python3 utils/clear_namespace.py --namespace dynamo

 # run ingress
-dynamo run in=http out=dyn --router-mode kv &
+dynamo run in=http out=dyn --router-mode kv --http-port=8000 &
 DYNAMO_PID=$!

 # run worker

--- a/examples/sglang/launch/disagg.sh
+++ b/examples/sglang/launch/disagg.sh
@@ -15,7 +15,7 @@ trap cleanup EXIT INT TERM
 python3 utils/clear_namespace.py --namespace dynamo

 # run ingress
-dynamo run in=http out=dyn &
+dynamo run in=http out=dyn --http-port=8000 &
 DYNAMO_PID=$!

 # run prefill worker

--- a/examples/sglang/launch/disagg_dp_attn.sh
+++ b/examples/sglang/launch/disagg_dp_attn.sh
@@ -15,7 +15,7 @@ trap cleanup EXIT INT TERM
 python3 utils/clear_namespace.py --namespace dynamo

 # run ingress
-dynamo run in=http out=dyn &
+dynamo run in=http out=dyn --http-port=8000 &
 DYNAMO_PID=$!

 # run prefill worker

--- a/examples/sglang/multinode-examples.md
+++ b/examples/sglang/multinode-examples.md
@@ -17,7 +17,7 @@ Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
 # run ingress
 dynamo run in=http out=dyn &
 # run prefill worker
-python3 components/worker_inc.py \
+python3 components/worker.py \
  --model-path /model/ \
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --tp 16 \
@@ -40,7 +40,7 @@ export NATS_SERVER="nats://<node-1-ip>"
 export ETCD_ENDPOINTS="<node-1-ip>:2379"

 # worker
-python3 components/worker_inc.py \
+python3 components/worker.py \
  --model-path /model/ \
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --tp 16 \
@@ -63,7 +63,7 @@ export NATS_SERVER="nats://<node-1-ip>"
 export ETCD_ENDPOINTS="<node-1-ip>:2379"

 # worker
-python3 components/decode_worker_inc.py \
+python3 components/decode_worker.py \
  --model-path /model/ \
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --tp 16 \
@@ -86,7 +86,7 @@ export NATS_SERVER="nats://<node-1-ip>"
 export ETCD_ENDPOINTS="<node-1-ip>:2379"

 # worker
-python3 components/decode_worker_inc.py \
+python3 components/decode_worker.py \
  --model-path /model/ \
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --tp 16 \