Unverified Commit 8bfc61ac authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

chore(sglang): readme and instruction fixes (#1761)

parent 6901c7c0
......@@ -95,7 +95,7 @@ that get spawned depend upon the chosen graph.
#### Aggregated
```bash
cd /workspace/examples/sglang
cd $DYNAMO_ROOT/examples/sglang
./launch/agg.sh
```
......@@ -108,8 +108,7 @@ cd /workspace/examples/sglang
> After these are in, the TODOs in `worker.py` will be resolved and the placeholder logic removed.
```bash
cd /workspace/examples/sglang
export PYTHONPATH=$PYTHONPATH:/workspace/examples/sglang/utils
cd $DYNAMO_ROOT/examples/sglang
./launch/agg_router.sh
```
......@@ -133,7 +132,7 @@ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead
> Disaggregated serving in SGLang currently requires each worker to have the same tensor parallel size [unless you are using an MLA based model](https://github.com/sgl-project/sglang/pull/5922)
```bash
cd /workspace/examples/sglang
cd $DYNAMO_ROOT/examples/sglang
./launch/disagg.sh
```
......@@ -143,7 +142,7 @@ SGLang also supports DP attention for MoE models. We provide an example config f
```bash
# note this will require 4 GPUs
cd /workspace/examples/sglang
cd $DYNAMO_ROOT/examples/sglang
./launch/disagg_dp_attn.sh
```
......
......@@ -73,7 +73,7 @@ In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang`
# run ingress
dynamo run in=http out=dyn &
# run prefill worker
python3 components/worker_inc.py \
python3 components/worker.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
......@@ -108,7 +108,7 @@ On the other prefill node (since this example has 4 total prefill nodes), run th
7. Run the decode worker on the head decode node
```bash
python3 components/decode_worker_inc.py \
python3 components/decode_worker.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
......@@ -138,14 +138,6 @@ python3 components/decode_worker_inc.py \
On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
8. Run the warmup script to warm up the model
DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
```bash
./warmup.sh HEAD_PREFILL_NODE_IP
```
## Benchmarking
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
......
......@@ -15,12 +15,9 @@ trap cleanup EXIT INT TERM
python3 utils/clear_namespace.py --namespace dynamo
# run ingress
dynamo run in=http out=dyn &
dynamo run in=http out=dyn --http-port=8000 &
DYNAMO_PID=$!
# run ingress
dynamo run in=http out=dyn &
# run worker
python3 components/worker.py \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
......
......@@ -15,7 +15,7 @@ trap cleanup EXIT INT TERM
python3 utils/clear_namespace.py --namespace dynamo
# run ingress
dynamo run in=http out=dyn --router-mode kv &
dynamo run in=http out=dyn --router-mode kv --http-port=8000 &
DYNAMO_PID=$!
# run worker
......
......@@ -15,7 +15,7 @@ trap cleanup EXIT INT TERM
python3 utils/clear_namespace.py --namespace dynamo
# run ingress
dynamo run in=http out=dyn &
dynamo run in=http out=dyn --http-port=8000 &
DYNAMO_PID=$!
# run prefill worker
......
......@@ -15,7 +15,7 @@ trap cleanup EXIT INT TERM
python3 utils/clear_namespace.py --namespace dynamo
# run ingress
dynamo run in=http out=dyn &
dynamo run in=http out=dyn --http-port=8000 &
DYNAMO_PID=$!
# run prefill worker
......
......@@ -17,7 +17,7 @@ Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
# run ingress
dynamo run in=http out=dyn &
# run prefill worker
python3 components/worker_inc.py \
python3 components/worker.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
......@@ -40,7 +40,7 @@ export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
# worker
python3 components/worker_inc.py \
python3 components/worker.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
......@@ -63,7 +63,7 @@ export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
# worker
python3 components/decode_worker_inc.py \
python3 components/decode_worker.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
......@@ -86,7 +86,7 @@ export NATS_SERVER="nats://<node-1-ip>"
export ETCD_ENDPOINTS="<node-1-ip>:2379"
# worker
python3 components/decode_worker_inc.py \
python3 components/decode_worker.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment