3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory.
In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory.
3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
3. Run the ingress and prefill worker
```bash
./utils/gen_env_vars.sh
```
4. Run the ingress and prefill worker
```bash
```bash
# run ingress
# run ingress
...
@@ -87,7 +81,7 @@ python3 -m dynamo.sglang \
...
@@ -87,7 +81,7 @@ python3 -m dynamo.sglang \
On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
**Step 1**: Use the provided helper script to generate commands to start NATS/ETCD on your head prefill node. This script will also give you environment variables to export on each other node. You will need the IP addresses of your head prefill and head decode node to run this script.
**Step 1**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
```bash
./utils/gen_env_vars.sh
```
**Step 2**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
```bash
```bash
...
@@ -104,7 +99,7 @@ python3 -m dynamo.sglang \
...
@@ -104,7 +99,7 @@ python3 -m dynamo.sglang \
--mem-fraction-static 0.82
--mem-fraction-static 0.82
```
```
**Step 3**: Run inference
**Step 2**: Run inference
SGLang typically requires a warmup period to ensure the DeepGEMM kernels are loaded. We recommend running a few warmup requests and ensuring that the DeepGEMM kernels load in.
SGLang typically requires a warmup period to ensure the DeepGEMM kernels are loaded. We recommend running a few warmup requests and ensuring that the DeepGEMM kernels load in.