README.md 9.21 KB
Newer Older
1
2
3
4
5
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---

6
7
# Running SGLang with Dynamo

8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Dynamo SGLang Integration](#dynamo-sglang-integration)
- [Installation](#installation)
- [Quick Start](#quick-start)
27
28
- [Single Node Examples](#run-single-node-examples)
- [Multi-Node and Advanced Examples](#advanced-examples)
29
30
31
32
33
34
35
36
37
- [Deploy on SLURM or Kubernetes](#deployment)

## Feature Support Matrix

### Core Dynamo Features

| Feature | SGLang | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ |  |
38
39
40
41
42
| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
| [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ |  |
| [**Multimodal Support**](../../features/multimodal/multimodal-sglang.md) | ✅ |  |
| [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


## Dynamo SGLang Integration

Dynamo SGLang integrates SGLang engines into Dynamo's distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang's engine arguments.

### Argument Handling

Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine arguments work identically**. You can pass any SGLang argument (like `--model-path`, `--tp`, `--trust-remote-code`) directly to `dynamo.sglang`.

#### Dynamo-Specific Arguments

| Argument | Description | Default | SGLang Equivalent |
|----------|-------------|---------|-------------------|
| `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
| `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` |
| `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` |
| `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A |
| `--custom-jinja-template` | Use custom chat template for that model (takes precedence over default chat template in model repo) | `None` | `--chat-template` |

#### Tokenizer Behavior

- **Default (`--use-sglang-tokenizer` not set)**: Dynamo handles tokenization/detokenization via our blazing fast frontend and passes `input_ids` to SGLang
- **With `--use-sglang-tokenizer`**: SGLang handles tokenization/detokenization, Dynamo passes raw prompts

68
69
> [!NOTE]
> When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
70
71
72
73
74
75
76
77
78
79
80
81

### Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

#### Cancellation Support Matrix

| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ⚠️ | ✅ |

82
83
> [!WARNING]
> ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
84
85
86
87
88
89
90
91

For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.

## Installation

### Install latest release
We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with `curl -LsSf https://astral.sh/uv/install.sh | sh`

92
<Accordion title="Expand for instructions">
93
94
95
96
97
98
```bash
# create a virtual env
uv venv --python 3.12 --seed
# install the latest release (which comes bundled with a stable sglang version)
uv pip install "ai-dynamo[sglang]"
```
99
</Accordion>
100
101
102

### Install editable version for development

103
<Accordion title="Expand for instructions">
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires `nvcc` to be available.

```bash
# create a virtual env
uv venv --python 3.12 --seed
# build dynamo runtime bindings
uv pip install maturin
cd $DYNAMO_HOME/lib/bindings/python
maturin develop --uv
cd $DYNAMO_HOME
# installs sglang supported version along with dynamo
# include the prerelease flag to install flashinfer rc versions
uv pip install -e .
# install any sglang version >= 0.5.3.post2
uv pip install "sglang[all]==0.5.3.post2"
```
120
</Accordion>
121
122
123

### Using docker containers

124
<Accordion title="Expand for instructions">
125
126
127
128
We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.

```bash
cd $DYNAMO_ROOT
129
python container/render.py --framework sglang --output-short-filename
130
docker build -f container/rendered.Dockerfile -t dynamo:latest-sglang .
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
```

And then run it using

```bash
docker run \
    --gpus all \
    -it \
    --rm \
    --network host \
    --shm-size=10G \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --ulimit nofile=65536:65536 \
    --cap-add CAP_SYS_PTRACE \
    --ipc host \
147
    dynamo:latest-sglang
148
```
149
</Accordion>
150
151
152
153
154

## Quick Start

Below we provide a guide that lets you run all of our common deployment patterns on a single node.

155
### Start Infrastructure Services (Local Development Only)
156

157
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):
158
159
160
161
162

```bash
docker compose -f deploy/docker-compose.yml up -d
```

163
164
165
166
167
168
169
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)

> [!TIP]
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
170
>
171
> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196


### Aggregated Serving

```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg.sh
```

### Aggregated Serving with KV Routing

```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_router.sh
```

### Aggregated Serving for Embedding Models

Here's an example that uses the [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model.

```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_embed.sh
```

197
<Accordion title="Send the following request to verify your deployment:">
198
199
200
201
202
203
204
205
```bash
curl localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Embedding-4B",
    "input": "Hello, world!"
  }'
```
206
</Accordion>
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262

### Disaggregated serving

See [SGLang Disaggregation](sglang-disaggregation.md) to learn more about how sglang and dynamo handle disaggregated serving.


```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg.sh
```

### Disaggregated Serving with KV Aware Prefill Routing

```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_router.sh
```

### Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention

You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.

```bash
# note this will require 4 GPUs
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_dp_attn.sh
```

### Testing the Deployment

Send a test request to verify your deployment:

```bash
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
    }
    ],
    "stream": true,
    "max_tokens": 30
  }'
```

## Deployment

We currently provide deployment examples for Kubernetes and SLURM.

## Kubernetes
- **[Deploying Dynamo with SGLang on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**

## SLURM
263
- **[Deploying Dynamo with SGLang on SLURM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/slurm-jobs/README.md)**