README.md 9.36 KB
Newer Older
1
2
3
4
5
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

6
# Running SGLang with Dynamo
7

8
9
10
11
12
13
14
15
16
17
18
19
## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

20
---
21

22
23
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
24
- [Dynamo SGLang Integration](#dynamo-sglang-integration)
25
- [Installation](#installation)
26
27
28
29
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Multi-Node and Advanced Examples](#advanced-examples)
- [Deploy on SLURM or Kubernetes](#deployment)
30

31
## Feature Support Matrix
32

33
### Core Dynamo Features
34

35
36
| Feature | SGLang | Notes |
|---------|--------|-------|
37
38
39
| [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ |  |
| [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ |  |
40
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ |  |
41
| [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ |  |
42
| [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned |
43

44

45
46
47
48
49
50
51
52
53
54
55
56
57
## Dynamo SGLang Integration

Dynamo SGLang integrates SGLang engines into Dynamo's distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang's engine arguments.

### Argument Handling

Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine arguments work identically**. You can pass any SGLang argument (like `--model-path`, `--tp`, `--trust-remote-code`) directly to `dynamo.sglang`.

#### Dynamo-Specific Arguments

| Argument | Description | Default | SGLang Equivalent |
|----------|-------------|---------|-------------------|
| `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
58
| `--migration-limit` | Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault_tolerance/request_migration.md). | `0` (disabled) | N/A |
59
60
61
| `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` |
| `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` |
| `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A |
62
| `--custom-jinja-template` | Use custom chat template for that model (takes precedence over default chat template in model repo) | `None` | `--chat-template` |
63
64
65

#### Tokenizer Behavior

66
67
- **Default (`--use-sglang-tokenizer` not set)**: Dynamo handles tokenization/detokenization via our blazing fast frontend and passes `input_ids` to SGLang
- **With `--use-sglang-tokenizer`**: SGLang handles tokenization/detokenization, Dynamo passes raw prompts
68

69
70
> [!NOTE]
> When using `--use-sglang-tokenizer`, only `v1/chat/completions` is available through Dynamo's frontend.
71

72
73
74
75
76
77
78
79
80
81
82
83
84
85
### Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

#### Cancellation Support Matrix

| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ⚠️ | ✅ |

> [!WARNING]
> ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.

86
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
87

88
## Installation
89

90
### Install latest release
91
92
We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with `curl -LsSf https://astral.sh/uv/install.sh | sh`

93
94
95
<details>
<summary>Expand for instructions</summary>

96
97
98
```bash
# create a virtual env
uv venv --python 3.12 --seed
99
# install the latest release (which comes bundled with a stable sglang version)
100
101
102
uv pip install "ai-dynamo[sglang]"
```

103
104
105
</details>

### Install editable version for development
106
107

<details>
108
<summary>Expand for instructions</summary>
109
110
111
112
113
114
115
116
117
118
119

This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires `nvcc` to be available.

```bash
# create a virtual env
uv venv --python 3.12 --seed
# build dynamo runtime bindings
uv pip install maturin
cd $DYNAMO_HOME/lib/bindings/python
maturin develop --uv
cd $DYNAMO_HOME
120
121
# installs sglang supported version along with dynamo
# include the prerelease flag to install flashinfer rc versions
122
uv pip install -e .
123
124
# install any sglang version >= 0.5.4.post1
uv pip install "sglang[all]==0.5.4.post1"
125
126
127
128
```

</details>

129
### Using docker containers
130
131

<details>
132
133
134
<summary>Expand for instructions</summary>

We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.
135
136

```bash
137
138
139
140
141
142
cd $DYNAMO_ROOT
docker build \
  -f container/Dockerfile.sglang-wideep \
  -t dynamo-sglang \
  --no-cache \
  .
143
144
```

145
And then run it using
146
147

```bash
148
149
150
151
152
153
154
155
156
157
158
159
docker run \
    --gpus all \
    -it \
    --rm \
    --network host \
    --shm-size=10G \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --ulimit nofile=65536:65536 \
    --cap-add CAP_SYS_PTRACE \
    --ipc host \
    dynamo-sglang:latest
160
161
```

162
163
</details>

164
165
166
167
168
## Quick Start

Below we provide a guide that lets you run all of our common deployment patterns on a single node.

### Start NATS and ETCD in the background
169

170
171
172
173
174
175
176
Start using [Docker Compose](../../../deploy/docker-compose.yml)

```bash
docker compose -f deploy/docker-compose.yml up -d
```

> [!TIP]
177
178
179
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
>
> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
180

181
182

### Aggregated Serving
183
184

```bash
185
cd $DYNAMO_HOME/components/backends/sglang
186
./launch/agg.sh
187
```
188

189
### Aggregated Serving with KV Routing
190
191

```bash
192
cd $DYNAMO_HOME/components/backends/sglang
193
./launch/agg_router.sh
194
195
```

196
### Aggregated Serving for Embedding Models
197
198
199
200
201
202
203
204

Here's an example that uses the [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model.

```bash
cd $DYNAMO_HOME/components/backends/sglang
./launch/agg_embed.sh
```

205
206
<details>
<summary>Send the following request to verify your deployment:</summary>
207
208
209
210
211
212
213
214
215
216

```bash
curl localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Embedding-4B",
    "input": "Hello, world!"
  }'
```

217
</details>
218

219
### Disaggregated serving
220

221
See [SGLang Disaggregation](sglang-disaggregation.md) to learn more about how sglang and dynamo handle disaggregated serving.
222
223


224
225
226
227
```bash
cd $DYNAMO_HOME/components/backends/sglang
./launch/disagg.sh
```
228

229
### Disaggregated Serving with KV Aware Prefill Routing
230
231

```bash
232
cd $DYNAMO_HOME/components/backends/sglang
233
./launch/disagg_router.sh
234
```
235

236
### Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention
237

238
You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
239
240
241

```bash
# note this will require 4 GPUs
242
cd $DYNAMO_HOME/components/backends/sglang
243
./launch/disagg_dp_attn.sh
244
```
245

246
247
248
249
250
251
252
253
### Testing the Deployment

Send a test request to verify your deployment:

```bash
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
254
    "model": "Qwen/Qwen3-0.6B",
255
256
257
258
259
260
    "messages": [
    {
        "role": "user",
        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
    }
    ],
261
    "stream": true,
262
263
264
265
    "max_tokens": 30
  }'
```

266
267
268
269
## Advanced Examples

Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!

ishandhanani's avatar
ishandhanani committed
270
### Run a multi-node sized model
271
- **[Run a multi-node model](multinode-examples.md)**
272
273

### Large scale P/D disaggregation with WideEP
274
- **[Run DeepSeek-R1-FP8 on H100s](dsr1-wideep-h100.md)**
275
- **[Run DeepSeek-R1-FP8 on GB200s](dsr1-wideep-gb200.md)**
276

277
### Hierarchical Cache (HiCache)
278
- **[Enable SGLang Hierarchical Cache (HiCache)](sgl-hicache-example.md)**
279

280
### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL
281
- **[Run a multimodal model with EPD Disaggregation](multimodal_epd.md)**
282

283
284
## Deployment

285
We currently provide deployment examples for Kubernetes and SLURM.
286

287
## Kubernetes
288
- **[Deploying Dynamo with SGLang on Kubernetes](../../../components/backends/sglang/deploy/README.md)**
289

290
## SLURM
291
- **[Deploying Dynamo with SGLang on SLURM](../../../components/backends/sglang/slurm_jobs/README.md)**