README.md 11.2 KB
Newer Older
Neelay Shah's avatar
Neelay Shah committed
1
<!--
Neelay Shah's avatar
Neelay Shah committed
2
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Neelay Shah's avatar
Neelay Shah committed
3
SPDX-License-Identifier: Apache-2.0
4
5
6
7
8
9
10
11
12
13
14
15

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Neelay Shah's avatar
Neelay Shah committed
16
-->
17
![Dynamo banner](./docs/images/frontpage-banner.png)
Neelay Shah's avatar
Neelay Shah committed
18

19
20
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
21
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
22
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)
Meenakshi Sharma's avatar
Meenakshi Sharma committed
23

Anish's avatar
Anish committed
24
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
25

Anish's avatar
Anish committed
26
# NVIDIA Dynamo
27

Anish's avatar
Anish committed
28
High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
29

Anish's avatar
Anish committed
30
## The Era of Multi-GPU, Multi-Node
31

Anish's avatar
Anish committed
32
33
34
<p align="center">
  <img src="./docs/images/frontpage-gpu-vertical.png" alt="Multi Node Multi-GPU topology" width="600" />
</p>
35

Anish's avatar
Anish committed
36
Large language models are quickly outgrowing the memory and compute budget of any single GPU. Tensor-parallelism solves the capacity problem by spreading each layer across many GPUs—and sometimes many servers—but it creates a new one: how do you coordinate those shards, route requests, and share KV cache fast enough to feel like one accelerator? This orchestration gap is exactly what NVIDIA Dynamo is built to close.
37

Anish's avatar
Anish committed
38
Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities such as:
39

Neelay Shah's avatar
Neelay Shah committed
40
41
42
43
44
- **Disaggregated prefill & decode inference** – Maximizes GPU throughput and facilitates trade off between throughput and latency.
- **Dynamic GPU scheduling** – Optimizes performance based on fluctuating demand
- **LLM-aware request routing** – Eliminates unnecessary KV cache re-computation
- **Accelerated data transfer** – Reduces inference response time using NIXL.
- **KV cache offloading** – Leverages multiple memory hierarchies for higher system throughput
45

Anish's avatar
Anish committed
46
47
48
49
50
<p align="center">
  <img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
</p>

## Framework Support Matrix
Neelay Shah's avatar
Neelay Shah committed
51

Anish's avatar
Anish committed
52
53
54
55
56
57
58
59
| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
60

Anish's avatar
Anish committed
61
62
63
64
To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**
65

Anish's avatar
Anish committed
66
67
68
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

# Installation
69

Neelay Shah's avatar
Neelay Shah committed
70
The following examples require a few system level packages.
71
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md)
72

Anish's avatar
Anish committed
73
74
75
76
77
78
## 1. Initial setup

The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
79

Anish's avatar
Anish committed
80
81
82
### Install etcd and NATS (required)

To coordinate across a data center, Dynamo relies on etcd and NATS. To run Dynamo locally, these need to be available.
83

84
85
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
86

Anish's avatar
Anish committed
87
To quickly setup etcd & NATS, you can also run:
88
```
Anish's avatar
Anish committed
89
90
# At the root of the repository:
docker compose -f deploy/docker-compose.yml up -d
91
92
```

Anish's avatar
Anish committed
93
## 2. Select an engine
94

Anish's avatar
Anish committed
95
We publish Python wheels specialized for each of our supported engines: vllm, sglang, trtllm, and llama.cpp. The examples that follow use SGLang; continue reading for other engines.
96

97
```
98
99
100
uv venv venv
source venv/bin/activate
uv pip install pip
101

102
# Choose one
Anish's avatar
Anish committed
103
uv pip install "ai-dynamo[sglang]"  #replace with [vllm], [trtllm], etc.
104
```
105

Anish's avatar
Anish committed
106
## 3. Run Dynamo
107
108
109
110

### Running an LLM API server

Dynamo provides a simple way to spin up a local set of inference components including:
111

Neelay Shah's avatar
Neelay Shah committed
112
113
114
- **OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
- **Workers** – Set of pre-configured LLM serving engines.
115
116

```
117
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
118
python -m dynamo.frontend --http-port 8080
119

Anish's avatar
Anish committed
120
# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
121
# both for the same model and for multiple models. The frontend node will discover them.
122
python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --skip-tokenizer-init
123
124
```

Neelay Shah's avatar
Neelay Shah committed
125
#### Send a Request
126

127
```bash
128
curl localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
Neelay Shah's avatar
Neelay Shah committed
129
130
131
132
133
134
135
136
137
138
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":false,
    "max_tokens": 300
  }' | jq
139
```
140

141
142
Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.

Anish's avatar
Anish committed
143
144
145
146
147
### Deploying Dynamo

- Follow the [Quickstart Guide](docs/guides/dynamo_deploy/README.md) to deploy on Kubernetes.
- Check out [Backends](components/backends) to deploy various workflow configurations (e.g. SGLang with router, vLLM with disaggregated serving, etc.)
- Run some [Examples](examples) to learn about building components in Dynamo and exploring various integrations.
148

Anish's avatar
Anish committed
149
# Engines
150

Anish's avatar
Anish committed
151
Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, NATS and etcd need to be installed, along with a Dynamo frontend (`python -m dynamo.frontend [--interactive]`).
152

Anish's avatar
Anish committed
153
## vLLM
154
155
156
157
158
159
160
161
162
163

```
uv pip install ai-dynamo[vllm]
```

Run the backend/worker like this:
```
python -m dynamo.vllm --help
```

Anish's avatar
Anish committed
164
vLLM attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
165
166
167

To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.

Anish's avatar
Anish committed
168
## SGLang
169
170

```
171
172
173
# Install libnuma
apt install -y libnuma-dev

174
175
176
177
178
uv pip install ai-dynamo[sglang]
```

Run the backend/worker like this:
```
179
python -m dynamo.sglang.worker --help
180
181
182
183
```

You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs.

Anish's avatar
Anish committed
184
## TensorRT-LLM
185

Anish's avatar
Anish committed
186
It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for running the TensorRT-LLM engine.
187
188
189
190
191
192
193
194
195

> [!Note]
> Ensure that you select a PyTorch container image version that matches the version of TensorRT-LLM you are using.
> For example, if you are using `tensorrt-llm==1.0.0rc4`, use the PyTorch container image version `25.05`.
> To find the correct PyTorch container version for your desired `tensorrt-llm` release, visit the [TensorRT-LLM Dockerfile.multi](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi) on GitHub. Switch to the branch that matches your `tensorrt-llm` version, and look for the `BASE_TAG` line to identify the recommended PyTorch container tag.

> [!Important]
> Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1`

Anish's avatar
Anish committed
196
### Install prerequisites
197
198
199
200
201
202
203
204
205
206
```
# Optional step: Only required for Blackwell and Grace Hopper
pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

sudo apt-get -y install libopenmpi-dev
```

> [!Tip]
> You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).

Anish's avatar
Anish committed
207
### After installing the pre-requisites above, install Dynamo
208
```
209
uv pip install ai-dynamo[trtllm]
210
211
212
213
214
215
216
217
```

Run the backend/worker like this:
```
python -m dynamo.trtllm --help
```

To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
218

Anish's avatar
Anish committed
219
# Developing Locally
220

Anish's avatar
Anish committed
221
## 1. Install libraries
222
223
224
225
226

**Ubuntu:**
```
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
```
227

228
229
230
231
232
233
234
**macOS:**
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)
235

236
237
```
brew install cmake protobuf
238

239
240
241
242
243
244
## Check that Metal is accessible
xcrun -sdk macosx metal
```
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.


Anish's avatar
Anish committed
245
## 2. Install Rust
246

247
248
249
```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
250
```
251

Anish's avatar
Anish committed
252
## 3. Create a Python virtual env:
253

254
255
256
257
```
uv venv dynamo
source dynamo/bin/activate
```
258

Anish's avatar
Anish committed
259
## 4. Install build tools
260

261
262
263
```
uv pip install pip maturin
```
264

265
[Maturin](https://github.com/PyO3/maturin) is the Rust<->Python bindings build tool.
266

Anish's avatar
Anish committed
267
## 5. Build the Rust bindings
268

269
```
270
cd lib/bindings/python
271
272
273
maturin develop --uv
```

Anish's avatar
Anish committed
274
## 6. Install the wheel
275
276
277
278

```
cd $PROJECT_ROOT
uv pip install .
279
280
# For development, use
export PYTHONPATH="${PYTHONPATH}:$(pwd)/components/frontend/src:$(pwd)/components/planner/src:$(pwd)/components/backends/vllm/src:$(pwd)/components/backends/sglang/src:$(pwd)/components/backends/trtllm/src:$(pwd)/components/backends/llama_cpp/src:$(pwd)/components/backends/mocker/src"
281
282
```

283
284
> [!Note]
> Editable (`-e`) does not work because the `dynamo` package is split over multiple directories, one per backend.
285
286
287
288
289
290
291
292

You should now be able to run `python -m dynamo.frontend`.

Remember that nats and etcd must be running (see earlier).

Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.

If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.