"lib/bindings/kvbm/vscode:/vscode.git/clone" did not exist on "e2f5125434f9461cea4f4d19f684e2e3403f532f"
quickstart.md 4.53 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Quickstart
5
6
---

7
This guide covers running Dynamo **using the CLI on your local machine or VM**.
8

9
10
11
12
> [!IMPORTANT]
> **Looking to deploy on Kubernetes instead?**
> See the [Kubernetes Installation Guide](../kubernetes/installation-guide.md)
> and [Kubernetes Quickstart](../kubernetes/README.md) for cluster deployments.
13

14
## Install Dynamo
15

16
**Option A: Containers (Recommended)**
17

18
Containers have all dependencies pre-installed. No setup required.
19

20
21
```bash
# SGLang
22
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
23
24

# TensorRT-LLM
25
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
26
27

# vLLM
28
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
29
30
```

31
32
33
34
35
> [!TIP]
> To run frontend and worker in the same container, either:
>
> - Run processes in background with `&` (see Run Dynamo section below), or
> - Open a second terminal and use `docker exec -it <container_id> bash`
36
37
38
39
40
41

See [Release Artifacts](../reference/release-artifacts.md#container-images) for available
versions and backend guides for run instructions: [SGLang](../backends/sglang/README.md) |
[TensorRT-LLM](../backends/trtllm/README.md) | [vLLM](../backends/vllm/README.md)

**Option B: Install from PyPI**
42
43
44
45
46

```bash
# Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

47
# Create virtual environment
48
49
uv venv venv
source venv/bin/activate
50
uv pip install pip
51
52
```

53
54
55
Install system dependencies and the Dynamo wheel for your chosen backend:

**SGLang**
56
57

```bash
58
59
sudo apt install python3-dev
uv pip install --prerelease=allow "ai-dynamo[sglang]"
60
61
```

62
63
64
> [!NOTE]
> For CUDA 13 (B300/GB300), the container is recommended. See
> [SGLang install docs](https://docs.sglang.io/get_started/install.html) for details.
65
66

**TensorRT-LLM**
67
68

```bash
69
70
71
72
73
sudo apt install python3-dev
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
```

74
75
76
77
78
> [!NOTE]
> TensorRT-LLM requires `pip` due to a transitive Git URL dependency that
> `uv` doesn't resolve. We recommend using the TensorRT-LLM container for
> broader compatibility. See the [TRT-LLM backend guide](../backends/trtllm/README.md)
> for details.
79

80
81
82
83
84
**vLLM**

```bash
sudo apt install python3-dev libxcb1
uv pip install --prerelease=allow "ai-dynamo[vllm]"
85
86
```

87
88
## Run Dynamo

89
90
91
> [!TIP]
> **(Optional)** Before running Dynamo, verify your system configuration:
> `python3 deploy/sanity_check.py`
92
93
94

Start the frontend, then start a worker for your chosen backend.

95
96
97
> [!TIP]
> To run in a single terminal (useful in containers), append `> logfile.log 2>&1 &`
> to run processes in background. Example: `python3 -m dynamo.frontend --discovery-backend file > dynamo.frontend.log 2>&1 &`
98
99

```bash
100
# Start the OpenAI compatible frontend (default port is 8000)
101
102
# --discovery-backend file avoids needing etcd (frontend and workers must share a disk)
python3 -m dynamo.frontend --discovery-backend file
103
104
105
106
107
108
109
```

In another terminal (or same terminal if using background mode), start a worker:

**SGLang**

```bash
110
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file
111
112
```

113
**TensorRT-LLM**
114

115
```bash
116
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --discovery-backend file
117
```
118

119
**vLLM**
120

121
```bash
122
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
123
124
  --kv-events-config '{"enable_kv_cache_events": false}'
```
125

126
127
128
129
130
131
132
133
134
135
136
137
> [!NOTE]
> For dependency-free local development, disable KV event publishing (avoids NATS):
>
> - **vLLM:** Add `--kv-events-config '{"enable_kv_cache_events": false}'`
> - **SGLang:** No flag needed (KV events disabled by default)
> - **TensorRT-LLM:** No flag needed (KV events disabled by default)
>
> **TensorRT-LLM only:** The warning `Cannot connect to ModelExpress server/transport error. Using direct download.`
> is expected and can be safely ignored.

> [!NOTE]
> **Deprecation notice:** vLLM automatically enables KV event publishing when prefix caching is active. In a future release, this will change — KV events will be disabled by default for all backends. Start using `--kv-events-config` explicitly to prepare.
138

139
## Test Your Deployment
140

141
142
143
144
145
146
147
```bash
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-0.6B",
       "messages": [{"role": "user", "content": "Hello!"}],
       "max_tokens": 50}'
```