quick_start_local.rst 4.46 KB
Newer Older
1
2
3
4
..
   SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
   All rights reserved.
   SPDX-License-Identifier: Apache-2.0
5

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
This guide covers running Dynamo **using the CLI on your local machine or VM**.

.. important::

   **Looking to deploy on Kubernetes instead?**
   See the `Kubernetes Installation Guide <../kubernetes/installation_guide.html>`_
   and `Kubernetes Quickstart <../kubernetes/README.html>`_ for cluster deployments.

**Install Dynamo**

**Option A: Containers (Recommended)**

Containers have all dependencies pre-installed. No setup required.

.. code-block:: bash

   # SGLang
   docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1

   # TensorRT-LLM
   docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1

   # vLLM
   docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1

.. tip::

   To run frontend and worker in the same container, either:

   - Run processes in background with ``&`` (see Run Dynamo section below), or
   - Open a second terminal and use ``docker exec -it <container_id> bash``

See `Release Artifacts <../reference/release-artifacts.html#container-images>`_ for available
versions and backend guides for run instructions: `SGLang <../backends/sglang/README.html>`_ |
`TensorRT-LLM <../backends/trtllm/README.html>`_ | `vLLM <../backends/vllm/README.html>`_

**Option B: Install from PyPI**
43
44
45
46
47
48

.. code-block:: bash

   # Install uv (recommended Python package manager)
   curl -LsSf https://astral.sh/uv/install.sh | sh

49
   # Create virtual environment
50
51
   uv venv venv
   source venv/bin/activate
52
   uv pip install pip
53

54
55
56
Install system dependencies and the Dynamo wheel for your chosen backend:

**SGLang**
57
58
59

.. code-block:: bash

60
61
62
63
   sudo apt install python3-dev
   uv pip install --prerelease=allow "ai-dynamo[sglang]"

.. note::
64

65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
   For CUDA 13 (B300/GB300), the container is recommended. See
   `SGLang install docs <https://docs.sglang.ai/start/install.html>`_ for details.

**TensorRT-LLM**

.. code-block:: bash

   sudo apt install python3-dev
   pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
   pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"

.. note::

   TensorRT-LLM requires ``pip`` due to a transitive Git URL dependency that
   ``uv`` doesn't resolve. We recommend using the TensorRT-LLM container for
   broader compatibility. See the `TRT-LLM backend guide <../backends/trtllm/README.html>`_
   for details.

**vLLM**

.. code-block:: bash

   sudo apt install python3-dev libxcb1
   uv pip install --prerelease=allow "ai-dynamo[vllm]"

**Run Dynamo**

.. tip::

   **(Optional)** Before running Dynamo, verify your system configuration:
   ``python3 deploy/sanity_check.py``

Start the frontend, then start a worker for your chosen backend.

.. tip::

   To run in a single terminal (useful in containers), append ``> logfile.log 2>&1 &``
   to run processes in background. Example: ``python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &``
103
104
105

.. code-block:: bash

106
   # Start the OpenAI compatible frontend (default port is 8000)
107
108
   # --store-kv file avoids needing etcd (frontend and workers must share a disk)
   python3 -m dynamo.frontend --store-kv file
109

110
In another terminal (or same terminal if using background mode), start a worker:
111

112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
**SGLang**

.. code-block:: bash

   python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file

**TensorRT-LLM**

.. code-block:: bash

   python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file

**vLLM**

.. code-block:: bash

   python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
     --kv-events-config '{"enable_kv_cache_events": false}'

.. note::

   For dependency-free local development, disable KV event publishing (avoids NATS):

   - **vLLM:** Add ``--kv-events-config '{"enable_kv_cache_events": false}'``
   - **SGLang:** No flag needed (KV events disabled by default)
   - **TensorRT-LLM:** No flag needed (KV events disabled by default)

   **TensorRT-LLM only:** The warning ``Cannot connect to ModelExpress server/transport error. Using direct download.``
   is expected and can be safely ignored.

**Test Your Deployment**
143
144
145

.. code-block:: bash

146
   curl localhost:8000/v1/chat/completions \
147
148
149
150
     -H "Content-Type: application/json" \
     -d '{"model": "Qwen/Qwen3-0.6B",
          "messages": [{"role": "user", "content": "Hello!"}],
          "max_tokens": 50}'