Unverified Commit e8c7bbf3 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: refactor Dynamo readme.md and quick_start_local.rst (#5649)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 7d5ed665
...@@ -44,34 +44,35 @@ Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provide ...@@ -44,34 +44,35 @@ Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provide
- **Accelerated Data Transfer** – Reduces inference response time using NIXL - **Accelerated Data Transfer** – Reduces inference response time using NIXL
- **KV Cache Offloading** – Leverages multiple memory hierarchies for higher throughput - **KV Cache Offloading** – Leverages multiple memory hierarchies for higher throughput
<p align="center">
<img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
</p>
Built in Rust for performance and Python for extensibility, Dynamo is fully open-source with an OSS-first development approach. Built in Rust for performance and Python for extensibility, Dynamo is fully open-source with an OSS-first development approach.
## Framework Support Matrix ## Backend Feature Support
| Feature | [vLLM](docs/backends/vllm/README.md) | [SGLang](docs/backends/sglang/README.md) | [TensorRT-LLM](docs/backends/trtllm/README.md) | | | [SGLang](docs/backends/sglang/README.md) | [TensorRT-LLM](docs/backends/trtllm/README.md) | [vLLM](docs/backends/vllm/README.md) |
| -------------------------------------------------------------------- | :--: | :----: | :----------: | |---|:----:|:----------:|:--:|
| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ | | **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ | | [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ | | [**KV-Aware Routing**](docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/kvbm_architecture.md) | ✅ | 🚧 | ✅ | | [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ | | [**KVBM**](docs/kvbm/kvbm_architecture.md) | 🚧 | ✅ | ✅ |
| [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ | | [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
| [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ |
> **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions. > **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
## Dynamo Architecture
<p align="center">
<img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
</p>
> **[Architecture Deep Dive →](docs/design_docs/architecture.md)**
## Latest News ## Latest News
- [12/05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/) - [12/05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/)
- [12/02] [Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo](https://www.marktechpost.com/2025/12/02/nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems/) - [12/02] [Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo](https://www.marktechpost.com/2025/12/02/nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems/)
- [12/01] [InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/) - [12/01] [InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/)
- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
- [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)
## Get Started ## Get Started
...@@ -79,62 +80,81 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open ...@@ -79,62 +80,81 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
|------|----------|------|--------------| |------|----------|------|--------------|
| [**Local Quick Start**](#local-quick-start) | Test on a single machine | ~5 min | 1 GPU, Ubuntu 24.04 | | [**Local Quick Start**](#local-quick-start) | Test on a single machine | ~5 min | 1 GPU, Ubuntu 24.04 |
| [**Kubernetes Deployment**](#kubernetes-deployment) | Production multi-node clusters | ~30 min | K8s cluster with GPUs | | [**Kubernetes Deployment**](#kubernetes-deployment) | Production multi-node clusters | ~30 min | K8s cluster with GPUs |
| [**Building from Source**](#building-from-source) | Contributors and development | ~15 min | Ubuntu, Rust, Python |
## Contributing Want to help shape the future of distributed LLM inference? See the **[Contributing Guide](CONTRIBUTING.md)**.
Want to help shape the future of distributed LLM inference? We welcome contributors at all levels—from doc fixes to new features.
- **[Contributing Guide](CONTRIBUTING.md)** – How to get started
- **[Report a Bug](https://github.com/ai-dynamo/dynamo/issues/new?template=bug_report.yml)** – Found an issue?
- **[Feature Request](https://github.com/ai-dynamo/dynamo/issues/new?template=feature_request.yml)** – Have an idea?
# Local Quick Start # Local Quick Start
The following examples require a few system level packages. The following examples require a few system level packages.
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/reference/support-matrix.md](docs/reference/support-matrix.md) Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/reference/support-matrix.md](docs/reference/support-matrix.md)
## 1. Initial Setup ## Install Dynamo
The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv: ### Option A: Containers (Recommended)
``` Containers have all dependencies pre-installed. No setup required.
curl -LsSf https://astral.sh/uv/install.sh | sh
```
### Install Python Development Headers ```bash
# SGLang
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
Backend engines require Python development headers for JIT compilation. Install them with: # TensorRT-LLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
```bash # vLLM
sudo apt install python3-dev docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
``` ```
## 2. Select an Engine > **Tip:** To run frontend and worker in the same container, either run processes in background with `&` (see below), or open a second terminal and use `docker exec -it <container_id> bash`.
We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines. See [Release Artifacts](docs/reference/release-artifacts.md#container-images) for available versions.
``` ### Option B: Install from PyPI
The Dynamo team recommends the `uv` Python package manager, although any way works.
```bash
# Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment
uv venv venv uv venv venv
source venv/bin/activate source venv/bin/activate
uv pip install pip uv pip install pip
```
# Choose one Install system dependencies and the Dynamo wheel for your chosen backend:
uv pip install "ai-dynamo[sglang]" #replace with [vllm], [trtllm], etc.
**SGLang**
```bash
sudo apt install python3-dev
uv pip install "ai-dynamo[sglang]"
``` ```
## 3. Run Dynamo > **Note:** For CUDA 13 (B300/GB300), the container is recommended. See [SGLang install docs](https://docs.sglang.ai/start/install.html) for details.
### Sanity Check (Optional) **TensorRT-LLM**
Before trying out Dynamo, you can verify your system configuration and dependencies: ```bash
sudo apt install python3-dev
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
```
> **Note:** TensorRT-LLM requires `pip` due to a transitive Git URL dependency that `uv` doesn't resolve. We recommend using the [TensorRT-LLM container](docs/reference/release-artifacts.md#container-images) for broader compatibility.
**vLLM**
```bash ```bash
python3 deploy/sanity_check.py sudo apt install python3-dev libxcb1
uv pip install "ai-dynamo[vllm]"
``` ```
This is a quick check for system resources, development tools, LLM frameworks, and Dynamo components. ## Run Dynamo
### Running an LLM API Server > **Tip (Optional):** Before running Dynamo, verify your system configuration with `python3 deploy/sanity_check.py`
Dynamo provides a simple way to spin up a local set of inference components including: Dynamo provides a simple way to spin up a local set of inference components including:
...@@ -142,17 +162,38 @@ Dynamo provides a simple way to spin up a local set of inference components incl ...@@ -142,17 +162,38 @@ Dynamo provides a simple way to spin up a local set of inference components incl
- **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers. - **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
- **Workers** – Set of pre-configured LLM serving engines. - **Workers** – Set of pre-configured LLM serving engines.
Start the frontend:
> **Tip:** To run in a single terminal (useful in containers), append `> logfile.log 2>&1 &` to run processes in background. Example: `python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &`
```bash ```bash
# Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing. # Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
# For local dev: --store-kv file avoids etcd (workers and frontend must share a disk) # For local dev: --store-kv file avoids etcd (workers and frontend must share a disk)
python3 -m dynamo.frontend --http-port 8000 --store-kv file python3 -m dynamo.frontend --http-port 8000 --store-kv file
```
In another terminal (or same terminal if using background mode), start a worker for your chosen backend:
# Start the SGLang engine. You can run several of these for the same or different models. ```bash
# The frontend will discover them automatically. # SGLang
python3 -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --store-kv file python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file
# TensorRT-LLM
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file
# vLLM (note: uses --model, not --model-path)
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
--kv-events-config '{"enable_kv_cache_events": false}'
``` ```
> **Note:** vLLM workers publish KV cache events by default, which requires NATS. For dependency-free local development with vLLM, add `--kv-events-config '{"enable_kv_cache_events": false}'`. This keeps local prefix caching enabled while disabling event publishing. See [Service Discovery and Messaging](#service-discovery-and-messaging) for details. > **Note:** For dependency-free local development, disable KV event publishing (avoids NATS):
> - **vLLM:** Add `--kv-events-config '{"enable_kv_cache_events": false}'`
> - **SGLang:** No flag needed (KV events disabled by default)
> - **TensorRT-LLM:** No flag needed (KV events disabled by default)
>
> **TensorRT-LLM only:** The warning `Cannot connect to ModelExpress server/transport error. Using direct download.` is expected and can be safely ignored.
>
> See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.
#### Send a Request #### Send a Request
...@@ -172,13 +213,6 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" ...@@ -172,13 +213,6 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them. Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.
### What's Next?
- **Scale up**: Deploy on Kubernetes with [Recipes](recipes/)
- **Add features**: Enable [KV-aware routing](docs/router/kv_cache_routing.md), [disaggregated serving](docs/design_docs/disagg_serving.md)
- **Benchmark**: Use [AIPerf](docs/benchmarks/benchmarking.md) to measure performance
- **Try other engines**: [vLLM](docs/backends/vllm/), [SGLang](docs/backends/sglang/), [TensorRT-LLM](docs/backends/trtllm/)
# Kubernetes Deployment # Kubernetes Deployment
For production deployments on Kubernetes clusters with multiple GPUs. For production deployments on Kubernetes clusters with multiple GPUs.
...@@ -206,60 +240,6 @@ See [recipes/README.md](recipes/README.md) for the full list and deployment inst ...@@ -206,60 +240,6 @@ See [recipes/README.md](recipes/README.md) for the full list and deployment inst
- [Amazon EKS](examples/deployments/EKS/) - [Amazon EKS](examples/deployments/EKS/)
- [Google GKE](examples/deployments/GKE/) - [Google GKE](examples/deployments/GKE/)
# Concepts
## Engines
Dynamo is inference engine agnostic. Install the wheel for your chosen engine and run with `python3 -m dynamo.<engine> --help`.
| Engine | Install | Docs | Best For |
|--------|---------|------|----------|
| vLLM | `uv pip install ai-dynamo[vllm]` | [Guide](docs/backends/vllm/) | Broadest feature coverage |
| SGLang | `uv pip install ai-dynamo[sglang]` | [Guide](docs/backends/sglang/) | High-throughput serving |
| TensorRT-LLM | `pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo[trtllm]` | [Guide](docs/backends/trtllm/) | Maximum performance |
> **Note:** TensorRT-LLM requires `pip` (not `uv`) due to URL-based dependencies. See the [TRT-LLM guide](docs/backends/trtllm/) for container setup and prerequisites.
Use `CUDA_VISIBLE_DEVICES` to specify which GPUs to use. Engine-specific options (context length, multi-GPU, etc.) are documented in each backend guide.
## Service Discovery and Messaging
Dynamo uses TCP for inter-component communication. External services are optional for most deployments:
| Deployment | etcd | NATS | Notes |
|------------|------|------|-------|
| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
| **Local Development** | ❌ Not required | ❌ Not required | Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
| **KV-Aware Routing** | — | ✅ Required | Prefix caching enabled by default requires NATS |
For local development without external dependencies, pass `--store-kv file` (avoids etcd) to both the frontend and workers. vLLM users should also pass `--kv-events-config '{"enable_kv_cache_events": false}'` to disable KV event publishing (avoids NATS) while keeping local prefix caching enabled; SGLang and TRT-LLM don't require this flag.
For distributed non-Kubernetes deployments or KV-aware routing:
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
# Advanced Topics
## Benchmarking
Dynamo provides comprehensive benchmarking tools:
- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
## Frontend OpenAPI Specification
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
```bash
cargo run -p dynamo-llm --bin generate-frontend-openapi
```
This writes to `docs/frontends/openapi.json`.
# Building from Source # Building from Source
For contributors who want to build Dynamo from source rather than installing from PyPI. For contributors who want to build Dynamo from source rather than installing from PyPI.
...@@ -347,13 +327,64 @@ cd $PROJECT_ROOT ...@@ -347,13 +327,64 @@ cd $PROJECT_ROOT
uv pip install -e . uv pip install -e .
``` ```
You should now be able to run `python3 -m dynamo.frontend`. ## 8. Run the Frontend
```bash
python3 -m dynamo.frontend
```
## 9. Configure for Local Development
For local development, pass `--store-kv file` to avoid external dependencies (see Service Discovery and Messaging section). - Pass `--store-kv file` to avoid external dependencies (see [Service Discovery and Messaging](#service-discovery-and-messaging))
- Set `DYN_LOG` to adjust the logging level (e.g., `export DYN_LOG=debug`). Uses the same syntax as `RUST_LOG`
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`. > **Note:** VSCode and Cursor users can use the `.devcontainer` folder for a pre-configured dev environment. See the [devcontainer README](.devcontainer/README.md) for details.
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details. # Advanced Topics
## Benchmarking
Dynamo provides comprehensive benchmarking tools:
- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
## Frontend OpenAPI Specification
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
```bash
cargo run -p dynamo-llm --bin generate-frontend-openapi
```
This writes to `docs/frontends/openapi.json`.
## Service Discovery and Messaging
Dynamo uses TCP for inter-component communication. On Kubernetes, native resources ([CRDs + EndpointSlices](docs/kubernetes/service_discovery.md)) handle service discovery. External services are optional for most deployments:
| Deployment | etcd | NATS | Notes |
|------------|------|------|-------|
| **Local Development** | ❌ Not required | ❌ Not required | Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
> **Note:** KV-Aware Routing requires NATS for prefix caching coordination.
For Slurm or other distributed deployments (and KV-aware routing):
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LLM on Slurm](examples/basics/multinode/trtllm/README.md) for deployment examples.
## More News
- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
- [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)
<!-- Reference links for Feature Compatibility Matrix --> <!-- Reference links for Feature Compatibility Matrix -->
[disagg]: docs/design_docs/disagg_serving.md [disagg]: docs/design_docs/disagg_serving.md
......
Pip (PyPI)
----------
Install a pre-built wheel from PyPI.
.. code-block:: bash
# Create a virtual environment and activate it
uv venv venv
source venv/bin/activate
# Install Dynamo from PyPI (choose one backend extra)
uv pip install "ai-dynamo[sglang]==my-tag" # or [vllm], [trtllm]
Pip from source
---------------
Install directly from a local checkout for development.
.. code-block:: bash
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Create a virtual environment and activate it
uv venv venv
source venv/bin/activate
uv pip install ".[sglang]" # or [vllm], [trtllm]
Docker
------
Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`).
.. code-block:: bash
# Run a container (mount your workspace if needed)
docker run --rm -it \
--gpus all \
--network host \
nvcr.io/nvidia/ai-dynamo/sglang-runtime:my-tag # or vllm, tensorrtllm
Get started with Dynamo locally in just a few commands: ..
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
**1. Install Dynamo** This guide covers running Dynamo **using the CLI on your local machine or VM**.
.. important::
**Looking to deploy on Kubernetes instead?**
See the `Kubernetes Installation Guide <../kubernetes/installation_guide.html>`_
and `Kubernetes Quickstart <../kubernetes/README.html>`_ for cluster deployments.
**Install Dynamo**
**Option A: Containers (Recommended)**
Containers have all dependencies pre-installed. No setup required.
.. code-block:: bash
# SGLang
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
# TensorRT-LLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
# vLLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
.. tip::
To run frontend and worker in the same container, either:
- Run processes in background with ``&`` (see Run Dynamo section below), or
- Open a second terminal and use ``docker exec -it <container_id> bash``
See `Release Artifacts <../reference/release-artifacts.html#container-images>`_ for available
versions and backend guides for run instructions: `SGLang <../backends/sglang/README.html>`_ |
`TensorRT-LLM <../backends/trtllm/README.html>`_ | `vLLM <../backends/vllm/README.html>`_
**Option B: Install from PyPI**
.. code-block:: bash .. code-block:: bash
# Install uv (recommended Python package manager) # Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install Dynamo # Create virtual environment
uv venv venv uv venv venv
source venv/bin/activate source venv/bin/activate
# Use prerelease flag to install RC versions of flashinfer and/or other dependencies uv pip install pip
uv pip install --prerelease=allow "ai-dynamo[sglang]" # or [vllm], [trtllm]
**2. Start etcd/NATS** Install system dependencies and the Dynamo wheel for your chosen backend:
**SGLang**
.. code-block:: bash .. code-block:: bash
# Fetch and start etcd and NATS using Docker Compose sudo apt install python3-dev
VERSION=$(uv pip show ai-dynamo | grep Version | cut -d' ' -f2) uv pip install --prerelease=allow "ai-dynamo[sglang]"
curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/refs/tags/v${VERSION}/deploy/docker-compose.yml
docker compose -f docker-compose.yml up -d .. note::
**3. Run Dynamo** For CUDA 13 (B300/GB300), the container is recommended. See
`SGLang install docs <https://docs.sglang.ai/start/install.html>`_ for details.
**TensorRT-LLM**
.. code-block:: bash
sudo apt install python3-dev
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
.. note::
TensorRT-LLM requires ``pip`` due to a transitive Git URL dependency that
``uv`` doesn't resolve. We recommend using the TensorRT-LLM container for
broader compatibility. See the `TRT-LLM backend guide <../backends/trtllm/README.html>`_
for details.
**vLLM**
.. code-block:: bash
sudo apt install python3-dev libxcb1
uv pip install --prerelease=allow "ai-dynamo[vllm]"
**Run Dynamo**
.. tip::
**(Optional)** Before running Dynamo, verify your system configuration:
``python3 deploy/sanity_check.py``
Start the frontend, then start a worker for your chosen backend.
.. tip::
To run in a single terminal (useful in containers), append ``> logfile.log 2>&1 &``
to run processes in background. Example: ``python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &``
.. code-block:: bash .. code-block:: bash
# Start the OpenAI compatible frontend (default port is 8000) # Start the OpenAI compatible frontend (default port is 8000)
python -m dynamo.frontend # --store-kv file avoids needing etcd (frontend and workers must share a disk)
python3 -m dynamo.frontend --store-kv file
# In another terminal, start an SGLang worker In another terminal (or same terminal if using background mode), start a worker:
python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B
**4. Test your deployment** **SGLang**
.. code-block:: bash
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file
**TensorRT-LLM**
.. code-block:: bash
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file
**vLLM**
.. code-block:: bash
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
--kv-events-config '{"enable_kv_cache_events": false}'
.. note::
For dependency-free local development, disable KV event publishing (avoids NATS):
- **vLLM:** Add ``--kv-events-config '{"enable_kv_cache_events": false}'``
- **SGLang:** No flag needed (KV events disabled by default)
- **TensorRT-LLM:** No flag needed (KV events disabled by default)
**TensorRT-LLM only:** The warning ``Cannot connect to ModelExpress server/transport error. Using direct download.``
is expected and can be safely ignored.
**Test Your Deployment**
.. code-block:: bash .. code-block:: bash
...@@ -41,5 +148,3 @@ Get started with Dynamo locally in just a few commands: ...@@ -41,5 +148,3 @@ Get started with Dynamo locally in just a few commands:
-d '{"model": "Qwen/Qwen3-0.6B", -d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}], "messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50}' "max_tokens": 50}'
..
Installation Page (left sidebar target)
..
Installation
============
.. include:: ../_includes/install.rst
...@@ -41,7 +41,6 @@ Quickstart ...@@ -41,7 +41,6 @@ Quickstart
:caption: Getting Started :caption: Getting Started
Quickstart <self> Quickstart <self>
Installation <_sections/installation>
Support Matrix <reference/support-matrix.md> Support Matrix <reference/support-matrix.md>
Feature Matrix <reference/feature-matrix.md> Feature Matrix <reference/feature-matrix.md>
Release Artifacts <reference/release-artifacts.md> Release Artifacts <reference/release-artifacts.md>
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Feature Compatibility Matrices # Dynamo Feature Compatibility Matrices
This document provides a comprehensive compatibility matrix for key Dynamo features across the supported backends. This document provides a comprehensive compatibility matrix for key Dynamo features across the supported backends.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment