docs: refactor Dynamo readme.md and quick_start_local.rst (#5649)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: refactor Dynamo readme.md and quick_start_local.rst (#5649)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
e8c7bbf3 · dagil-nvidia · GitHub · 7d5ed665 · e8c7bbf3 · 7d5ed665
Unverified Commit e8c7bbf3 authored Jan 30, 2026 by dagil-nvidia Committed by GitHub Jan 30, 2026
6 changed files
--- a/README.md
+++ b/README.md
@@ -44,34 +44,35 @@ Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provide
 - **Accelerated Data Transfer** – Reduces inference response time using NIXL
 - **KV Cache Offloading** – Leverages multiple memory hierarchies for higher throughput

-<p align="center">
-  <img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
-</p>
-
 Built in Rust for performance and Python for extensibility, Dynamo is fully open-source with an OSS-first development approach.

-## Framework Support Matrix
+## Backend Feature Support

-| Feature                                                              | [vLLM](docs/backends/vllm/README.md) | [SGLang](docs/backends/sglang/README.md) | [TensorRT-LLM](docs/backends/trtllm/README.md) |
-| -------------------------------------------------------------------- | :--: | :----: | :----------: |
-| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md)      | ✅   | ✅     | ✅           |
-| [**KV-Aware Routing**](docs/router/kv_cache_routing.md)              | ✅   | ✅     | ✅           |
-| [**SLA-Based Planner**](docs/planner/sla_planner.md)                 | ✅   | ✅     | ✅           |
-| [**KVBM**](docs/kvbm/kvbm_architecture.md)                           | ✅   | 🚧     | ✅           |
-| [**Multimodal**](docs/multimodal/index.md)                           | ✅   | ✅     | ✅           |
-| [**Tool Calling**](docs/agents/tool-calling.md)                      | ✅   | ✅     | ✅           |
+| | [SGLang](docs/backends/sglang/README.md) | [TensorRT-LLM](docs/backends/trtllm/README.md) | [vLLM](docs/backends/vllm/README.md) |
+|---|:----:|:----------:|:--:|
+| **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
+| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
+| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
+| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
+| [**KVBM**](docs/kvbm/kvbm_architecture.md) | 🚧 | ✅ | ✅ |
+| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
+| [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ |

 > **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.

+## Dynamo Architecture
+
+<p align="center">
+  <img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
+</p>
+
+> **[Architecture Deep Dive →](docs/design_docs/architecture.md)**
+
 ## Latest News

 - [12/05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/)
 - [12/02] [Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo](https://www.marktechpost.com/2025/12/02/nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems/)
 - [12/01] [InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/)
- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
- [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)

 ## Get Started

@@ -79,62 +80,81 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
 |------|----------|------|--------------|
 | [**Local Quick Start**](#local-quick-start) | Test on a single machine | ~5 min | 1 GPU, Ubuntu 24.04 |
 | [**Kubernetes Deployment**](#kubernetes-deployment) | Production multi-node clusters | ~30 min | K8s cluster with GPUs |
+| [**Building from Source**](#building-from-source) | Contributors and development | ~15 min | Ubuntu, Rust, Python |

-## Contributing
-
-Want to help shape the future of distributed LLM inference? We welcome contributors at all levels—from doc fixes to new features.
-
- **[Contributing Guide](CONTRIBUTING.md)** – How to get started
- **[Report a Bug](https://github.com/ai-dynamo/dynamo/issues/new?template=bug_report.yml)** – Found an issue?
- **[Feature Request](https://github.com/ai-dynamo/dynamo/issues/new?template=feature_request.yml)** – Have an idea?
+Want to help shape the future of distributed LLM inference? See the **[Contributing Guide](CONTRIBUTING.md)**.

 # Local Quick Start

 The following examples require a few system level packages.
 Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/reference/support-matrix.md](docs/reference/support-matrix.md)

-## 1. Initial Setup
+## Install Dynamo

-The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv:
+### Option A: Containers (Recommended)

-```
-curl -LsSf https://astral.sh/uv/install.sh | sh
-```
+Containers have all dependencies pre-installed. No setup required.

-### Install Python Development Headers
+```bash
+# SGLang
+docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1

-Backend engines require Python development headers for JIT compilation. Install them with:
+# TensorRT-LLM
+docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1

-```bash
-sudo apt install python3-dev
+# vLLM
+docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
 ```

-## 2. Select an Engine
+> **Tip:** To run frontend and worker in the same container, either run processes in background with `&` (see below), or open a second terminal and use `docker exec -it <container_id> bash`.

-We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.
+See [Release Artifacts](docs/reference/release-artifacts.md#container-images) for available versions.

-```
+### Option B: Install from PyPI
+
+The Dynamo team recommends the `uv` Python package manager, although any way works.
+
+```bash
+# Install uv (recommended Python package manager)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Create virtual environment
 uv venv venv
 source venv/bin/activate
 uv pip install pip
+```

-# Choose one
-uv pip install "ai-dynamo[sglang]"  #replace with [vllm], [trtllm], etc.
+Install system dependencies and the Dynamo wheel for your chosen backend:
+
+**SGLang**
+
+```bash
+sudo apt install python3-dev
+uv pip install "ai-dynamo[sglang]"
 ```

-## 3. Run Dynamo
+> **Note:** For CUDA 13 (B300/GB300), the container is recommended. See [SGLang install docs](https://docs.sglang.ai/start/install.html) for details.

-### Sanity Check (Optional)
+**TensorRT-LLM**

-Before trying out Dynamo, you can verify your system configuration and dependencies:
+```bash
+sudo apt install python3-dev
+pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
+pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
+```
+
+> **Note:** TensorRT-LLM requires `pip` due to a transitive Git URL dependency that `uv` doesn't resolve. We recommend using the [TensorRT-LLM container](docs/reference/release-artifacts.md#container-images) for broader compatibility.
+
+**vLLM**

 ```bash
-python3 deploy/sanity_check.py
+sudo apt install python3-dev libxcb1
+uv pip install "ai-dynamo[vllm]"
 ```

-This is a quick check for system resources, development tools, LLM frameworks, and Dynamo components.
+## Run Dynamo

-### Running an LLM API Server
+> **Tip (Optional):** Before running Dynamo, verify your system configuration with `python3 deploy/sanity_check.py`

 Dynamo provides a simple way to spin up a local set of inference components including:

@@ -142,17 +162,38 @@ Dynamo provides a simple way to spin up a local set of inference components incl
 - **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
 - **Workers** – Set of pre-configured LLM serving engines.

+Start the frontend:
+
+> **Tip:** To run in a single terminal (useful in containers), append `> logfile.log 2>&1 &` to run processes in background. Example: `python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &`
+
 ```bash
 # Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
 # For local dev: --store-kv file avoids etcd (workers and frontend must share a disk)
 python3 -m dynamo.frontend --http-port 8000 --store-kv file
+```
+
+In another terminal (or same terminal if using background mode), start a worker for your chosen backend:

-# Start the SGLang engine. You can run several of these for the same or different models.
-# The frontend will discover them automatically.
-python3 -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --store-kv file
+```bash
+# SGLang
+python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file
+
+# TensorRT-LLM
+python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file
+
+# vLLM (note: uses --model, not --model-path)
+python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
+  --kv-events-config '{"enable_kv_cache_events": false}'
 ```

-> **Note:** vLLM workers publish KV cache events by default, which requires NATS. For dependency-free local development with vLLM, add `--kv-events-config '{"enable_kv_cache_events": false}'`. This keeps local prefix caching enabled while disabling event publishing. See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.
+> **Note:** For dependency-free local development, disable KV event publishing (avoids NATS):
+> - **vLLM:** Add `--kv-events-config '{"enable_kv_cache_events": false}'`
+> - **SGLang:** No flag needed (KV events disabled by default)
+> - **TensorRT-LLM:** No flag needed (KV events disabled by default)
+>
+> **TensorRT-LLM only:** The warning `Cannot connect to ModelExpress server/transport error. Using direct download.` is expected and can be safely ignored.
+>
+> See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.

 #### Send a Request

@@ -172,13 +213,6 @@ curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"

 Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.

-### What's Next?
-
- **Scale up**: Deploy on Kubernetes with [Recipes](recipes/)
- **Add features**: Enable [KV-aware routing](docs/router/kv_cache_routing.md), [disaggregated serving](docs/design_docs/disagg_serving.md)
- **Benchmark**: Use [AIPerf](docs/benchmarks/benchmarking.md) to measure performance
- **Try other engines**: [vLLM](docs/backends/vllm/), [SGLang](docs/backends/sglang/), [TensorRT-LLM](docs/backends/trtllm/)
-
 # Kubernetes Deployment

 For production deployments on Kubernetes clusters with multiple GPUs.
@@ -206,60 +240,6 @@ See [recipes/README.md](recipes/README.md) for the full list and deployment inst
 - [Amazon EKS](examples/deployments/EKS/)
 - [Google GKE](examples/deployments/GKE/)

-# Concepts
-
-## Engines
-
-Dynamo is inference engine agnostic. Install the wheel for your chosen engine and run with `python3 -m dynamo.<engine> --help`.
-
-| Engine | Install | Docs | Best For |
-|--------|---------|------|----------|
-| vLLM | `uv pip install ai-dynamo[vllm]` | [Guide](docs/backends/vllm/) | Broadest feature coverage |
-| SGLang | `uv pip install ai-dynamo[sglang]` | [Guide](docs/backends/sglang/) | High-throughput serving |
-| TensorRT-LLM | `pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo[trtllm]` | [Guide](docs/backends/trtllm/) | Maximum performance |
-
-> **Note:** TensorRT-LLM requires `pip` (not `uv`) due to URL-based dependencies. See the [TRT-LLM guide](docs/backends/trtllm/) for container setup and prerequisites.
-
-Use `CUDA_VISIBLE_DEVICES` to specify which GPUs to use. Engine-specific options (context length, multi-GPU, etc.) are documented in each backend guide.
-
-## Service Discovery and Messaging
-
-Dynamo uses TCP for inter-component communication. External services are optional for most deployments:
-
-| Deployment | etcd | NATS | Notes |
-|------------|------|------|-------|
-| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
-| **Local Development** | ❌ Not required | ❌ Not required | Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
-| **KV-Aware Routing** | — | ✅ Required | Prefix caching enabled by default requires NATS |
-
-For local development without external dependencies, pass `--store-kv file` (avoids etcd) to both the frontend and workers. vLLM users should also pass `--kv-events-config '{"enable_kv_cache_events": false}'` to disable KV event publishing (avoids NATS) while keeping local prefix caching enabled; SGLang and TRT-LLM don't require this flag.
-
-For distributed non-Kubernetes deployments or KV-aware routing:
-
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
-
-To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
-
-# Advanced Topics
-
-## Benchmarking
-
-Dynamo provides comprehensive benchmarking tools:
-
- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
-
-## Frontend OpenAPI Specification
-
-The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
-
-```bash
-cargo run -p dynamo-llm --bin generate-frontend-openapi
-```
-
-This writes to `docs/frontends/openapi.json`.
-
 # Building from Source

 For contributors who want to build Dynamo from source rather than installing from PyPI.
@@ -347,13 +327,64 @@ cd $PROJECT_ROOT
 uv pip install -e .
 ```

-You should now be able to run `python3 -m dynamo.frontend`.
+## 8. Run the Frontend
+
+```bash
+python3 -m dynamo.frontend
+```
+
+## 9. Configure for Local Development

-For local development, pass `--store-kv file` to avoid external dependencies (see Service Discovery and Messaging section).
+- Pass `--store-kv file` to avoid external dependencies (see [Service Discovery and Messaging](#service-discovery-and-messaging))
+- Set `DYN_LOG` to adjust the logging level (e.g., `export DYN_LOG=debug`). Uses the same syntax as `RUST_LOG`

-Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
+> **Note:** VSCode and Cursor users can use the `.devcontainer` folder for a pre-configured dev environment. See the [devcontainer README](.devcontainer/README.md) for details.

-If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
+# Advanced Topics
+
+## Benchmarking
+
+Dynamo provides comprehensive benchmarking tools:
+
+- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
+- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
+
+## Frontend OpenAPI Specification
+
+The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
+
+```bash
+cargo run -p dynamo-llm --bin generate-frontend-openapi
+```
+
+This writes to `docs/frontends/openapi.json`.
+
+## Service Discovery and Messaging
+
+Dynamo uses TCP for inter-component communication. On Kubernetes, native resources ([CRDs + EndpointSlices](docs/kubernetes/service_discovery.md)) handle service discovery. External services are optional for most deployments:
+
+| Deployment | etcd | NATS | Notes |
+|------------|------|------|-------|
+| **Local Development** | ❌ Not required | ❌ Not required | Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
+| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
+
+> **Note:** KV-Aware Routing requires NATS for prefix caching coordination.
+
+For Slurm or other distributed deployments (and KV-aware routing):
+
+- [etcd](https://etcd.io/) can be run directly as `./etcd`.
+- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
+
+To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
+
+See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LLM on Slurm](examples/basics/multinode/trtllm/README.md) for deployment examples.
+
+## More News
+
+- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
+- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
+- [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
+- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)

 <!-- Reference links for Feature Compatibility Matrix -->
 [disagg]: docs/design_docs/disagg_serving.md

--- a/docs/_includes/install.rst
+++ b/docs/_includes/install.rst
-Pip (PyPI)
----------
-
-Install a pre-built wheel from PyPI.
-
-.. code-block:: bash
-
-   # Create a virtual environment and activate it
-   uv venv venv
-   source venv/bin/activate
-
-   # Install Dynamo from PyPI (choose one backend extra)
-   uv pip install "ai-dynamo[sglang]==my-tag"  # or [vllm], [trtllm]
-
-
-Pip from source
---------------
-
-Install directly from a local checkout for development.
-
-.. code-block:: bash
-
-   # Clone the repository
-   git clone https://github.com/ai-dynamo/dynamo.git
-   cd dynamo
-
-   # Create a virtual environment and activate it
-   uv venv venv
-   source venv/bin/activate
-   uv pip install ".[sglang]"  # or [vllm], [trtllm]
-
-
-Docker
------
-
-Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`).
-
-.. code-block:: bash
-
-   # Run a container (mount your workspace if needed)
-   docker run --rm -it \
-     --gpus all \
-     --network host \
-     nvcr.io/nvidia/ai-dynamo/sglang-runtime:my-tag  # or vllm, tensorrtllm
--- a/docs/_includes/quick_start_local.rst
+++ b/docs/_includes/quick_start_local.rst
-Get started with Dynamo locally in just a few commands:
+..
+   SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
+   All rights reserved.
+   SPDX-License-Identifier: Apache-2.0

-**1. Install Dynamo**
+This guide covers running Dynamo **using the CLI on your local machine or VM**.
+
+.. important::
+
+   **Looking to deploy on Kubernetes instead?**
+   See the `Kubernetes Installation Guide <../kubernetes/installation_guide.html>`_
+   and `Kubernetes Quickstart <../kubernetes/README.html>`_ for cluster deployments.
+
+**Install Dynamo**
+
+**Option A: Containers (Recommended)**
+
+Containers have all dependencies pre-installed. No setup required.
+
+.. code-block:: bash
+
+   # SGLang
+   docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
+
+   # TensorRT-LLM
+   docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
+
+   # vLLM
+   docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
+
+.. tip::
+
+   To run frontend and worker in the same container, either:
+
+   - Run processes in background with ``&`` (see Run Dynamo section below), or
+   - Open a second terminal and use ``docker exec -it <container_id> bash``
+
+See `Release Artifacts <../reference/release-artifacts.html#container-images>`_ for available
+versions and backend guides for run instructions: `SGLang <../backends/sglang/README.html>`_ |
+`TensorRT-LLM <../backends/trtllm/README.html>`_ | `vLLM <../backends/vllm/README.html>`_
+
+**Option B: Install from PyPI**

 .. code-block:: bash

   # Install uv (recommended Python package manager)
   curl -LsSf https://astral.sh/uv/install.sh | sh

-   # Create virtual environment and install Dynamo
+   # Create virtual environment
   uv venv venv
   source venv/bin/activate
-   # Use prerelease flag to install RC versions of flashinfer and/or other dependencies
-   uv pip install --prerelease=allow "ai-dynamo[sglang]"  # or [vllm], [trtllm]
+   uv pip install pip

-**2. Start etcd/NATS**
+Install system dependencies and the Dynamo wheel for your chosen backend:
+
+**SGLang**

 .. code-block:: bash

-   # Fetch and start etcd and NATS using Docker Compose
-   VERSION=$(uv pip show ai-dynamo | grep Version | cut -d' ' -f2)
-   curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/refs/tags/v${VERSION}/deploy/docker-compose.yml
-   docker compose -f docker-compose.yml up -d
+   sudo apt install python3-dev
+   uv pip install --prerelease=allow "ai-dynamo[sglang]"
+
+.. note::

-**3. Run Dynamo**
+   For CUDA 13 (B300/GB300), the container is recommended. See
+   `SGLang install docs <https://docs.sglang.ai/start/install.html>`_ for details.
+
+**TensorRT-LLM**
+
+.. code-block:: bash
+
+   sudo apt install python3-dev
+   pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
+   pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
+
+.. note::
+
+   TensorRT-LLM requires ``pip`` due to a transitive Git URL dependency that
+   ``uv`` doesn't resolve. We recommend using the TensorRT-LLM container for
+   broader compatibility. See the `TRT-LLM backend guide <../backends/trtllm/README.html>`_
+   for details.
+
+**vLLM**
+
+.. code-block:: bash
+
+   sudo apt install python3-dev libxcb1
+   uv pip install --prerelease=allow "ai-dynamo[vllm]"
+
+**Run Dynamo**
+
+.. tip::
+
+   **(Optional)** Before running Dynamo, verify your system configuration:
+   ``python3 deploy/sanity_check.py``
+
+Start the frontend, then start a worker for your chosen backend.
+
+.. tip::
+
+   To run in a single terminal (useful in containers), append ``> logfile.log 2>&1 &``
+   to run processes in background. Example: ``python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &``

 .. code-block:: bash

   # Start the OpenAI compatible frontend (default port is 8000)
-   python -m dynamo.frontend
+   # --store-kv file avoids needing etcd (frontend and workers must share a disk)
+   python3 -m dynamo.frontend --store-kv file

-   # In another terminal, start an SGLang worker
-   python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B
+In another terminal (or same terminal if using background mode), start a worker:

-**4. Test your deployment**
+**SGLang**
+
+.. code-block:: bash
+
+   python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file
+
+**TensorRT-LLM**
+
+.. code-block:: bash
+
+   python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file
+
+**vLLM**
+
+.. code-block:: bash
+
+   python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
+     --kv-events-config '{"enable_kv_cache_events": false}'
+
+.. note::
+
+   For dependency-free local development, disable KV event publishing (avoids NATS):
+
+   - **vLLM:** Add ``--kv-events-config '{"enable_kv_cache_events": false}'``
+   - **SGLang:** No flag needed (KV events disabled by default)
+   - **TensorRT-LLM:** No flag needed (KV events disabled by default)
+
+   **TensorRT-LLM only:** The warning ``Cannot connect to ModelExpress server/transport error. Using direct download.``
+   is expected and can be safely ignored.
+
+**Test Your Deployment**

 .. code-block:: bash

@@ -41,5 +148,3 @@ Get started with Dynamo locally in just a few commands:
     -d '{"model": "Qwen/Qwen3-0.6B",
          "messages": [{"role": "user", "content": "Hello!"}],
          "max_tokens": 50}'
-
-
--- a/docs/_sections/installation.rst
+++ b/docs/_sections/installation.rst
-..
-    Installation Page (left sidebar target)
-..
-
-Installation
-============
-
-.. include:: ../_includes/install.rst
-
-
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -41,7 +41,6 @@ Quickstart
   :caption: Getting Started

   Quickstart <self>
-   Installation <_sections/installation>
   Support Matrix <reference/support-matrix.md>
   Feature Matrix <reference/feature-matrix.md>
   Release Artifacts <reference/release-artifacts.md>

--- a/docs/reference/feature-matrix.md
+++ b/docs/reference/feature-matrix.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
 # Dynamo Feature Compatibility Matrices

 This document provides a comprehensive compatibility matrix for key Dynamo features across the supported backends.