Unverified Commit 2c3066bd authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: full migration of docs/ to fern format in fern/ (#6050)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent d59b9d72
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
This diff is collapsed.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
This directory contains the documentation source files for NVIDIA Dynamo.
## Prerequisites
- Python 3.11 or later
- [uv](https://docs.astral.sh/uv/) package manager
## Build Instructions
### Option 1: Dedicated Docs Environment (Recommended)
This approach builds the docs without requiring the full project dependencies (including `ai-dynamo-runtime`):
```bash
# One-time setup: Create docs environment and install dependencies
uv venv .venv-docs
uv pip install --python .venv-docs --group docs
# Generate documentation
uv run --python .venv-docs --no-project docs/generate_docs.py
```
The generated HTML will be available in `docs/build/html/`.
### Option 2: Using Full Development Environment
If you already have the full project dependencies installed (i.e., you're actively developing the codebase), you can use `uv run` directly:
```bash
uv run --group docs docs/generate_docs.py
```
This will use your existing project environment and add the docs dependencies.
### Option 3: Using Docker
Build the docs in a Docker container with all dependencies isolated:
```bash
docker build -f container/Dockerfile.docs -t dynamo-docs .
```
The documentation will be built inside the container. To extract the built docs:
```bash
# Run the container and copy the output
docker run --rm -v $(pwd)/docs/build:/workspace/dynamo/docs/build dynamo-docs
# Or create a container to copy files from
docker create --name temp-docs dynamo-docs
docker cp temp-docs:/workspace/dynamo/docs/build ./docs/build
docker rm temp-docs
```
This approach is ideal for CI/CD pipelines or when you want complete isolation from your local environment.
## Directory Structure
- `docs/` - Documentation source files (Markdown and reStructuredText)
- `docs/conf.py` - Sphinx configuration
- `docs/_static/` - Static assets (CSS, JS, images)
- `docs/_extensions/` - Custom Sphinx extensions
- `docs/build/` - Generated documentation output (not tracked in git)
## Redirect Creation
When moving or renaming files a redirect must be created.
Redirect entries should be added to the `redirects` dictionary in `conf.py`. For detailed information on redirect syntax, see the [sphinx-reredirects usage documentation](https://documatt.com/sphinx-reredirects/usage/#introduction).
## Dependency Management
Documentation dependencies are defined in `pyproject.toml` under the `[dependency-groups]` section:
```toml
[dependency-groups]
docs = [
"sphinx>=8.1",
"nvidia-sphinx-theme>=0.0.8",
# ... other doc dependencies
]
```
## Troubleshooting
### Build Warnings
The build process treats warnings as errors. Common issues:
- **Missing toctree entries**: Documents must be referenced in a table of contents
- **Non-consecutive headers**: Don't skip header levels (e.g., H1 → H3)
- **Broken links**: Ensure all internal and external links are valid
### Missing Dependencies
If you encounter import errors, ensure the docs dependencies are installed:
```bash
uv pip install --python .venv-docs --group docs
```
## Viewing the Documentation
After building, open `docs/build/html/index.html` in your, or use Python's built-in HTTP server:
```bash
cd docs/build/html
python -m http.server 8000
# Then visit http://localhost:8000 in your browser
```
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Tool Calling with Dynamo
You can connect Dynamo to external tools and services using function calling (also known as tool calling). By providing a list of available functions, Dynamo can choose You can connect Dynamo to external tools and services using function calling (also known as tool calling). By providing a list of available functions, Dynamo can choose
to output function arguments for the relevant function(s) which you can execute to augment the prompt with relevant external information. to output function arguments for the relevant function(s) which you can execute to augment the prompt with relevant external information.
......
...@@ -11,13 +11,14 @@ The relaxed registration comes with some performance overheads, but simplifies t ...@@ -11,13 +11,14 @@ The relaxed registration comes with some performance overheads, but simplifies t
Especially for larger data transfer operations, such as between models in a multi-model graph, the overhead would be marginal. Especially for larger data transfer operations, such as between models in a multi-model graph, the overhead would be marginal.
The `dynamo.nixl_connect` library can be imported by any Dynamo container hosted application. The `dynamo.nixl_connect` library can be imported by any Dynamo container hosted application.
> [!NOTE] > [!Note]
> Dynamo NIXL Connect will pick the best available method of data transfer available to it. > Dynamo NIXL Connect will pick the best available method of data transfer available to it.
> The available methods depend on the hardware and software configuration of the machines and network running the graph. > The available methods depend on the hardware and software configuration of the machines and network running the graph.
> GPU Direct RDMA operations require that both ends of the operation have: > GPU Direct RDMA operations require that both ends of the operation have:
> - NIC and GPU capable of performing RDMA operations > - NIC and GPU capable of performing RDMA operations
> - Device drivers that support GPU-NIC direct interactions (aka "zero copy") and RDMA operations > - Device drivers that support GPU-NIC direct interactions (aka "zero copy") and RDMA operations
> - Network that supports InfiniBand or RoCE > - Network that supports InfiniBand or RoCE
>
> With any of the above not satisfied, GPU Direct RDMA will not be available to the graph's workers, and less-optimal methods will be utilized to ensure basic functionality. > With any of the above not satisfied, GPU Direct RDMA will not be available to the graph's workers, and less-optimal methods will be utilized to ensure basic functionality.
> For additional information, please read this [GPUDirect RDMA](https://docs.nvidia.com/cuda/pdf/GPUDirect_RDMA.pdf) document. > For additional information, please read this [GPUDirect RDMA](https://docs.nvidia.com/cuda/pdf/GPUDirect_RDMA.pdf) document.
...@@ -85,12 +86,12 @@ flowchart LR ...@@ -85,12 +86,12 @@ flowchart LR
e2@{ animate: true; } e2@{ animate: true; }
``` ```
> [!NOTE] > [!Note]
> When RDMA isn't available, the NIXL data transfer will still complete using non-accelerated methods. > When RDMA isn't available, the NIXL data transfer will still complete using non-accelerated methods.
### Multimodal Example ### Multimodal Example
In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md): In the case of the [Dynamo Multimodal Disaggregated Example](../../features/multimodal/multimodal-vllm.md):
1. The HTTP frontend accepts a text prompt and a URL to an image. 1. The HTTP frontend accepts a text prompt and a URL to an image.
...@@ -134,17 +135,17 @@ flowchart LR ...@@ -134,17 +135,17 @@ flowchart LR
o2@{ animate: true; } o2@{ animate: true; }
``` ```
> [!NOTE] > [!Note]
> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library. > In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
> The KV Cache transfer between Decode Worker and Prefill Worker utilizes a different connector that also uses the NIXL-based I/O subsystem underneath. > The KV Cache transfer between Decode Worker and Prefill Worker utilizes a different connector that also uses the NIXL-based I/O subsystem underneath.
#### Code Examples #### Code Examples
See [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) or [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) from our Multimodal example, See [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) or [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) from our Multimodal example,
for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable-operation.md), for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable-operation.md),
sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data. sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.
See [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) from our Multimodal example, See [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) from our Multimodal example,
for how the resulting embeddings are registered with the NIXL subsystem by creating a [`Descriptor`](descriptor.md), for how the resulting embeddings are registered with the NIXL subsystem by creating a [`Descriptor`](descriptor.md),
a [`WriteOperation`](write-operation.md) is created using the metadata provided by the requesting worker, a [`WriteOperation`](write-operation.md) is created using the metadata provided by the requesting worker,
and the worker awaits for the data transfer to complete for yielding a response. and the worker awaits for the data transfer to complete for yielding a response.
...@@ -165,5 +166,5 @@ and the worker awaits for the data transfer to complete for yielding a response. ...@@ -165,5 +166,5 @@ and the worker awaits for the data transfer to complete for yielding a response.
- [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo) - [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
- [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl) - [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
- [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal) - [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal.md)
- [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect) - [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
...@@ -15,7 +15,7 @@ The connector provides two methods of moving data between workers: ...@@ -15,7 +15,7 @@ The connector provides two methods of moving data between workers:
- Preparing local memory to be read by a remote worker. - Preparing local memory to be read by a remote worker.
In both cases, local memory is registered with the NIXL-based I/O subsystem via the [`Descriptor`](descriptor.md) class and provided to the connector. In both cases, local memory is registered with the NIXL-based I/O subsystem via the [`Descriptor`](#descriptor) class and provided to the connector.
When RDMA is available, the connector then configures the RDMA subsystem to expose the memory for the requested operation and returns an operation control object; When RDMA is available, the connector then configures the RDMA subsystem to expose the memory for the requested operation and returns an operation control object;
otherwise the connector will select the best available RDMA alternative. otherwise the connector will select the best available RDMA alternative.
The operation control object, either a [`ReadableOperation`](readable-operation.md) or a [`WritableOperation`](writable-operation.md), The operation control object, either a [`ReadableOperation`](readable-operation.md) or a [`WritableOperation`](writable-operation.md),
...@@ -24,7 +24,7 @@ provides NIXL metadata ([RdmaMetadata](rdma-metadata.md)) via its `.metadata()` ...@@ -24,7 +24,7 @@ provides NIXL metadata ([RdmaMetadata](rdma-metadata.md)) via its `.metadata()`
The NIXL metadata must be provided to the remote worker expected to complete the operation. The NIXL metadata must be provided to the remote worker expected to complete the operation.
The metadata contains required information (identifiers, keys, etc.) which enables the remote worker to interact with the provided memory. The metadata contains required information (identifiers, keys, etc.) which enables the remote worker to interact with the provided memory.
> [!WARNING] > [!Warning]
> NIXL metadata contains a worker's address as well as security keys to access specific registered memory descriptors. > NIXL metadata contains a worker's address as well as security keys to access specific registered memory descriptors.
> This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly. > This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
...@@ -37,7 +37,7 @@ The metadata contains required information (identifiers, keys, etc.) which enabl ...@@ -37,7 +37,7 @@ The metadata contains required information (identifiers, keys, etc.) which enabl
self.connector = dynamo.nixl_connect.Connector() self.connector = dynamo.nixl_connect.Connector()
``` ```
> [!TIP] > [!Tip]
> See [`ReadOperation`](read-operation.md#example-usage), [`ReadableOperation`](readable-operation.md#example-usage), > See [`ReadOperation`](read-operation.md#example-usage), [`ReadableOperation`](readable-operation.md#example-usage),
> [`WritableOperation`](writable-operation.md#example-usage), and [`WriteOperation`](write-operation.md#example-usage) > [`WritableOperation`](writable-operation.md#example-usage), and [`WriteOperation`](write-operation.md#example-usage)
> for additional examples. > for additional examples.
......
...@@ -9,13 +9,13 @@ A Pydantic type intended to provide JSON serialized NIXL metadata about a [`Read ...@@ -9,13 +9,13 @@ A Pydantic type intended to provide JSON serialized NIXL metadata about a [`Read
NIXL metadata contains detailed information about a worker process and how to access memory regions registered with the corresponding agent. NIXL metadata contains detailed information about a worker process and how to access memory regions registered with the corresponding agent.
This data is required to perform data transfers using the NIXL-based I/O subsystem. This data is required to perform data transfers using the NIXL-based I/O subsystem.
> [!WARNING] > [!Warning]
> NIXL metadata contains information to connect corresponding backends across agents, as well as identification keys to access specific registered memory regions. > NIXL metadata contains information to connect corresponding backends across agents, as well as identification keys to access specific registered memory regions.
> This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly. > This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
Use the respective class's `.metadata()` method to generate an `RdmaMetadata` object for an operation. Use the respective class's `.metadata()` method to generate an `RdmaMetadata` object for an operation.
> [!TIP] > [!Tip]
> Classes using `RdmaMetadata` objects must be paired correctly. > Classes using `RdmaMetadata` objects must be paired correctly.
> [`ReadableOperation`](readable-operation.md) with [`ReadOperation`](read-operation.md), and > [`ReadableOperation`](readable-operation.md) with [`ReadOperation`](read-operation.md), and
> [`WritableOperation`](write-operation.md) with [`WriteOperation`](write-operation.md). > [`WritableOperation`](write-operation.md) with [`WriteOperation`](write-operation.md).
......
...@@ -24,8 +24,8 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -24,8 +24,8 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
- [Dynamo SGLang Integration](#dynamo-sglang-integration) - [Dynamo SGLang Integration](#dynamo-sglang-integration)
- [Installation](#installation) - [Installation](#installation)
- [Quick Start](#quick-start) - [Quick Start](#quick-start)
- [Aggregated Serving](#aggregated-serving) - [Single Node Examples](#run-single-node-examples)
- [Disaggregated Serving](#disaggregated-serving) - [Multi-Node and Advanced Examples](#advanced-examples)
- [Deploy on SLURM or Kubernetes](#deployment) - [Deploy on SLURM or Kubernetes](#deployment)
## Feature Support Matrix ## Feature Support Matrix
...@@ -35,11 +35,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -35,11 +35,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | SGLang | Notes | | Feature | SGLang | Notes |
|---------|--------|-------| |---------|--------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | | | [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | | [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ | | | [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ | | | [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ | |
| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | | | [**Multimodal Support**](../../features/multimodal/multimodal-sglang.md) | ✅ | |
| [**KVBM**](../../kvbm/kvbm-architecture.md) | ❌ | Planned | | [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |
## Dynamo SGLang Integration ## Dynamo SGLang Integration
...@@ -55,7 +55,6 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu ...@@ -55,7 +55,6 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu
| Argument | Description | Default | SGLang Equivalent | | Argument | Description | Default | SGLang Equivalent |
|----------|-------------|---------|-------------------| |----------|-------------|---------|-------------------|
| `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A | | `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
| `--migration-limit` | Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault-tolerance/request-migration.md). | `0` (disabled) | N/A |
| `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` | | `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` |
| `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` | | `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` |
| `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A | | `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A |
...@@ -90,23 +89,18 @@ For more details, see the [Request Cancellation Architecture](../../fault-tolera ...@@ -90,23 +89,18 @@ For more details, see the [Request Cancellation Architecture](../../fault-tolera
### Install latest release ### Install latest release
We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with `curl -LsSf https://astral.sh/uv/install.sh | sh` We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with `curl -LsSf https://astral.sh/uv/install.sh | sh`
<details> <Accordion title="Expand for instructions">
<summary>Expand for instructions</summary>
```bash ```bash
# create a virtual env # create a virtual env
uv venv --python 3.12 --seed uv venv --python 3.12 --seed
# install the latest release (which comes bundled with a stable sglang version) # install the latest release (which comes bundled with a stable sglang version)
uv pip install "ai-dynamo[sglang]" uv pip install "ai-dynamo[sglang]"
``` ```
</Accordion>
</details>
### Install editable version for development ### Install editable version for development
<details> <Accordion title="Expand for instructions">
<summary>Expand for instructions</summary>
This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires `nvcc` to be available. This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires `nvcc` to be available.
```bash ```bash
...@@ -123,14 +117,11 @@ uv pip install -e . ...@@ -123,14 +117,11 @@ uv pip install -e .
# install any sglang version >= 0.5.3.post2 # install any sglang version >= 0.5.3.post2
uv pip install "sglang[all]==0.5.3.post2" uv pip install "sglang[all]==0.5.3.post2"
``` ```
</Accordion>
</details>
### Using docker containers ### Using docker containers
<details> <Accordion title="Expand for instructions">
<summary>Expand for instructions</summary>
We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command. We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.
```bash ```bash
...@@ -156,8 +147,7 @@ docker run \ ...@@ -156,8 +147,7 @@ docker run \
--ipc host \ --ipc host \
dynamo-sglang:latest dynamo-sglang:latest
``` ```
</Accordion>
</details>
## Quick Start ## Quick Start
...@@ -178,6 +168,7 @@ docker compose -f deploy/docker-compose.yml up -d ...@@ -178,6 +168,7 @@ docker compose -f deploy/docker-compose.yml up -d
> [!TIP] > [!TIP]
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals. > Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
>
> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker! > Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
...@@ -204,9 +195,7 @@ cd $DYNAMO_HOME/examples/backends/sglang ...@@ -204,9 +195,7 @@ cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_embed.sh ./launch/agg_embed.sh
``` ```
<details> <Accordion title="Send the following request to verify your deployment:">
<summary>Send the following request to verify your deployment:</summary>
```bash ```bash
curl localhost:8000/v1/embeddings \ curl localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
...@@ -215,8 +204,7 @@ curl localhost:8000/v1/embeddings \ ...@@ -215,8 +204,7 @@ curl localhost:8000/v1/embeddings \
"input": "Hello, world!" "input": "Hello, world!"
}' }'
``` ```
</Accordion>
</details>
### Disaggregated serving ### Disaggregated serving
...@@ -273,4 +261,4 @@ We currently provide deployment examples for Kubernetes and SLURM. ...@@ -273,4 +261,4 @@ We currently provide deployment examples for Kubernetes and SLURM.
- **[Deploying Dynamo with SGLang on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)** - **[Deploying Dynamo with SGLang on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**
## SLURM ## SLURM
- **[Deploying Dynamo with SGLang on SLURM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/slurm_jobs/README.md)** - **[Deploying Dynamo with SGLang on SLURM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/slurm-jobs/README.md)**
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Running Diffusion LMs with SGLang
Diffusion Language Models (Diffusion LMs) are a class of generative models that use diffusion processes for text generation. This guide shows how to deploy diffusion models like LLaDA2.0 using SGLang as the backend with Dynamo. Diffusion LMs work differently from autoregressive models - they iteratively refine generated text through a diffusion process.
## Launch the Deployment
### Using the Launch Script (Recommended)
The easiest way to start the diffusion LM service is using the provided launch script:
```bash
bash examples/backends/sglang/launch/diffusion_llada.sh
```
### Manual Launch Steps
If you prefer to launch components manually:
**Start frontend**
```bash
python -m dynamo.frontend --http-port 8001 &
```
**Run diffusion worker**
```bash
export CUDA_VISIBLE_DEVICES=0,1
python -m dynamo.sglang \
--model-path inclusionAI/LLaDA2.0-mini-preview \
--tp-size 2 \
--skip-tokenizer-init \
--trust-remote-code \
--endpoint dyn://dynamo.backend.generate \
--enable-metrics \
--disable-cuda-graph \
--disable-overlap-schedule \
--attention-backend triton \
--dllm-algorithm LowConfidence
```
## Diffusion Algorithms
The diffusion worker uses the **LowConfidence** algorithm for the iterative refinement process. This algorithm refines tokens with low confidence scores, progressively replacing masked tokens with the model's predictions until confidence thresholds are met.
For more details on diffusion algorithms and configuration options, refer to the [SGLang Diffusion Language Models documentation](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/diffusion_language_models.md).
## Testing the Deployment
Once deployed, you can test the service using curl:
```bash
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inclusionAI/LLaDA2.0-mini-preview",
"messages": [
{
"role": "user",
"content": "Hello! How are you?"
}
],
"temperature": 0.7,
"max_tokens": 512
}'
```
Or use the completions endpoint:
```bash
curl -X POST http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inclusionAI/LLaDA2.0-mini-preview",
"prompt": "Once upon a time",
"max_tokens": 256
}'
```
\ No newline at end of file
...@@ -5,6 +5,9 @@ ...@@ -5,6 +5,9 @@
# Profiling SGLang Workers in Dynamo # Profiling SGLang Workers in Dynamo
> [!NOTE]
> **See also**: [Profiler Component Overview](../../components/profiler/README.md) for SLA-driven profiling and deployment optimization.
Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them. Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters. These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.
......
...@@ -23,7 +23,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -23,7 +23,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
## Table of Contents ## Table of Contents
- [Feature Support Matrix](#feature-support-matrix) - [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#tensorrt-llm-quick-start) - [Quick Start](#quick-start)
- [Single Node Examples](#single-node-examples) - [Single Node Examples](#single-node-examples)
- [Advanced Examples](#advanced-examples) - [Advanced Examples](#advanced-examples)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving) - [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
...@@ -31,7 +31,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -31,7 +31,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
- [Benchmarking](#benchmarking) - [Benchmarking](#benchmarking)
- [Multimodal Support](#multimodal-support) - [Multimodal Support](#multimodal-support)
- [Logits Processing](#logits-processing) - [Logits Processing](#logits-processing)
- [DP Rank Routing](#dp-rank-routing-attention-data-parallelism)
- [Performance Sweep](#performance-sweep) - [Performance Sweep](#performance-sweep)
- [Known Issues and Mitigations](#known-issues-and-mitigations)
## Feature Support Matrix ## Feature Support Matrix
...@@ -40,11 +42,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -40,11 +42,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | TensorRT-LLM | Notes | | Feature | TensorRT-LLM | Notes |
|---------|--------------|-------| |---------|--------------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | | | [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | Not supported yet | | [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ | | | [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ | | | [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ | |
| [**Load Based Planner**](../../planner/load-planner.md) | 🚧 | Planned | | [**Load Based Planner**](../../components/planner/README.md) | 🚧 | Planned |
| [**KVBM**](../../kvbm/kvbm-architecture.md) | ✅ | | | [**KVBM**](../../components/kvbm/README.md) | ✅ | |
### Large Scale P/D and WideEP Features ### Large Scale P/D and WideEP Features
...@@ -97,10 +99,10 @@ apt-get update && apt-get -y install git git-lfs ...@@ -97,10 +99,10 @@ apt-get update && apt-get -y install git git-lfs
## Single Node Examples ## Single Node Examples
> [!WARNING] > [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals. > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv-cache-routing.md). For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router-guide.md).
### Aggregated ### Aggregated
```bash ```bash
...@@ -123,7 +125,7 @@ cd $DYNAMO_HOME/examples/backends/trtllm ...@@ -123,7 +125,7 @@ cd $DYNAMO_HOME/examples/backends/trtllm
### Disaggregated with KV Routing ### Disaggregated with KV Routing
> [!WARNING] > [!IMPORTANT]
> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse. > In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
```bash ```bash
...@@ -152,10 +154,10 @@ Below we provide a selected list of advanced examples. Please open up an issue i ...@@ -152,10 +154,10 @@ Below we provide a selected list of advanced examples. Please open up an issue i
### Multinode Deployment ### Multinode Deployment
For comprehensive instructions on multinode serving, see the [multinode-examples.md](multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on the single node. For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
### Speculative Decoding ### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](llama4-plus-eagle.md)** - **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4-plus-eagle.md)**
### Kubernetes Deployment ### Kubernetes Deployment
...@@ -170,26 +172,16 @@ NOTE: To send a request to a multi-node deployment, target the node which is run ...@@ -170,26 +172,16 @@ NOTE: To send a request to a multi-node deployment, target the node which is run
### Benchmarking ### Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/llm/perf.sh) `model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
## KV Cache Transfer in Disaggregated Serving ## KV Cache Transfer in Disaggregated Serving
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](kv-cache-transfer.md). Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
## Request Migration ## Request Migration
You can enable [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker: Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
```bash
# For decode and aggregated workers
python3 -m dynamo.trtllm ... --migration-limit=3
```
> [!WARNING]
> **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for details on how this works.
## Request Cancellation ## Request Cancellation
...@@ -213,11 +205,11 @@ NOTE: To send a request to a multi-node deployment, target the node which is run ...@@ -213,11 +205,11 @@ NOTE: To send a request to a multi-node deployment, target the node which is run
## Benchmarking ## Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/llm/perf.sh) `model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
## Multimodal support ## Multimodal support
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md). Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).
## Logits Processing ## Logits Processing
...@@ -276,12 +268,67 @@ sampling_params.logits_processor = create_trtllm_adapters(processors) ...@@ -276,12 +268,67 @@ sampling_params.logits_processor = create_trtllm_adapters(processors)
- Processors must modify logits in-place and not return a new tensor. - Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init). - If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
## DP Rank Routing (Attention Data Parallelism)
TensorRT-LLM supports [attention data parallelism](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.
### Dynamo vs TRT-LLM Internal Routing
- **Dynamo DP Rank Routing**: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (`attention_dp_relax=False`). Use this with `--router-mode kv` for cache-aware routing.
- **TRT-LLM Internal Routing**: TRT-LLM's scheduler assigns DP ranks internally. Use this with `--router-mode round-robin` or `random` when KV-aware routing isn't needed.
### Enabling DP Rank Routing
```bash
# Worker with attention DP
# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
--model-path <MODEL_PATH> \
--tensor-parallel-size 2 \
--enable-attention-dp \
--publish-events-and-metrics
# Frontend with KV routing
python3 -m dynamo.frontend --router-mode kv
```
The `--enable-attention-dp` flag sets `attention_dp_size = tensor_parallel_size` and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each `(worker_id, dp_rank)` combination.
> [!NOTE]
> Attention DP requires TRT-LLM's PyTorch backend. AutoDeploy does not support attention DP.
## Performance Sweep ## Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance. For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance-sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
## Dynamo KV Block Manager Integration ## Dynamo KV Block Manager Integration
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](../../kvbm/trtllm-setup.md) . Here is the instruction: [Running KVBM in TensorRT-LLM](../../components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
## Known Issues and Mitigations
### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
**Symptoms:**
- Workers function normally initially but hang after heavy load testing
- Inference requests get stuck and eventually timeout
- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
```yaml
cache_transceiver_config:
backend: DEFAULT
max_tokens_in_buffer: 65536 # Must exceed max ISL
```
For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`.
**Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327)
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU. This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers. VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
> [!NOTE] > [!Note]
> - Ensure that required services such as `nats` and `etcd` are running before starting. > - Ensure that required services such as `nats` and `etcd` are running before starting.
> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication. > - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA. > - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
......
...@@ -16,43 +16,11 @@ By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX ...@@ -16,43 +16,11 @@ By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX
### Specify Backends for NIXL ### Specify Backends for NIXL
NIXL supports multiple communication backends that can be configured via environment variables. By default, UCX is used if no backends are explicitly specified. TensorRT-LLM supports two NIXL communication backends: UCX and LIBFABRIC. By default, UCX is used if no backend is explicitly specified. Dynamo currently only supports the UCX backend, as LIBFABRIC support is still a work in progress. Please do not change the NIXL backend in the Dynamo runtime image.
**Environment Variable Format:**
```bash
DYN_KVBM_NIXL_BACKEND_<BACKEND>=<value>
```
**Supported Backends:**
- `UCX` - Unified Communication X (default)
- `GDS` - GPU Direct Storage
**Examples:**
```bash
# Enable UCX backend (default behavior)
export DYN_KVBM_NIXL_BACKEND_UCX=true
# Enable GDS backend
export DYN_KVBM_NIXL_BACKEND_GDS=true
# Enable multiple backends
export DYN_KVBM_NIXL_BACKEND_UCX=true
export DYN_KVBM_NIXL_BACKEND_GDS=true
# Explicitly disable a backend
export DYN_KVBM_NIXL_BACKEND_GDS=false
```
**Valid Values:**
- `true`, `1`, `on`, `yes` - Enable the backend
- `false`, `0`, `off`, `no` - Disable the backend
> [!NOTE]
> If no `DYN_KVBM_NIXL_BACKEND_*` environment variables are set, UCX is used as the default backend.
## Alternative Method: UCX ## Alternative Method: UCX
TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file. TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
> [!NOTE] > [!Note]
> The environment variable `TRTLLM_USE_UCX_KV_CACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration. > The environment variable `TRTLLM_USE_UCX_KVCACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
# Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM # Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](multinode/multinode-examples.md) to set up the environment for the following scenarios: This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios:
- **Aggregated Serving:** - **Aggregated Serving:**
Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving. Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
...@@ -34,7 +34,7 @@ export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" ...@@ -34,7 +34,7 @@ export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
``` ```
See [this](multinode/multinode-examples.md#setup) section from multinode guide to learn more about the above options. See [this](./multinode/multinode-examples.md#setup) section from multinode guide to learn more about the above options.
## Aggregated Serving ## Aggregated Serving
...@@ -56,7 +56,7 @@ export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4 ...@@ -56,7 +56,7 @@ export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4
## Example Request ## Example Request
See [here](multinode/multinode-examples.md#example-request) to learn how to send a request to the deployment. See [here](./multinode/multinode-examples.md#example-request) to learn how to send a request to the deployment.
``` ```
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
......
...@@ -38,7 +38,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl ...@@ -38,7 +38,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl
If your cluster supports similar container based plugins, you may be able to If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead. modify the script to use that instead.
3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as 3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
described [here](https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container). described [here](../README.md#build-container).
This is the image that can be set to the `IMAGE` environment variable in later steps. This is the image that can be set to the `IMAGE` environment variable in later steps.
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We 4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
will allocate 8 nodes below as a reference command to have enough capacity will allocate 8 nodes below as a reference command to have enough capacity
...@@ -77,7 +77,7 @@ following environment variables based: ...@@ -77,7 +77,7 @@ following environment variables based:
```bash ```bash
# NOTE: IMAGE must be set manually for now # NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here: # To build an iamge, see the steps here:
# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container # ../README.md#build-container
export IMAGE="<dynamo_trtllm_image>" export IMAGE="<dynamo_trtllm_image>"
# MOUNTS are the host:container path pairs that are mounted into the containers # MOUNTS are the host:container path pairs that are mounted into the containers
...@@ -149,7 +149,7 @@ Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode) ...@@ -149,7 +149,7 @@ Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
following the setup above, follow these steps below to launch a **disaggregated** following the setup above, follow these steps below to launch a **disaggregated**
deployment across 8 nodes: deployment across 8 nodes:
> [!TIP] > [!Tip]
> Make sure you have a fresh environment and don't still have the aggregated > Make sure you have a fresh environment and don't still have the aggregated
> example above still deployed on the same set of nodes. > example above still deployed on the same set of nodes.
...@@ -176,7 +176,7 @@ deployment across 8 nodes: ...@@ -176,7 +176,7 @@ deployment across 8 nodes:
./srun_disaggregated.sh ./srun_disaggregated.sh
``` ```
> [!TIP] > [!Tip]
> To launch multiple replicas of the configured prefill/decode workers, you can set > To launch multiple replicas of the configured prefill/decode workers, you can set
> NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1). > NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
......
...@@ -23,7 +23,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -23,7 +23,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
## Table of Contents ## Table of Contents
- [Feature Support Matrix](#feature-support-matrix) - [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#vllm-quick-start) - [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples) - [Single Node Examples](#run-single-node-examples)
- [Advanced Examples](#advanced-examples) - [Advanced Examples](#advanced-examples)
- [Deploy on Kubernetes](#kubernetes-deployment) - [Deploy on Kubernetes](#kubernetes-deployment)
...@@ -36,13 +36,13 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -36,13 +36,13 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | vLLM | Notes | | Feature | vLLM | Notes |
|---------|------|-------| |---------|------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | | | [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | WIP | | [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | WIP |
| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ | | | [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ | | | [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ | |
| [**Load Based Planner**](../../planner/load-planner.md) | 🚧 | WIP | | [**Load Based Planner**](../../components/planner/README.md) | 🚧 | WIP |
| [**KVBM**](../../kvbm/kvbm-architecture.md) | ✅ | | | [**KVBM**](../../components/kvbm/README.md) | ✅ | |
| [**LMCache**](LMCache-Integration.md) | ✅ | | | [**LMCache**](../../integrations/lmcache-integration.md) | ✅ | |
| [**Prompt Embeddings**](prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag | | [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
### Large Scale P/D and WideEP Features ### Large Scale P/D and WideEP Features
...@@ -87,7 +87,7 @@ This includes the specific commit [vllm-project/vllm#19790](https://github.com/v ...@@ -87,7 +87,7 @@ This includes the specific commit [vllm-project/vllm#19790](https://github.com/v
## Run Single Node Examples ## Run Single Node Examples
> [!WARNING] > [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility. > Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
### Aggregated Serving ### Aggregated Serving
...@@ -144,7 +144,9 @@ Below we provide a selected list of advanced deployments. Please open up an issu ...@@ -144,7 +144,9 @@ Below we provide a selected list of advanced deployments. Please open up an issu
Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node. Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy. This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
**Guide:** [Speculative Decoding Quickstart](speculative-decoding.md) **Guide:** [Speculative Decoding Quickstart](../../features/speculative-decoding/speculative-decoding-vllm.md)
> **See also:** [Speculative Decoding Feature Overview](../../features/speculative-decoding/README.md) for cross-backend documentation.
### Kubernetes Deployment ### Kubernetes Deployment
...@@ -177,17 +179,11 @@ When using KV-aware routing, ensure deterministic hashing across processes to av ...@@ -177,17 +179,11 @@ When using KV-aware routing, ensure deterministic hashing across processes to av
```bash ```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256 vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
``` ```
See the high-level notes in [KV Cache Routing](../../router/kv-cache-routing.md) on deterministic event IDs. See the high-level notes in [Router Design](../../design-docs/router-design.md#deterministic-event-ids) on deterministic event IDs.
## Request Migration ## Request Migration
You can enable [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker: Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
```bash
python3 -m dynamo.vllm ... --migration-limit=3
```
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for details on how this works.
## Request Cancellation ## Request Cancellation
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment