"docs/pages/vscode:/vscode.git/clone" did not exist on "cb39311b74bb5e5c6db54df482aa273f3c4be5eb"
Unverified Commit 7ca6a562 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: update Fern docs for main branch (#5706)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent 704c1dad
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Dynamo Distributed Runtime"
--- ---
# Dynamo Distributed Runtime
## Overview ## Overview
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure: Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Dynamo Architecture Flow"
--- ---
# Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm). Color-coded flows indicate different types of operations: This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm). Color-coded flows indicate different types of operations:
## 🔵 Main Request Flow (Blue) ## 🔵 Main Request Flow (Blue)
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Event Plane Architecture"
--- ---
# Event Plane Architecture
This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS. This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS.
## Overview ## Overview
...@@ -15,9 +16,8 @@ Dynamo's coordination layer adapts to the deployment environment: ...@@ -15,9 +16,8 @@ Dynamo's coordination layer adapts to the deployment environment:
| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP | | **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP | | **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |
<Note> > [!NOTE]
The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically. > The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
</Note>
``` ```
┌─────────────────────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────────────────────┐
...@@ -51,9 +51,8 @@ The operator explicitly sets: ...@@ -51,9 +51,8 @@ The operator explicitly sets:
DYN_DISCOVERY_BACKEND=kubernetes DYN_DISCOVERY_BACKEND=kubernetes
``` ```
<Warning> > [!WARNING]
This must be explicitly configured. The runtime defaults to `kv_store` in all environments. > This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
</Warning>
### How It Works ### How It Works
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Writing Python Workers in Dynamo"
--- ---
# Writing Python Workers in Dynamo
This guide explains how to create your own Python worker in Dynamo. This guide explains how to create your own Python worker in Dynamo.
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
...@@ -115,9 +116,8 @@ A Python worker may need to be shut down promptly, for example when the node run ...@@ -115,9 +116,8 @@ A Python worker may need to be shut down promptly, for example when the node run
In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault-tolerance/request-migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed. In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault-tolerance/request-migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
<Warning> > [!WARNING]
We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future. > We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future.
</Warning>
Here's an example of how to implement this in your `RequestHandler`: Here's an example of how to implement this in your `RequestHandler`:
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Dynamo Runtime"
--- ---
# Dynamo Runtime
<h4>A Datacenter Scale Distributed Inference Serving Framework</h4> <h4>A Datacenter Scale Distributed Inference Serving Framework</h4>
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Fault Tolerance"
--- ---
# Fault Tolerance
Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability. Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability.
## Overview ## Overview
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Graceful Shutdown"
--- ---
# Graceful Shutdown
This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up. This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.
## Overview ## Overview
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Request Cancellation Architecture"
--- ---
# Request Cancellation Architecture
This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed. This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed.
## AsyncEngineContext Trait ## AsyncEngineContext Trait
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Request Migration Architecture"
--- ---
# Request Migration Architecture
This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience. This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience.
## Overview ## Overview
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Request Rejection (Load Shedding)"
--- ---
# Request Rejection (Load Shedding)
This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions. This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
## Overview ## Overview
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Fault Tolerance Testing"
--- ---
# Fault Tolerance Testing
This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios. This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.
## Overview ## Overview
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "KServe gRPC frontend"
--- ---
# KServe gRPC frontend
## Motivation ## Motivation
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend. [KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Examples"
--- ---
Explore practical examples to get started with NVIDIA Dynamo. # Dynamo Examples
## Quick Start Examples This directory contains practical examples demonstrating how to deploy and use Dynamo for distributed LLM inference. Each example includes setup instructions, configuration files, and explanations to help you understand different deployment patterns and use cases.
The [examples directory](https://github.com/ai-dynamo/dynamo/tree/main/examples) in the Dynamo repository contains ready-to-run examples for various use cases. > **Want to see a specific example?**
> Open a [GitHub issue](https://github.com/ai-dynamo/dynamo/issues) to request an example you'd like to see, or [open a pull request](https://github.com/ai-dynamo/dynamo/pulls) if you'd like to contribute your own!
### Backend Examples ## Basics & Tutorials
| Backend | Description | Link | Learn fundamental Dynamo concepts through these introductory examples:
|---------|-------------|------|
| **vLLM** | Run inference with vLLM backend | [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm) |
| **SGLang** | Run inference with SGLang backend | [examples/backends/sglang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang) |
| **TensorRT-LLM** | Run inference with TensorRT-LLM backend | [examples/backends/trtllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm) |
### Deployment Examples - **[Quickstart](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/quickstart/README.md)** - Simple aggregated serving example with vLLM backend
- **[Disaggregated Serving](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/disaggregated_serving/README.md)** - Prefill/decode separation for enhanced performance and scalability
- **[Multi-node](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/multinode/README.md)** - Distributed inference across multiple nodes and GPUs
| Example | Description | Link | ## Framework Support
|---------|-------------|------|
| **Basic Deployment** | Simple single-node deployment | [examples/basics](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics) |
| **Kubernetes** | Deploy on Kubernetes | [examples/deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/deployments) |
| **Multimodal** | Vision and multimodal models | [examples/multimodal](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal) |
### Custom Backend Examples These examples show how Dynamo broadly works using major inference engines.
Learn how to create custom backends: If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Examples Backends](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/) directory:
- **[vLLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/)** – vLLM-specific deployment and configuration
- **[SGLang](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/)** – SGLang integration examples and workflows
- **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/trtllm/)** – TensorRT-LLM workflows and optimizations
| Example | Description | Link | ## Deployment Examples
|---------|-------------|------|
| **Custom Backend** | Build your own backend | [examples/custom_backend](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom_backend) |
## Running Examples Platform-specific deployment guides for production environments:
Most examples can be run directly after installing Dynamo: - **[Amazon EKS](https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/EKS/)** - Deploy Dynamo on Amazon Elastic Kubernetes Service
- **[Azure AKS](https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/AKS/)** - Deploy Dynamo on Azure Kubernetes Service
- **[Amazon ECS](https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/ECS/)** - Deploy Dynamo on Amazon Elastic Container Service
- **[Router Standalone](https://github.com/ai-dynamo/dynamo/blob/main/examples/deployments/router_standalone/)** - Standalone router deployment patterns
- **Google GKE** - _Coming soon_
```bash ## Runtime Examples
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Navigate to an example Low-level runtime examples for developers using Python/Rust bindings:
cd examples/backends/sglang
# Follow the README in each example directory - **[Hello World](https://github.com/ai-dynamo/dynamo/blob/main/examples/custom_backend/hello_world/README.md)** - Minimal Dynamo runtime service demonstrating basic concepts
```
## Getting Started
1. **Choose your deployment pattern**: Start with the [Quickstart](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/quickstart/README.md) for a simple local deployment, or explore [Disaggregated Serving](https://github.com/ai-dynamo/dynamo/blob/main/examples/basics/disaggregated_serving/README.md) for advanced architectures.
2. **Set up prerequisites**: Most examples require etcd and NATS services. You can start them using:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
3. **Follow the example**: Each directory contains detailed setup instructions and configuration files specific to that deployment pattern.
## Prerequisites
Before running any examples, ensure you have:
- **Docker & Docker Compose** - For containerized services
- **CUDA-compatible GPU** - For LLM inference (except hello_world, which is non-GPU aware)
- **Python 3.9+** - For client scripts and utilities
### For Kubernetes Deployments
If you're running Kubernetes/cloud deployment examples (EKS, AKS, GKE), you'll also need:
| Tool | Minimum Version | Installation |
|------|-----------------|--------------|
| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) |
See the [Kubernetes Installation Guide](../kubernetes/installation-guide.md#prerequisites) for detailed setup instructions and pre-deployment checks.
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Installation"
--- ---
# Installation
## Pip (PyPI) ## Pip (PyPI)
Install a pre-built wheel from PyPI. Install a pre-built wheel from PyPI.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Welcome to NVIDIA Dynamo"
--- ---
The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale. # Welcome to NVIDIA Dynamo
<Tip> The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.
**Discover the Latest Developments!**
This guide is a snapshot of a specific point in time. For the latest information, examples, and Release Assets, see the [Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo/releases/latest). > [!TIP]
</Tip> > **Discover the Latest Developments!**
>
> This guide is a snapshot of a specific point in time. For the latest information, examples, and Release Assets, see the [Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo/releases/latest).
## Quickstart ## Quickstart
...@@ -94,4 +94,4 @@ curl localhost:8000/v1/chat/completions \ ...@@ -94,4 +94,4 @@ curl localhost:8000/v1/chat/completions \
- **GitHub Issues**: [Report bugs or request features](https://github.com/ai-dynamo/dynamo/issues) - **GitHub Issues**: [Report bugs or request features](https://github.com/ai-dynamo/dynamo/issues)
- **Discussions**: [Ask questions and share ideas](https://github.com/ai-dynamo/dynamo/discussions) - **Discussions**: [Ask questions and share ideas](https://github.com/ai-dynamo/dynamo/discussions)
- **Reference**: [CLI Reference](../reference/cli.md) | [Glossary](../reference/glossary.md) | [Support Matrix](./support-matrix.md) - **Reference**: [CLI Reference](../reference/cli.md) | [Glossary](../reference/glossary.md) | [Support Matrix](../reference/support-matrix.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Support Matrix"
---
This document provides the support matrix for Dynamo, including hardware, software and build instructions.
## Hardware Compatibility
| **CPU Architecture** | **Status** |
| :------------------- | :----------- |
| **x86_64** | Supported |
| **ARM64** | Supported |
### GPU Compatibility
If you are using a **GPU**, the following GPU models and architectures are supported:
| **GPU Architecture** | **Status** |
| :----------------------------------- | :--------- |
| **NVIDIA Blackwell Architecture** | Supported |
| **NVIDIA Hopper Architecture** | Supported |
| **NVIDIA Ada Lovelace Architecture** | Supported |
| **NVIDIA Ampere Architecture** | Supported |
## Platform Architecture Compatibility
**Dynamo** is compatible with the following platforms:
| **Operating System** | **Version** | **Architecture** | **Status** |
| :------------------- | :---------- | :--------------- | :----------- |
| **Ubuntu** | 22.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | ARM64 | Supported |
| **CentOS Stream** | 9 | x86_64 | Experimental |
<Note>
Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04).
Compatibility with other Linux distributions is expected but has not been officially verified yet.
</Note>
<Error>
KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
</Error>
## Software Compatibility
### Runtime Dependency
| **Python Package** | **Version** | glibc version | CUDA Version |
| :----------------- | :---------- | :------------------------------------ | :----------- |
| ai-dynamo | 0.8.0 | >=2.28 | |
| ai-dynamo-runtime | 0.8.0 | >=2.28 (Python 3.12 has known issues) | |
| NIXL | 0.8.0 | >=2.27 | >=11.8 |
### Build Dependency
The following table shows the dependency versions included with each Dynamo release:
| **Dependency** | **main (ToT)** | **v0.8.0 (unreleased)** | **v0.7.1** | **v0.7.0.post1** | **v0.7.0** |
| :------------- | :------------- | :---------------------- | :--------- | :--------------- | :--------- |
| SGLang | 0.5.8 | 0.5.7 | 0.5.3.post4| 0.5.3.post4 | 0.5.3.post4|
| TensorRT-LLM | 1.2.0rc6 | 1.2.0rc6 | 1.2.0rc3 | 1.2.0rc3 | 1.2.0rc2 |
| vLLM | 0.13.0 | 0.12.0 | 0.11.0 | 0.11.0 | 0.11.0 |
| NIXL | 0.8.0 | 0.8.0 | 0.8.0 | 0.8.0 | 0.8.0 |
<Note>
**main (ToT)** reflects the current development branch. **v0.8.0** is the upcoming release (planned for January 14, 2025) and not yet available.
</Note>
<Warning>
Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
</Warning>
### CUDA Support by Framework
| **Dynamo Version** | **SGLang** | **TensorRT-LLM** | **vLLM** |
| :------------------- | :-----------------------| :-----------------------| :-----------------------|
| **Dynamo 0.7.1** | CUDA 12.8 | CUDA 13.0 | CUDA 12.9 |
## Cloud Service Provider Compatibility
### AWS
| **Host Operating System** | **Version** | **Architecture** | **Status** |
| :------------------------ | :---------- | :--------------- | :--------- |
| **Amazon Linux** | 2023 | x86_64 | Supported¹ |
<Error>
There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
</Error>
## Build Support
**Dynamo** currently provides build support in the following ways:
- **Wheels**: We distribute Python wheels of Dynamo and KV Block Manager:
- [ai-dynamo](https://pypi.org/project/ai-dynamo/)
- [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/)
- **New as of Dynamo v0.7.0:** [kvbm](https://pypi.org/project/kvbm/) as a standalone implementation.
- **Dynamo Runtime Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Runtime for each of the LLM inference frameworks on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
- [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime)
- [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime)
- [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)
- **Dynamo Kubernetes Operator Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Operator on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
- [kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
- **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo:
- [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds)
- [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform)
- [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph)
- **Rust Crates**:
- [dynamo-runtime](https://crates.io/crates/dynamo-runtime/)
- [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai/)
- [dynamo-parsers](https://crates.io/crates/dynamo-parsers/)
- [dynamo-llm](https://crates.io/crates/dynamo-llm/)
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/blob/main/README.md#installation).
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "JailedStream Implementation"
--- ---
# JailedStream Implementation
## Overview ## Overview
The `JailedStream` is a standalone implementation for handling "jail" detection in token streams. It provides a clean, builder-based API for accumulating tokens when certain sequences are detected, then releasing them as a single chunk when the jail ends. The `JailedStream` is a standalone implementation for handling "jail" detection in token streams. It provides a clean, builder-based API for accumulating tokens when certain sequences are detected, then releasing them as a single chunk when the jail ends.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Dynamo Request Planes User Guide"
--- ---
# Dynamo Request Planes User Guide
## Overview ## Overview
Dynamo supports multiple transport mechanisms for its request plane (the communication layer between services). You can choose from three different request plane modes based on your deployment requirements: Dynamo supports multiple transport mechanisms for its request plane (the communication layer between services). You can choose from three different request plane modes based on your deployment requirements:
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "Deploying Dynamo on Kubernetes"
--- ---
# Deploying Dynamo on Kubernetes
[Link to installation](../getting-started/installation.md) [Link to installation](../getting-started/installation.md)
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides. High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
......
--- ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: "API Reference"
--- ---
<Info> # API Reference
This documentation is automatically generated from source code.
Do not edit this file directly. > [!IMPORTANT]
</Info> > This documentation is automatically generated from source code.
> Do not edit this file directly.
## Packages ## Packages
- [nvidia.com/v1alpha1](#nvidiacomv1alpha1) - [nvidia.com/v1alpha1](#nvidiacomv1alpha1)
...@@ -423,6 +423,7 @@ _Appears in:_ ...@@ -423,6 +423,7 @@ _Appears in:_
| `services` _object (keys:string, values:[DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec))_ | Services are the services to deploy as part of this deployment. | | MaxProperties: 25 <br />Optional: \{\} <br /> | | `services` _object (keys:string, values:[DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec))_ | Services are the services to deploy as part of this deployment. | | MaxProperties: 25 <br />Optional: \{\} <br /> |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the deployment unless<br />overridden by service-specific configuration. | | Optional: \{\} <br /> | | `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the deployment unless<br />overridden by service-specific configuration. | | Optional: \{\} <br /> |
| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). | | Enum: [sglang vllm trtllm] <br /> | | `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). | | Enum: [sglang vllm trtllm] <br /> |
| `restart` _[Restart](#restart)_ | Restart specifies the restart policy for the graph deployment. | | Optional: \{\} <br /> |
#### DynamoGraphDeploymentStatus #### DynamoGraphDeploymentStatus
...@@ -441,6 +442,7 @@ _Appears in:_ ...@@ -441,6 +442,7 @@ _Appears in:_
| `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. | | | | `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. | | |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. | | | | `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. | | |
| `services` _object (keys:string, values:[ServiceReplicaStatus](#servicereplicastatus))_ | Services contains per-service replica status information.<br />The map key is the service name from spec.services. | | | | `services` _object (keys:string, values:[ServiceReplicaStatus](#servicereplicastatus))_ | Services contains per-service replica status information.<br />The map key is the service name from spec.services. | | |
| `restart` _[RestartStatus](#restartstatus)_ | Restart contains the status of the restart of the graph deployment. | | |
#### DynamoModel #### DynamoModel
...@@ -808,6 +810,88 @@ _Appears in:_ ...@@ -808,6 +810,88 @@ _Appears in:_
| `useAsCompilationCache` _boolean_ | UseAsCompilationCache indicates this volume should be used as a compilation cache.<br />When true, backend-specific environment variables will be set and default mount points may be used. | false | | | `useAsCompilationCache` _boolean_ | UseAsCompilationCache indicates this volume should be used as a compilation cache.<br />When true, backend-specific environment variables will be set and default mount points may be used. | false | |
#### Restart
_Appears in:_
- [DynamoGraphDeploymentSpec](#dynamographdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `id` _string_ | ID is an arbitrary string that triggers a restart when changed.<br />Any modification to this value will initiate a restart of the graph deployment according to the strategy. | | MinLength: 1 <br />Required: \{\} <br /> |
| `strategy` _[RestartStrategy](#restartstrategy)_ | Strategy specifies the restart strategy for the graph deployment. | | Optional: \{\} <br /> |
#### RestartPhase
_Underlying type:_ _string_
_Appears in:_
- [RestartStatus](#restartstatus)
| Field | Description |
| --- | --- |
| `Pending` | |
| `Restarting` | |
| `Completed` | |
| `Failed` | |
#### RestartStatus
RestartStatus contains the status of the restart of the graph deployment.
_Appears in:_
- [DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `observedID` _string_ | ObservedID is the restart ID that has been observed and is being processed.<br />Matches the Restart.ID field in the spec. | | |
| `phase` _[RestartPhase](#restartphase)_ | Phase is the phase of the restart. | | |
| `inProgress` _string array_ | InProgress contains the names of the services that are currently being restarted. | | |
#### RestartStrategy
_Appears in:_
- [Restart](#restart)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `type` _[RestartStrategyType](#restartstrategytype)_ | Type specifies the restart strategy type. | Sequential | Enum: [Sequential Parallel] <br /> |
| `order` _string array_ | Order specifies the order in which the services should be restarted. | | Optional: \{\} <br /> |
#### RestartStrategyType
_Underlying type:_ _string_
_Appears in:_
- [RestartStrategy](#restartstrategy)
| Field | Description |
| --- | --- |
| `Sequential` | |
| `Parallel` | |
# Operator Default Values Injection # Operator Default Values Injection
The Dynamo operator automatically applies default values to various fields when they are not explicitly specified in your deployments. These defaults include: The Dynamo operator automatically applies default values to various fields when they are not explicitly specified in your deployments. These defaults include:
...@@ -941,9 +1025,8 @@ Worker components receive the following probe configurations: ...@@ -941,9 +1025,8 @@ Worker components receive the following probe configurations:
- **Timeout**: 5 seconds - **Timeout**: 5 seconds
- **Failure Threshold**: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s) - **Failure Threshold**: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s)
<Note> > [!NOTE]
**For larger models (typically >70B parameters) or slower storage systems, you may need to increase the `failureThreshold` to allow more time for model loading. Calculate the required threshold based on your expected startup time: `failureThreshold = (expected_startup_seconds / period)`. Override the startup probe in your component specification if the default 2-hour window is insufficient.** > **For larger models (typically >70B parameters) or slower storage systems, you may need to increase the `failureThreshold` to allow more time for model loading. Calculate the required threshold based on your expected startup time: `failureThreshold = (expected_startup_seconds / period)`. Override the startup probe in your component specification if the default 2-hour window is insufficient.**
</Note>
### Multinode Deployment Probe Modifications ### Multinode Deployment Probe Modifications
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment