Unverified Commit f9050aae authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate existing docs to fern (#5445)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
Co-authored-by: default avatarNeal Vaidya <nealv@nvidia.com>
parent f238d23a
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Distributed Runtime"
---
## Overview
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
- `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
- `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple workers:
- `Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the `Processor`.
- `Processor`: When a new request arrives, `Processor` applies the chat template and performs the tokenization.
Then, it routes the request to the `Worker`.
- `Worker` components (e.g., `VllmDecodeWorker`, `SGLangDecodeWorker`, `TrtllmWorker`): Perform the actual computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
Since the workers are deployed in different processes, each of them has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-agg`). Then, under their namespace, they have their own `Component`s: `Frontend` uses the `make_engine` function which handles HTTP serving and routing automatically, while worker components create components with names like `worker`, `decode`, or `prefill` and register endpoints like `generate`, `flush_cache`, or `clear_kv_blocks`. The `Frontend` component doesn't explicitly create endpoints - instead, the `make_engine` function handles the HTTP server and worker discovery. Worker components create their endpoints programmatically using the `component.endpoint()` method. Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("worker")`), and their `Endpoint`s are created using the `component.endpoint()` method.
## Initialization
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.
:::caution
The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
:::
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following two services:
- etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
- NATS (both static and dynamic mode): for messaging.
where etcd and NATS are two global services (there could be multiple etcd and NATS services for high availability).
For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` as the service identifier and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
- `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
- NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
- etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
## Calling Endpoints
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
- `random`: randomly select an endpoint to hit
- `round_robin`: select endpoints in round-robin order
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
## Examples
We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`:
- Rust: `/lib/runtime/examples/`
- Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `components/src/dynamo` for full implementation details.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Architecture Flow"
---
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm). Color-coded flows indicate different types of operations:
## 🔵 Main Request Flow (Blue)
The primary user journey through the system:
1. **Discovery (S1)**: Client discovers the service endpoint
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
## 🟠 Decision and Allocation Flow (Orange)
The system's intelligent routing and resource allocation:
4. **Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing
5. **Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill
5a. **Allocate (S5a)**: Decode Worker pre-allocates KV cache blocks in its local GPU memory
6. **Queue (S6)**: If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
## 🟢 Prefill Worker Flow (Green)
The dedicated prefill processing pipeline:
7. **NATS Pull (S7)**: PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
8. **Load Metadata (S8)**: PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
9. **Prefill (S9)**: Worker executes the prefill computation on the input tokens
10. **NIXL Transfer (S10)**: Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker's pre-allocated blocks
## 🟣 Completion Flow (Purple)
The response generation and delivery:
11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker
12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data
13. **Response (S13)**: The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
## 🔗 Infrastructure Connections (Dotted lines)
Coordination and messaging support:
### ETCD Connections (Gray, dotted)
- **Frontend, Processor, Planner**: Service discovery and registration
- **Decode Worker, PrefillWorker**: NIXL metadata storage for GPU communication setup
### NATS Connections (Teal, dotted)
- **PrefillQueue**: JetStream consumer group for reliable work distribution
- **Processor**: Load balancing across workers
### Planning Connections (Gold, dotted)
- **Frontend → Planner**: Metrics collection for auto-scaling decisions
- **Planner → Workers**: Resource scaling commands for both Decode Worker and PrefillWorker
## Technical Implementation Details
### NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination
- PrefillWorker loads metadata to establish direct communication channels
- Block-based transfers (64–128 tokens per block) for efficient batching
### Disaggregated KV Cache:
- Each Decode Worker maintains local KV cache in its GPU memory
- No shared storage bottlenecks—all transfers are direct worker-to-worker
- Pre-allocated blocks ensure deterministic memory layout and performance
```mermaid
%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
graph TD
%% Top Layer - Client & Frontend
Client["<b>HTTP Client</b>"]
S1[["<b>1 DISCOVERY</b>"]]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
S2[["<b>2 REQUEST</b>"]]
%% Processing Layer
Processor["<b>Processor</b><br/><i>Request Handler & Router</i>"]
S3[["<b>3 VALIDATE</b>"]]
%% Infrastructure - Positioned strategically to minimize crossings
subgraph INF["<b>Infrastructure Layer</b>"]
ETCD[("<b>ETCD</b><br/><i>Service Discovery &<br/>NIXL Metadata</i>")]
NATS[("<b>NATS</b><br/><i>Message Broker</i>")]
Planner["<b>Planner</b><br/><i>Resource Management<br/>Auto-scaling</i>"]
end
%% Worker Layer - Main processing
subgraph WL["<b>Worker Layer</b>"]
%% VllmWorker section
VllmWorker["<b>Decode Worker</b><br/><i>Handles Decoding & Disagg Decisions</i>"]
S4[["<b>4 QUERY</b>"]]
S5[["<b>5 DISAGG DECISION</b>"]]
S5a[["<b>5a ALLOCATE</b>"]]
S12[["<b>12 DECODE</b>"]]
S6[["<b>6 QUEUE</b>"]]
S13[["<b>13 RESPONSE</b>"]]
%% Storage positioned near workers
LocalKVCache[("<b>Local KV Cache</b><br/><i>Pre-allocated Blocks</i>")]
%% Prefill System - Right side to minimize crossings
subgraph PS["<b>Prefill System</b>"]
PrefillQueue["<b>Prefill Queue</b><br/><i>NATS JetStream<br/>Consumer Group</i>"]
PrefillWorker["<b>Prefill Worker</b><br/><i>Dedicated Prefill Processing<br/>(Multiple Instances)</i>"]
S7[["<b>7 NATS PULL</b>"]]
S8[["<b>8 LOAD METADATA</b>"]]
S9[["<b>9 PREFILL</b>"]]
S10[["<b>10 NIXL TRANSFER</b>"]]
S11[["<b>11 NOTIFY</b>"]]
end
end
%% Main Request Flow (Blue) - Clean vertical flow
Client -.-> S1
S1 -->|HTTP API Call| Frontend
Frontend -.-> S2
S2 -->|Process & Validate| Processor
Processor -.-> S3
S3 -->|Route to Worker| VllmWorker
%% VllmWorker Internal Flow (Orange)
VllmWorker -.-> S4
S4 -->|Query Prefix Cache Hit| S5
S5 -->|Prefill Length & Queue Check| S5a
S5a -->|Continue to Decode| S12
%% Allocation & Queuing (Orange) - Minimize crossings
S5a -->|Allocate KV Cache Blocks| LocalKVCache
VllmWorker --> S6
S6 -->|Put RemotePrefillRequest| PrefillQueue
%% Prefill Worker Flow (Green) - Self-contained within PS
PrefillQueue -.-> S7
S7 -->|Consumer Group Pull| PrefillWorker
PrefillWorker -.-> S8
PrefillWorker -.-> S9
S9 -->|Execute Prefill| S10
S10 -->|Direct GPU Transfer| LocalKVCache
PrefillWorker --> S11
%% Return Flow (Purple) - Clean return path
S11 -->|Completion Notification| S12
S12 -->|Decode from KV Cache| S13
S13 -->|Post-process Response| Processor
Processor -->|HTTP Response| Frontend
Frontend -->|Final Response| Client
%% Infrastructure Connections - Organized to avoid crossings
%% ETCD Connections - Grouped by proximity
Frontend -.->|Service Discovery| ETCD
Processor -.->|Service Discovery| ETCD
VllmWorker -.->|NIXL Metadata| ETCD
PrefillWorker -.->|NIXL Metadata| ETCD
S8 -.->|Load NIXL Metadata| ETCD
Planner -.->|Service Discovery| ETCD
%% NATS Connections - Direct to queue system
PrefillQueue -.->|JetStream| NATS
Processor -.->|Load Balancing| NATS
%% Planning Connections - Strategic positioning
Frontend -.->|Metrics| Planner
Planner -.->|Auto-scaling| VllmWorker
Planner -.->|Auto-scaling| PrefillWorker
%% Styling - Each component with unique colors
classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px
classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px
classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px
classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px
classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
class Client client
class Frontend frontend
class Processor processor
class VllmWorker worker
class PrefillQueue prefillQueue
class PrefillWorker prefillWorker
class Planner planner
class LocalKVCache storage
class ETCD etcd
class NATS nats
class PS prefillBox
class INF infraLayer
class WL workerLayer
%% Flow Colors - Different line styles to reduce visual clutter
%% Main Request Flow - Blue (solid)
linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 1 stroke:#1565C0,stroke-width:4px
linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 3 stroke:#1565C0,stroke-width:4px
linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 5 stroke:#1565C0,stroke-width:4px
%% Decision & Allocation Flow - Orange (mixed)
linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 7 stroke:#E65100,stroke-width:4px
linkStyle 8 stroke:#E65100,stroke-width:4px
linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
%% KV Cache & Queue - Orange (solid)
linkStyle 10 stroke:#E65100,stroke-width:4px
linkStyle 11 stroke:#E65100,stroke-width:4px
linkStyle 12 stroke:#E65100,stroke-width:4px
%% Prefill Worker Flow - Green (mixed)
linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 14 stroke:#2E7D32,stroke-width:4px
linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 17 stroke:#2E7D32,stroke-width:4px
linkStyle 18 stroke:#2E7D32,stroke-width:4px
linkStyle 19 stroke:#2E7D32,stroke-width:4px
%% Completion Flow - Purple (mixed)
linkStyle 20 stroke:#6A1B9A,stroke-width:4px
linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 22 stroke:#6A1B9A,stroke-width:4px
linkStyle 23 stroke:#6A1B9A,stroke-width:4px
linkStyle 24 stroke:#6A1B9A,stroke-width:4px
%% Infrastructure Flows - Lighter and dotted to reduce visual noise
%% ETCD Connections - Gray (dotted, thinner)
linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
%% NATS Connections - Teal (dotted, thinner)
linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
%% Planning Connections - Gold (dotted, thinner)
linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Event Plane Architecture"
---
This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS.
## Overview
Dynamo's coordination layer adapts to the deployment environment:
| Deployment | Service Discovery | KV Events | Request Plane |
|------------|-------------------|-----------|---------------|
| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |
<Note>
The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
</Note>
```
┌─────────────────────────────────────────────────────────────────────┐
│ Coordination Layer │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Service Discovery │ │ NATS │ │
│ │ │ │ (Optional) │ │
│ │ • K8s: CRDs + API │ │ • KV Cache Events │ │
│ │ • Bare metal: etcd │ │ • Router Replica Sync │ │
│ │ │ │ • JetStream Persistence │ │
│ └─────────────────────────┘ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
│ │
┌──────────┴──────────┐ ┌─────────┴──────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Frontend │ │ Planner │ │ Worker │
└─────────┘ └─────────┘ └─────────┘
```
## Kubernetes-Native Service Discovery
When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd.
### Configuration
The operator explicitly sets:
```bash
DYN_DISCOVERY_BACKEND=kubernetes
```
<Warning>
This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
</Warning>
### How It Works
1. **DynamoWorkerMetadata CRD**: Workers register their endpoints by creating/updating DynamoWorkerMetadata custom resources
2. **EndpointSlices**: Used to signal readiness status to the system
3. **K8s API Watches**: Components watch for CRD changes to discover available endpoints
### Benefits
- No external etcd cluster required
- Native integration with Kubernetes lifecycle
- Automatic cleanup when pods terminate
- Works with standard K8s RBAC
### Environment Variables (Injected by Operator)
| Variable | Description |
|----------|-------------|
| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
| `POD_NAME` | Current pod name |
| `POD_NAMESPACE` | Current namespace |
| `POD_UID` | Pod unique identifier |
---
## etcd Architecture (Default for All Deployments)
When `DYN_DISCOVERY_BACKEND=kv_store` (the global default), etcd is used for service discovery.
### Connection Configuration
etcd connection is configured via environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `ETCD_ENDPOINTS` | Comma-separated etcd URLs | `http://localhost:2379` |
| `ETCD_AUTH_USERNAME` | Basic auth username | None |
| `ETCD_AUTH_PASSWORD` | Basic auth password | None |
| `ETCD_AUTH_CA` | CA certificate path (TLS) | None |
| `ETCD_AUTH_CLIENT_CERT` | Client certificate path | None |
| `ETCD_AUTH_CLIENT_KEY` | Client key path | None |
Example:
```bash
export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379
```
### Lease Management
Each `DistributedRuntime` maintains a primary lease with etcd:
```
┌────────────────────┐ ┌──────────────┐
│ DistributedRuntime │◄────────│ Primary Lease │
│ │ │ TTL: 10s │
│ • Namespace │ └───────┬───────┘
│ • Components │ │
│ • Endpoints │ │ Keep-Alive
│ │ │ Heartbeat
└────────────────────┘ ▼
┌──────────────┐
│ etcd │
└──────────────┘
```
**Lease Lifecycle:**
1. **Creation**: Lease created during `DistributedRuntime` initialization
2. **Keep-Alive**: Background task sends heartbeats at 50% of remaining TTL
3. **Expiration**: If heartbeats stop, lease expires after TTL (10 seconds default)
4. **Cleanup**: All keys associated with the lease are automatically deleted
**Automatic Recovery:**
- Reconnection with exponential backoff (50ms to 5s)
- Deadline-based retry logic
- Cancellation token propagation
### Service Discovery
Endpoints are registered in etcd for dynamic discovery:
**Key Format:**
```
/services/{namespace}/{component}/{endpoint}/{instance_id}
```
**Example:**
```
/services/vllm-agg/backend/generate/694d98147d54be25
```
**Registration Data:**
```json
{
"namespace": "vllm-agg",
"component": "backend",
"endpoint": "generate",
"instance_id": 7587888160958628000,
"transport": {
"tcp": "192.168.1.10:9999"
}
}
```
### Discovery Queries
The discovery system supports multiple query patterns:
| Query Type | Pattern | Use Case |
|------------|---------|----------|
| `AllEndpoints` | `/services/` | List all services |
| `NamespacedEndpoints` | `/services/{namespace}/` | Filter by namespace |
| `ComponentEndpoints` | `/services/{namespace}/{component}/` | Filter by component |
| `Endpoint` | `/services/{namespace}/{component}/{endpoint}/` | Specific endpoint |
### Watch Functionality
Clients watch etcd prefixes for real-time updates:
```python
# Client watches for endpoint changes
watcher = etcd.watch_prefix("/services/vllm-agg/backend/generate/")
for event in watcher:
if event.type == "PUT":
# New endpoint registered
add_endpoint(event.value)
elif event.type == "DELETE":
# Endpoint removed (worker died)
remove_endpoint(event.key)
```
**Watch Features:**
- Initial state retrieval with `get_and_watch_prefix()`
- Automatic reconnection on stream failure
- Revision tracking for no-event-loss guarantees
- Event types: `PUT` (create/update) and `DELETE`
### Distributed Locks
etcd provides distributed locking for coordination:
**Lock Types:**
| Type | Key Pattern | Behavior |
|------|-------------|----------|
| Write Lock | `v1/{prefix}/writer` | Exclusive (no readers/writers) |
| Read Lock | `v1/{prefix}/readers/{id}` | Shared (multiple readers) |
**Operations:**
```rust
// Non-blocking write lock
let lock = client.try_write_lock("my_resource").await?;
// Blocking read lock with polling (100ms intervals)
let lock = client.read_lock_with_wait("my_resource").await?;
```
## NATS Architecture
### When NATS is Used
NATS is used for:
1. **KV Cache Events**: Real-time KV cache state updates for routing
2. **Router Replica Sync**: Synchronizing router state across replicas
3. **Legacy Request Plane**: NATS-based request transport (optional)
### Configuration
| Variable | Description | Default |
|----------|-------------|---------|
| `NATS_SERVER` | NATS server URL | `nats://localhost:4222` |
### Disabling NATS
For deployments without KV-aware routing:
```bash
# Disable NATS and KV events
python -m dynamo.frontend --no-kv-events
```
This enables "approximate mode" for KV routing without event persistence.
### Event Publishing
Components publish events to NATS subjects:
```rust
pub trait EventPublisher {
async fn publish(&self, event: &str, data: &[u8]) -> Result<()>;
async fn publish_serialized<T: Serialize>(&self, event: &str, data: &T) -> Result<()>;
}
```
**Subject Naming:**
```
{base_subject}.{event_name}
```
Example:
```
vllm-agg.backend.kv_cache_update
```
### Event Subscription
Components subscribe to events:
```rust
pub trait EventSubscriber {
async fn subscribe(&self, topic: &str) -> Result<Subscriber>;
async fn subscribe_typed<T: DeserializeOwned>(&self, topic: &str) -> Result<TypedSubscriber<T>>;
}
```
### JetStream Persistence
For durable event delivery, NATS JetStream provides:
- Message persistence
- Replay from offset
- Consumer groups for load balancing
- Acknowledgment tracking
## Key-Value Store Abstraction
Dynamo provides a unified KV store interface supporting multiple backends:
### Supported Backends
| Backend | Use Case | Configuration |
|---------|----------|---------------|
| `EtcdStore` | Production deployments | `ETCD_ENDPOINTS` |
| `MemoryStore` | Testing, development | Default |
| `NatsStore` | NATS-only deployments | `NATS_SERVER` |
| `FileStore` | Local persistence | File path |
### Store Interface
```rust
pub trait KvStore {
async fn get(&self, bucket: &str, key: &str) -> Result<Option<Vec<u8>>>;
async fn put(&self, bucket: &str, key: &str, value: &[u8]) -> Result<()>;
async fn delete(&self, bucket: &str, key: &str) -> Result<()>;
async fn watch(&self, bucket: &str) -> Result<WatchStream>;
}
```
### Buckets
Data is organized into logical buckets:
| Bucket | Purpose |
|--------|---------|
| `v1/instances` | Endpoint instance registry |
| `v1/mdc` | Model deployment cards |
## Typed Prefix Watcher
For type-safe watching of etcd prefixes:
```rust
// Watch and maintain HashMap of deserialized values
let watcher = watch_prefix_with_extraction::<DiscoveryInstance>(
&etcd_client,
"/services/vllm-agg/",
lease_id_extractor,
value_extractor,
).await?;
// Receive updates via watch channel
let instances = watcher.borrow();
```
**Key Extractors:**
| Extractor | Description |
|-----------|-------------|
| `lease_id()` | Use lease ID as key |
| `key_string()` | Extract key with prefix stripping |
| `full_key_string()` | Use full etcd key |
## Reliability Features
### Connection Resilience
**etcd Reconnection:**
- Exponential backoff: 50ms to 5s
- Deadline-based retry logic
- Mutex ensures single concurrent reconnect
**NATS Reconnection:**
- Built-in reconnection in NATS client
- Configurable max reconnect attempts
- Buffering during disconnection
### Lease-Based Cleanup
When a worker crashes or loses connectivity:
1. Keep-alive heartbeats stop
2. Lease expires after TTL (10 seconds)
3. All registered endpoints automatically deleted
4. Clients receive DELETE watch events
5. Traffic reroutes to healthy workers
### Transaction Safety
etcd transactions ensure atomic operations:
```rust
// Atomic create-if-not-exists
let txn = Txn::new()
.when([Compare::create_revision(key, CompareOp::Equal, 0)])
.and_then([Op::put(key, value, options)]);
etcd_client.txn(txn).await?;
```
This prevents race conditions in concurrent service registration.
## Operational Modes
### Kubernetes Mode (Requires Explicit Configuration)
Native Kubernetes service discovery:
```bash
# Operator explicitly sets this (not auto-detected):
export DYN_DISCOVERY_BACKEND=kubernetes
# Workers register via K8s CRDs
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
# Frontend discovers workers via K8s API
python -m dynamo.frontend
```
No etcd or NATS required for basic operation when using K8s discovery.
### KV Store Mode (Global Default)
Full service discovery with etcd:
```bash
# This is the default - no configuration needed
# export DYN_DISCOVERY_BACKEND=kv_store # (implicit)
# Workers register with etcd
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
# Frontend discovers workers via etcd
python -m dynamo.frontend
```
### KV-Aware Routing (Optional)
Enable NATS for KV cache event tracking:
```bash
# Default: KV events enabled (requires NATS)
python -m dynamo.frontend --router-mode kv
# Disable KV events for prediction-based routing (no NATS)
python -m dynamo.frontend --router-mode kv --no-kv-events
```
With `--no-kv-events`:
- Router predicts cache state based on routing decisions
- TTL-based expiration and LRU pruning
- No NATS infrastructure required
## Best Practices
### 1. Use Kubernetes Discovery on K8s
The Dynamo operator automatically sets `DYN_DISCOVERY_BACKEND=kubernetes` for pods. No additional setup required when using the operator.
### 2. For Bare Metal: Deploy etcd Cluster
For bare-metal production deployments, deploy a 3-node etcd cluster for high availability.
### 3. Configure Appropriate TTLs (etcd mode)
Balance between detection speed and overhead:
- **Short TTL (5s)**: Faster failure detection, more keep-alive traffic
- **Long TTL (30s)**: Less overhead, slower detection
### 4. KV Routing Without NATS
For simpler deployments without NATS:
```bash
# Use prediction-based KV routing
python -m dynamo.frontend --router-mode kv --no-kv-events
```
This provides KV-aware routing with reduced accuracy but no NATS dependency.
## Related Documentation
- [Distributed Runtime](distributed-runtime.md) - Runtime architecture
- [Request Plane](../guides/request-plane.md) - Request transport configuration
- [Fault Tolerance](../fault-tolerance/request-cancellation.md) - Failure handling
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Writing Python Workers in Dynamo"
---
This guide explains how to create your own Python worker in Dynamo.
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
The Python file must do three things:
1. Decorate a function to get the runtime
2. Register on the network
3. Attach a request handler
```
from dynamo.llm import ModelInput, ModelType, register_llm
from dynamo.runtime import DistributedRuntime, dynamo_worker
# 1. Decorate a function to get the runtime
#
@dynamo_worker()
async def worker(runtime: DistributedRuntime):
# 2. Register ourselves on the network
#
component = runtime.namespace("namespace").component("component")
model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
model_input = ModelInput.Tokens # or ModelInput.Text if engine handles pre-processing
model_type = ModelType.Chat # or ModelType.Chat | ModelType.Completions if model can be deployed on chat and completions endpoints
endpoint = component.endpoint("endpoint")
# Optional last param to register_llm is model_name. If not present derives it from model_path
await register_llm(model_input, model_type, endpoint, model_path)
# Initialize your engine here
# engine = ...
# 3. Attach request handler
#
await endpoint.serve_endpoint(RequestHandler(engine).generate)
class RequestHandler:
def __init__(self, engine):
...
async def generate(self, request):
# Call the engine
# yield result dict
...
if __name__ == "__main__":
uvloop.install()
asyncio.run(worker())
```
The `model_path` can be:
- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
The `model_input` can be:
- ModelInput.Tokens. Your engine expects pre-processed input (token IDs). Dynamo handles tokenization and pre-processing.
- ModelInput.Text. Your engine expects raw text input and handles its own tokenization and pre-processing.
The `model_type` can be:
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat).
- ModelType.Completions. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions).
`register_llm` can also take the following kwargs:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
- `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../fault-tolerance/request-migration.md). Defaults to 0.
- `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
See `examples/backends` for full code examples.
## Component names
A worker needs three names to register itself: namespace.component.endpoint
* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
* *Endpoint*: Like a URL. "generate", "load_metrics".
* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.
If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.
If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.
Example 1: Data parallel load balanced, one model one pipeline two instances.
```
Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 0
Node 2: namespace: qwen3-32b, component: backend, endpoint: generate model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 2
```
Example 2: Two models, two pipelines.
```
Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B
Node 2: namespace: llama3-1-8b, component: backend, endpoint: generat, model: /data/Llama-3.1-8B-Instruct/
```
Example 3: Different endpoints.
The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.
Example 4: Multiple component in a pipeline.
In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`.
## Migrate Ongoing Requests
A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline.
In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault-tolerance/request-migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
<Warning>
We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future.
</Warning>
Here's an example of how to implement this in your `RequestHandler`:
```python
class RequestHandler:
async def generate(self, request):
"""Generate response, with support for request migration"""
for result in self.engine.generate_streaming(request):
# Check if we need to migrate before yielding each token
if is_shutting_down():
# Raising GeneratorExit closes the stream and triggers migration
raise GeneratorExit("Worker shutting down, migrating request")
yield result
```
When `GeneratorExit` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns.
For more information about how request migration works, see the [Request Migration Architecture](../fault-tolerance/request-migration.md) documentation.
## Request Cancellation
Your Python worker's request handler can optionally support request cancellation by accepting a `context` argument after the `request` argument. This context object allows you to check for cancellation signals and respond appropriately:
```python
class RequestHandler:
async def generate(self, request, context):
"""Generate response with cancellation support"""
for result in self.engine.generate_streaming(request):
# Check if the request has been cancelled
if context.is_stopped():
# Stop processing and clean up
break
yield result
```
The context parameter is optional - if your generate method doesn't include it in its signature, Dynamo will call your method without the context argument.
For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](../fault-tolerance/request-cancellation.md) documentation.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Runtime"
---
<h4>A Datacenter Scale Distributed Inference Serving Framework</h4>
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
Rust implementation of the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
## Prerequisites
### Install Rust and Cargo using [rustup](https://rustup.rs/):
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
### Build
```
cargo build
cargo test
```
### Start Dependencies
#### Docker Compose
The simplest way to deploy the pre-requisite services is using
[docker-compose](https://docs.docker.com/compose/install/linux/),
defined in [deploy/docker-compose.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml).
```
# At the root of the repository:
docker compose -f deploy/docker-compose.yml up -d
```
This will deploy a [NATS.io](https://nats.io/) server and an [etcd](https://etcd.io/)
server used to communicate between and discover components at runtime.
#### Local (alternate)
To deploy the pre-requisite services locally instead of using `docker-compose`
above, you can manually launch each:
- [NATS.io](https://docs.nats.io/running-a-nats-service/introduction/installation) server with [Jetstream](https://docs.nats.io/nats-concepts/jetstream)
- example: `nats-server -js --trace`
- [etcd](https://etcd.io) server
- follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally
### Run Examples
When developing or running examples, any process or user that shared your core-services (`etcd` and `nats.io`) will
be operating within your distributed runtime.
The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later.
All examples require the `etcd` and `nats.io` pre-requisites to be running and available.
#### Rust `hello_world`
With two terminals open, in one window:
```
cd examples/hello_world
cargo run --bin server
```
In the second terminal, execute:
```
cd examples/hello_world
cargo run --bin client
```
which should yield some output similar to:
```
Finished `dev` profile [unoptimized + debuginfo] target(s) in 6.25s
Running `target/debug/client`
Annotated { data: Some("h"), id: None, event: None, comment: None }
Annotated { data: Some("e"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("o"), id: None, event: None, comment: None }
Annotated { data: Some(" "), id: None, event: None, comment: None }
Annotated { data: Some("w"), id: None, event: None, comment: None }
Annotated { data: Some("o"), id: None, event: None, comment: None }
Annotated { data: Some("r"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("d"), id: None, event: None, comment: None }
```
#### Python
See the [README.md](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/lib/bindings/python/README.md) for details
The Python and Rust `hello_world` client and server examples are interchangeable,
so you can start the Python `server.py` and talk to it from the Rust `client`.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Fault Tolerance"
---
Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability.
## Overview
Fault tolerance in Dynamo operates at multiple levels:
| Layer | Mechanism | Purpose |
|-------|-----------|---------|
| **Request** | Migration, Cancellation | Handle in-flight request failures |
| **Worker** | Health Checks, Graceful Shutdown | Detect and recover from worker failures |
| **System** | Load Shedding, Request Rejection | Prevent system overload |
| **Infrastructure** | etcd HA, NATS resilience | Handle infrastructure component failures |
## Key Features
### Request Migration
When a worker fails during request processing, Dynamo can migrate in-progress requests to healthy workers. The migration system:
- Preserves partial generation state (accumulated tokens)
- Transparently continues generation on a new worker
- Maintains seamless token flow to clients
See [Request Migration](request-migration.md) for details.
### Request Cancellation
Dynamo supports canceling in-flight requests to free computational resources:
- Graceful stop signals for clean termination
- Kill signals for immediate termination
- Hierarchical cancellation propagation through request chains
See [Request Cancellation](request-cancellation.md) for details.
### Graceful Shutdown
Workers handle shutdown signals (SIGTERM/SIGINT) gracefully:
- Immediately stop accepting new requests
- Optionally drain in-flight requests before terminating
- Clean up resources (engines, connections, temp files)
See [Graceful Shutdown](graceful-shutdown.md) for details.
### Request Rejection (Load Shedding)
When workers are overloaded, Dynamo rejects new requests to prevent cascading failures:
- Configurable busy thresholds based on KV cache utilization
- Real-time worker load monitoring
- HTTP 503 responses with retry guidance
See [Request Rejection](request-rejection.md) for details.
### Health Checks
Dynamo provides multiple health check mechanisms:
- **HTTP Endpoints**: `/health` and `/live` endpoints for orchestration
- **Canary Health Checks**: Active monitoring via periodic test requests
- **Engine Monitoring**: Automatic shutdown on engine failure detection
See [Health Checks](../observability/health-checks.md) for details.
## Configuration Quick Reference
| Feature | Environment Variable | Default |
|---------|---------------------|---------|
| Worker health port | `DYN_SYSTEM_PORT` | `9090` |
| Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` (K8s: `true`) |
| Canary wait time | `DYN_CANARY_WAIT_TIME` | `10` seconds |
| Health check timeout | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | `3` seconds |
| Decode blocks threshold | `--active-decode-blocks-threshold` | None (disabled) |
| Prefill tokens threshold | `--active-prefill-tokens-threshold` | None (disabled) |
## Failure Scenarios and Recovery
### Worker Pod Restart
1. Worker receives SIGTERM from Kubernetes
2. Endpoints are immediately invalidated (no new requests)
3. In-flight requests complete or migrate (based on configuration)
4. Resources are cleaned up
5. Pod restarts with fresh state
### Worker Crash (Unexpected)
1. etcd lease expires (TTL-based detection)
2. Client discovers endpoint removal via etcd watch
3. New requests route to remaining healthy workers
4. In-flight requests on crashed worker are migrated (if enabled)
### Network Partition
1. Worker loses connectivity to etcd/NATS
2. Lease keep-alive fails, lease eventually expires
3. Worker is removed from service discovery
4. Traffic reroutes to reachable workers
### GPU Failure
1. Engine health check detects GPU error (XID, OOM, etc.)
2. Worker initiates graceful shutdown
3. Runtime is shut down, engine cleaned up
4. Process exits with code 1 for pod restart
## Testing Fault Tolerance
Dynamo includes a comprehensive testing framework for validating fault tolerance:
- Request cancellation tests
- Migration tests with worker failures
- etcd HA failover tests
- Hardware fault injection (GPU XID, network partitions)
See [Fault Tolerance Testing](testing.md) for details.
## Related Documentation
- [Observability](../observability/README.md) - Metrics and monitoring
- [Distributed Runtime](../design-docs/distributed-runtime.md) - Service discovery architecture
- [Event Plane](../design-docs/event-plane.md) - etcd and NATS coordination
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Graceful Shutdown"
---
This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.
## Overview
Graceful shutdown in Dynamo ensures that:
1. **No new requests are accepted** - Endpoints are immediately invalidated
2. **In-flight requests complete** - Existing requests finish processing (configurable)
3. **Resources are cleaned up** - Engines, connections, and temporary files are released
4. **Pods restart cleanly** - Exit codes signal Kubernetes for proper restart behavior
## Signal Handling
All Dynamo components handle Unix signals for graceful shutdown:
| Signal | Trigger | Behavior |
|--------|---------|----------|
| `SIGTERM` | Kubernetes pod termination | Graceful shutdown initiated |
| `SIGINT` | Ctrl+C / manual interrupt | Graceful shutdown initiated |
### Implementation
Each component registers signal handlers at startup:
```python
def signal_handler():
asyncio.create_task(graceful_shutdown(runtime))
for sig in (signal.SIGTERM, signal.SIGINT):
loop.add_signal_handler(sig, signal_handler)
```
The `graceful_shutdown()` function:
1. Logs the shutdown signal
2. Calls `runtime.shutdown()` to invalidate endpoints
3. Waits for in-flight requests (based on configuration)
4. Returns to allow cleanup to proceed
## Endpoint Draining
When `runtime.shutdown()` is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the `graceful_shutdown` parameter when serving the endpoint.
### Configuration
When registering an endpoint, the `graceful_shutdown` parameter controls draining behavior:
```python
generate_endpoint.serve_endpoint(
handler.generate,
graceful_shutdown=True, # Wait for all requests to finish
metrics_labels=[("model", model_name)],
health_check_payload=health_check_payload,
)
```
| `graceful_shutdown` | Behavior |
|---------------------|----------|
| `True` | Wait for all in-flight requests to complete before returning |
| `False` | Return immediately without waiting for requests |
### Component-Specific Behavior
| Component | Default Behavior | Rationale |
|-----------|------------------|-----------|
| **Frontend** | N/A (HTTP server) | HTTP server handles its own shutdown |
| **Prefill Workers** | `graceful_shutdown=True` | Prefill operations must complete to avoid wasted computation |
| **Decode Workers** | Conditional | If migration is enabled (`migration_limit > 0`), shutdown immediately to allow migration; otherwise wait |
| **Router** | `graceful_shutdown=True` | Ensure routing decisions complete |
### Decode Worker Migration Integration
Decode workers use conditional draining based on whether request migration is supported:
```python
generate_endpoint.serve_endpoint(
handler.generate,
graceful_shutdown=config.migration_limit <= 0, # If no migration, wait for requests
...
)
```
When `migration_limit > 0`:
- Worker shuts down immediately (`graceful_shutdown=False`)
- In-flight requests are migrated to healthy workers
- No request loss occurs
When `migration_limit <= 0`:
- Worker waits for in-flight requests (`graceful_shutdown=True`)
- Migration is not available
- Requests complete on the shutting-down worker
## Resource Cleanup
After endpoint draining, components clean up their resources in `finally` blocks:
### vLLM Worker Cleanup
```python
finally:
logger.debug("Cleaning up worker")
handler.cleanup()
```
The handler's `cleanup()` method:
- Removes temporary directories (LoRA adapters, etc.)
- Releases engine resources
### SGLang Worker Cleanup
```python
def cleanup(self) -> None:
# Cancel pending consume tasks
for task in self._consume_tasks:
if not task.done():
task.cancel()
self._consume_tasks.clear()
# Shutdown engine
self.engine.shutdown()
```
### TensorRT-LLM Worker Cleanup
```python
async def cleanup(self):
if self._llm:
try:
self._llm.shutdown()
except Exception as e:
logging.error(f"Error during cleanup: {e}")
finally:
self._llm = None
```
## Error-Initiated Shutdown
Workers can initiate graceful shutdown when fatal errors occur:
### Engine Health Monitoring (vLLM)
The `VllmEngineMonitor` continuously checks engine health:
```python
async def _check_engine_health(self):
while True:
try:
await self.engine_client.check_health()
await asyncio.sleep(HEALTH_CHECK_INTERVAL) # 2 seconds
except EngineDeadError as e:
logger.error(f"Health check failed: {e}")
self._shutdown_engine()
self.runtime.shutdown()
os._exit(1)
```
Configuration:
- `HEALTH_CHECK_INTERVAL`: 2 seconds between checks
- `ENGINE_SHUTDOWN_TIMEOUT`: 30 seconds max for engine shutdown
### Fatal Error Handling (TensorRT-LLM)
```python
async def _initiate_shutdown(self, error: Exception):
logging.warning(f"Initiating graceful shutdown due to: {error}")
try:
if self.runtime:
self.runtime.shutdown()
if self.engine:
await self.engine.cleanup()
except Exception as cleanup_error:
logging.error(f"Error during graceful shutdown: {cleanup_error}")
finally:
logging.critical("Forcing process exit for restart")
os._exit(1)
```
## Kubernetes Integration
### Pod Termination Flow
1. Kubernetes sends `SIGTERM` to the pod
2. Dynamo initiates graceful shutdown
3. Pod has `terminationGracePeriodSeconds` to complete (default: 30s)
4. If not terminated, Kubernetes sends `SIGKILL`
### Recommended Configuration
For production deployments, configure adequate termination grace period:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
VllmWorker:
extraPodSpec:
terminationGracePeriodSeconds: 60 # Allow time for request draining
```
### Health Check Integration
Kubernetes uses health endpoints to determine pod readiness:
- **During shutdown**: Endpoints become unavailable
- **Readiness probe fails**: Traffic stops routing to the pod
- **Graceful draining**: Existing requests complete
## Best Practices
### 1. Set Appropriate Grace Periods
Match `terminationGracePeriodSeconds` to your expected request completion time:
- Short requests (< 10s): 30s grace period
- Long generation (> 30s): 120s+ grace period
### 2. Enable Request Migration for Decode Workers
If using disaggregated serving, enable migration for decode workers:
```python
--migration-limit 3 # Allow up to 3 migration attempts
```
This allows immediate shutdown while preserving request state.
### 3. Monitor Shutdown Metrics
Track shutdown behavior via logs:
```
INFO Received shutdown signal, shutting down DistributedRuntime
INFO DistributedRuntime shutdown complete
DEBUG Cleaning up worker
```
### 4. Handle Cleanup Errors
Ensure cleanup methods handle errors gracefully:
```python
def cleanup(self):
for resource in self.resources:
try:
resource.cleanup()
except Exception as e:
logger.warning(f"Cleanup failed: {e}")
# Continue with other resources
```
## Related Documentation
- [Request Migration](request-migration.md) - How requests migrate during shutdown
- [Request Cancellation](request-cancellation.md) - Canceling in-flight requests
- [Health Checks](../observability/health-checks.md) - Liveness and readiness probes
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Request Cancellation Architecture"
---
This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed.
## AsyncEngineContext Trait
At the core of Dynamo's request cancellation system is the `AsyncEngineContext` trait. This trait is associated with every request stream and provides lifecycle management for async operations, including stream identification, graceful shutdown capabilities, and immediate termination capabilities.
### Key Methods
#### Identification
- **`id()`**: Returns the unique identifier for the stream. This ID is set by the user for request identification, and the same ID can be used for sub-requests to associate them with the original user request.
#### Status Checking
- **`is_stopped()`**: Returns `true` if graceful cancellation has been requested via `stop_generating()`. This represents a signal to the worker that the request has been cancelled and it should return early.
- **`is_killed()`**: Returns `true` if a hard stop has been issued via `kill()`. This typically indicates that the network connection between client and server has been cut or an immediate termination is required.
#### Async Status Monitoring
- **`stopped()`**: An async method that completes when the context becomes stopped. If already stopped, returns immediately.
- **`killed()`**: An async method that completes when the context becomes killed. If already killed, returns immediately.
#### Cancellation Control
- **`stop_generating()`**: The recommended method for cancelling a request. This informs the engine to stop producing results for the stream gracefully. This method is idempotent and does not invalidate results currently in the stream.
- **`stop()`**: Alias for `stop_generating()`.
- **`kill()`**: Extends `stop_generating()` but also indicates a preference to terminate without draining remaining items in the stream. This is implementation-specific and may not be supported by all engines.
#### Child Request Management
- **`link_child(child: Arc<dyn AsyncEngineContext>)`**: Links a child `AsyncEngineContext` to this context. When `stop_generating()`, `stop()`, or `kill()` is called on the parent context, the same method is automatically called on all linked child contexts in the order they were linked. This is especially useful in disaggregated serving scenarios where a frontend receives cancellation notification and needs to cancel requests to workers, and the worker can then cancel its sub-requests (e.g., remote prefill operations).
### Thread Safety
The `AsyncEngineContext` trait ensures thread-safety with `Send + Sync` bounds, allowing safe concurrent access across multiple threads and async tasks.
## Python Bindings
The `AsyncEngineContext` functionality is exposed to Python through the `Context` class, which provides a largely one-to-one mapping from Rust methods to Python methods.
### Python Context Class
The Python `Context` class wraps the Rust `AsyncEngineContext` and exposes the following methods:
- **`id()`**: Returns the unique identifier for the context
- **`is_stopped()`**: Synchronous method equivalent to the Rust `is_stopped()`
- **`is_killed()`**: Synchronous method equivalent to the Rust `is_killed()`
- **`stop_generating()`**: Issues a stop generating signal, equivalent to the Rust method
- **`async_killed_or_stopped()`**: An async method that completes when the context becomes either killed or stopped, whichever happens first. This combines the functionality of the Rust `killed()` and `stopped()` async methods using `tokio::select!`.
For a working example of request cancellation, see the [cancellation demo](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom_backend/cancellation/README.md).
### Context Usage in Python
The context is available optionally in both incoming and outgoing request scenarios:
#### Incoming Requests
For incoming requests, the generate method may optionally accept a `context` argument after the `request` argument. If the `context` parameter is specified in the method signature, it will receive the context object of the incoming request. Request handlers can:
- Check for cancellation synchronously using `context.is_stopped()` before beginning expensive operations
- Listen for cancellation asynchronously using `await context.async_killed_or_stopped()`
Example:
```python
async def generate(self, request, context):
for i in range(1000):
# Check for cancellation before expensive work
if context.is_stopped():
raise asyncio.CancelledError
# Perform work...
await expensive_computation()
yield result
```
#### Outgoing Requests
For outgoing requests, Python scripts may optionally provide a context object to outgoing runtime endpoint client router operations (such as `generate`, `round_robin`, `random`, `direct` methods) as a keyword argument. The script can cancel the outgoing request via the provided context object.
This is especially useful when child outgoing requests need to be cancelled when the parent incoming request is cancelled. In such cases, the script can simply pass the incoming context object to the outgoing request, automatically linking the cancellation behavior.
Example:
```python
async def generate(self, request, context):
# Forward the incoming context to outgoing request
# If the incoming request is cancelled, the outgoing request will be too
stream = await self.client.generate(request, context=context)
async for response in stream:
yield response
```
This design enables seamless cancellation propagation through multi-tier request chains, ensuring that when a client cancels a request, all associated sub-requests are automatically cancelled, saving computational resources across the entire request pipeline.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Request Migration Architecture"
---
This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience.
## Overview
Request migration is implemented through a Migration operator that sits in the LLM processing pipeline between the Backend operator and the service backend. When a worker fails during request processing, the migration system preserves the partial generation state and recreates the request on a new worker to continue from where the previous worker left off.
## Architecture Components
### Migrator
The migration system is integrated into the LLM processing pipeline between the frontend preprocessing and the actual service backends. This positioning allows it to intercept all communication flows and manage failure scenarios transparently.
Key responsibilities:
- Intercepts all requests and responses flowing through the pipeline
- Detects worker failure scenarios through error pattern matching
- Manages retry logic with configurable migration limits
- Tracks partial response state for seamless continuation
### Migration Limit Configuration
Each model can be configured with a migration limit parameter that specifies the maximum number of times a request can be migrated to another worker:
- Default behavior: no migration allowed
- Can be set independently for different engine types
- Applicable to LLM worker nodes that perform inference
- Allows engines to override user-specified limits for compatibility
## Token State Tracking and Request Migration
The core of the migration system is the ability to preserve and continue partial generations through token state management. This ensures that when a worker fails mid-generation, the new worker can seamlessly continue from the exact point of failure.
### Token Accumulation Process
When a request is being processed and responses are flowing back from a worker, the migration system tracks every token that has been successfully generated:
1. **Initial Request State**: The system starts with the original preprocessed request containing the initial prompt tokens.
2. **Response Tracking**: As each response arrives from the worker, the migration system extracts the newly generated tokens and appends them to the request's token sequence. This creates accumulates all tokens that have been generated.
3. **Token Count Management**: The system also updates the remaining token budget to reflect the number of tokens already generated, ensuring that the total generation stays within the originally requested limits.
### Migration Trigger Scenarios
The migration system handles two distinct failure scenarios:
#### 1. New Request Migration (Initial Connection Failure)
**Scenario**: Worker is unreachable when creating the initial connection.
**Error Pattern**: Communication system reports chosen worker instance is unavailable.
**Migration Process**:
- Detects connection failure during initial stream setup
- Decrements migration retry count
- Attempts to create a new stream with the original request
- No partial state to preserve since generation hasn't started
#### 2. Ongoing Request Migration (Mid-Stream Disconnection)
**Scenario**: Connection lost during active generation after partial responses have been received.
**Error Pattern**: Stream termination detected before generation completion.
**Migration Process**:
1. **Failure Detection**: The system detects the stream disconnection through error monitoring.
2. **State Preservation**: At this point, the request's token sequence contains both the original prompt tokens and all successfully generated tokens from the failed worker.
3. **New Stream Creation**: A fresh stream is created with the accumulated request state, ensuring the new worker has complete context.
4. **Continuation**: The new worker receives the request with the full token context and continues generation from the exact point where the previous worker left off.
### Seamless Token Flow and Request State Evolution
From the client's perspective, the token stream appears continuous and uninterrupted. The client receives tokens from the first worker until failure occurs, then seamlessly continues receiving tokens from the backup worker without any indication of the underlying migration.
The request state evolves dynamically during processing. Initially, the request contains only the original prompt tokens. As generation proceeds, each successfully generated token is appended to the request's token sequence, creating a growing record of the complete conversation context.
When a migration occurs, this accumulated state is transferred to the new worker, which uses it to reconstruct the complete context. The new worker then continues generation as if it had been processing the request from the beginning, but starting from the current position in the sequence.
The migration is transparent because:
1. No tokens are lost or duplicated during the transition
2. The new worker has complete context via the accumulated token sequence
3. Generation continues from the exact failure point
4. Response streaming maintains consistent format and timing
This token accumulation mechanism ensures that migrations are truly seamless, preserving all computational work and maintaining generation quality across worker transitions.
## Benefits
1. **Fault Tolerance**: System continues operating during individual worker failures
2. **Resource Efficiency**: Partial generations are preserved rather than restarted
3. **Seamless User Experience**: Users experience no interruption during worker failures
4. **Configurable Behavior**: Migration limits allow tuning based on deployment requirements
5. **No Token Loss**: Complete preservation of generation state across migrations
## Design Considerations
The migration system is designed with several important architectural considerations:
**Engine Compatibility**: Different LLM engines may have varying capabilities for handling migrated requests. The system allows engines to override migration settings to ensure compatibility and correctness.
**Multi-Model Support**: Since a frontend may serve multiple models simultaneously, migration limits can be configured at the engine level, providing flexibility for different model types with varying reliability characteristics.
**State Management**: The system carefully tracks not only token sequences but also metadata such as remaining token budgets, stop conditions, and sampling parameters to ensure complete state preservation.
**Error Handling**: The migration system distinguishes between different types of failures and applies appropriate recovery strategies for each scenario.
## Monitoring and Metrics
The migration system exposes Prometheus metrics to monitor migration activity. These metrics are available on the frontend's `/metrics` endpoint (default port 8000):
- `dynamo_frontend_model_migration_total`: Counter tracking the total number of request migrations
- Labels:
- `model`: The model name being served
- `migration_type`: Either `new_request` (initial connection failure) or `ongoing_request` (mid-stream disconnection)
**Example metrics output:**
```
dynamo_frontend_model_migration_total{migration_type="ongoing_request",model="Qwen/Qwen3-0.6B"} 3
dynamo_frontend_model_migration_total{migration_type="new_request",model="Qwen/Qwen3-0.6B"} 1
```
These metrics can be used to:
- Monitor worker reliability and failure patterns
- Alert on excessive migration rates indicating infrastructure issues
- Track the effectiveness of fault tolerance mechanisms
For more information on Dynamo metrics, see the [Metrics documentation](../observability/metrics.md).
## Operational Impact
Request migration fundamentally changes how the system handles failures, moving from a "fail-fast" approach to a "graceful degradation" model. This architectural shift enables higher availability and better resource utilization while maintaining the same external API contract for clients.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Request Rejection (Load Shedding)"
---
This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
## Overview
Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents:
- Cascading failures from resource exhaustion
- Degraded latency for all requests
- Out-of-memory conditions on GPU workers
When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later.
## Architecture
```
┌─────────────────┐
│ Worker Monitor │
│ (Background) │
└────────┬────────┘
│ Updates busy list
┌──────────┐ ┌──────────┐ ┌─────────────────────┐ ┌──────────┐
│ Client │───▶│ Frontend │───▶│ Push Router │───▶│ Worker │
└──────────┘ └──────────┘ │ (checks busy list) │ └──────────┘
└─────────────────────┘
│ If all workers busy
┌─────────────────────┐
│ HTTP 503 Error │
│ "All workers busy" │
└─────────────────────┘
```
## Configuration
### Frontend Arguments
Configure busy thresholds when starting the frontend:
```bash
python -m dynamo.frontend \
--active-decode-blocks-threshold 0.85 \
--active-prefill-tokens-threshold 10000
```
| Argument | Type | Description |
|----------|------|-------------|
| `--active-decode-blocks-threshold` | float (0.0-1.0) | KV cache block utilization threshold |
| `--active-prefill-tokens-threshold` | int | Prefill token count threshold |
### Dynamic Configuration via API
Thresholds can be adjusted at runtime via the `/busy_threshold` endpoint:
#### Set Thresholds
```bash
curl -X POST http://localhost:8000/busy_threshold \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"active_decode_blocks_threshold": 0.85,
"active_prefill_tokens_threshold": 10000
}'
```
#### Get Current Thresholds
```bash
curl http://localhost:8000/busy_threshold
```
Response:
```json
{
"thresholds": [
{
"model": "Qwen/Qwen3-0.6B",
"active_decode_blocks_threshold": 0.85,
"active_prefill_tokens_threshold": 10000
}
]
}
```
## Busy Detection Logic
Workers are marked as "busy" based on a dual-threshold system. A worker is considered busy when **either** threshold is exceeded.
### KV Cache Block Threshold
Monitors the percentage of KV cache blocks in use:
```
busy = active_decode_blocks / kv_total_blocks > threshold
```
Example: With `active_decode_blocks_threshold=0.85`, a worker using 87% of its KV cache blocks is marked busy.
### Prefill Token Threshold
Monitors the number of tokens currently being prefilled:
```
busy = active_prefill_tokens > threshold
```
Example: With `active_prefill_tokens_threshold=10000`, a worker prefilling 12,000 tokens is marked busy.
### Data-Parallel Rank Aggregation
For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if **ALL** ranks are busy:
```python
def is_busy(worker):
return all(rank.is_busy() for rank in worker.dp_ranks)
```
This prevents false positives when only some ranks are temporarily loaded.
## Worker Load Monitoring
The `KvWorkerMonitor` runs as a background task that:
1. Subscribes to KV cache metrics events from workers
2. Maintains load state for each worker instance
3. Recalculates busy instances when metrics change
4. Updates the router with the current busy list
### Metrics Collected
Workers publish these metrics for monitoring:
| Metric | Description |
|--------|-------------|
| `active_decode_blocks` | Number of KV cache blocks currently in use |
| `kv_total_blocks` | Total KV cache blocks available |
| `active_prefill_tokens` | Number of tokens currently being prefilled |
## Rejection Behavior
### Request Flow
1. Request arrives at frontend
2. Push router checks if busy threshold is configured
3. If configured, router retrieves list of free (non-busy) instances
4. If no free instances exist (but instances are registered):
- Request is rejected with `PipelineError::ServiceOverloaded`
- HTTP 503 response is returned to client
### Error Response
When requests are rejected, clients receive:
```http
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
{
"message": "Service temporarily unavailable: All workers are busy, please retry later",
"type": "service_unavailable",
"code": 503
}
```
### Client Retry Strategy
Clients should implement exponential backoff when receiving 503 responses:
```python
import time
import random
def send_with_retry(request, max_retries=5):
for attempt in range(max_retries):
response = send_request(request)
if response.status_code != 503:
return response
# Exponential backoff with jitter
wait_time = min(60, (2 ** attempt) + random.uniform(0, 1))
time.sleep(wait_time)
raise Exception("Max retries exceeded")
```
## Monitoring
### Prometheus Metrics
Track rejection behavior with these metrics:
| Metric | Type | Description |
|--------|------|-------------|
| `dynamo_tasks_rejected_total` | Counter | Total number of rejected tasks |
| `dynamo_queued_requests` | Gauge | Requests waiting in HTTP queue |
### Example Prometheus Queries
```promql
# Rejection rate over 5 minutes
rate(dynamo_tasks_rejected_total[5m])
# Percentage of requests rejected
sum(rate(dynamo_tasks_rejected_total[5m])) /
sum(rate(dynamo_tasks_issued_total[5m])) * 100
```
### Grafana Alerting
Example alert for high rejection rate:
```yaml
alert: HighRequestRejectionRate
expr: |
sum(rate(dynamo_tasks_rejected_total[5m])) /
sum(rate(dynamo_tasks_issued_total[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High request rejection rate"
description: "More than 10% of requests are being rejected"
```
## Tuning Thresholds
### Conservative Settings (Latency-Focused)
For applications prioritizing low latency:
```bash
--active-decode-blocks-threshold 0.70
--active-prefill-tokens-threshold 5000
```
- Rejects earlier, before workers become fully loaded
- Maintains lower queue depths
- Better tail latencies
### Aggressive Settings (Throughput-Focused)
For applications prioritizing throughput:
```bash
--active-decode-blocks-threshold 0.95
--active-prefill-tokens-threshold 20000
```
- Allows higher worker utilization
- May increase latency variability
- Better overall throughput
### Disabled (No Rejection)
To disable request rejection entirely:
```bash
# Simply don't set the threshold arguments
python -m dynamo.frontend
```
Without thresholds configured, all requests are accepted regardless of worker load.
## Best Practices
### 1. Start Conservative, Then Tune
Begin with conservative thresholds and increase based on observed behavior:
```bash
# Start here
--active-decode-blocks-threshold 0.75
# Increase if rejection rate is too high
--active-decode-blocks-threshold 0.85
```
### 2. Monitor Before Enabling
Observe worker load patterns before setting thresholds:
```bash
# Watch KV cache utilization
watch -n 1 'curl -s localhost:8000/metrics | grep kv_blocks'
```
### 3. Use Both Thresholds for Disaggregated Serving
In disaggregated deployments:
- Use `active_prefill_tokens_threshold` for prefill workers
- Use `active_decode_blocks_threshold` for decode workers
### 4. Coordinate with Autoscaling
If using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling:
```yaml
# HPA triggers at 70% utilization
# Rejection at 85% provides buffer
--active-decode-blocks-threshold 0.85
```
## Related Documentation
- [Health Checks](../observability/health-checks.md) - Worker health monitoring
- [Metrics](../observability/metrics.md) - Available Prometheus metrics
- [Request Migration](request-migration.md) - Handling failed requests
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Fault Tolerance Testing"
---
This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.
## Overview
Dynamo's fault tolerance test suite is located in `tests/fault_tolerance/` and includes:
| Test Category | Location | Purpose |
|---------------|----------|---------|
| Cancellation | `cancellation/` | Request cancellation during in-flight operations |
| Migration | `migration/` | Request migration when workers fail |
| etcd HA | `etcd_ha/` | etcd failover and recovery |
| Hardware | `hardware/` | GPU and network fault injection |
| Deployment | `deploy/` | End-to-end deployment testing |
## Test Directory Structure
```
tests/fault_tolerance/
├── cancellation/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── migration/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── etcd_ha/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── hardware/
│ └── fault_injection_service/
│ ├── api_service/
│ └── agents/
├── deploy/
│ ├── test_deployment.py
│ ├── scenarios.py
│ ├── base_checker.py
│ └── ...
└── client.py
```
## Request Cancellation Tests
Test that in-flight requests can be properly canceled.
### Running Cancellation Tests
```bash
# Run all cancellation tests
pytest tests/fault_tolerance/cancellation/ -v
# Run for specific backend
pytest tests/fault_tolerance/cancellation/test_vllm.py -v
```
### Cancellation Test Utilities
The `cancellation/utils.py` module provides:
#### CancellableRequest
Thread-safe request cancellation via TCP socket manipulation:
```python
from tests.fault_tolerance.cancellation.utils import CancellableRequest
request = CancellableRequest()
# Send request in separate thread
thread = Thread(target=send_request, args=(request,))
thread.start()
# Cancel after some time
time.sleep(1)
request.cancel() # Closes underlying socket
```
#### send_completion_request / send_chat_completion_request
Send cancellable completion requests:
```python
from tests.fault_tolerance.cancellation.utils import (
send_completion_request,
send_chat_completion_request
)
# Non-streaming
response = send_completion_request(
base_url="http://localhost:8000",
model="Qwen/Qwen3-0.6B",
prompt="Hello, world!",
max_tokens=100
)
# Streaming with cancellation
responses = send_chat_completion_request(
base_url="http://localhost:8000",
model="Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
cancellable_request=request
)
```
#### poll_for_pattern
Wait for specific patterns in logs:
```python
from tests.fault_tolerance.cancellation.utils import poll_for_pattern
# Wait for cancellation confirmation
found = poll_for_pattern(
log_file="/var/log/dynamo/worker.log",
pattern="Request cancelled",
timeout=30,
interval=0.5
)
```
## Migration Tests
Test that requests migrate to healthy workers when failures occur.
### Running Migration Tests
```bash
# Run all migration tests
pytest tests/fault_tolerance/migration/ -v
# Run for specific backend
pytest tests/fault_tolerance/migration/test_vllm.py -v
```
### Migration Test Utilities
The `migration/utils.py` module provides:
- Frontend wrapper with configurable request planes
- Long-running request spawning for migration scenarios
- Health check disabling for controlled testing
### Example Migration Test
```python
def test_migration_on_worker_failure():
# Start deployment with 2 workers
deployment = start_deployment(workers=2)
# Send long-running request
request_thread = spawn_long_request(max_tokens=1000)
# Kill one worker mid-generation
kill_worker(deployment.workers[0])
# Verify request completes on remaining worker
response = request_thread.join()
assert response.status_code == 200
assert len(response.tokens) > 0
```
## etcd HA Tests
Test system behavior during etcd failures and recovery.
### Running etcd HA Tests
```bash
pytest tests/fault_tolerance/etcd_ha/ -v
```
### Test Scenarios
- **Leader failover**: etcd leader node fails, cluster elects new leader
- **Network partition**: etcd node becomes unreachable
- **Recovery**: System recovers after etcd becomes available
## Hardware Fault Injection
The fault injection service enables testing under simulated hardware failures.
### Fault Injection Service
Located at `tests/fault_tolerance/hardware/fault_injection_service/`, this FastAPI service orchestrates fault injection:
```bash
# Start the fault injection service
cd tests/fault_tolerance/hardware/fault_injection_service
python -m api_service.main
```
### Supported Fault Types
#### GPU Faults
| Fault Type | Description |
|------------|-------------|
| `XID_ERROR` | Simulate GPU XID error (various codes) |
| `THROTTLE` | GPU thermal throttling |
| `MEMORY_PRESSURE` | GPU memory exhaustion |
| `OVERHEAT` | GPU overheating condition |
| `COMPUTE_OVERLOAD` | GPU compute saturation |
#### Network Faults
| Fault Type | Description |
|------------|-------------|
| `FRONTEND_WORKER` | Partition between frontend and workers |
| `WORKER_NATS` | Partition between workers and NATS |
| `WORKER_WORKER` | Partition between workers |
| `CUSTOM` | Custom network partition |
### Fault Injection API
#### Inject GPU Fault
```bash
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
-H "Content-Type: application/json" \
-d '{
"target_pod": "vllm-worker-0",
"fault_type": "XID_ERROR",
"severity": "HIGH"
}'
```
#### Inject Specific XID Error
```bash
# Inject XID 79 (GPU memory page fault)
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
-H "Content-Type: application/json" \
-d '{"target_pod": "vllm-worker-0"}'
```
Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120
#### Inject Network Partition
```bash
curl -X POST http://localhost:8080/api/v1/faults/network/inject \
-H "Content-Type: application/json" \
-d '{
"partition_type": "FRONTEND_WORKER",
"duration_seconds": 30
}'
```
#### Recover from Fault
```bash
curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover
```
#### List Active Faults
```bash
curl http://localhost:8080/api/v1/faults
```
### GPU Fault Injector Agent
The GPU fault injector runs as a DaemonSet on worker nodes:
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-fault-injector
spec:
selector:
matchLabels:
app: gpu-fault-injector
template:
spec:
containers:
- name: agent
image: dynamo/gpu-fault-injector:latest
securityContext:
privileged: true
volumeMounts:
- name: dev
mountPath: /dev
```
The agent injects fake XID messages via `/dev/kmsg` to trigger NVSentinel detection.
## Deployment Testing Framework
The `deploy/` directory contains an end-to-end testing framework.
### Test Phases
Tests run through three phases:
| Phase | Description |
|-------|-------------|
| `STANDARD` | Baseline performance under normal conditions |
| `OVERFLOW` | System behavior during fault/overload |
| `RECOVERY` | System recovery after fault resolution |
### Scenario Configuration
Define test scenarios in `scenarios.py`:
```python
from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure
scenario = Scenario(
name="worker_failure_migration",
backend="vllm",
load=Load(
clients=10,
requests_per_client=100,
max_tokens=256
),
failure=Failure(
type="pod_kill",
target="vllm-worker-0",
trigger_after_requests=50
)
)
```
### Running Deployment Tests
```bash
# Run all deployment tests
pytest tests/fault_tolerance/deploy/test_deployment.py -v
# Run specific scenario
pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v
```
### Validation Checkers
The framework includes pluggable validators:
```python
from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext
class MigrationChecker(BaseChecker):
def check(self, context: ValidationContext) -> bool:
# Verify migrations occurred
migrations = context.metrics.get("migrations_total", 0)
return migrations > 0
```
### Results Parsing
Parse test results for analysis:
```python
from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test
results = process_overflow_recovery_test(log_dir="/path/to/logs")
print(f"Success rate: {results['success_rate']}")
print(f"P99 latency: {results['p99_latency_ms']}ms")
```
## Client Utilities
The `client.py` module provides shared client functionality:
### Multi-Threaded Load Generation
```python
from tests.fault_tolerance.client import client
# Generate load with multiple clients
results = client(
base_url="http://localhost:8000",
num_clients=10,
requests_per_client=100,
model="Qwen/Qwen3-0.6B",
max_tokens=256,
log_dir="/tmp/test_logs"
)
```
### Request Options
| Parameter | Description |
|-----------|-------------|
| `base_url` | Frontend URL |
| `num_clients` | Number of concurrent clients |
| `requests_per_client` | Requests per client |
| `model` | Model name |
| `max_tokens` | Max tokens per request |
| `log_dir` | Directory for client logs |
| `endpoint` | `completions` or `chat/completions` |
## Running the Full Test Suite
### Prerequisites
1. Kubernetes cluster with GPU nodes
2. Dynamo deployment
3. etcd cluster (for HA tests)
4. Fault injection service (for hardware tests)
### Environment Setup
```bash
export KUBECONFIG=/path/to/kubeconfig
export DYNAMO_NAMESPACE=dynamo-test
export FRONTEND_URL=http://localhost:8000
```
### Run All Tests
```bash
# Install test dependencies
pip install pytest pytest-asyncio
# Run all fault tolerance tests
pytest tests/fault_tolerance/ -v --tb=short
# Run with specific markers
pytest tests/fault_tolerance/ -v -m "not slow"
```
### Test Markers
| Marker | Description |
|--------|-------------|
| `slow` | Long-running tests (> 5 minutes) |
| `gpu` | Requires GPU resources |
| `k8s` | Requires Kubernetes cluster |
| `etcd_ha` | Requires multi-node etcd |
## Best Practices
### 1. Isolate Test Environments
Run fault tolerance tests in dedicated namespaces:
```bash
kubectl create namespace dynamo-fault-test
```
### 2. Clean Up After Tests
Ensure fault injection is recovered:
```bash
# List and recover all active faults
curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover
```
### 3. Collect Logs
Preserve logs for debugging:
```bash
pytest tests/fault_tolerance/ -v \
--log-dir=/tmp/fault_test_logs \
--capture=no
```
### 4. Monitor During Tests
Watch system state during tests:
```bash
# Terminal 1: Watch pods
watch kubectl get pods -n dynamo-test
# Terminal 2: Watch metrics
watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'
```
## Related Documentation
- [Request Migration](request-migration.md) - Migration implementation details
- [Request Cancellation](request-cancellation.md) - Cancellation implementation
- [Health Checks](../observability/health-checks.md) - Health monitoring
- [Metrics](../observability/metrics.md) - Available metrics for monitoring
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "KServe gRPC frontend"
---
## Motivation
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
## Supported Endpoints
* `ModelInfer` endpoint: KServe Standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1)
* `ModelStreamInfer` endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto#L84-L92)
* `ModelMetadata` endpoint: KServe standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#model-metadata-1)
* `ModelConfig` endpoint: Triton extension endpoint as described [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md)
## Starting the Frontend
To start the KServe frontend, run the below command
```
python -m dynamo.frontend --kserve-grpc-server
```
## Registering a Backend
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference
The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
### OpenAI Completions
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor based messages and OpenAI message and imposes specific structure of the KServe request message.
#### Model Metadata / Config
The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
```
{
name: $MODEL_NAME,
version: 1,
platform: "dynamo",
backend: "dynamo", # model config specific
inputs: [
{
name: "text_input",
datatype: "BYTES",
shape: [1]
},
{
name: "streaming",
datatype: "BOOL",
shape: [1],
optional: true
}
]
outputs: [
{
name: "text_output",
datatype: "BYTES",
shape: [-1]
},
{
name: "finish_reason",
datatype: "BYTES",
shape: [-1],
optional: true
}
]
}
```
#### Inference
On receiving inference request, the following conversion will be performed:
* `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
* `streaming`: the element will be converted to `stream` field in OpenAI Completion request
On receiving model response, the following conversion will be performed:
* `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice.
* `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.
### Tensor
This combination is used when the user is migrating an existing KServe based backend into Dynamo ecosystem.
#### Model Metadata / Config
When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
* [TensorModelConfig](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
* [triton_model_config](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/main/tests/frontend/grpc/echo_tensor_worker.py) for example.
#### Inference
When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
## Python Bindings
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Examples"
---
Explore practical examples to get started with NVIDIA Dynamo.
## Quick Start Examples
The [examples directory](https://github.com/ai-dynamo/dynamo/tree/main/examples) in the Dynamo repository contains ready-to-run examples for various use cases.
### Backend Examples
| Backend | Description | Link |
|---------|-------------|------|
| **vLLM** | Run inference with vLLM backend | [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm) |
| **SGLang** | Run inference with SGLang backend | [examples/backends/sglang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang) |
| **TensorRT-LLM** | Run inference with TensorRT-LLM backend | [examples/backends/trtllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm) |
### Deployment Examples
| Example | Description | Link |
|---------|-------------|------|
| **Basic Deployment** | Simple single-node deployment | [examples/basics](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics) |
| **Kubernetes** | Deploy on Kubernetes | [examples/deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/deployments) |
| **Multimodal** | Vision and multimodal models | [examples/multimodal](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal) |
### Custom Backend Examples
Learn how to create custom backends:
| Example | Description | Link |
|---------|-------------|------|
| **Custom Backend** | Build your own backend | [examples/custom_backend](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom_backend) |
## Running Examples
Most examples can be run directly after installing Dynamo:
```bash
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Navigate to an example
cd examples/backends/sglang
# Follow the README in each example directory
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Installation"
---
## Pip (PyPI)
Install a pre-built wheel from PyPI.
```bash
# Create a virtual environment and activate it
uv venv venv
source venv/bin/activate
# Install Dynamo from PyPI (choose one backend extra)
uv pip install "ai-dynamo[sglang]" # or [vllm], [trtllm]
```
## Pip from source
Install directly from a local checkout for development.
```bash
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Create a virtual environment and activate it
uv venv venv
source venv/bin/activate
uv pip install ".[sglang]" # or [vllm], [trtllm]
```
## Docker
Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`).
```bash
# Run a container (mount your workspace if needed)
docker run --rm -it \
--gpus all \
--network host \
nvcr.io/nvidia/ai-dynamo/sglang-runtime:latest # or vllm, tensorrtllm
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Welcome to NVIDIA Dynamo"
---
The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.
<Tip>
**Discover the Latest Developments!**
This guide is a snapshot of a specific point in time. For the latest information, examples, and Release Assets, see the [Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo/releases/latest).
</Tip>
## Quickstart
Get started with Dynamo locally in just a few commands:
### 1. Install Dynamo
```bash
# Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install Dynamo
uv venv venv
source venv/bin/activate
# Use prerelease flag to install RC versions of flashinfer and/or other dependencies
uv pip install --prerelease=allow "ai-dynamo[sglang]" # or [vllm], [trtllm]
```
### 2. Start etcd/NATS
```bash
# Fetch and start etcd and NATS using Docker Compose
VERSION=$(uv pip show ai-dynamo | grep Version | cut -d' ' -f2)
curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/refs/tags/v${VERSION}/deploy/docker-compose.yml
docker compose -f docker-compose.yml up -d
```
### 3. Run Dynamo
```bash
# Start the OpenAI compatible frontend (default port is 8000)
python -m dynamo.frontend
# In another terminal, start an SGLang worker
python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B
```
### 4. Test Your Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50}'
```
## Key Features
| Feature | Description |
|---------|-------------|
| **Multi-Backend Support** | vLLM, SGLang, and TensorRT-LLM backends |
| **Disaggregated Serving** | Separate prefill and decode for optimal performance |
| **KV Cache Routing** | Intelligent request routing based on KV cache state |
| **Kubernetes Native** | Full operator and Helm chart support |
| **Observability** | Prometheus metrics, Grafana dashboards, and tracing |
## Documentation Overview
### Backends
- [vLLM Backend](../backends/vllm/README.md) - High-throughput serving with vLLM
- [SGLang Backend](../backends/sglang/README.md) - Fast inference with SGLang
- [TensorRT-LLM Backend](../backends/trtllm/README.md) - Optimized inference with TensorRT-LLM
### Kubernetes Deployment
- [Installation Guide updated](../kubernetes/installation-guide.md) - Deploy Dynamo on Kubernetes
- [Operator Guide](../kubernetes/dynamo-operator.md) - Using the Dynamo Operator
- [Autoscaling](../kubernetes/autoscaling.md) - Automatic scaling configuration
### Architecture
- [System Architecture](../design-docs/architecture.md) - Overall system design
- [Disaggregated Serving](../design-docs/disagg-serving.md) - P/D separation architecture
- [Distributed Runtime](../design-docs/distributed-runtime.md) - Runtime internals
### Performance & Tuning
- [Performance Tuning](../performance/tuning.md) - Optimize your deployment
- [Benchmarking](../benchmarks/benchmarking.md) - Measure and compare performance
- [AI Configurator](../performance/aiconfigurator.md) - Automated configuration
## Getting Help
- **GitHub Issues**: [Report bugs or request features](https://github.com/ai-dynamo/dynamo/issues)
- **Discussions**: [Ask questions and share ideas](https://github.com/ai-dynamo/dynamo/discussions)
- **Reference**: [CLI Reference](../reference/cli.md) | [Glossary](../reference/glossary.md) | [Support Matrix](./support-matrix.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Support Matrix"
---
This document provides the support matrix for Dynamo, including hardware, software and build instructions.
## Hardware Compatibility
| **CPU Architecture** | **Status** |
| :------------------- | :----------- |
| **x86_64** | Supported |
| **ARM64** | Supported |
### GPU Compatibility
If you are using a **GPU**, the following GPU models and architectures are supported:
| **GPU Architecture** | **Status** |
| :----------------------------------- | :--------- |
| **NVIDIA Blackwell Architecture** | Supported |
| **NVIDIA Hopper Architecture** | Supported |
| **NVIDIA Ada Lovelace Architecture** | Supported |
| **NVIDIA Ampere Architecture** | Supported |
## Platform Architecture Compatibility
**Dynamo** is compatible with the following platforms:
| **Operating System** | **Version** | **Architecture** | **Status** |
| :------------------- | :---------- | :--------------- | :----------- |
| **Ubuntu** | 22.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | ARM64 | Supported |
| **CentOS Stream** | 9 | x86_64 | Experimental |
<Note>
Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04).
Compatibility with other Linux distributions is expected but has not been officially verified yet.
</Note>
<Error>
KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
</Error>
## Software Compatibility
### Runtime Dependency
| **Python Package** | **Version** | glibc version | CUDA Version |
| :----------------- | :---------- | :------------------------------------ | :----------- |
| ai-dynamo | 0.8.0 | >=2.28 | |
| ai-dynamo-runtime | 0.8.0 | >=2.28 (Python 3.12 has known issues) | |
| NIXL | 0.8.0 | >=2.27 | >=11.8 |
### Build Dependency
The following table shows the dependency versions included with each Dynamo release:
| **Dependency** | **main (ToT)** | **v0.8.0 (unreleased)** | **v0.7.1** | **v0.7.0.post1** | **v0.7.0** |
| :------------- | :------------- | :---------------------- | :--------- | :--------------- | :--------- |
| SGLang | 0.5.7 | 0.5.7 | 0.5.3.post4| 0.5.3.post4 | 0.5.3.post4|
| TensorRT-LLM | 1.2.0rc6 | 1.2.0rc6 | 1.2.0rc3 | 1.2.0rc3 | 1.2.0rc2 |
| vLLM | 0.13.0 | 0.12.0 | 0.11.0 | 0.11.0 | 0.11.0 |
| NIXL | 0.8.0 | 0.8.0 | 0.8.0 | 0.8.0 | 0.8.0 |
<Note>
**main (ToT)** reflects the current development branch. **v0.8.0** is the upcoming release (planned for January 14, 2025) and not yet available.
</Note>
<Warning>
Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
</Warning>
### CUDA Support by Framework
| **Dynamo Version** | **SGLang** | **TensorRT-LLM** | **vLLM** |
| :------------------- | :-----------------------| :-----------------------| :-----------------------|
| **Dynamo 0.7.1** | CUDA 12.8 | CUDA 13.0 | CUDA 12.9 |
## Cloud Service Provider Compatibility
### AWS
| **Host Operating System** | **Version** | **Architecture** | **Status** |
| :------------------------ | :---------- | :--------------- | :--------- |
| **Amazon Linux** | 2023 | x86_64 | Supported¹ |
<Error>
There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
</Error>
## Build Support
**Dynamo** currently provides build support in the following ways:
- **Wheels**: We distribute Python wheels of Dynamo and KV Block Manager:
- [ai-dynamo](https://pypi.org/project/ai-dynamo/)
- [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/)
- **New as of Dynamo v0.7.0:** [kvbm](https://pypi.org/project/kvbm/) as a standalone implementation.
- **Dynamo Runtime Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Runtime for each of the LLM inference frameworks on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
- [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime)
- [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime)
- [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)
- **Dynamo Kubernetes Operator Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Operator on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
- [kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
- **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo:
- [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds)
- [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform)
- [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph)
- **Rust Crates**:
- [dynamo-runtime](https://crates.io/crates/dynamo-runtime/)
- [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai/)
- [dynamo-parsers](https://crates.io/crates/dynamo-parsers/)
- [dynamo-llm](https://crates.io/crates/dynamo-llm/)
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/blob/main/README.md#installation).
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "JailedStream Implementation"
---
## Overview
The `JailedStream` is a standalone implementation for handling "jail" detection in token streams. It provides a clean, builder-based API for accumulating tokens when certain sequences are detected, then releasing them as a single chunk when the jail ends.
## Key Features
- **Builder Pattern**: Clean configuration API using the builder pattern
- **Configurable Sequences**: Support for multiple start/end jail sequences
- **Tool Call Parsing**: Integrated tool call detection and parsing
- **Stream Macro**: Uses `async-stream::stream!` for clean async implementation
- **Standalone**: Completely independent of existing code
- **Annotations**: Preserves annotations for observability
## Implementation
### Location
- Main implementation: `lib/llm/src/protocols/openai/chat_completions/jail.rs`
- Examples: `lib/llm/src/protocols/openai/chat_completions/jail_example.rs`
### Usage
```rust
use crate::protocols::openai::chat_completions::jail::JailedStream;
use dynamo_runtime::engine::{AsyncEngineContextProvider, ResponseStream};
// Get your ResponseStream with context
let response_stream: Pin<Box<ResponseStream<_>>> = get_stream_from_engine();
// Extract context BEFORE passing to apply
let context = response_stream.context();
// Apply jail transformation (ResponseStream implements Stream)
let jail = JailedStream::builder()
.tool_call_parser("nemotron_deci")
.build();
let jailed_stream = jail.apply(response_stream);
// Re-wrap with context when needed for engine consumption
let final_stream = ResponseStream::new(Box::pin(jailed_stream), context);
```
### Advanced Configuration
```rust
// With custom jail sequences
let jail = JailedStream::builder()
.jail_start_sequence("<TOOLCALL>")
.jail_end_sequence("</TOOLCALL>")
.tool_call_parser("nemotron_deci")
.build();
// With multiple sequences
let jail = JailedStream::builder()
.jail_start_sequences(vec!["<TOOLCALL>", "<FUNCTION>"])
.jail_end_sequences(vec!["</TOOLCALL>", "</FUNCTION>"])
.tool_call_parser("harmony")
.build();
```
## How It Works
1. **Detection**: When a jail start sequence (or tool call start) is detected, the stream enters "jail" mode
2. **Accumulation**: While jailed, tokens are accumulated in memory instead of being yielded
3. **Annotations**: Empty chunks with annotations are sent downstream for observability
4. **Release**: When a jail end sequence is detected OR the stream ends:
- Accumulated content is parsed for tool calls
- A single chunk with the parsed content is yielded
5. **Pass-through**: Non-jailed content passes through unchanged
## Testing
The implementation includes comprehensive tests:
- `test_jailed_stream_with_start_end_sequences`: Tests explicit jail sequences
- `test_jailed_stream_with_tool_calls`: Tests tool call detection and parsing
- `test_jailed_stream_no_jailing`: Tests normal pass-through behavior
Run tests with:
```bash
cargo test -p dynamo-llm jail --lib
```
## Benefits
1. **Standalone**: No modifications to existing code required
2. **Clean API**: Builder pattern makes configuration intuitive
3. **Flexible**: Supports multiple jail detection strategies
4. **Maintainable**: Uses `stream!` macro for cleaner async code
5. **Testable**: Comprehensive test suite with shared utilities
6. **Efficient**: No unnecessary boxing or context handling in the library
7. **Composable**: Can chain multiple stream transformers before re-adding context
## Performance Optimizations
- **No Boxing in Library**: Returns `impl Stream` instead of `Pin<Box<ResponseStream>>`
- **Stack Pinning**: Uses `tokio::pin!()` instead of `Box::pin()` for better performance
- **No Context Overhead**: JailedStream doesn't manage AsyncEngineContext
- **Lazy Evaluation**: Only processes what's needed
- **Efficient State Management**: Minimal cloning, only when entering jail state
## Integration Options
To replace the existing `apply_tool_calling_jail_internal` function:
```rust
// In preprocessor.rs
pub fn apply_tool_calling_jail_with_parser(
&self,
stream: ManyOut<Annotated<NvCreateChatCompletionStreamResponse>>,
) -> ManyOut<Annotated<NvCreateChatCompletionStreamResponse>> {
let jail = JailedStream::builder()
.tool_call_parser(self.tool_call_parser.clone())
.build();
jail.apply(stream)
}
```
## Future Enhancements
- Add support for regex patterns for jail sequences
- Add metrics/telemetry for jail detection
- Support for partial sequence matching across chunk boundaries
- Configurable accumulation limits
- Support for nested jails
\ No newline at end of file
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Request Planes User Guide"
---
## Overview
Dynamo supports multiple transport mechanisms for its request plane (the communication layer between services). You can choose from three different request plane modes based on your deployment requirements:
- **TCP** (default): Direct TCP connection for optimal performance
- **NATS**: Message broker-based request plane
- **HTTP**: HTTP/2-based request plane
This guide explains how to configure and use request plane in your Dynamo deployment.
## What is a Request Plane?
The request plane is the transport layer that handles communication between Dynamo services (e.g., frontend to backend, worker to worker). Different request planes offer different trade-offs:
| Request Plane | Suitable For | Characteristics |
|--------------|----------|-----------------|
| **NATS** | Production deployments with KV routing | Requires NATS infrastructure, provides pub/sub patterns, highest flexibility |
| **TCP** | Low-latency direct communication | Direct connections, minimal overhead |
| **HTTP** | Standard deployments, debugging | HTTP/2 protocol, easier observability with standard tools, widely compatible |
## Request Plane vs KV Event Plane
Dynamo has **two independent communication planes**:
- **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`.
- **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing.
**Note:** if you are using `tcp` or `http` request plane and choose to use NATS for KV events, you must still configure NATS server using `NATS_SERVER` environment variable, e.g. `NATS_SERVER=nats://nats-hostname:port`.
Because they are independent, you can mix them.
For example, a deployment with TCP request plane can use different KV event planes:
- **JetStream KV events**: requests use TCP, KV routing still uses NATS JetStream + object store for persistence.
- **NATS Core KV events (local indexer)**: requests use TCP, KV events use NATS Core pub/sub and persistence lives on workers.
- **no KV events**: requests use TCP and KV routing runs without events (no NATS required, but no event-backed persistence).
## Configuration
### Environment Variable
Set the request plane mode using the `DYN_REQUEST_PLANE` environment variable:
```bash
export DYN_REQUEST_PLANE=<mode>
```
Where `<mode>` is one of:
- `tcp` (default)
- `nats`
- `http`
The value is case-insensitive.
### Default Behavior
If `DYN_REQUEST_PLANE` is not set or contains an invalid value, Dynamo defaults to `tcp`.
## Usage Examples
### Using TCP (Default)
TCP is the default request plane and provides direct, low-latency communication between services.
**Configuration:**
```bash
# TCP is the default, so no need to set DYN_REQUEST_PLANE explicitly
# But you can explicitly set it if desired:
export DYN_REQUEST_PLANE=tcp
# Optional: Configure TCP server host and port
export DYN_TCP_RPC_HOST=0.0.0.0 # Default host
# export DYN_TCP_RPC_PORT=9999 # Optional: specify a fixed port
# Run your Dynamo service
DYN_REQUEST_PLANE=tcp python -m dynamo.frontend --http-port=8000 &
DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
**Note:** By default, TCP uses an OS-assigned free port (port 0). This is ideal for environments where multiple services may run on the same machine or when you want to avoid port conflicts. If you need a specific port (e.g., for firewall rules), set `DYN_TCP_RPC_PORT` explicitly.
**When to use TCP:**
- Simple deployments with direct service-to-service communication (e.g. frontend to backend)
- Minimal infrastructure requirements (**no NATS needed unless you enable KV-event-backed routing/replica sync**)
- Low-latency requirements
**TCP Configuration Options:**
Additional TCP-specific environment variables:
- `DYN_TCP_RPC_HOST`: Server host address (default: auto-detected)
- `DYN_TCP_RPC_PORT`: Server port. If not set, the OS assigns a free port automatically (recommended for most deployments). Set explicitly only if you need a specific port for firewall rules.
- `DYN_TCP_MAX_MESSAGE_SIZE`: Maximum message size for TCP client (default: 32MB)
- `DYN_TCP_REQUEST_TIMEOUT`: Request timeout for TCP client (default: 10 seconds)
- `DYN_TCP_POOL_SIZE`: Connection pool size for TCP client (default: 50)
- `DYN_TCP_CONNECT_TIMEOUT`: Connect timeout for TCP client (default: 3 seconds)
- `DYN_TCP_CHANNEL_BUFFER`: Request channel buffer size for TCP client (default: 100)
### Using HTTP
HTTP/2 provides a standards-based request plane that's easy to debug and widely compatible.
**Configuration:**
```bash
# Optional: Configure HTTP server host and port
export DYN_HTTP_RPC_HOST=0.0.0.0 # Default host
export DYN_HTTP_RPC_PORT=8888 # Default port
export DYN_HTTP_RPC_ROOT_PATH=/v1/rpc # Default path
# Run your Dynamo service
DYN_REQUEST_PLANE=http python -m dynamo.frontend --http-port=8000 &
DYN_REQUEST_PLANE=http python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
**When to use HTTP:**
- Standard deployments requiring HTTP compatibility
- Debugging scenarios (use curl, browser tools, etc.)
- Integration with HTTP-based infrastructure
- Load balancers and proxies that work with HTTP
**HTTP Configuration Options:**
Additional HTTP-specific environment variables:
- `DYN_HTTP_RPC_HOST`: Server host address (default: auto-detected)
- `DYN_HTTP_RPC_PORT`: Server port (default: 8888)
- `DYN_HTTP_RPC_ROOT_PATH`: Root path for RPC endpoints (default: /v1/rpc)
`DYN_HTTP2_*`: Various HTTP/2 client configuration options
- `DYN_HTTP2_MAX_FRAME_SIZE`: Maximum frame size for HTTP client (default: 1MB)
- `DYN_HTTP2_MAX_CONCURRENT_STREAMS`: Maximum concurrent streams for HTTP client (default: 1000)
- `DYN_HTTP2_POOL_MAX_IDLE_PER_HOST`: Maximum idle connections per host for HTTP client (default: 100)
- `DYN_HTTP2_POOL_IDLE_TIMEOUT_SECS`: Idle timeout for HTTP client (default: 90 seconds)
- `DYN_HTTP2_KEEP_ALIVE_INTERVAL_SECS`: Keep-alive interval for HTTP client (default: 30 seconds)
- `DYN_HTTP2_KEEP_ALIVE_TIMEOUT_SECS`: Keep-alive timeout for HTTP client (default: 10 seconds)
- `DYN_HTTP2_ADAPTIVE_WINDOW`: Enable adaptive flow control (default: true)
### Using NATS
NATS provides durable jetstream messaging for request plane and can be used for KV events (and router replica sync).
**Prerequisites:**
- NATS server must be running and accessible
- Configure NATS connection via standard Dynamo NATS environment variables
```bash
# Explicitly set to NATS
export DYN_REQUEST_PLANE=nats
# Run your Dynamo service
DYN_REQUEST_PLANE=nats python -m dynamo.frontend --http-port=8000 &
DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
**When to use NATS:**
- Production deployments with service discovery
- Currently KV based routing require NATS. If you want to completely disable NATS, KV based routing won't be available
- Need for message replay and persistence features
Limitations:
- NATS does not support payloads beyond 16MB (use TCP for larger payloads)
## Complete Example
Here's a complete example showing how to launch a Dynamo deployment with different request planes:
See [`examples/backends/vllm/launch/agg_request_planes.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/agg_request_planes.sh) for a complete working example that demonstrates launching Dynamo with TCP, HTTP, or NATS request planes.
## Real-World Example
The Dynamo repository includes a complete example demonstrating all three request planes:
**Location:** `examples/backends/vllm/launch/agg_request_planes.sh`
```bash
cd examples/backends/vllm/launch
# Run with TCP
./agg_request_planes.sh --tcp
# Run with HTTP
./agg_request_planes.sh --http
# Run with NATS
./agg_request_planes.sh --nats
```
## Architecture Details
### Network Manager
The request plane implementation is centralized in the Network Manager (`lib/runtime/src/pipeline/network/manager.rs`), which:
1. Reads the `DYN_REQUEST_PLANE` environment variable at startup
2. Creates the appropriate server and client implementations
3. Provides a transport-agnostic interface to the rest of the codebase
4. Manages all network configuration and lifecycle
### Transport Abstraction
All request plane implementations conform to common trait interfaces:
- `RequestPlaneServer`: Server-side interface for receiving requests
- `RequestPlaneClient`: Client-side interface for sending requests
This abstraction means your application code doesn't need to change when switching request planes.
### Configuration Loading
Request plane configuration is loaded from environment variables at startup and cached globally. The configuration hierarchy is:
1. **Mode Selection**: `DYN_REQUEST_PLANE` (defaults to `tcp`)
2. **Transport-Specific Config**: Mode-specific environment variables (e.g., `DYN_TCP_*`, `DYN_HTTP2_*`)
## Migration Guide
### From NATS to TCP
1. Stop your Dynamo services
2. Set environment variable `DYN_REQUEST_PLANE=tcp`
3. Optionally configure TCP-specific settings (e.g., `DYN_TCP_RPC_HOST`). Note: `DYN_TCP_RPC_PORT` is optional; if not set, an OS-assigned free port is used automatically.
4. Restart your services
### From NATS to HTTP
1. Stop your Dynamo services
2. Set environment variable `DYN_REQUEST_PLANE=http`
3. Optionally configure HTTP-specific settings (`DYN_HTTP_RPC_PORT`, etc.)
4. Restart your services
### Testing the Migration
After switching request planes, verify your deployment:
```bash
# Test with a simple request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
## Troubleshooting
### Issue: Services Can't Communicate
**Symptoms:** Requests timeout or fail to reach the backend
**Solutions:**
- Verify all services use the same `DYN_REQUEST_PLANE` setting
- Check that server ports are not blocked by k8s network policies or firewalls
- For TCP/HTTP: Ensure host/port configurations are correct and accessible
- For NATS: Verify NATS server is running and accessible
### Issue: "Invalid request plane mode" Error
**Symptoms:** Service fails to start with configuration error
**Solutions:**
- Check `DYN_REQUEST_PLANE` spelling (valid values: `nats`, `tcp`, `http`)
- Value is case-insensitive but must be one of the three options
- If not set, defaults to `tcp`
### Issue: Port Conflicts
**Symptoms:** Server fails to start due to "address already in use"
**Solutions:**
- TCP: By default, TCP uses an OS-assigned free port, so port conflicts should be rare. If you explicitly set `DYN_TCP_RPC_PORT` to a specific port and get conflicts, either change the port or remove the setting to use automatic port assignment.
- HTTP default port: 8888 (adjust environment variable `DYN_HTTP_RPC_PORT`)
## Performance Considerations
### Latency
- **TCP**: Lowest latency due to direct connections and binary serialization
- **HTTP**: Moderate latency with HTTP/2 overhead
- **NATS**: Moderate latency due to nats jet stream persistence
### Resource Usage
- **TCP**: Minimal infrastructure (no additional services required)
- **HTTP**: Minimal infrastructure (no additional services required)
- **NATS**: Requires running NATS server (additional memory/CPU)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Deploying Dynamo on Kubernetes"
---
[Link to installation](../getting-started/installation.md)
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
## Important Terminology
**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created.
- Used for: Resource isolation, RBAC, organizing deployments
- Example: `dynamo-system`, `team-a-namespace`
**Dynamo Namespace**: The logical namespace used by Dynamo components for [service discovery](service-discovery.md).
- Used for: Runtime component communication, service discovery
- Specified in: `.spec.services.<ServiceName>.dynamoNamespace` field
- Example: `my-llm`, `production-model`, `dynamo-dev`
These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
## Prerequisites
Before you begin, ensure you have the following tools installed:
| Tool | Minimum Version | Installation Guide |
|------|-----------------|-------------------|
| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) |
Verify your installation:
```bash
kubectl version --client # Should show v1.24+
helm version # Should show v3.0+
```
For detailed installation instructions, see the [Prerequisites section](installation-guide.md#prerequisites) in the Installation Guide.
## Pre-deployment Checks
Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:
```bash
./deploy/pre-deployment/pre-deployment-check.sh
```
This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for more details.
## 1. Install Platform First
```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
**For Shared/Multi-Tenant Clusters:**
If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```
For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](installation-guide.md)**.
## 2. Choose Your Backend
Each backend has deployment examples and configuration options:
| Backend | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
| **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
| **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
## 3. Deploy Your First Model
```bash
export NAMESPACE=dynamo-system
kubectl create namespace ${NAMESPACE}
# to pull model from HF
export HF_TOKEN=<Token-Here>
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="$HF_TOKEN" \
-n ${NAMESPACE};
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
# Test it
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
```
For SLA-based autoscaling, see [SLA Planner Quick Start Guide](../planner/sla-planner-quickstart.md).
## Understanding Dynamo's Custom Resources
Dynamo provides two main Kubernetes Custom Resources for deploying models:
### DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration
The **recommended approach** for generating optimal configurations. DGDR provides a high-level interface where you specify:
- Model name and backend framework
- SLA targets (latency requirements)
- GPU type (optional)
Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
- SLA-driven configuration generation
- Automated resource optimization
- Users who want simplicity over control
**Note**: DGDR generates a DGD spec which you can then use to deploy.
### DynamoGraphDeployment (DGD) - Direct Configuration
A lower-level interface that defines your complete inference pipeline:
- Model configuration
- Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections
Use this when you need fine-grained control or have already completed profiling.
Refer to the [API Reference and Documentation](api-reference.md) for more details.
## 📖 API Reference & Documentation
For detailed technical specifications of Dynamo's Kubernetes resources:
- **[API Reference](api-reference.md)** - Complete CRD field specifications for all Dynamo resources
- **[Create Deployment](deployment/create-deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
- **[Operator Guide](dynamo-operator.md)** - Dynamo operator configuration and management
### Choosing Your Architecture Pattern
When creating a deployment, select the architecture pattern that best fits your use case:
- **Development / Testing** - Use `agg.yaml` as the base configuration
- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
### Frontend and Worker Components
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
- Provides OpenAI-compatible `/v1/chat/completions` endpoint
- Auto-discovers backend workers via [service discovery](service-discovery.md) (Kubernetes-native by default)
- Routes requests and handles load balancing
- Validates and preprocesses requests
### Customizing Your Deployment
Example structure:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
services:
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: your-image
VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
dynamoNamespace: dynamo-dev
componentType: worker
replicas: 1
envFromSecret: hf-token-secret # for HuggingFace models
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: your-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
```
Worker command examples per backend:
```yaml
# vLLM worker
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
# SGLang worker
args:
- >-
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--tp 1
--trust-remote-code
# TensorRT-LLM worker
args:
- python3 -m dynamo.trtllm
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
```
Key customization points include:
- **Model Configuration**: Specify model in the args command
- **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
## Additional Resources
- **[Examples](../getting-started/examples.md)** - Complete working examples
- **[Create Custom Deployments](deployment/create-deployment.md)** - Build your own CRDs
- **[Managing Models with DynamoModel](deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models
- **[Operator Documentation](dynamo-operator.md)** - How the platform works
- **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
- **[Logging](observability/logging.md)** - For logging setup
- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment
- **[Grove](grove.md)** - For grove details and custom installation
- **[Monitoring](observability/metrics.md)** - For monitoring setup
- **[Model Caching with Fluid](model-caching-with-fluid.md)** - For model caching with Fluid
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "API Reference"
---
<Info>
This documentation is automatically generated from source code.
Do not edit this file directly.
</Info>
## Packages
- [nvidia.com/v1alpha1](#nvidiacomv1alpha1)
## nvidia.com/v1alpha1
Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
This package defines the DynamoGraphDeploymentRequest (DGDR) custom resource, which provides
a high-level, SLA-driven interface for deploying machine learning models on Dynamo.
Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
### Resource Types
- [DynamoComponentDeployment](#dynamocomponentdeployment)
- [DynamoGraphDeployment](#dynamographdeployment)
- [DynamoGraphDeploymentRequest](#dynamographdeploymentrequest)
- [DynamoGraphDeploymentScalingAdapter](#dynamographdeploymentscalingadapter)
- [DynamoModel](#dynamomodel)
#### Autoscaling
Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter
with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md
for migration guidance. This field will be removed in a future API version.
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `enabled` _boolean_ | Deprecated: This field is ignored. | | |
| `minReplicas` _integer_ | Deprecated: This field is ignored. | | |
| `maxReplicas` _integer_ | Deprecated: This field is ignored. | | |
| `behavior` _[HorizontalPodAutoscalerBehavior](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#horizontalpodautoscalerbehavior-v2-autoscaling)_ | Deprecated: This field is ignored. | | |
| `metrics` _[MetricSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#metricspec-v2-autoscaling) array_ | Deprecated: This field is ignored. | | |
#### ComponentKind
_Underlying type:_ _string_
ComponentKind represents the type of underlying Kubernetes resource.
_Validation:_
- Enum: [PodClique PodCliqueScalingGroup Deployment LeaderWorkerSet]
_Appears in:_
- [ServiceReplicaStatus](#servicereplicastatus)
| Field | Description |
| --- | --- |
| `PodClique` | ComponentKindPodClique represents a PodClique resource.<br /> |
| `PodCliqueScalingGroup` | ComponentKindPodCliqueScalingGroup represents a PodCliqueScalingGroup resource.<br /> |
| `Deployment` | ComponentKindDeployment represents a Deployment resource.<br /> |
| `LeaderWorkerSet` | ComponentKindLeaderWorkerSet represents a LeaderWorkerSet resource.<br /> |
#### ConfigMapKeySelector
ConfigMapKeySelector selects a specific key from a ConfigMap.
Used to reference external configuration data stored in ConfigMaps.
_Appears in:_
- [ProfilingConfigSpec](#profilingconfigspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `name` _string_ | Name of the ConfigMap containing the desired data. | | Required: \{\} <br /> |
| `key` _string_ | Key in the ConfigMap to select. If not specified, defaults to "disagg.yaml". | disagg.yaml | |
#### DeploymentOverridesSpec
DeploymentOverridesSpec allows users to customize metadata for auto-created DynamoGraphDeployments.
When autoApply is enabled, these overrides are applied to the generated DGD resource.
_Appears in:_
- [DynamoGraphDeploymentRequestSpec](#dynamographdeploymentrequestspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `name` _string_ | Name is the desired name for the created DynamoGraphDeployment.<br />If not specified, defaults to the DGDR name. | | Optional: \{\} <br /> |
| `namespace` _string_ | Namespace is the desired namespace for the created DynamoGraphDeployment.<br />If not specified, defaults to the DGDR namespace. | | Optional: \{\} <br /> |
| `labels` _object (keys:string, values:string)_ | Labels are additional labels to add to the DynamoGraphDeployment metadata.<br />These are merged with auto-generated labels from the profiling process. | | Optional: \{\} <br /> |
| `annotations` _object (keys:string, values:string)_ | Annotations are additional annotations to add to the DynamoGraphDeployment metadata. | | Optional: \{\} <br /> |
| `workersImage` _string_ | WorkersImage specifies the container image to use for DynamoGraphDeployment worker components.<br />This image is used for both temporary DGDs created during online profiling and the final DGD.<br />If omitted, the image from the base config file (e.g., disagg.yaml) is used.<br />Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" | | Optional: \{\} <br /> |
#### DeploymentStatus
DeploymentStatus tracks the state of an auto-created DynamoGraphDeployment.
This status is populated when autoApply is enabled and a DGD is created.
_Appears in:_
- [DynamoGraphDeploymentRequestStatus](#dynamographdeploymentrequeststatus)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `name` _string_ | Name is the name of the created DynamoGraphDeployment. | | |
| `namespace` _string_ | Namespace is the namespace of the created DynamoGraphDeployment. | | |
| `state` _string_ | State is the current state of the DynamoGraphDeployment.<br />This value is mirrored from the DGD's status.state field. | | |
| `created` _boolean_ | Created indicates whether the DGD has been successfully created.<br />Used to prevent recreation if the DGD is manually deleted by users. | | |
#### DynamoComponentDeployment
DynamoComponentDeployment is the Schema for the dynamocomponentdeployments API
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
| `kind` _string_ | `DynamoComponentDeployment` | | |
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `spec` _[DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)_ | Spec defines the desired state for this Dynamo component deployment. | | |
#### DynamoComponentDeploymentSharedSpec
_Appears in:_
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
- [DynamoGraphDeploymentSpec](#dynamographdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component<br />(such as Pod, Service, and Ingress when applicable). | | |
| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. | | |
| `serviceName` _string_ | The name of the component | | |
| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). | | |
| `subComponentType` _string_ | SubComponentType indicates the sub-role of this component (for example, "prefill"). | | |
| `dynamoNamespace` _string_ | DynamoNamespace is deprecated and will be removed in a future version.<br />The DGD Kubernetes namespace and DynamoGraphDeployment name are used to construct the Dynamo namespace for each component | | Optional: \{\} <br /> |
| `globalDynamoNamespace` _boolean_ | GlobalDynamoNamespace indicates that the Component will be placed in the global Dynamo namespace | | |
| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br />GPUs/devices, and any runtime-specific resources. | | |
| `autoscaling` _[Autoscaling](#autoscaling)_ | Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter<br />with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md<br />for migration guidance. This field will be removed in a future API version. | | |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. | | |
| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. | | |
| `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. | | |
| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | |
| `modelRef` _[ModelReference](#modelreference)_ | ModelRef references a model that this component serves<br />When specified, a headless service will be created for endpoint discovery | | |
| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | |
| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | |
| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. | | |
| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | |
| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br />When scalingAdapter is enabled, this field is managed by the<br />DynamoGraphDeploymentScalingAdapter and should not be modified directly. | | Minimum: 0 <br /> |
| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | |
| `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br />When enabled, replicas are managed via DGDSA and external autoscalers can scale<br />the service using the Scale subresource. When disabled, replicas can be modified directly. | | |
#### DynamoComponentDeploymentSpec
DynamoComponentDeploymentSpec defines the desired state of DynamoComponentDeployment
_Appears in:_
- [DynamoComponentDeployment](#dynamocomponentdeployment)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm") | | Enum: [sglang vllm trtllm] <br /> |
| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component<br />(such as Pod, Service, and Ingress when applicable). | | |
| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. | | |
| `serviceName` _string_ | The name of the component | | |
| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). | | |
| `subComponentType` _string_ | SubComponentType indicates the sub-role of this component (for example, "prefill"). | | |
| `dynamoNamespace` _string_ | DynamoNamespace is deprecated and will be removed in a future version.<br />The DGD Kubernetes namespace and DynamoGraphDeployment name are used to construct the Dynamo namespace for each component | | Optional: \{\} <br /> |
| `globalDynamoNamespace` _boolean_ | GlobalDynamoNamespace indicates that the Component will be placed in the global Dynamo namespace | | |
| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br />GPUs/devices, and any runtime-specific resources. | | |
| `autoscaling` _[Autoscaling](#autoscaling)_ | Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter<br />with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md<br />for migration guidance. This field will be removed in a future API version. | | |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. | | |
| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. | | |
| `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. | | |
| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | |
| `modelRef` _[ModelReference](#modelreference)_ | ModelRef references a model that this component serves<br />When specified, a headless service will be created for endpoint discovery | | |
| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | |
| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | |
| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. | | |
| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | |
| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br />When scalingAdapter is enabled, this field is managed by the<br />DynamoGraphDeploymentScalingAdapter and should not be modified directly. | | Minimum: 0 <br /> |
| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | |
| `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br />When enabled, replicas are managed via DGDSA and external autoscalers can scale<br />the service using the Scale subresource. When disabled, replicas can be modified directly. | | |
#### DynamoGraphDeployment
DynamoGraphDeployment is the Schema for the dynamographdeployments API.
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
| `kind` _string_ | `DynamoGraphDeployment` | | |
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `spec` _[DynamoGraphDeploymentSpec](#dynamographdeploymentspec)_ | Spec defines the desired state for this graph deployment. | | |
| `status` _[DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)_ | Status reflects the current observed state of this graph deployment. | | |
#### DynamoGraphDeploymentRequest
DynamoGraphDeploymentRequest is the Schema for the dynamographdeploymentrequests API.
It serves as the primary interface for users to request model deployments with
specific performance and resource constraints, enabling SLA-driven deployments.
Lifecycle:
1. Initial → Pending: Validates spec and prepares for profiling
2. Pending → Profiling: Creates and runs profiling job (online or AIC)
3. Profiling → Ready/Deploying: Generates DGD spec after profiling completes
4. Deploying → Ready: When autoApply=true, monitors DGD until Ready
5. Ready: Terminal state when DGD is operational or spec is available
6. DeploymentDeleted: Terminal state when auto-created DGD is manually deleted
The spec becomes immutable once profiling starts. Users must delete and recreate
the DGDR to modify configuration after this point.
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
| `kind` _string_ | `DynamoGraphDeploymentRequest` | | |
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `spec` _[DynamoGraphDeploymentRequestSpec](#dynamographdeploymentrequestspec)_ | Spec defines the desired state for this deployment request. | | |
| `status` _[DynamoGraphDeploymentRequestStatus](#dynamographdeploymentrequeststatus)_ | Status reflects the current observed state of this deployment request. | | |
#### DynamoGraphDeploymentRequestSpec
DynamoGraphDeploymentRequestSpec defines the desired state of a DynamoGraphDeploymentRequest.
This CRD serves as the primary interface for users to request model deployments with
specific performance constraints and resource requirements, enabling SLA-driven deployments.
_Appears in:_
- [DynamoGraphDeploymentRequest](#dynamographdeploymentrequest)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `model` _string_ | Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").<br />This is a high-level identifier for easy reference in kubectl output and logs.<br />The controller automatically sets this value in profilingConfig.config.deployment.model. | | Required: \{\} <br /> |
| `backend` _string_ | Backend specifies the inference backend for profiling.<br />The controller automatically sets this value in profilingConfig.config.engine.backend.<br />Profiling runs on real GPUs or via AIC simulation to collect performance data. | | Enum: [vllm sglang trtllm] <br />Required: \{\} <br /> |
| `useMocker` _boolean_ | UseMocker indicates whether to deploy a mocker DynamoGraphDeployment instead of<br />a real backend deployment. When true, the deployment uses simulated engines that<br />don't require GPUs, using the profiling data to simulate realistic timing behavior.<br />Mocker is available in all backend images and useful for large-scale experiments.<br />Profiling still runs against the real backend (specified above) to collect performance data. | false | |
| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br />resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br />any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,<br />num_gpus_per_node) with values detected from the cluster.<br />Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\} <br /> |
| `profilingConfig` _[ProfilingConfigSpec](#profilingconfigspec)_ | ProfilingConfig provides the complete configuration for the profiling job.<br />This configuration is passed directly to the profiler.<br />The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).<br />Note: deployment.model and engine.backend are automatically set from the high-level<br />modelName and backend fields and should not be specified in this config. | | Required: \{\} <br /> |
| `autoApply` _boolean_ | AutoApply indicates whether to automatically create a DynamoGraphDeployment<br />after profiling completes. If false, only the spec is generated and stored in status.<br />Users can then manually create a DGD using the generated spec. | false | |
| `deploymentOverrides` _[DeploymentOverridesSpec](#deploymentoverridesspec)_ | DeploymentOverrides allows customizing metadata for the auto-created DGD.<br />Only applicable when AutoApply is true. | | Optional: \{\} <br /> |
#### DynamoGraphDeploymentRequestStatus
DynamoGraphDeploymentRequestStatus represents the observed state of a DynamoGraphDeploymentRequest.
The controller updates this status as the DGDR progresses through its lifecycle.
_Appears in:_
- [DynamoGraphDeploymentRequest](#dynamographdeploymentrequest)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `state` _string_ | State is a high-level textual status of the deployment request lifecycle.<br />Possible values: "", "Pending", "Profiling", "Deploying", "Ready", "DeploymentDeleted", "Failed"<br />Empty string ("") represents the initial state before initialization. | | |
| `backend` _string_ | Backend is extracted from profilingConfig.config.engine.backend for display purposes.<br />This field is populated by the controller and shown in kubectl output. | | Optional: \{\} <br /> |
| `observedGeneration` _integer_ | ObservedGeneration reflects the generation of the most recently observed spec.<br />Used to detect spec changes and enforce immutability after profiling starts. | | |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the deployment request.<br />Standard condition types include: Validation, Profiling, SpecGenerated, DeploymentReady.<br />Conditions are merged by type on patch updates. | | |
| `profilingResults` _string_ | ProfilingResults contains a reference to the ConfigMap holding profiling data.<br />Format: "configmap/\<name\>" | | Optional: \{\} <br /> |
| `generatedDeployment` _[RawExtension](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#rawextension-runtime-pkg)_ | GeneratedDeployment contains the full generated DynamoGraphDeployment specification<br />including metadata, based on profiling results. Users can extract this to create<br />a DGD manually, or it's used automatically when autoApply is true.<br />Stored as RawExtension to preserve all fields including metadata.<br />For mocker backends, this contains the mocker DGD spec. | | EmbeddedResource: \{\} <br />Optional: \{\} <br /> |
| `deployment` _[DeploymentStatus](#deploymentstatus)_ | Deployment tracks the auto-created DGD when AutoApply is true.<br />Contains name, namespace, state, and creation status of the managed DGD. | | Optional: \{\} <br /> |
#### DynamoGraphDeploymentScalingAdapter
DynamoGraphDeploymentScalingAdapter provides a scaling interface for individual services
within a DynamoGraphDeployment. It implements the Kubernetes scale
subresource, enabling integration with HPA, KEDA, and custom autoscalers.
The adapter acts as an intermediary between autoscalers and the DGD,
ensuring that only the adapter controller modifies the DGD's service replicas.
This prevents conflicts when multiple autoscaling mechanisms are in play.
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
| `kind` _string_ | `DynamoGraphDeploymentScalingAdapter` | | |
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `spec` _[DynamoGraphDeploymentScalingAdapterSpec](#dynamographdeploymentscalingadapterspec)_ | | | |
| `status` _[DynamoGraphDeploymentScalingAdapterStatus](#dynamographdeploymentscalingadapterstatus)_ | | | |
#### DynamoGraphDeploymentScalingAdapterSpec
DynamoGraphDeploymentScalingAdapterSpec defines the desired state of DynamoGraphDeploymentScalingAdapter
_Appears in:_
- [DynamoGraphDeploymentScalingAdapter](#dynamographdeploymentscalingadapter)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `replicas` _integer_ | Replicas is the desired number of replicas for the target service.<br />This field is modified by external autoscalers (HPA/KEDA/Planner) or manually by users. | | Minimum: 0 <br />Required: \{\} <br /> |
| `dgdRef` _[DynamoGraphDeploymentServiceRef](#dynamographdeploymentserviceref)_ | DGDRef references the DynamoGraphDeployment and the specific service to scale. | | Required: \{\} <br /> |
#### DynamoGraphDeploymentScalingAdapterStatus
DynamoGraphDeploymentScalingAdapterStatus defines the observed state of DynamoGraphDeploymentScalingAdapter
_Appears in:_
- [DynamoGraphDeploymentScalingAdapter](#dynamographdeploymentscalingadapter)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `replicas` _integer_ | Replicas is the current number of replicas for the target service.<br />This is synced from the DGD's service replicas and is required for the scale subresource. | | |
| `selector` _string_ | Selector is a label selector string for the pods managed by this adapter.<br />Required for HPA compatibility via the scale subresource. | | |
| `lastScaleTime` _[Time](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#time-v1-meta)_ | LastScaleTime is the last time the adapter scaled the target service. | | |
#### DynamoGraphDeploymentServiceRef
DynamoGraphDeploymentServiceRef identifies a specific service within a DynamoGraphDeployment
_Appears in:_
- [DynamoGraphDeploymentScalingAdapterSpec](#dynamographdeploymentscalingadapterspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `name` _string_ | Name of the DynamoGraphDeployment | | MinLength: 1 <br />Required: \{\} <br /> |
| `serviceName` _string_ | ServiceName is the key name of the service within the DGD's spec.services map to scale | | MinLength: 1 <br />Required: \{\} <br /> |
#### DynamoGraphDeploymentSpec
DynamoGraphDeploymentSpec defines the desired state of DynamoGraphDeployment.
_Appears in:_
- [DynamoGraphDeployment](#dynamographdeployment)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `pvcs` _[PVC](#pvc) array_ | PVCs defines a list of persistent volume claims that can be referenced by components.<br />Each PVC must have a unique name that can be referenced in component specifications. | | MaxItems: 100 <br />Optional: \{\} <br /> |
| `services` _object (keys:string, values:[DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec))_ | Services are the services to deploy as part of this deployment. | | MaxProperties: 25 <br />Optional: \{\} <br /> |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the deployment unless<br />overridden by service-specific configuration. | | Optional: \{\} <br /> |
| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). | | Enum: [sglang vllm trtllm] <br /> |
#### DynamoGraphDeploymentStatus
DynamoGraphDeploymentStatus defines the observed state of DynamoGraphDeployment.
_Appears in:_
- [DynamoGraphDeployment](#dynamographdeployment)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. | | |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. | | |
| `services` _object (keys:string, values:[ServiceReplicaStatus](#servicereplicastatus))_ | Services contains per-service replica status information.<br />The map key is the service name from spec.services. | | |
#### DynamoModel
DynamoModel is the Schema for the dynamo models API
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
| `kind` _string_ | `DynamoModel` | | |
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `spec` _[DynamoModelSpec](#dynamomodelspec)_ | | | |
| `status` _[DynamoModelStatus](#dynamomodelstatus)_ | | | |
#### DynamoModelSpec
DynamoModelSpec defines the desired state of DynamoModel
_Appears in:_
- [DynamoModel](#dynamomodel)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `modelName` _string_ | ModelName is the full model identifier (e.g., "meta-llama/Llama-3.3-70B-Instruct-lora") | | Required: \{\} <br /> |
| `baseModelName` _string_ | BaseModelName is the base model identifier that matches the service label<br />This is used to discover endpoints via headless services | | Required: \{\} <br /> |
| `modelType` _string_ | ModelType specifies the type of model (e.g., "base", "lora", "adapter") | base | Enum: [base lora adapter] <br /> |
| `source` _[ModelSource](#modelsource)_ | Source specifies the model source location (only applicable for lora model type) | | |
#### DynamoModelStatus
DynamoModelStatus defines the observed state of DynamoModel
_Appears in:_
- [DynamoModel](#dynamomodel)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `endpoints` _[EndpointInfo](#endpointinfo) array_ | Endpoints is the current list of all endpoints for this model | | |
| `readyEndpoints` _integer_ | ReadyEndpoints is the count of endpoints that are ready | | |
| `totalEndpoints` _integer_ | TotalEndpoints is the total count of endpoints | | |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions represents the latest available observations of the model's state | | |
#### EndpointInfo
EndpointInfo represents a single endpoint (pod) serving the model
_Appears in:_
- [DynamoModelStatus](#dynamomodelstatus)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `address` _string_ | Address is the full address of the endpoint (e.g., "http://10.0.1.5:9090") | | |
| `podName` _string_ | PodName is the name of the pod serving this endpoint | | |
| `ready` _boolean_ | Ready indicates whether the endpoint is ready to serve traffic<br />For LoRA models: true if the POST /loras request succeeded with a 2xx status code<br />For base models: always false (no probing performed) | | |
#### ExtraPodMetadata
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `annotations` _object (keys:string, values:string)_ | | | |
| `labels` _object (keys:string, values:string)_ | | | |
#### ExtraPodSpec
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `mainContainer` _[Container](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#container-v1-core)_ | | | |
#### IngressSpec
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `enabled` _boolean_ | Enabled exposes the component through an ingress or virtual service when true. | | |
| `host` _string_ | Host is the base host name to route external traffic to this component. | | |
| `useVirtualService` _boolean_ | UseVirtualService indicates whether to configure a service-mesh VirtualService instead of a standard Ingress. | | |
| `virtualServiceGateway` _string_ | VirtualServiceGateway optionally specifies the gateway name to attach the VirtualService to. | | |
| `hostPrefix` _string_ | HostPrefix is an optional prefix added before the host. | | |
| `annotations` _object (keys:string, values:string)_ | Annotations to set on the generated Ingress/VirtualService resources. | | |
| `labels` _object (keys:string, values:string)_ | Labels to set on the generated Ingress/VirtualService resources. | | |
| `tls` _[IngressTLSSpec](#ingresstlsspec)_ | TLS holds the TLS configuration used by the Ingress/VirtualService. | | |
| `hostSuffix` _string_ | HostSuffix is an optional suffix appended after the host. | | |
| `ingressControllerClassName` _string_ | IngressControllerClassName selects the ingress controller class (e.g., "nginx"). | | |
#### IngressTLSSpec
_Appears in:_
- [IngressSpec](#ingressspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `secretName` _string_ | SecretName is the name of a Kubernetes Secret containing the TLS certificate and key. | | |
#### ModelReference
ModelReference identifies a model served by this component
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `name` _string_ | Name is the base model identifier (e.g., "llama-3-70b-instruct-v1") | | Required: \{\} <br /> |
| `revision` _string_ | Revision is the model revision/version (optional) | | |
#### ModelSource
ModelSource defines the source location of a model
_Appears in:_
- [DynamoModelSpec](#dynamomodelspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `uri` _string_ | URI is the model source URI<br />Supported formats:<br />- S3: s3://bucket/path/to/model<br />- HuggingFace: hf://org/model@revision_sha | | Required: \{\} <br /> |
#### MultinodeSpec
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `nodeCount` _integer_ | Indicates the number of nodes to deploy for multinode components.<br />Total number of GPUs is NumberOfNodes * GPU limit.<br />Must be greater than 1. | 2 | Minimum: 2 <br /> |
#### PVC
_Appears in:_
- [DynamoGraphDeploymentSpec](#dynamographdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `create` _boolean_ | Create indicates to create a new PVC | | |
| `name` _string_ | Name is the name of the PVC | | Required: \{\} <br /> |
| `storageClass` _string_ | StorageClass to be used for PVC creation. Required when create is true. | | |
| `size` _[Quantity](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#quantity-resource-api)_ | Size of the volume in Gi, used during PVC creation. Required when create is true. | | |
| `volumeAccessMode` _[PersistentVolumeAccessMode](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#persistentvolumeaccessmode-v1-core)_ | VolumeAccessMode is the volume access mode of the PVC. Required when create is true. | | |
#### ProfilingConfigSpec
ProfilingConfigSpec defines configuration for the profiling process.
This structure maps directly to the profile_sla.py config format.
See benchmarks/profiler/utils/profiler_argparse.py for the complete schema.
_Appears in:_
- [DynamoGraphDeploymentRequestSpec](#dynamographdeploymentrequestspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `config` _[JSON](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#json-v1-apiextensions-k8s-io)_ | Config is the profiling configuration as arbitrary JSON/YAML. This will be passed directly to the profiler.<br />The profiler will validate the configuration and report any errors. | | Optional: \{\} <br />Type: object <br /> |
| `configMapRef` _[ConfigMapKeySelector](#configmapkeyselector)_ | ConfigMapRef is an optional reference to a ConfigMap containing the DynamoGraphDeployment<br />base config file (disagg.yaml). This is separate from the profiling config above.<br />The path to this config will be set as engine.config in the profiling config. | | Optional: \{\} <br /> |
| `profilerImage` _string_ | ProfilerImage specifies the container image to use for profiling jobs.<br />This image contains the profiler code and dependencies needed for SLA-based profiling.<br />Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" | | Required: \{\} <br /> |
| `outputPVC` _string_ | OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.<br />If specified, all profiling artifacts (logs, plots, configs, raw data) will be written<br />to this PVC instead of an ephemeral emptyDir volume. This allows users to access<br />complete profiling results after the job completes by mounting the PVC.<br />The PVC must exist in the same namespace as the DGDR.<br />If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.<br />Note: ConfigMaps are still created regardless of this setting for planner integration. | | Optional: \{\} <br /> |
| `resources` _[ResourceRequirements](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourcerequirements-v1-core)_ | Resources specifies the compute resource requirements for the profiling job container.<br />If not specified, no resource requests or limits are set. | | Optional: \{\} <br /> |
| `tolerations` _[Toleration](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#toleration-v1-core) array_ | Tolerations allows the profiling job to be scheduled on nodes with matching taints.<br />For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint. | | Optional: \{\} <br /> |
#### ResourceItem
_Appears in:_
- [Resources](#resources)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `cpu` _string_ | CPU specifies the CPU resource request/limit (e.g., "1000m", "2") | | |
| `memory` _string_ | Memory specifies the memory resource request/limit (e.g., "4Gi", "8Gi") | | |
| `gpu` _string_ | GPU indicates the number of GPUs to request.<br />Total number of GPUs is NumberOfNodes * GPU in case of multinode deployment. | | |
| `gpuType` _string_ | GPUType can specify a custom GPU type, e.g. "gpu.intel.com/xe"<br />By default if not specified, the GPU type is "nvidia.com/gpu" | | |
| `custom` _object (keys:string, values:string)_ | Custom specifies additional custom resource requests/limits | | |
#### Resources
Resources defines requested and limits for a component, including CPU, memory,
GPUs/devices, and any runtime-specific resources.
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `requests` _[ResourceItem](#resourceitem)_ | Requests specifies the minimum resources required by the component | | |
| `limits` _[ResourceItem](#resourceitem)_ | Limits specifies the maximum resources allowed for the component | | |
| `claims` _[ResourceClaim](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourceclaim-v1-core) array_ | Claims specifies resource claims for dynamic resource allocation | | |
#### ScalingAdapter
ScalingAdapter configures whether a service uses the DynamoGraphDeploymentScalingAdapter
for replica management. When enabled, the DGDSA owns the replicas field and
external autoscalers (HPA, KEDA, Planner) can control scaling via the Scale subresource.
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `enabled` _boolean_ | Enabled indicates whether the ScalingAdapter should be enabled for this service.<br />When true, a DGDSA is created and owns the replicas field.<br />When false (default), no DGDSA is created and replicas can be modified directly in the DGD. | false | |
#### ServiceReplicaStatus
ServiceReplicaStatus contains replica information for a single service.
_Appears in:_
- [DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `componentKind` _[ComponentKind](#componentkind)_ | ComponentKind is the underlying resource kind (e.g., "PodClique", "PodCliqueScalingGroup", "Deployment", "LeaderWorkerSet"). | | Enum: [PodClique PodCliqueScalingGroup Deployment LeaderWorkerSet] <br /> |
| `componentName` _string_ | ComponentName is the name of the underlying resource. | | |
| `replicas` _integer_ | Replicas is the total number of non-terminated replicas.<br />Required for all component kinds. | | Minimum: 0 <br /> |
| `updatedReplicas` _integer_ | UpdatedReplicas is the number of replicas at the current/desired revision.<br />Required for all component kinds. | | Minimum: 0 <br /> |
| `readyReplicas` _integer_ | ReadyReplicas is the number of ready replicas.<br />Populated for PodClique, Deployment, and LeaderWorkerSet.<br />Not available for PodCliqueScalingGroup.<br />When nil, the field is omitted from the API response. | | Minimum: 0 <br /> |
| `availableReplicas` _integer_ | AvailableReplicas is the number of available replicas.<br />For Deployment: replicas ready for >= minReadySeconds.<br />For PodCliqueScalingGroup: replicas where all constituent PodCliques have >= MinAvailable ready pods.<br />Not available for PodClique or LeaderWorkerSet.<br />When nil, the field is omitted from the API response. | | Minimum: 0 <br /> |
#### SharedMemorySpec
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `disabled` _boolean_ | | | |
| `size` _[Quantity](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#quantity-resource-api)_ | | | |
#### VolumeMount
VolumeMount references a PVC defined at the top level for volumes to be mounted by the component
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `name` _string_ | Name references a PVC name defined in the top-level PVCs map | | Required: \{\} <br /> |
| `mountPoint` _string_ | MountPoint specifies where to mount the volume.<br />If useAsCompilationCache is true and mountPoint is not specified,<br />a backend-specific default will be used. | | |
| `useAsCompilationCache` _boolean_ | UseAsCompilationCache indicates this volume should be used as a compilation cache.<br />When true, backend-specific environment variables will be set and default mount points may be used. | false | |
# Operator Default Values Injection
The Dynamo operator automatically applies default values to various fields when they are not explicitly specified in your deployments. These defaults include:
- **Health Probes**: Startup, liveness, and readiness probes are configured differently for frontend, worker, and planner components. For example, worker components receive a startup probe with a 2-hour timeout (720 failures × 10 seconds) to accommodate long model loading times.
- **Security Context**: All components receive `fsGroup: 1000` by default to ensure proper file permissions for mounted volumes. This can be overridden via the `extraPodSpec.securityContext` field.
- **Shared Memory**: All components receive an 8Gi shared memory volume mounted at `/dev/shm` by default (can be disabled or resized via the `sharedMemory` field).
- **Environment Variables**: Components automatically receive environment variables like `DYN_NAMESPACE`, `DYN_PARENT_DGD_K8S_NAME`, `DYNAMO_PORT`, and backend-specific variables.
- **Pod Configuration**: Default `terminationGracePeriodSeconds` of 60 seconds and `restartPolicy: Always`.
- **Autoscaling**: When enabled without explicit metrics, defaults to CPU-based autoscaling with 80% target utilization.
- **Backend-Specific Behavior**: For multinode deployments, probes are automatically modified or removed for worker nodes depending on the backend framework (VLLM, SGLang, or TensorRT-LLM).
## Pod Specification Defaults
All components receive the following pod-level defaults unless overridden:
- **`terminationGracePeriodSeconds`**: `60` seconds
- **`restartPolicy`**: `Always`
## Security Context
The operator automatically applies default security context settings to all components to ensure proper file permissions, particularly for mounted volumes:
- **`fsGroup`**: `1000` - Sets the group ownership of mounted volumes and any files created in those volumes
This default ensures that non-root containers can write to mounted volumes (like model caches or persistent storage) without permission issues. The `fsGroup` setting is particularly important for:
- Model downloads and caching
- Compilation cache directories
- Persistent volume claims (PVCs)
- SSH key generation in multinode deployments
### Overriding Security Context
To override the default security context, specify your own `securityContext` in the `extraPodSpec` of your component:
```yaml
services:
YourWorker:
extraPodSpec:
securityContext:
fsGroup: 2000 # Custom group ID
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
```
**Important**: When you provide *any* `securityContext` object in `extraPodSpec`, the operator will not inject any defaults. This gives you complete control over the security context, including the ability to run as root (by omitting `runAsNonRoot` or setting it to `false`).
### OpenShift and Security Context Constraints
In OpenShift environments with Security Context Constraints (SCCs), you may need to omit explicit UID/GID values to allow OpenShift's admission controllers to assign them dynamically:
```yaml
services:
YourWorker:
extraPodSpec:
securityContext:
# Omit fsGroup to let OpenShift assign it based on SCC
# OpenShift will inject the appropriate UID range
```
Alternatively, if you want to keep the default `fsGroup: 1000` behavior and are certain your cluster allows it, you don't need to specify anything - the operator defaults will work.
## Shared Memory Configuration
Shared memory is enabled by default for all components:
- **Enabled**: `true` (unless explicitly disabled via `sharedMemory.disabled`)
- **Size**: `8Gi`
- **Mount Path**: `/dev/shm`
- **Volume Type**: `emptyDir` with `memory` medium
To disable shared memory or customize the size, use the `sharedMemory` field in your component specification.
## Health Probes by Component Type
The operator applies different default health probes based on the component type.
### Frontend Components
Frontend components receive the following probe configurations:
**Liveness Probe:**
- **Type**: HTTP GET
- **Path**: `/health`
- **Port**: `http` (8000)
- **Initial Delay**: 60 seconds
- **Period**: 60 seconds
- **Timeout**: 30 seconds
- **Failure Threshold**: 10
**Readiness Probe:**
- **Type**: Exec command
- **Command**: `curl -s http://localhost:${DYNAMO_PORT}/health | jq -e ".status == \"healthy\""`
- **Initial Delay**: 60 seconds
- **Period**: 60 seconds
- **Timeout**: 30 seconds
- **Failure Threshold**: 10
### Worker Components
Worker components receive the following probe configurations:
**Liveness Probe:**
- **Type**: HTTP GET
- **Path**: `/live`
- **Port**: `system` (9090)
- **Period**: 5 seconds
- **Timeout**: 30 seconds
- **Failure Threshold**: 1
**Readiness Probe:**
- **Type**: HTTP GET
- **Path**: `/health`
- **Port**: `system` (9090)
- **Period**: 10 seconds
- **Timeout**: 30 seconds
- **Failure Threshold**: 60
**Startup Probe:**
- **Type**: HTTP GET
- **Path**: `/live`
- **Port**: `system` (9090)
- **Period**: 10 seconds
- **Timeout**: 5 seconds
- **Failure Threshold**: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s)
<Note>
**For larger models (typically >70B parameters) or slower storage systems, you may need to increase the `failureThreshold` to allow more time for model loading. Calculate the required threshold based on your expected startup time: `failureThreshold = (expected_startup_seconds / period)`. Override the startup probe in your component specification if the default 2-hour window is insufficient.**
</Note>
### Multinode Deployment Probe Modifications
For multinode deployments, the operator modifies probes based on the backend framework and node role:
#### VLLM Backend
The operator automatically selects between two deployment modes based on parallelism configuration:
**Tensor/Pipeline Parallel Mode** (when `world_size > GPUs_per_node`):
- Uses Ray for distributed execution (`--distributed-executor-backend ray`)
- **Leader nodes**: Starts Ray head and runs vLLM; all probes remain active
- **Worker nodes**: Run Ray agents only; all probes (liveness, readiness, startup) are removed
**Data Parallel Mode** (when `world_size × data_parallel_size > GPUs_per_node`):
- **Worker nodes**: All probes (liveness, readiness, startup) are removed
- **Leader nodes**: All probes remain active
#### SGLang Backend
- **Worker nodes**: All probes (liveness, readiness, startup) are removed
#### TensorRT-LLM Backend
- **Leader nodes**: All probes remain unchanged
- **Worker nodes**:
- Liveness and startup probes are removed
- Readiness probe is replaced with a TCP socket check on SSH port (2222):
- **Initial Delay**: 20 seconds
- **Period**: 20 seconds
- **Timeout**: 5 seconds
- **Failure Threshold**: 10
## Environment Variables
The operator automatically injects environment variables based on component type and configuration:
### All Components
- **`DYN_NAMESPACE`**: The Dynamo namespace for the component
- **`DYN_PARENT_DGD_K8S_NAME`**: The parent DynamoGraphDeployment Kubernetes resource name
- **`DYN_PARENT_DGD_K8S_NAMESPACE`**: The parent DynamoGraphDeployment Kubernetes namespace
### Frontend Components
- **`DYNAMO_PORT`**: `8000`
- **`DYN_HTTP_PORT`**: `8000`
### Worker Components
- **`DYN_SYSTEM_PORT`**: `9090` (automatically enables the system metrics server)
- **`DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS`**: `["generate"]`
- **`DYN_SYSTEM_ENABLED`**: `true` (needed for runtime images 0.6.1 and older)
### Planner Components
- **`PLANNER_PROMETHEUS_PORT`**: `9085`
### VLLM Backend (with compilation cache)
When a volume mount is configured with `useAsCompilationCache: true`:
- **`VLLM_CACHE_ROOT`**: Set to the mount point of the cache volume
## Service Account
Planner components automatically receive the following service account:
- **`serviceAccountName`**: `planner-serviceaccount`
## Image Pull Secrets
The operator automatically discovers and injects image pull secrets for container images. When a component specifies a container image, the operator:
1. Scans all Kubernetes secrets of type `kubernetes.io/dockerconfigjson` in the component's namespace
2. Extracts the docker registry server URLs from each secret's authentication configuration
3. Matches the container image's registry host against the discovered registry URLs
4. Automatically injects matching secrets as `imagePullSecrets` in the pod specification
This eliminates the need to manually specify image pull secrets for each component. The operator maintains an internal index of docker secrets and their associated registries, refreshing this index periodically.
**To disable automatic image pull secret discovery** for a specific component, add the following annotation:
```yaml
annotations:
nvidia.com/disable-image-pull-secret-discovery: "true"
```
## Autoscaling Defaults
When autoscaling is enabled but no metrics are specified, the operator applies:
- **Default Metric**: CPU utilization
- **Target Average Utilization**: `80%`
## Port Configurations
Default container ports are configured based on component type:
### Frontend Components
- **Port**: 8000
- **Protocol**: TCP
- **Name**: `http`
### Worker Components
- **Port**: 9090
- **Protocol**: TCP
- **Name**: `system`
### Planner Components
- **Port**: 9085
- **Protocol**: TCP
- **Name**: `metrics`
## Backend-Specific Configurations
### VLLM
- **Ray Head Port**: 6379 (for Ray cluster coordination in multinode TP/PP deployments)
- **Data Parallel RPC Port**: 13445 (for data parallel multinode deployments)
### SGLang
- **Distribution Init Port**: 29500 (for multinode deployments)
### TensorRT-LLM
- **SSH Port**: 2222 (for multinode MPI communication)
- **OpenMPI Environment**: `OMPI_MCA_orte_keep_fqdn_hostnames=1`
## Implementation Reference
For users who want to understand the implementation details or contribute to the operator, the default values described in this document are set in the following source files:
- **Health Probes, Security Context & Pod Specifications**: [`internal/dynamo/graph.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/graph.go) - Contains the main logic for applying default probes, security context, environment variables, shared memory, and pod configurations
- **Component-Specific Defaults**:
- [`internal/dynamo/component_frontend.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_frontend.go)
- [`internal/dynamo/component_worker.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_worker.go)
- [`internal/dynamo/component_planner.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_planner.go)
- **Image Pull Secrets**: [`internal/secrets/docker.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/secrets/docker.go) - Implements the docker secret indexer and automatic discovery
- **Backend-Specific Behavior**:
- [`internal/dynamo/backend_vllm.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_vllm.go)
- [`internal/dynamo/backend_sglang.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_sglang.go)
- [`internal/dynamo/backend_trtllm.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_trtllm.go)
- **Constants & Annotations**: [`internal/consts/consts.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/consts/consts.go) - Defines annotation keys and other constants
## Notes
- All these defaults can be overridden by explicitly specifying values in your DynamoComponentDeployment or DynamoGraphDeployment resources
- User-specified probes (via `livenessProbe`, `readinessProbe`, or `startupProbe` fields) take precedence over operator defaults
- For security context, if you provide *any* `securityContext` in `extraPodSpec`, no defaults will be injected, giving you full control
- For multinode deployments, some defaults are modified or removed as described above to accommodate distributed execution patterns
- The `extraPodSpec.mainContainer` field can be used to override probe configurations set by the operator
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment