Unverified Commit 0d597e7c authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: Documentation audit and updates for 0.8 (#5380)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: default avatarDan Gil <dagil@nvidia.com>
parent 553acb48
Fault Tolerance
===============
.. toctree::
:maxdepth: 1
Overview <../fault_tolerance/README.md>
Request Migration <../fault_tolerance/request_migration.md>
Request Cancellation <../fault_tolerance/request_cancellation.md>
Graceful Shutdown <../fault_tolerance/graceful_shutdown.md>
Request Rejection <../fault_tolerance/request_rejection.md>
Testing <../fault_tolerance/testing.md>
......@@ -163,14 +163,19 @@ docker run \
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
### Start NATS and ETCD in the background
### Start Infrastructure Services (Local Development Only)
Start using [Docker Compose](../../../deploy/docker-compose.yml)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
> [!TIP]
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
>
......
......@@ -70,14 +70,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
### Start Infrastructure Services (Local Development Only)
Start using [Docker Compose](../../../deploy/docker-compose.yml)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Build container
```bash
......
......@@ -56,14 +56,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
### Start Infrastructure Services (Local Development Only)
Start using [Docker Compose](../../../deploy/docker-compose.yml)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Pull or build container
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
......
......@@ -19,56 +19,81 @@ limitations under the License.
## Overview
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., Python bindings can be found in `/lib/bindings/python`). The runtime supports multiple discovery backends (Kubernetes-native or etcd) and request planes (TCP, HTTP, or NATS). `DistributedRuntime` follows a hierarchical structure:
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It manages connections to discovery backends (K8s API or etcd) and optional messaging (NATS for KV events), and handles lifecycle with cancellation tokens.
- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
- `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
- `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple workers:
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components:
- `Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the `Processor`.
- `Processor`: When a new request arrives, `Processor` applies the chat template and performs the tokenization.
Then, it routes the request to the `Worker`.
- `Worker` components (e.g., `VllmDecodeWorker`, `SGLangDecodeWorker`, `TrtllmWorker`): Perform the actual computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
- `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality.
- `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
Since the workers are deployed in different processes, each of them has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-agg`). Then, under their namespace, they have their own `Component`s: `Frontend` uses the `make_engine` function which handles HTTP serving and routing automatically, while worker components create components with names like `worker`, `decode`, or `prefill` and register endpoints like `generate`, `flush_cache`, or `clear_kv_blocks`. The `Frontend` component doesn't explicitly create endpoints - instead, the `make_engine` function handles the HTTP server and worker discovery. Worker components create their endpoints programmatically using the `component.endpoint()` method. Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("worker")`), and their `Endpoint`s are created using the `component.endpoint()` method.
Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`:
- `Frontend` uses the `make_engine` function which handles HTTP serving, request preprocessing, and worker discovery automatically
- Worker components register with names like `backend`, `prefill`, `decode`, or `encoder` depending on their role
- Workers register endpoints like `generate`, `clear_kv_blocks`, or `load_metrics`
Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("backend")`), and their `Endpoint`s are created using the `component.endpoint()` method.
## Initialization
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are multiple modes for `DistributedRuntime` initialization based on the deployment environment.
```{caution}
The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
The hierarchy and naming may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
```
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following services:
- etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
- NATS (optional): for KV event messaging and router replica sync. NATS is enabled by default but can be disabled via the `enable_nats` parameter (e.g., using `--no-kv-events` flag). When NATS is disabled, the system operates in approximate mode without KV event persistence. Also legacy nats based request_plane is supported.
### Service Discovery Backends
The `DistributedRuntime` supports two service discovery backends, configured via `DYN_DISCOVERY_BACKEND`:
- **KV Store Discovery** (`DYN_DISCOVERY_BACKEND=kv_store`): Uses etcd for service discovery. **This is the global default** for all deployments unless explicitly overridden.
- **Kubernetes Discovery** (`DYN_DISCOVERY_BACKEND=kubernetes`): Uses native Kubernetes resources (DynamoWorkerMetadata CRD, EndpointSlices) for service discovery. **Must be explicitly set.** The Dynamo operator automatically sets this environment variable for Kubernetes deployments. **No etcd required.**
etcd and NATS are global services (there could be multiple instances for high availability).
> **Note:** There is no automatic detection of the deployment environment. The runtime always defaults to `kv_store`. For Kubernetes deployments, the operator injects `DYN_DISCOVERY_BACKEND=kubernetes` into pod environments.
For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` as the service identifier and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
- `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
- NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
- etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
When using Kubernetes discovery, the KV store backend automatically switches to in-memory storage since etcd is not needed.
### Runtime Initialization
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections based on the discovery backend:
- **Kubernetes mode**: Uses K8s API for service registration via DynamoWorkerMetadata CRD. No external dependencies required.
- **KV Store mode**: Connects to etcd for service discovery. Creates a primary lease with a background keep-alive task. All objects registered under this `DistributedRuntime` use this lease_id to maintain their lifecycle.
- **NATS** (optional): Used for KV event messaging when using KV-aware routing. Can be disabled via `--no-kv-events` flag, which enables prediction-based routing without event persistence.
- **Request Plane**: TCP by default. Can be configured to use HTTP or NATS via `DYN_REQUEST_PLANE` environment variable.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism. They provide the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, it registers a service in the internal registry of the `DistributedRuntime`, which tracks all services and endpoints.
- `Endpoint`: When an Endpoint object is created and started, it performs registration based on the discovery backend:
- **Kubernetes mode**: Endpoint information is stored in DynamoWorkerMetadata CRD resources, which are watched by other components for discovery.
- **KV Store mode**: Endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id`.
## Calling Endpoints
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
Dynamo uses a `Client` object to call an endpoint. When a `Client` is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then watches for endpoint changes:
- **Kubernetes mode**: Watches DynamoWorkerMetadata CRD resources for endpoint updates.
- **KV Store mode**: Sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`.
The watcher continuously updates the `Client` with information about available `Endpoint`s.
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
- `random`: randomly select an endpoint to hit
- `round_robin`: select endpoints in round-robin order
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
- `direct`: direct the request to a specific endpoint by specifying the instance ID
After selecting which endpoint to hit, the `Client` sends the request using the configured request plane (TCP by default). The request plane handles the actual transport:
After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
- **TCP** (default): Direct TCP connection with connection pooling
- **HTTP**: HTTP/2-based transport
- **NATS**: Message broker-based transport (legacy)
## Examples
......
......@@ -17,15 +17,17 @@ limitations under the License.
# Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations:
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations.
> **Note**: The "Processor" shown in the diagram represents the request processing logic (tokenization, chat template application, routing) that runs within the Frontend component. It is not a separate deployment—the Frontend handles both HTTP serving and request preprocessing via the `make_engine` function.
## 🔵 Main Request Flow (Blue)
The primary user journey through the system:
1. **Discovery (S1)**: Client discovers the service endpoint
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
3. **Validate (S3)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
4. **Route (S3)**: Frontend routes the validated request to appropriate Decode Worker
## 🟠 Decision and Allocation Flow (Orange)
The system's intelligent routing and resource allocation:
......@@ -48,18 +50,22 @@ The response generation and delivery:
11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker
12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data
13. **Response (S13)**: The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
13. **Response (S13)**: The generated response flows back through the Frontend for post-processing (detokenization) and delivery to the Client
## 🔗 Infrastructure Connections (Dotted lines)
Coordination and messaging support:
### ETCD Connections (Gray, dotted)
- **Frontend, Processor, Planner**: Service discovery and registration
- **Decode Worker, PrefillWorker**: NIXL metadata storage for GPU communication setup
### Service Discovery
- **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- **On bare metal**: Uses etcd for service discovery and endpoint registration.
### Request Plane
- **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport.
- **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`.
### NATS Connections (Teal, dotted)
- **PrefillQueue**: JetStream consumer group for reliable work distribution
- **Processor**: Load balancing across workers
### NATS Connections (Optional, for KV routing)
- **PrefillQueue**: JetStream consumer group for reliable work distribution in disaggregated serving
- **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`)
### Planning Connections (Gold, dotted)
- **Frontend → Planner**: Metrics collection for auto-scaling decisions
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Event Plane Architecture
This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS.
## Overview
Dynamo's coordination layer adapts to the deployment environment:
| Deployment | Service Discovery | KV Events | Request Plane |
|------------|-------------------|-----------|---------------|
| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |
> **Note:** The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
```
┌─────────────────────────────────────────────────────────────────────┐
│ Coordination Layer │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Service Discovery │ │ NATS │ │
│ │ │ │ (Optional) │ │
│ │ • K8s: CRDs + API │ │ • KV Cache Events │ │
│ │ • Bare metal: etcd │ │ • Router Replica Sync │ │
│ │ │ │ • JetStream Persistence │ │
│ └─────────────────────────┘ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
│ │
┌──────────┴──────────┐ ┌─────────┴──────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Frontend │ │ Planner │ │ Worker │
└─────────┘ └─────────┘ └─────────┘
```
## Kubernetes-Native Service Discovery
When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd.
### Configuration
The operator explicitly sets:
```bash
DYN_DISCOVERY_BACKEND=kubernetes
```
> **Important:** This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
### How It Works
1. **DynamoWorkerMetadata CRD**: Workers register their endpoints by creating/updating DynamoWorkerMetadata custom resources
2. **EndpointSlices**: Used to signal readiness status to the system
3. **K8s API Watches**: Components watch for CRD changes to discover available endpoints
### Benefits
- No external etcd cluster required
- Native integration with Kubernetes lifecycle
- Automatic cleanup when pods terminate
- Works with standard K8s RBAC
### Environment Variables (Injected by Operator)
| Variable | Description |
|----------|-------------|
| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
| `POD_NAME` | Current pod name |
| `POD_NAMESPACE` | Current namespace |
| `POD_UID` | Pod unique identifier |
---
## etcd Architecture (Default for All Deployments)
When `DYN_DISCOVERY_BACKEND=kv_store` (the global default), etcd is used for service discovery.
### Connection Configuration
etcd connection is configured via environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `ETCD_ENDPOINTS` | Comma-separated etcd URLs | `http://localhost:2379` |
| `ETCD_AUTH_USERNAME` | Basic auth username | None |
| `ETCD_AUTH_PASSWORD` | Basic auth password | None |
| `ETCD_AUTH_CA` | CA certificate path (TLS) | None |
| `ETCD_AUTH_CLIENT_CERT` | Client certificate path | None |
| `ETCD_AUTH_CLIENT_KEY` | Client key path | None |
Example:
```bash
export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379
```
### Lease Management
Each `DistributedRuntime` maintains a primary lease with etcd:
```
┌────────────────────┐ ┌──────────────┐
│ DistributedRuntime │◄────────│ Primary Lease │
│ │ │ TTL: 10s │
│ • Namespace │ └───────┬───────┘
│ • Components │ │
│ • Endpoints │ │ Keep-Alive
│ │ │ Heartbeat
└────────────────────┘ ▼
┌──────────────┐
│ etcd │
└──────────────┘
```
**Lease Lifecycle:**
1. **Creation**: Lease created during `DistributedRuntime` initialization
2. **Keep-Alive**: Background task sends heartbeats at 50% of remaining TTL
3. **Expiration**: If heartbeats stop, lease expires after TTL (10 seconds default)
4. **Cleanup**: All keys associated with the lease are automatically deleted
**Automatic Recovery:**
- Reconnection with exponential backoff (50ms to 5s)
- Deadline-based retry logic
- Cancellation token propagation
### Service Discovery
Endpoints are registered in etcd for dynamic discovery:
**Key Format:**
```
/services/{namespace}/{component}/{endpoint}/{instance_id}
```
**Example:**
```
/services/vllm-agg/backend/generate/694d98147d54be25
```
**Registration Data:**
```json
{
"namespace": "vllm-agg",
"component": "backend",
"endpoint": "generate",
"instance_id": 7587888160958628000,
"transport": {
"tcp": "192.168.1.10:9999"
}
}
```
### Discovery Queries
The discovery system supports multiple query patterns:
| Query Type | Pattern | Use Case |
|------------|---------|----------|
| `AllEndpoints` | `/services/` | List all services |
| `NamespacedEndpoints` | `/services/{namespace}/` | Filter by namespace |
| `ComponentEndpoints` | `/services/{namespace}/{component}/` | Filter by component |
| `Endpoint` | `/services/{namespace}/{component}/{endpoint}/` | Specific endpoint |
### Watch Functionality
Clients watch etcd prefixes for real-time updates:
```python
# Client watches for endpoint changes
watcher = etcd.watch_prefix("/services/vllm-agg/backend/generate/")
for event in watcher:
if event.type == "PUT":
# New endpoint registered
add_endpoint(event.value)
elif event.type == "DELETE":
# Endpoint removed (worker died)
remove_endpoint(event.key)
```
**Watch Features:**
- Initial state retrieval with `get_and_watch_prefix()`
- Automatic reconnection on stream failure
- Revision tracking for no-event-loss guarantees
- Event types: `PUT` (create/update) and `DELETE`
### Distributed Locks
etcd provides distributed locking for coordination:
**Lock Types:**
| Type | Key Pattern | Behavior |
|------|-------------|----------|
| Write Lock | `v1/{prefix}/writer` | Exclusive (no readers/writers) |
| Read Lock | `v1/{prefix}/readers/{id}` | Shared (multiple readers) |
**Operations:**
```rust
// Non-blocking write lock
let lock = client.try_write_lock("my_resource").await?;
// Blocking read lock with polling (100ms intervals)
let lock = client.read_lock_with_wait("my_resource").await?;
```
## NATS Architecture
### When NATS is Used
NATS is used for:
1. **KV Cache Events**: Real-time KV cache state updates for routing
2. **Router Replica Sync**: Synchronizing router state across replicas
3. **Legacy Request Plane**: NATS-based request transport (optional)
### Configuration
| Variable | Description | Default |
|----------|-------------|---------|
| `NATS_SERVER` | NATS server URL | `nats://localhost:4222` |
### Disabling NATS
For deployments without KV-aware routing:
```bash
# Disable NATS and KV events
python -m dynamo.frontend --no-kv-events
```
This enables "approximate mode" for KV routing without event persistence.
### Event Publishing
Components publish events to NATS subjects:
```rust
pub trait EventPublisher {
async fn publish(&self, event: &str, data: &[u8]) -> Result<()>;
async fn publish_serialized<T: Serialize>(&self, event: &str, data: &T) -> Result<()>;
}
```
**Subject Naming:**
```
{base_subject}.{event_name}
```
Example:
```
vllm-agg.backend.kv_cache_update
```
### Event Subscription
Components subscribe to events:
```rust
pub trait EventSubscriber {
async fn subscribe(&self, topic: &str) -> Result<Subscriber>;
async fn subscribe_typed<T: DeserializeOwned>(&self, topic: &str) -> Result<TypedSubscriber<T>>;
}
```
### JetStream Persistence
For durable event delivery, NATS JetStream provides:
- Message persistence
- Replay from offset
- Consumer groups for load balancing
- Acknowledgment tracking
## Key-Value Store Abstraction
Dynamo provides a unified KV store interface supporting multiple backends:
### Supported Backends
| Backend | Use Case | Configuration |
|---------|----------|---------------|
| `EtcdStore` | Production deployments | `ETCD_ENDPOINTS` |
| `MemoryStore` | Testing, development | Default |
| `NatsStore` | NATS-only deployments | `NATS_SERVER` |
| `FileStore` | Local persistence | File path |
### Store Interface
```rust
pub trait KvStore {
async fn get(&self, bucket: &str, key: &str) -> Result<Option<Vec<u8>>>;
async fn put(&self, bucket: &str, key: &str, value: &[u8]) -> Result<()>;
async fn delete(&self, bucket: &str, key: &str) -> Result<()>;
async fn watch(&self, bucket: &str) -> Result<WatchStream>;
}
```
### Buckets
Data is organized into logical buckets:
| Bucket | Purpose |
|--------|---------|
| `v1/instances` | Endpoint instance registry |
| `v1/mdc` | Model deployment cards |
## Typed Prefix Watcher
For type-safe watching of etcd prefixes:
```rust
// Watch and maintain HashMap of deserialized values
let watcher = watch_prefix_with_extraction::<DiscoveryInstance>(
&etcd_client,
"/services/vllm-agg/",
lease_id_extractor,
value_extractor,
).await?;
// Receive updates via watch channel
let instances = watcher.borrow();
```
**Key Extractors:**
| Extractor | Description |
|-----------|-------------|
| `lease_id()` | Use lease ID as key |
| `key_string()` | Extract key with prefix stripping |
| `full_key_string()` | Use full etcd key |
## Reliability Features
### Connection Resilience
**etcd Reconnection:**
- Exponential backoff: 50ms to 5s
- Deadline-based retry logic
- Mutex ensures single concurrent reconnect
**NATS Reconnection:**
- Built-in reconnection in NATS client
- Configurable max reconnect attempts
- Buffering during disconnection
### Lease-Based Cleanup
When a worker crashes or loses connectivity:
1. Keep-alive heartbeats stop
2. Lease expires after TTL (10 seconds)
3. All registered endpoints automatically deleted
4. Clients receive DELETE watch events
5. Traffic reroutes to healthy workers
### Transaction Safety
etcd transactions ensure atomic operations:
```rust
// Atomic create-if-not-exists
let txn = Txn::new()
.when([Compare::create_revision(key, CompareOp::Equal, 0)])
.and_then([Op::put(key, value, options)]);
etcd_client.txn(txn).await?;
```
This prevents race conditions in concurrent service registration.
## Operational Modes
### Kubernetes Mode (Requires Explicit Configuration)
Native Kubernetes service discovery:
```bash
# Operator explicitly sets this (not auto-detected):
export DYN_DISCOVERY_BACKEND=kubernetes
# Workers register via K8s CRDs
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
# Frontend discovers workers via K8s API
python -m dynamo.frontend
```
No etcd or NATS required for basic operation when using K8s discovery.
### KV Store Mode (Global Default)
Full service discovery with etcd:
```bash
# This is the default - no configuration needed
# export DYN_DISCOVERY_BACKEND=kv_store # (implicit)
# Workers register with etcd
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
# Frontend discovers workers via etcd
python -m dynamo.frontend
```
### KV-Aware Routing (Optional)
Enable NATS for KV cache event tracking:
```bash
# Default: KV events enabled (requires NATS)
python -m dynamo.frontend --router-mode kv
# Disable KV events for prediction-based routing (no NATS)
python -m dynamo.frontend --router-mode kv --no-kv-events
```
With `--no-kv-events`:
- Router predicts cache state based on routing decisions
- TTL-based expiration and LRU pruning
- No NATS infrastructure required
## Best Practices
### 1. Use Kubernetes Discovery on K8s
The Dynamo operator automatically sets `DYN_DISCOVERY_BACKEND=kubernetes` for pods. No additional setup required when using the operator.
### 2. For Bare Metal: Deploy etcd Cluster
For bare-metal production deployments, deploy a 3-node etcd cluster for high availability.
### 3. Configure Appropriate TTLs (etcd mode)
Balance between detection speed and overhead:
- **Short TTL (5s)**: Faster failure detection, more keep-alive traffic
- **Long TTL (30s)**: Less overhead, slower detection
### 4. KV Routing Without NATS
For simpler deployments without NATS:
```bash
# Use prediction-based KV routing
python -m dynamo.frontend --router-mode kv --no-kv-events
```
This provides KV-aware routing with reduced accuracy but no NATS dependency.
## Related Documentation
- [Distributed Runtime](distributed_runtime.md) - Runtime architecture
- [Request Plane](request_plane.md) - Request transport configuration
- [Fault Tolerance](../fault_tolerance/README.md) - Failure handling
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Fault Tolerance
Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability.
## Overview
Fault tolerance in Dynamo operates at multiple levels:
| Layer | Mechanism | Purpose |
|-------|-----------|---------|
| **Request** | Migration, Cancellation | Handle in-flight request failures |
| **Worker** | Health Checks, Graceful Shutdown | Detect and recover from worker failures |
| **System** | Load Shedding, Request Rejection | Prevent system overload |
| **Infrastructure** | etcd HA, NATS resilience | Handle infrastructure component failures |
## Key Features
### Request Migration
When a worker fails during request processing, Dynamo can migrate in-progress requests to healthy workers. The migration system:
- Preserves partial generation state (accumulated tokens)
- Transparently continues generation on a new worker
- Maintains seamless token flow to clients
See [Request Migration](request_migration.md) for details.
### Request Cancellation
Dynamo supports canceling in-flight requests to free computational resources:
- Graceful stop signals for clean termination
- Kill signals for immediate termination
- Hierarchical cancellation propagation through request chains
See [Request Cancellation](request_cancellation.md) for details.
### Graceful Shutdown
Workers handle shutdown signals (SIGTERM/SIGINT) gracefully:
- Immediately stop accepting new requests
- Optionally drain in-flight requests before terminating
- Clean up resources (engines, connections, temp files)
See [Graceful Shutdown](graceful_shutdown.md) for details.
### Request Rejection (Load Shedding)
When workers are overloaded, Dynamo rejects new requests to prevent cascading failures:
- Configurable busy thresholds based on KV cache utilization
- Real-time worker load monitoring
- HTTP 503 responses with retry guidance
See [Request Rejection](request_rejection.md) for details.
### Health Checks
Dynamo provides multiple health check mechanisms:
- **HTTP Endpoints**: `/health` and `/live` endpoints for orchestration
- **Canary Health Checks**: Active monitoring via periodic test requests
- **Engine Monitoring**: Automatic shutdown on engine failure detection
See [Health Checks](../observability/health-checks.md) for details.
## Configuration Quick Reference
| Feature | Environment Variable | Default |
|---------|---------------------|---------|
| Worker health port | `DYN_SYSTEM_PORT` | `9090` |
| Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` (K8s: `true`) |
| Canary wait time | `DYN_CANARY_WAIT_TIME` | `10` seconds |
| Health check timeout | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | `3` seconds |
| Decode blocks threshold | `--active-decode-blocks-threshold` | None (disabled) |
| Prefill tokens threshold | `--active-prefill-tokens-threshold` | None (disabled) |
## Failure Scenarios and Recovery
### Worker Pod Restart
1. Worker receives SIGTERM from Kubernetes
2. Endpoints are immediately invalidated (no new requests)
3. In-flight requests complete or migrate (based on configuration)
4. Resources are cleaned up
5. Pod restarts with fresh state
### Worker Crash (Unexpected)
1. etcd lease expires (TTL-based detection)
2. Client discovers endpoint removal via etcd watch
3. New requests route to remaining healthy workers
4. In-flight requests on crashed worker are migrated (if enabled)
### Network Partition
1. Worker loses connectivity to etcd/NATS
2. Lease keep-alive fails, lease eventually expires
3. Worker is removed from service discovery
4. Traffic reroutes to reachable workers
### GPU Failure
1. Engine health check detects GPU error (XID, OOM, etc.)
2. Worker initiates graceful shutdown
3. Runtime is shut down, engine cleaned up
4. Process exits with code 1 for pod restart
## Testing Fault Tolerance
Dynamo includes a comprehensive testing framework for validating fault tolerance:
- Request cancellation tests
- Migration tests with worker failures
- etcd HA failover tests
- Hardware fault injection (GPU XID, network partitions)
See [Fault Tolerance Testing](testing.md) for details.
## Related Documentation
- [Observability](../observability/README.md) - Metrics and monitoring
- [Distributed Runtime](../design_docs/distributed_runtime.md) - Service discovery architecture
- [Event Plane](../design_docs/event_plane.md) - etcd and NATS coordination
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Graceful Shutdown
This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.
## Overview
Graceful shutdown in Dynamo ensures that:
1. **No new requests are accepted** - Endpoints are immediately invalidated
2. **In-flight requests complete** - Existing requests finish processing (configurable)
3. **Resources are cleaned up** - Engines, connections, and temporary files are released
4. **Pods restart cleanly** - Exit codes signal Kubernetes for proper restart behavior
## Signal Handling
All Dynamo components handle Unix signals for graceful shutdown:
| Signal | Trigger | Behavior |
|--------|---------|----------|
| `SIGTERM` | Kubernetes pod termination | Graceful shutdown initiated |
| `SIGINT` | Ctrl+C / manual interrupt | Graceful shutdown initiated |
### Implementation
Each component registers signal handlers at startup:
```python
def signal_handler():
asyncio.create_task(graceful_shutdown(runtime))
for sig in (signal.SIGTERM, signal.SIGINT):
loop.add_signal_handler(sig, signal_handler)
```
The `graceful_shutdown()` function:
1. Logs the shutdown signal
2. Calls `runtime.shutdown()` to invalidate endpoints
3. Waits for in-flight requests (based on configuration)
4. Returns to allow cleanup to proceed
## Endpoint Draining
When `runtime.shutdown()` is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the `graceful_shutdown` parameter when serving the endpoint.
### Configuration
When registering an endpoint, the `graceful_shutdown` parameter controls draining behavior:
```python
generate_endpoint.serve_endpoint(
handler.generate,
graceful_shutdown=True, # Wait for all requests to finish
metrics_labels=[("model", model_name)],
health_check_payload=health_check_payload,
)
```
| `graceful_shutdown` | Behavior |
|---------------------|----------|
| `True` | Wait for all in-flight requests to complete before returning |
| `False` | Return immediately without waiting for requests |
### Component-Specific Behavior
| Component | Default Behavior | Rationale |
|-----------|------------------|-----------|
| **Frontend** | N/A (HTTP server) | HTTP server handles its own shutdown |
| **Prefill Workers** | `graceful_shutdown=True` | Prefill operations must complete to avoid wasted computation |
| **Decode Workers** | Conditional | If migration is enabled (`migration_limit > 0`), shutdown immediately to allow migration; otherwise wait |
| **Router** | `graceful_shutdown=True` | Ensure routing decisions complete |
### Decode Worker Migration Integration
Decode workers use conditional draining based on whether request migration is supported:
```python
generate_endpoint.serve_endpoint(
handler.generate,
graceful_shutdown=config.migration_limit <= 0, # If no migration, wait for requests
...
)
```
When `migration_limit > 0`:
- Worker shuts down immediately (`graceful_shutdown=False`)
- In-flight requests are migrated to healthy workers
- No request loss occurs
When `migration_limit <= 0`:
- Worker waits for in-flight requests (`graceful_shutdown=True`)
- Migration is not available
- Requests complete on the shutting-down worker
## Resource Cleanup
After endpoint draining, components clean up their resources in `finally` blocks:
### vLLM Worker Cleanup
```python
finally:
logger.debug("Cleaning up worker")
handler.cleanup()
```
The handler's `cleanup()` method:
- Removes temporary directories (LoRA adapters, etc.)
- Releases engine resources
### SGLang Worker Cleanup
```python
def cleanup(self) -> None:
# Cancel pending consume tasks
for task in self._consume_tasks:
if not task.done():
task.cancel()
self._consume_tasks.clear()
# Shutdown engine
self.engine.shutdown()
```
### TensorRT-LLM Worker Cleanup
```python
async def cleanup(self):
if self._llm:
try:
self._llm.shutdown()
except Exception as e:
logging.error(f"Error during cleanup: {e}")
finally:
self._llm = None
```
## Error-Initiated Shutdown
Workers can initiate graceful shutdown when fatal errors occur:
### Engine Health Monitoring (vLLM)
The `VllmEngineMonitor` continuously checks engine health:
```python
async def _check_engine_health(self):
while True:
try:
await self.engine_client.check_health()
await asyncio.sleep(HEALTH_CHECK_INTERVAL) # 2 seconds
except EngineDeadError as e:
logger.error(f"Health check failed: {e}")
self._shutdown_engine()
self.runtime.shutdown()
os._exit(1)
```
Configuration:
- `HEALTH_CHECK_INTERVAL`: 2 seconds between checks
- `ENGINE_SHUTDOWN_TIMEOUT`: 30 seconds max for engine shutdown
### Fatal Error Handling (TensorRT-LLM)
```python
async def _initiate_shutdown(self, error: Exception):
logging.warning(f"Initiating graceful shutdown due to: {error}")
try:
if self.runtime:
self.runtime.shutdown()
if self.engine:
await self.engine.cleanup()
except Exception as cleanup_error:
logging.error(f"Error during graceful shutdown: {cleanup_error}")
finally:
logging.critical("Forcing process exit for restart")
os._exit(1)
```
## Kubernetes Integration
### Pod Termination Flow
1. Kubernetes sends `SIGTERM` to the pod
2. Dynamo initiates graceful shutdown
3. Pod has `terminationGracePeriodSeconds` to complete (default: 30s)
4. If not terminated, Kubernetes sends `SIGKILL`
### Recommended Configuration
For production deployments, configure adequate termination grace period:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
VllmWorker:
extraPodSpec:
terminationGracePeriodSeconds: 60 # Allow time for request draining
```
### Health Check Integration
Kubernetes uses health endpoints to determine pod readiness:
- **During shutdown**: Endpoints become unavailable
- **Readiness probe fails**: Traffic stops routing to the pod
- **Graceful draining**: Existing requests complete
## Best Practices
### 1. Set Appropriate Grace Periods
Match `terminationGracePeriodSeconds` to your expected request completion time:
- Short requests (< 10s): 30s grace period
- Long generation (> 30s): 120s+ grace period
### 2. Enable Request Migration for Decode Workers
If using disaggregated serving, enable migration for decode workers:
```python
--migration-limit 3 # Allow up to 3 migration attempts
```
This allows immediate shutdown while preserving request state.
### 3. Monitor Shutdown Metrics
Track shutdown behavior via logs:
```
INFO Received shutdown signal, shutting down DistributedRuntime
INFO DistributedRuntime shutdown complete
DEBUG Cleaning up worker
```
### 4. Handle Cleanup Errors
Ensure cleanup methods handle errors gracefully:
```python
def cleanup(self):
for resource in self.resources:
try:
resource.cleanup()
except Exception as e:
logger.warning(f"Cleanup failed: {e}")
# Continue with other resources
```
## Related Documentation
- [Request Migration](request_migration.md) - How requests migrate during shutdown
- [Request Cancellation](request_cancellation.md) - Canceling in-flight requests
- [Health Checks](../observability/health-checks.md) - Liveness and readiness probes
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Request Rejection (Load Shedding)
This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
## Overview
Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents:
- Cascading failures from resource exhaustion
- Degraded latency for all requests
- Out-of-memory conditions on GPU workers
When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later.
## Architecture
```
┌─────────────────┐
│ Worker Monitor │
│ (Background) │
└────────┬────────┘
│ Updates busy list
┌──────────┐ ┌──────────┐ ┌─────────────────────┐ ┌──────────┐
│ Client │───▶│ Frontend │───▶│ Push Router │───▶│ Worker │
└──────────┘ └──────────┘ │ (checks busy list) │ └──────────┘
└─────────────────────┘
│ If all workers busy
┌─────────────────────┐
│ HTTP 503 Error │
│ "All workers busy" │
└─────────────────────┘
```
## Configuration
### Frontend Arguments
Configure busy thresholds when starting the frontend:
```bash
python -m dynamo.frontend \
--active-decode-blocks-threshold 0.85 \
--active-prefill-tokens-threshold 10000
```
| Argument | Type | Description |
|----------|------|-------------|
| `--active-decode-blocks-threshold` | float (0.0-1.0) | KV cache block utilization threshold |
| `--active-prefill-tokens-threshold` | int | Prefill token count threshold |
### Dynamic Configuration via API
Thresholds can be adjusted at runtime via the `/busy_threshold` endpoint:
#### Set Thresholds
```bash
curl -X POST http://localhost:8000/busy_threshold \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"active_decode_blocks_threshold": 0.85,
"active_prefill_tokens_threshold": 10000
}'
```
#### Get Current Thresholds
```bash
curl http://localhost:8000/busy_threshold
```
Response:
```json
{
"thresholds": [
{
"model": "Qwen/Qwen3-0.6B",
"active_decode_blocks_threshold": 0.85,
"active_prefill_tokens_threshold": 10000
}
]
}
```
## Busy Detection Logic
Workers are marked as "busy" based on a dual-threshold system. A worker is considered busy when **either** threshold is exceeded.
### KV Cache Block Threshold
Monitors the percentage of KV cache blocks in use:
```
busy = active_decode_blocks / kv_total_blocks > threshold
```
Example: With `active_decode_blocks_threshold=0.85`, a worker using 87% of its KV cache blocks is marked busy.
### Prefill Token Threshold
Monitors the number of tokens currently being prefilled:
```
busy = active_prefill_tokens > threshold
```
Example: With `active_prefill_tokens_threshold=10000`, a worker prefilling 12,000 tokens is marked busy.
### Data-Parallel Rank Aggregation
For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if **ALL** ranks are busy:
```python
def is_busy(worker):
return all(rank.is_busy() for rank in worker.dp_ranks)
```
This prevents false positives when only some ranks are temporarily loaded.
## Worker Load Monitoring
The `KvWorkerMonitor` runs as a background task that:
1. Subscribes to KV cache metrics events from workers
2. Maintains load state for each worker instance
3. Recalculates busy instances when metrics change
4. Updates the router with the current busy list
### Metrics Collected
Workers publish these metrics for monitoring:
| Metric | Description |
|--------|-------------|
| `active_decode_blocks` | Number of KV cache blocks currently in use |
| `kv_total_blocks` | Total KV cache blocks available |
| `active_prefill_tokens` | Number of tokens currently being prefilled |
## Rejection Behavior
### Request Flow
1. Request arrives at frontend
2. Push router checks if busy threshold is configured
3. If configured, router retrieves list of free (non-busy) instances
4. If no free instances exist (but instances are registered):
- Request is rejected with `PipelineError::ServiceOverloaded`
- HTTP 503 response is returned to client
### Error Response
When requests are rejected, clients receive:
```http
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
{
"message": "Service temporarily unavailable: All workers are busy, please retry later",
"type": "service_unavailable",
"code": 503
}
```
### Client Retry Strategy
Clients should implement exponential backoff when receiving 503 responses:
```python
import time
import random
def send_with_retry(request, max_retries=5):
for attempt in range(max_retries):
response = send_request(request)
if response.status_code != 503:
return response
# Exponential backoff with jitter
wait_time = min(60, (2 ** attempt) + random.uniform(0, 1))
time.sleep(wait_time)
raise Exception("Max retries exceeded")
```
## Monitoring
### Prometheus Metrics
Track rejection behavior with these metrics:
| Metric | Type | Description |
|--------|------|-------------|
| `dynamo_tasks_rejected_total` | Counter | Total number of rejected tasks |
| `dynamo_queued_requests` | Gauge | Requests waiting in HTTP queue |
### Example Prometheus Queries
```promql
# Rejection rate over 5 minutes
rate(dynamo_tasks_rejected_total[5m])
# Percentage of requests rejected
sum(rate(dynamo_tasks_rejected_total[5m])) /
sum(rate(dynamo_tasks_issued_total[5m])) * 100
```
### Grafana Alerting
Example alert for high rejection rate:
```yaml
alert: HighRequestRejectionRate
expr: |
sum(rate(dynamo_tasks_rejected_total[5m])) /
sum(rate(dynamo_tasks_issued_total[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High request rejection rate"
description: "More than 10% of requests are being rejected"
```
## Tuning Thresholds
### Conservative Settings (Latency-Focused)
For applications prioritizing low latency:
```bash
--active-decode-blocks-threshold 0.70
--active-prefill-tokens-threshold 5000
```
- Rejects earlier, before workers become fully loaded
- Maintains lower queue depths
- Better tail latencies
### Aggressive Settings (Throughput-Focused)
For applications prioritizing throughput:
```bash
--active-decode-blocks-threshold 0.95
--active-prefill-tokens-threshold 20000
```
- Allows higher worker utilization
- May increase latency variability
- Better overall throughput
### Disabled (No Rejection)
To disable request rejection entirely:
```bash
# Simply don't set the threshold arguments
python -m dynamo.frontend
```
Without thresholds configured, all requests are accepted regardless of worker load.
## Best Practices
### 1. Start Conservative, Then Tune
Begin with conservative thresholds and increase based on observed behavior:
```bash
# Start here
--active-decode-blocks-threshold 0.75
# Increase if rejection rate is too high
--active-decode-blocks-threshold 0.85
```
### 2. Monitor Before Enabling
Observe worker load patterns before setting thresholds:
```bash
# Watch KV cache utilization
watch -n 1 'curl -s localhost:8000/metrics | grep kv_blocks'
```
### 3. Use Both Thresholds for Disaggregated Serving
In disaggregated deployments:
- Use `active_prefill_tokens_threshold` for prefill workers
- Use `active_decode_blocks_threshold` for decode workers
### 4. Coordinate with Autoscaling
If using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling:
```yaml
# HPA triggers at 70% utilization
# Rejection at 85% provides buffer
--active-decode-blocks-threshold 0.85
```
## Related Documentation
- [Health Checks](../observability/health-checks.md) - Worker health monitoring
- [Metrics](../observability/metrics.md) - Available Prometheus metrics
- [Request Migration](request_migration.md) - Handling failed requests
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Fault Tolerance Testing
This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.
## Overview
Dynamo's fault tolerance test suite is located in `tests/fault_tolerance/` and includes:
| Test Category | Location | Purpose |
|---------------|----------|---------|
| Cancellation | `cancellation/` | Request cancellation during in-flight operations |
| Migration | `migration/` | Request migration when workers fail |
| etcd HA | `etcd_ha/` | etcd failover and recovery |
| Hardware | `hardware/` | GPU and network fault injection |
| Deployment | `deploy/` | End-to-end deployment testing |
## Test Directory Structure
```
tests/fault_tolerance/
├── cancellation/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── migration/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── etcd_ha/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── hardware/
│ └── fault_injection_service/
│ ├── api_service/
│ └── agents/
├── deploy/
│ ├── test_deployment.py
│ ├── scenarios.py
│ ├── base_checker.py
│ └── ...
└── client.py
```
## Request Cancellation Tests
Test that in-flight requests can be properly canceled.
### Running Cancellation Tests
```bash
# Run all cancellation tests
pytest tests/fault_tolerance/cancellation/ -v
# Run for specific backend
pytest tests/fault_tolerance/cancellation/test_vllm.py -v
```
### Cancellation Test Utilities
The `cancellation/utils.py` module provides:
#### CancellableRequest
Thread-safe request cancellation via TCP socket manipulation:
```python
from tests.fault_tolerance.cancellation.utils import CancellableRequest
request = CancellableRequest()
# Send request in separate thread
thread = Thread(target=send_request, args=(request,))
thread.start()
# Cancel after some time
time.sleep(1)
request.cancel() # Closes underlying socket
```
#### send_completion_request / send_chat_completion_request
Send cancellable completion requests:
```python
from tests.fault_tolerance.cancellation.utils import (
send_completion_request,
send_chat_completion_request
)
# Non-streaming
response = send_completion_request(
base_url="http://localhost:8000",
model="Qwen/Qwen3-0.6B",
prompt="Hello, world!",
max_tokens=100
)
# Streaming with cancellation
responses = send_chat_completion_request(
base_url="http://localhost:8000",
model="Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
cancellable_request=request
)
```
#### poll_for_pattern
Wait for specific patterns in logs:
```python
from tests.fault_tolerance.cancellation.utils import poll_for_pattern
# Wait for cancellation confirmation
found = poll_for_pattern(
log_file="/var/log/dynamo/worker.log",
pattern="Request cancelled",
timeout=30,
interval=0.5
)
```
## Migration Tests
Test that requests migrate to healthy workers when failures occur.
### Running Migration Tests
```bash
# Run all migration tests
pytest tests/fault_tolerance/migration/ -v
# Run for specific backend
pytest tests/fault_tolerance/migration/test_vllm.py -v
```
### Migration Test Utilities
The `migration/utils.py` module provides:
- Frontend wrapper with configurable request planes
- Long-running request spawning for migration scenarios
- Health check disabling for controlled testing
### Example Migration Test
```python
def test_migration_on_worker_failure():
# Start deployment with 2 workers
deployment = start_deployment(workers=2)
# Send long-running request
request_thread = spawn_long_request(max_tokens=1000)
# Kill one worker mid-generation
kill_worker(deployment.workers[0])
# Verify request completes on remaining worker
response = request_thread.join()
assert response.status_code == 200
assert len(response.tokens) > 0
```
## etcd HA Tests
Test system behavior during etcd failures and recovery.
### Running etcd HA Tests
```bash
pytest tests/fault_tolerance/etcd_ha/ -v
```
### Test Scenarios
- **Leader failover**: etcd leader node fails, cluster elects new leader
- **Network partition**: etcd node becomes unreachable
- **Recovery**: System recovers after etcd becomes available
## Hardware Fault Injection
The fault injection service enables testing under simulated hardware failures.
### Fault Injection Service
Located at `tests/fault_tolerance/hardware/fault_injection_service/`, this FastAPI service orchestrates fault injection:
```bash
# Start the fault injection service
cd tests/fault_tolerance/hardware/fault_injection_service
python -m api_service.main
```
### Supported Fault Types
#### GPU Faults
| Fault Type | Description |
|------------|-------------|
| `XID_ERROR` | Simulate GPU XID error (various codes) |
| `THROTTLE` | GPU thermal throttling |
| `MEMORY_PRESSURE` | GPU memory exhaustion |
| `OVERHEAT` | GPU overheating condition |
| `COMPUTE_OVERLOAD` | GPU compute saturation |
#### Network Faults
| Fault Type | Description |
|------------|-------------|
| `FRONTEND_WORKER` | Partition between frontend and workers |
| `WORKER_NATS` | Partition between workers and NATS |
| `WORKER_WORKER` | Partition between workers |
| `CUSTOM` | Custom network partition |
### Fault Injection API
#### Inject GPU Fault
```bash
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
-H "Content-Type: application/json" \
-d '{
"target_pod": "vllm-worker-0",
"fault_type": "XID_ERROR",
"severity": "HIGH"
}'
```
#### Inject Specific XID Error
```bash
# Inject XID 79 (GPU memory page fault)
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
-H "Content-Type: application/json" \
-d '{"target_pod": "vllm-worker-0"}'
```
Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120
#### Inject Network Partition
```bash
curl -X POST http://localhost:8080/api/v1/faults/network/inject \
-H "Content-Type: application/json" \
-d '{
"partition_type": "FRONTEND_WORKER",
"duration_seconds": 30
}'
```
#### Recover from Fault
```bash
curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover
```
#### List Active Faults
```bash
curl http://localhost:8080/api/v1/faults
```
### GPU Fault Injector Agent
The GPU fault injector runs as a DaemonSet on worker nodes:
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-fault-injector
spec:
selector:
matchLabels:
app: gpu-fault-injector
template:
spec:
containers:
- name: agent
image: dynamo/gpu-fault-injector:latest
securityContext:
privileged: true
volumeMounts:
- name: dev
mountPath: /dev
```
The agent injects fake XID messages via `/dev/kmsg` to trigger NVSentinel detection.
## Deployment Testing Framework
The `deploy/` directory contains an end-to-end testing framework.
### Test Phases
Tests run through three phases:
| Phase | Description |
|-------|-------------|
| `STANDARD` | Baseline performance under normal conditions |
| `OVERFLOW` | System behavior during fault/overload |
| `RECOVERY` | System recovery after fault resolution |
### Scenario Configuration
Define test scenarios in `scenarios.py`:
```python
from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure
scenario = Scenario(
name="worker_failure_migration",
backend="vllm",
load=Load(
clients=10,
requests_per_client=100,
max_tokens=256
),
failure=Failure(
type="pod_kill",
target="vllm-worker-0",
trigger_after_requests=50
)
)
```
### Running Deployment Tests
```bash
# Run all deployment tests
pytest tests/fault_tolerance/deploy/test_deployment.py -v
# Run specific scenario
pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v
```
### Validation Checkers
The framework includes pluggable validators:
```python
from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext
class MigrationChecker(BaseChecker):
def check(self, context: ValidationContext) -> bool:
# Verify migrations occurred
migrations = context.metrics.get("migrations_total", 0)
return migrations > 0
```
### Results Parsing
Parse test results for analysis:
```python
from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test
results = process_overflow_recovery_test(log_dir="/path/to/logs")
print(f"Success rate: {results['success_rate']}")
print(f"P99 latency: {results['p99_latency_ms']}ms")
```
## Client Utilities
The `client.py` module provides shared client functionality:
### Multi-Threaded Load Generation
```python
from tests.fault_tolerance.client import client
# Generate load with multiple clients
results = client(
base_url="http://localhost:8000",
num_clients=10,
requests_per_client=100,
model="Qwen/Qwen3-0.6B",
max_tokens=256,
log_dir="/tmp/test_logs"
)
```
### Request Options
| Parameter | Description |
|-----------|-------------|
| `base_url` | Frontend URL |
| `num_clients` | Number of concurrent clients |
| `requests_per_client` | Requests per client |
| `model` | Model name |
| `max_tokens` | Max tokens per request |
| `log_dir` | Directory for client logs |
| `endpoint` | `completions` or `chat/completions` |
## Running the Full Test Suite
### Prerequisites
1. Kubernetes cluster with GPU nodes
2. Dynamo deployment
3. etcd cluster (for HA tests)
4. Fault injection service (for hardware tests)
### Environment Setup
```bash
export KUBECONFIG=/path/to/kubeconfig
export DYNAMO_NAMESPACE=dynamo-test
export FRONTEND_URL=http://localhost:8000
```
### Run All Tests
```bash
# Install test dependencies
pip install pytest pytest-asyncio
# Run all fault tolerance tests
pytest tests/fault_tolerance/ -v --tb=short
# Run with specific markers
pytest tests/fault_tolerance/ -v -m "not slow"
```
### Test Markers
| Marker | Description |
|--------|-------------|
| `slow` | Long-running tests (> 5 minutes) |
| `gpu` | Requires GPU resources |
| `k8s` | Requires Kubernetes cluster |
| `etcd_ha` | Requires multi-node etcd |
## Best Practices
### 1. Isolate Test Environments
Run fault tolerance tests in dedicated namespaces:
```bash
kubectl create namespace dynamo-fault-test
```
### 2. Clean Up After Tests
Ensure fault injection is recovered:
```bash
# List and recover all active faults
curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover
```
### 3. Collect Logs
Preserve logs for debugging:
```bash
pytest tests/fault_tolerance/ -v \
--log-dir=/tmp/fault_test_logs \
--capture=no
```
### 4. Monitor During Tests
Watch system state during tests:
```bash
# Terminal 1: Watch pods
watch kubectl get pods -n dynamo-test
# Terminal 2: Watch metrics
watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'
```
## Related Documentation
- [Request Migration](request_migration.md) - Migration implementation details
- [Request Cancellation](request_cancellation.md) - Cancellation implementation
- [Health Checks](../observability/health-checks.md) - Health monitoring
- [Metrics](../observability/metrics.md) - Available metrics for monitoring
......@@ -37,14 +37,19 @@
kvbm/vllm-setup.md
kvbm/trtllm-setup.md
agents/tool-calling.md
guides/jail_stream_readme.md
guides/request_plane.md
development/jail_stream.md
router/kv_cache_routing.md
router/kv_events.md
planner/load_planner.md
fault_tolerance/README.md
fault_tolerance/request_migration.md
fault_tolerance/request_cancellation.md
fault_tolerance/graceful_shutdown.md
fault_tolerance/request_rejection.md
fault_tolerance/testing.md
design_docs/request_plane.md
design_docs/event_plane.md
backends/trtllm/multinode/multinode-examples.md
backends/trtllm/llama4_plus_eagle.md
......@@ -81,5 +86,5 @@
.. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
have some outdated names/references and need a refresh.
.. TODO: Add an OpenAI frontend doc and then add top-level Frontends section
to index.rst pointing to both OpenAI HTTP and KServe GRPC docs.
.. TODO: Add an OpenAI frontend doc to complement the KServe GRPC doc
in the Frontends section.
......@@ -64,6 +64,7 @@ Quickstart
Tuning Disaggregated Performance <performance/tuning.md>
Writing Python Workers in Dynamo <development/backend-guide.md>
Observability (Local) <_sections/observability>
Fault Tolerance <_sections/fault_tolerance>
Glossary <reference/glossary.md>
.. toctree::
......@@ -71,6 +72,7 @@ Quickstart
:caption: Components
Backends <_sections/backends>
Frontends <_sections/frontends>
Router <router/README>
Planner <planner/planner_intro>
KVBM <kvbm/kvbm_intro>
......@@ -83,3 +85,5 @@ Quickstart
Architecture Flow <design_docs/dynamo_flow.md>
Disaggregated Serving <design_docs/disagg_serving.md>
Distributed Runtime <design_docs/distributed_runtime.md>
Request Plane <design_docs/request_plane.md>
Event Plane <design_docs/event_plane.md>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment