docs: Documentation audit and updates for 0.8 (#5380)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Dan Gil <dagil@nvidia.com>

docs: Documentation audit and updates for 0.8 (#5380)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Dan Gil <dagil@nvidia.com>
0d597e7c · Anish · GitHub · 553acb48 · 0d597e7c · 0d597e7c
Unverified Commit 0d597e7c authored Jan 14, 2026 by Anish Committed by GitHub Jan 15, 2026
15 changed files
--- a/docs/_sections/fault_tolerance.rst
+++ b/docs/_sections/fault_tolerance.rst
+Fault Tolerance
+===============
+
+.. toctree::
+   :maxdepth: 1
+
+   Overview <../fault_tolerance/README.md>
+   Request Migration <../fault_tolerance/request_migration.md>
+   Request Cancellation <../fault_tolerance/request_cancellation.md>
+   Graceful Shutdown <../fault_tolerance/graceful_shutdown.md>
+   Request Rejection <../fault_tolerance/request_rejection.md>
+   Testing <../fault_tolerance/testing.md>
--- a/docs/backends/sglang/README.md
+++ b/docs/backends/sglang/README.md
@@ -163,14 +163,19 @@ docker run \

 Below we provide a guide that lets you run all of our common deployment patterns on a single node.

-### Start NATS and ETCD in the background
+### Start Infrastructure Services (Local Development Only)

-Start using [Docker Compose](../../../deploy/docker-compose.yml)
+For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):

 ```bash
 docker compose -f deploy/docker-compose.yml up -d
 ```

+> [!NOTE]
+> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
+> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
+> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
+
 > [!TIP]
 > Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
 >

--- a/docs/backends/trtllm/README.md
+++ b/docs/backends/trtllm/README.md
@@ -70,14 +70,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

 Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

-### Start NATS and ETCD in the background
+### Start Infrastructure Services (Local Development Only)

-Start using [Docker Compose](../../../deploy/docker-compose.yml)
+For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):

 ```bash
 docker compose -f deploy/docker-compose.yml up -d
 ```

+> [!NOTE]
+> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
+> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
+> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
+
 ### Build container

 ```bash

--- a/docs/backends/vllm/README.md
+++ b/docs/backends/vllm/README.md
@@ -56,14 +56,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

 Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

-### Start NATS and ETCD in the background
+### Start Infrastructure Services (Local Development Only)

-Start using [Docker Compose](../../../deploy/docker-compose.yml)
+For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):

 ```bash
 docker compose -f deploy/docker-compose.yml up -d
 ```

+> [!NOTE]
+> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
+> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
+> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
+
 ### Pull or build container

 We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:

--- a/docs/design_docs/distributed_runtime.md
+++ b/docs/design_docs/distributed_runtime.md
@@ -19,56 +19,81 @@ limitations under the License.

 ## Overview

-Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
+Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., Python bindings can be found in `/lib/bindings/python`). The runtime supports multiple discovery backends (Kubernetes-native or etcd) and request planes (TCP, HTTP, or NATS). `DistributedRuntime` follows a hierarchical structure:

- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
+- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It manages connections to discovery backends (K8s API or etcd) and optional messaging (NATS for KV events), and handles lifecycle with cancellation tokens.
 - `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
 - `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
 - `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.

 While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.

-For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple workers:
+For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components:

- `Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the `Processor`.
- `Processor`: When a new request arrives, `Processor` applies the chat template and performs the tokenization.
-Then, it routes the request to the `Worker`.
- `Worker` components (e.g., `VllmDecodeWorker`, `SGLangDecodeWorker`, `TrtllmWorker`): Perform the actual computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
+- `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality.
+- `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (vLLM, SGLang, TensorRT-LLM).

-Since the workers are deployed in different processes, each of them has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-agg`). Then, under their namespace, they have their own `Component`s: `Frontend` uses the `make_engine` function which handles HTTP serving and routing automatically, while worker components create components with names like `worker`, `decode`, or `prefill` and register endpoints like `generate`, `flush_cache`, or `clear_kv_blocks`. The `Frontend` component doesn't explicitly create endpoints - instead, the `make_engine` function handles the HTTP server and worker discovery. Worker components create their endpoints programmatically using the `component.endpoint()` method. Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("worker")`), and their `Endpoint`s are created using the `component.endpoint()` method.
+Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`:
+
+- `Frontend` uses the `make_engine` function which handles HTTP serving, request preprocessing, and worker discovery automatically
+- Worker components register with names like `backend`, `prefill`, `decode`, or `encoder` depending on their role
+- Workers register endpoints like `generate`, `clear_kv_blocks`, or `load_metrics`
+
+Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("backend")`), and their `Endpoint`s are created using the `component.endpoint()` method.

 ## Initialization

-In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.
+In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are multiple modes for `DistributedRuntime` initialization based on the deployment environment.

 ```{caution}
-The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
+The hierarchy and naming may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
 ```

- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following services:
-    - etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
-    - NATS (optional): for KV event messaging and router replica sync. NATS is enabled by default but can be disabled via the `enable_nats` parameter (e.g., using `--no-kv-events` flag). When NATS is disabled, the system operates in approximate mode without KV event persistence. Also legacy nats based request_plane is supported.
+### Service Discovery Backends
+
+The `DistributedRuntime` supports two service discovery backends, configured via `DYN_DISCOVERY_BACKEND`:
+
+- **KV Store Discovery** (`DYN_DISCOVERY_BACKEND=kv_store`): Uses etcd for service discovery. **This is the global default** for all deployments unless explicitly overridden.
+
+- **Kubernetes Discovery** (`DYN_DISCOVERY_BACKEND=kubernetes`): Uses native Kubernetes resources (DynamoWorkerMetadata CRD, EndpointSlices) for service discovery. **Must be explicitly set.** The Dynamo operator automatically sets this environment variable for Kubernetes deployments. **No etcd required.**

-  etcd and NATS are global services (there could be multiple instances for high availability).
+> **Note:** There is no automatic detection of the deployment environment. The runtime always defaults to `kv_store`. For Kubernetes deployments, the operator injects `DYN_DISCOVERY_BACKEND=kubernetes` into pod environments.

-  For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` as the service identifier and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
- `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
-  - NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
-  - etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
+When using Kubernetes discovery, the KV store backend automatically switches to in-memory storage since etcd is not needed.
+
+### Runtime Initialization
+
+- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections based on the discovery backend:
+    - **Kubernetes mode**: Uses K8s API for service registration via DynamoWorkerMetadata CRD. No external dependencies required.
+    - **KV Store mode**: Connects to etcd for service discovery. Creates a primary lease with a background keep-alive task. All objects registered under this `DistributedRuntime` use this lease_id to maintain their lifecycle.
+    - **NATS** (optional): Used for KV event messaging when using KV-aware routing. Can be disabled via `--no-kv-events` flag, which enables prediction-based routing without event persistence.
+    - **Request Plane**: TCP by default. Can be configured to use HTTP or NATS via `DYN_REQUEST_PLANE` environment variable.
+- `Namespace`: `Namespace`s are primarily a logical grouping mechanism. They provide the root path for all components under this `Namespace`.
+- `Component`: When a `Component` object is created, it registers a service in the internal registry of the `DistributedRuntime`, which tracks all services and endpoints.
+- `Endpoint`: When an Endpoint object is created and started, it performs registration based on the discovery backend:
+  - **Kubernetes mode**: Endpoint information is stored in DynamoWorkerMetadata CRD resources, which are watched by other components for discovery.
+  - **KV Store mode**: Endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id`.

 ## Calling Endpoints

-Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
+Dynamo uses a `Client` object to call an endpoint. When a `Client` is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then watches for endpoint changes:
+
+- **Kubernetes mode**: Watches DynamoWorkerMetadata CRD resources for endpoint updates.
+- **KV Store mode**: Sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`.
+
+The watcher continuously updates the `Client` with information about available `Endpoint`s.

 The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:

 - `random`: randomly select an endpoint to hit
 - `round_robin`: select endpoints in round-robin order
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
+- `direct`: direct the request to a specific endpoint by specifying the instance ID
+
+After selecting which endpoint to hit, the `Client` sends the request using the configured request plane (TCP by default). The request plane handles the actual transport:

-After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
+- **TCP** (default): Direct TCP connection with connection pooling
+- **HTTP**: HTTP/2-based transport
+- **NATS**: Message broker-based transport (legacy)

 ## Examples


--- a/docs/design_docs/dynamo_flow.md
+++ b/docs/design_docs/dynamo_flow.md
@@ -17,15 +17,17 @@ limitations under the License.

 # Dynamo Architecture Flow

-This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations:
+This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations.
+
+> **Note**: The "Processor" shown in the diagram represents the request processing logic (tokenization, chat template application, routing) that runs within the Frontend component. It is not a separate deployment—the Frontend handles both HTTP serving and request preprocessing via the `make_engine` function.

 ## 🔵 Main Request Flow (Blue)
 The primary user journey through the system:

 1. **Discovery (S1)**: Client discovers the service endpoint
 2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
-3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
-4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
+3. **Validate (S3)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
+4. **Route (S3)**: Frontend routes the validated request to appropriate Decode Worker

 ## 🟠 Decision and Allocation Flow (Orange)
 The system's intelligent routing and resource allocation:
@@ -48,18 +50,22 @@ The response generation and delivery:

 11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker
 12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data
-13. **Response (S13)**: The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
+13. **Response (S13)**: The generated response flows back through the Frontend for post-processing (detokenization) and delivery to the Client

 ## 🔗 Infrastructure Connections (Dotted lines)
 Coordination and messaging support:

-### ETCD Connections (Gray, dotted)
- **Frontend, Processor, Planner**: Service discovery and registration
- **Decode Worker, PrefillWorker**: NIXL metadata storage for GPU communication setup
+### Service Discovery
+- **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
+- **On bare metal**: Uses etcd for service discovery and endpoint registration.
+
+### Request Plane
+- **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport.
+- **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`.

-### NATS Connections (Teal, dotted)
- **PrefillQueue**: JetStream consumer group for reliable work distribution
- **Processor**: Load balancing across workers
+### NATS Connections (Optional, for KV routing)
+- **PrefillQueue**: JetStream consumer group for reliable work distribution in disaggregated serving
+- **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`)

 ### Planning Connections (Gold, dotted)
 - **Frontend → Planner**: Metrics collection for auto-scaling decisions

--- a/docs/design_docs/event_plane.md
+++ b/docs/design_docs/event_plane.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Event Plane Architecture
+
+This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS.
+
+## Overview
+
+Dynamo's coordination layer adapts to the deployment environment:
+
+| Deployment | Service Discovery | KV Events | Request Plane |
+|------------|-------------------|-----------|---------------|
+| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
+| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |
+
+> **Note:** The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                    Coordination Layer                                │
+│                                                                      │
+│  ┌─────────────────────────┐    ┌─────────────────────────────────┐ │
+│  │   Service Discovery     │    │            NATS                 │ │
+│  │                         │    │         (Optional)              │ │
+│  │  • K8s: CRDs + API      │    │  • KV Cache Events              │ │
+│  │  • Bare metal: etcd     │    │  • Router Replica Sync          │ │
+│  │                         │    │  • JetStream Persistence        │ │
+│  └─────────────────────────┘    └─────────────────────────────────┘ │
+│                                                                      │
+└─────────────────────────────────────────────────────────────────────┘
+                    │                          │
+         ┌──────────┴──────────┐    ┌─────────┴──────────┐
+         ▼                     ▼    ▼                    ▼
+    ┌─────────┐          ┌─────────┐              ┌─────────┐
+    │Frontend │          │ Planner │              │ Worker  │
+    └─────────┘          └─────────┘              └─────────┘
+```
+
+## Kubernetes-Native Service Discovery
+
+When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd.
+
+### Configuration
+
+The operator explicitly sets:
+```bash
+DYN_DISCOVERY_BACKEND=kubernetes
+```
+
+> **Important:** This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
+
+### How It Works
+
+1. **DynamoWorkerMetadata CRD**: Workers register their endpoints by creating/updating DynamoWorkerMetadata custom resources
+2. **EndpointSlices**: Used to signal readiness status to the system
+3. **K8s API Watches**: Components watch for CRD changes to discover available endpoints
+
+### Benefits
+
+- No external etcd cluster required
+- Native integration with Kubernetes lifecycle
+- Automatic cleanup when pods terminate
+- Works with standard K8s RBAC
+
+### Environment Variables (Injected by Operator)
+
+| Variable | Description |
+|----------|-------------|
+| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
+| `POD_NAME` | Current pod name |
+| `POD_NAMESPACE` | Current namespace |
+| `POD_UID` | Pod unique identifier |
+
+---
+
+## etcd Architecture (Default for All Deployments)
+
+When `DYN_DISCOVERY_BACKEND=kv_store` (the global default), etcd is used for service discovery.
+
+### Connection Configuration
+
+etcd connection is configured via environment variables:
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `ETCD_ENDPOINTS` | Comma-separated etcd URLs | `http://localhost:2379` |
+| `ETCD_AUTH_USERNAME` | Basic auth username | None |
+| `ETCD_AUTH_PASSWORD` | Basic auth password | None |
+| `ETCD_AUTH_CA` | CA certificate path (TLS) | None |
+| `ETCD_AUTH_CLIENT_CERT` | Client certificate path | None |
+| `ETCD_AUTH_CLIENT_KEY` | Client key path | None |
+
+Example:
+```bash
+export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379
+```
+
+### Lease Management
+
+Each `DistributedRuntime` maintains a primary lease with etcd:
+
+```
+┌────────────────────┐         ┌──────────────┐
+│ DistributedRuntime │◄────────│ Primary Lease │
+│                    │         │  TTL: 10s     │
+│  • Namespace       │         └───────┬───────┘
+│  • Components      │                 │
+│  • Endpoints       │                 │ Keep-Alive
+│                    │                 │ Heartbeat
+└────────────────────┘                 ▼
+                               ┌──────────────┐
+                               │     etcd     │
+                               └──────────────┘
+```
+
+**Lease Lifecycle:**
+
+1. **Creation**: Lease created during `DistributedRuntime` initialization
+2. **Keep-Alive**: Background task sends heartbeats at 50% of remaining TTL
+3. **Expiration**: If heartbeats stop, lease expires after TTL (10 seconds default)
+4. **Cleanup**: All keys associated with the lease are automatically deleted
+
+**Automatic Recovery:**
+
+- Reconnection with exponential backoff (50ms to 5s)
+- Deadline-based retry logic
+- Cancellation token propagation
+
+### Service Discovery
+
+Endpoints are registered in etcd for dynamic discovery:
+
+**Key Format:**
+```
+/services/{namespace}/{component}/{endpoint}/{instance_id}
+```
+
+**Example:**
+```
+/services/vllm-agg/backend/generate/694d98147d54be25
+```
+
+**Registration Data:**
+```json
+{
+  "namespace": "vllm-agg",
+  "component": "backend",
+  "endpoint": "generate",
+  "instance_id": 7587888160958628000,
+  "transport": {
+    "tcp": "192.168.1.10:9999"
+  }
+}
+```
+
+### Discovery Queries
+
+The discovery system supports multiple query patterns:
+
+| Query Type | Pattern | Use Case |
+|------------|---------|----------|
+| `AllEndpoints` | `/services/` | List all services |
+| `NamespacedEndpoints` | `/services/{namespace}/` | Filter by namespace |
+| `ComponentEndpoints` | `/services/{namespace}/{component}/` | Filter by component |
+| `Endpoint` | `/services/{namespace}/{component}/{endpoint}/` | Specific endpoint |
+
+### Watch Functionality
+
+Clients watch etcd prefixes for real-time updates:
+
+```python
+# Client watches for endpoint changes
+watcher = etcd.watch_prefix("/services/vllm-agg/backend/generate/")
+
+for event in watcher:
+    if event.type == "PUT":
+        # New endpoint registered
+        add_endpoint(event.value)
+    elif event.type == "DELETE":
+        # Endpoint removed (worker died)
+        remove_endpoint(event.key)
+```
+
+**Watch Features:**
+
+- Initial state retrieval with `get_and_watch_prefix()`
+- Automatic reconnection on stream failure
+- Revision tracking for no-event-loss guarantees
+- Event types: `PUT` (create/update) and `DELETE`
+
+### Distributed Locks
+
+etcd provides distributed locking for coordination:
+
+**Lock Types:**
+
+| Type | Key Pattern | Behavior |
+|------|-------------|----------|
+| Write Lock | `v1/{prefix}/writer` | Exclusive (no readers/writers) |
+| Read Lock | `v1/{prefix}/readers/{id}` | Shared (multiple readers) |
+
+**Operations:**
+
+```rust
+// Non-blocking write lock
+let lock = client.try_write_lock("my_resource").await?;
+
+// Blocking read lock with polling (100ms intervals)
+let lock = client.read_lock_with_wait("my_resource").await?;
+```
+
+## NATS Architecture
+
+### When NATS is Used
+
+NATS is used for:
+
+1. **KV Cache Events**: Real-time KV cache state updates for routing
+2. **Router Replica Sync**: Synchronizing router state across replicas
+3. **Legacy Request Plane**: NATS-based request transport (optional)
+
+### Configuration
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `NATS_SERVER` | NATS server URL | `nats://localhost:4222` |
+
+### Disabling NATS
+
+For deployments without KV-aware routing:
+
+```bash
+# Disable NATS and KV events
+python -m dynamo.frontend --no-kv-events
+```
+
+This enables "approximate mode" for KV routing without event persistence.
+
+### Event Publishing
+
+Components publish events to NATS subjects:
+
+```rust
+pub trait EventPublisher {
+    async fn publish(&self, event: &str, data: &[u8]) -> Result<()>;
+    async fn publish_serialized<T: Serialize>(&self, event: &str, data: &T) -> Result<()>;
+}
+```
+
+**Subject Naming:**
+```
+{base_subject}.{event_name}
+```
+
+Example:
+```
+vllm-agg.backend.kv_cache_update
+```
+
+### Event Subscription
+
+Components subscribe to events:
+
+```rust
+pub trait EventSubscriber {
+    async fn subscribe(&self, topic: &str) -> Result<Subscriber>;
+    async fn subscribe_typed<T: DeserializeOwned>(&self, topic: &str) -> Result<TypedSubscriber<T>>;
+}
+```
+
+### JetStream Persistence
+
+For durable event delivery, NATS JetStream provides:
+
+- Message persistence
+- Replay from offset
+- Consumer groups for load balancing
+- Acknowledgment tracking
+
+## Key-Value Store Abstraction
+
+Dynamo provides a unified KV store interface supporting multiple backends:
+
+### Supported Backends
+
+| Backend | Use Case | Configuration |
+|---------|----------|---------------|
+| `EtcdStore` | Production deployments | `ETCD_ENDPOINTS` |
+| `MemoryStore` | Testing, development | Default |
+| `NatsStore` | NATS-only deployments | `NATS_SERVER` |
+| `FileStore` | Local persistence | File path |
+
+### Store Interface
+
+```rust
+pub trait KvStore {
+    async fn get(&self, bucket: &str, key: &str) -> Result<Option<Vec<u8>>>;
+    async fn put(&self, bucket: &str, key: &str, value: &[u8]) -> Result<()>;
+    async fn delete(&self, bucket: &str, key: &str) -> Result<()>;
+    async fn watch(&self, bucket: &str) -> Result<WatchStream>;
+}
+```
+
+### Buckets
+
+Data is organized into logical buckets:
+
+| Bucket | Purpose |
+|--------|---------|
+| `v1/instances` | Endpoint instance registry |
+| `v1/mdc` | Model deployment cards |
+
+## Typed Prefix Watcher
+
+For type-safe watching of etcd prefixes:
+
+```rust
+// Watch and maintain HashMap of deserialized values
+let watcher = watch_prefix_with_extraction::<DiscoveryInstance>(
+    &etcd_client,
+    "/services/vllm-agg/",
+    lease_id_extractor,
+    value_extractor,
+).await?;
+
+// Receive updates via watch channel
+let instances = watcher.borrow();
+```
+
+**Key Extractors:**
+
+| Extractor | Description |
+|-----------|-------------|
+| `lease_id()` | Use lease ID as key |
+| `key_string()` | Extract key with prefix stripping |
+| `full_key_string()` | Use full etcd key |
+
+## Reliability Features
+
+### Connection Resilience
+
+**etcd Reconnection:**
+- Exponential backoff: 50ms to 5s
+- Deadline-based retry logic
+- Mutex ensures single concurrent reconnect
+
+**NATS Reconnection:**
+- Built-in reconnection in NATS client
+- Configurable max reconnect attempts
+- Buffering during disconnection
+
+### Lease-Based Cleanup
+
+When a worker crashes or loses connectivity:
+
+1. Keep-alive heartbeats stop
+2. Lease expires after TTL (10 seconds)
+3. All registered endpoints automatically deleted
+4. Clients receive DELETE watch events
+5. Traffic reroutes to healthy workers
+
+### Transaction Safety
+
+etcd transactions ensure atomic operations:
+
+```rust
+// Atomic create-if-not-exists
+let txn = Txn::new()
+    .when([Compare::create_revision(key, CompareOp::Equal, 0)])
+    .and_then([Op::put(key, value, options)]);
+
+etcd_client.txn(txn).await?;
+```
+
+This prevents race conditions in concurrent service registration.
+
+## Operational Modes
+
+### Kubernetes Mode (Requires Explicit Configuration)
+
+Native Kubernetes service discovery:
+
+```bash
+# Operator explicitly sets this (not auto-detected):
+export DYN_DISCOVERY_BACKEND=kubernetes
+
+# Workers register via K8s CRDs
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B
+
+# Frontend discovers workers via K8s API
+python -m dynamo.frontend
+```
+
+No etcd or NATS required for basic operation when using K8s discovery.
+
+### KV Store Mode (Global Default)
+
+Full service discovery with etcd:
+
+```bash
+# This is the default - no configuration needed
+# export DYN_DISCOVERY_BACKEND=kv_store  # (implicit)
+
+# Workers register with etcd
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B
+
+# Frontend discovers workers via etcd
+python -m dynamo.frontend
+```
+
+### KV-Aware Routing (Optional)
+
+Enable NATS for KV cache event tracking:
+
+```bash
+# Default: KV events enabled (requires NATS)
+python -m dynamo.frontend --router-mode kv
+
+# Disable KV events for prediction-based routing (no NATS)
+python -m dynamo.frontend --router-mode kv --no-kv-events
+```
+
+With `--no-kv-events`:
+- Router predicts cache state based on routing decisions
+- TTL-based expiration and LRU pruning
+- No NATS infrastructure required
+
+## Best Practices
+
+### 1. Use Kubernetes Discovery on K8s
+
+The Dynamo operator automatically sets `DYN_DISCOVERY_BACKEND=kubernetes` for pods. No additional setup required when using the operator.
+
+### 2. For Bare Metal: Deploy etcd Cluster
+
+For bare-metal production deployments, deploy a 3-node etcd cluster for high availability.
+
+### 3. Configure Appropriate TTLs (etcd mode)
+
+Balance between detection speed and overhead:
+
+- **Short TTL (5s)**: Faster failure detection, more keep-alive traffic
+- **Long TTL (30s)**: Less overhead, slower detection
+
+### 4. KV Routing Without NATS
+
+For simpler deployments without NATS:
+
+```bash
+# Use prediction-based KV routing
+python -m dynamo.frontend --router-mode kv --no-kv-events
+```
+
+This provides KV-aware routing with reduced accuracy but no NATS dependency.
+
+## Related Documentation
+
+- [Distributed Runtime](distributed_runtime.md) - Runtime architecture
+- [Request Plane](request_plane.md) - Request transport configuration
+- [Fault Tolerance](../fault_tolerance/README.md) - Failure handling
--- a/docs/guides/request_plane.md
+++ b/docs/guides/request_plane.md
--- a/docs/guides/jail_stream_readme.md
+++ b/docs/guides/jail_stream_readme.md
--- a/docs/fault_tolerance/README.md
+++ b/docs/fault_tolerance/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Fault Tolerance
+
+Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability.
+
+## Overview
+
+Fault tolerance in Dynamo operates at multiple levels:
+
+| Layer | Mechanism | Purpose |
+|-------|-----------|---------|
+| **Request** | Migration, Cancellation | Handle in-flight request failures |
+| **Worker** | Health Checks, Graceful Shutdown | Detect and recover from worker failures |
+| **System** | Load Shedding, Request Rejection | Prevent system overload |
+| **Infrastructure** | etcd HA, NATS resilience | Handle infrastructure component failures |
+
+## Key Features
+
+### Request Migration
+
+When a worker fails during request processing, Dynamo can migrate in-progress requests to healthy workers. The migration system:
+
+- Preserves partial generation state (accumulated tokens)
+- Transparently continues generation on a new worker
+- Maintains seamless token flow to clients
+
+See [Request Migration](request_migration.md) for details.
+
+### Request Cancellation
+
+Dynamo supports canceling in-flight requests to free computational resources:
+
+- Graceful stop signals for clean termination
+- Kill signals for immediate termination
+- Hierarchical cancellation propagation through request chains
+
+See [Request Cancellation](request_cancellation.md) for details.
+
+### Graceful Shutdown
+
+Workers handle shutdown signals (SIGTERM/SIGINT) gracefully:
+
+- Immediately stop accepting new requests
+- Optionally drain in-flight requests before terminating
+- Clean up resources (engines, connections, temp files)
+
+See [Graceful Shutdown](graceful_shutdown.md) for details.
+
+### Request Rejection (Load Shedding)
+
+When workers are overloaded, Dynamo rejects new requests to prevent cascading failures:
+
+- Configurable busy thresholds based on KV cache utilization
+- Real-time worker load monitoring
+- HTTP 503 responses with retry guidance
+
+See [Request Rejection](request_rejection.md) for details.
+
+### Health Checks
+
+Dynamo provides multiple health check mechanisms:
+
+- **HTTP Endpoints**: `/health` and `/live` endpoints for orchestration
+- **Canary Health Checks**: Active monitoring via periodic test requests
+- **Engine Monitoring**: Automatic shutdown on engine failure detection
+
+See [Health Checks](../observability/health-checks.md) for details.
+
+## Configuration Quick Reference
+
+| Feature | Environment Variable | Default |
+|---------|---------------------|---------|
+| Worker health port | `DYN_SYSTEM_PORT` | `9090` |
+| Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` (K8s: `true`) |
+| Canary wait time | `DYN_CANARY_WAIT_TIME` | `10` seconds |
+| Health check timeout | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | `3` seconds |
+| Decode blocks threshold | `--active-decode-blocks-threshold` | None (disabled) |
+| Prefill tokens threshold | `--active-prefill-tokens-threshold` | None (disabled) |
+
+## Failure Scenarios and Recovery
+
+### Worker Pod Restart
+
+1. Worker receives SIGTERM from Kubernetes
+2. Endpoints are immediately invalidated (no new requests)
+3. In-flight requests complete or migrate (based on configuration)
+4. Resources are cleaned up
+5. Pod restarts with fresh state
+
+### Worker Crash (Unexpected)
+
+1. etcd lease expires (TTL-based detection)
+2. Client discovers endpoint removal via etcd watch
+3. New requests route to remaining healthy workers
+4. In-flight requests on crashed worker are migrated (if enabled)
+
+### Network Partition
+
+1. Worker loses connectivity to etcd/NATS
+2. Lease keep-alive fails, lease eventually expires
+3. Worker is removed from service discovery
+4. Traffic reroutes to reachable workers
+
+### GPU Failure
+
+1. Engine health check detects GPU error (XID, OOM, etc.)
+2. Worker initiates graceful shutdown
+3. Runtime is shut down, engine cleaned up
+4. Process exits with code 1 for pod restart
+
+## Testing Fault Tolerance
+
+Dynamo includes a comprehensive testing framework for validating fault tolerance:
+
+- Request cancellation tests
+- Migration tests with worker failures
+- etcd HA failover tests
+- Hardware fault injection (GPU XID, network partitions)
+
+See [Fault Tolerance Testing](testing.md) for details.
+
+## Related Documentation
+
+- [Observability](../observability/README.md) - Metrics and monitoring
+- [Distributed Runtime](../design_docs/distributed_runtime.md) - Service discovery architecture
+- [Event Plane](../design_docs/event_plane.md) - etcd and NATS coordination
--- a/docs/fault_tolerance/graceful_shutdown.md
+++ b/docs/fault_tolerance/graceful_shutdown.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Graceful Shutdown
+
+This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.
+
+## Overview
+
+Graceful shutdown in Dynamo ensures that:
+
+1. **No new requests are accepted** - Endpoints are immediately invalidated
+2. **In-flight requests complete** - Existing requests finish processing (configurable)
+3. **Resources are cleaned up** - Engines, connections, and temporary files are released
+4. **Pods restart cleanly** - Exit codes signal Kubernetes for proper restart behavior
+
+## Signal Handling
+
+All Dynamo components handle Unix signals for graceful shutdown:
+
+| Signal | Trigger | Behavior |
+|--------|---------|----------|
+| `SIGTERM` | Kubernetes pod termination | Graceful shutdown initiated |
+| `SIGINT` | Ctrl+C / manual interrupt | Graceful shutdown initiated |
+
+### Implementation
+
+Each component registers signal handlers at startup:
+
+```python
+def signal_handler():
+    asyncio.create_task(graceful_shutdown(runtime))
+
+for sig in (signal.SIGTERM, signal.SIGINT):
+    loop.add_signal_handler(sig, signal_handler)
+```
+
+The `graceful_shutdown()` function:
+1. Logs the shutdown signal
+2. Calls `runtime.shutdown()` to invalidate endpoints
+3. Waits for in-flight requests (based on configuration)
+4. Returns to allow cleanup to proceed
+
+## Endpoint Draining
+
+When `runtime.shutdown()` is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the `graceful_shutdown` parameter when serving the endpoint.
+
+### Configuration
+
+When registering an endpoint, the `graceful_shutdown` parameter controls draining behavior:
+
+```python
+generate_endpoint.serve_endpoint(
+    handler.generate,
+    graceful_shutdown=True,  # Wait for all requests to finish
+    metrics_labels=[("model", model_name)],
+    health_check_payload=health_check_payload,
+)
+```
+
+| `graceful_shutdown` | Behavior |
+|---------------------|----------|
+| `True` | Wait for all in-flight requests to complete before returning |
+| `False` | Return immediately without waiting for requests |
+
+### Component-Specific Behavior
+
+| Component | Default Behavior | Rationale |
+|-----------|------------------|-----------|
+| **Frontend** | N/A (HTTP server) | HTTP server handles its own shutdown |
+| **Prefill Workers** | `graceful_shutdown=True` | Prefill operations must complete to avoid wasted computation |
+| **Decode Workers** | Conditional | If migration is enabled (`migration_limit > 0`), shutdown immediately to allow migration; otherwise wait |
+| **Router** | `graceful_shutdown=True` | Ensure routing decisions complete |
+
+### Decode Worker Migration Integration
+
+Decode workers use conditional draining based on whether request migration is supported:
+
+```python
+generate_endpoint.serve_endpoint(
+    handler.generate,
+    graceful_shutdown=config.migration_limit <= 0,  # If no migration, wait for requests
+    ...
+)
+```
+
+When `migration_limit > 0`:
+- Worker shuts down immediately (`graceful_shutdown=False`)
+- In-flight requests are migrated to healthy workers
+- No request loss occurs
+
+When `migration_limit <= 0`:
+- Worker waits for in-flight requests (`graceful_shutdown=True`)
+- Migration is not available
+- Requests complete on the shutting-down worker
+
+## Resource Cleanup
+
+After endpoint draining, components clean up their resources in `finally` blocks:
+
+### vLLM Worker Cleanup
+
+```python
+finally:
+    logger.debug("Cleaning up worker")
+    handler.cleanup()
+```
+
+The handler's `cleanup()` method:
+- Removes temporary directories (LoRA adapters, etc.)
+- Releases engine resources
+
+### SGLang Worker Cleanup
+
+```python
+def cleanup(self) -> None:
+    # Cancel pending consume tasks
+    for task in self._consume_tasks:
+        if not task.done():
+            task.cancel()
+    self._consume_tasks.clear()
+
+    # Shutdown engine
+    self.engine.shutdown()
+```
+
+### TensorRT-LLM Worker Cleanup
+
+```python
+async def cleanup(self):
+    if self._llm:
+        try:
+            self._llm.shutdown()
+        except Exception as e:
+            logging.error(f"Error during cleanup: {e}")
+        finally:
+            self._llm = None
+```
+
+## Error-Initiated Shutdown
+
+Workers can initiate graceful shutdown when fatal errors occur:
+
+### Engine Health Monitoring (vLLM)
+
+The `VllmEngineMonitor` continuously checks engine health:
+
+```python
+async def _check_engine_health(self):
+    while True:
+        try:
+            await self.engine_client.check_health()
+            await asyncio.sleep(HEALTH_CHECK_INTERVAL)  # 2 seconds
+        except EngineDeadError as e:
+            logger.error(f"Health check failed: {e}")
+            self._shutdown_engine()
+            self.runtime.shutdown()
+            os._exit(1)
+```
+
+Configuration:
+- `HEALTH_CHECK_INTERVAL`: 2 seconds between checks
+- `ENGINE_SHUTDOWN_TIMEOUT`: 30 seconds max for engine shutdown
+
+### Fatal Error Handling (TensorRT-LLM)
+
+```python
+async def _initiate_shutdown(self, error: Exception):
+    logging.warning(f"Initiating graceful shutdown due to: {error}")
+
+    try:
+        if self.runtime:
+            self.runtime.shutdown()
+        if self.engine:
+            await self.engine.cleanup()
+    except Exception as cleanup_error:
+        logging.error(f"Error during graceful shutdown: {cleanup_error}")
+    finally:
+        logging.critical("Forcing process exit for restart")
+        os._exit(1)
+```
+
+## Kubernetes Integration
+
+### Pod Termination Flow
+
+1. Kubernetes sends `SIGTERM` to the pod
+2. Dynamo initiates graceful shutdown
+3. Pod has `terminationGracePeriodSeconds` to complete (default: 30s)
+4. If not terminated, Kubernetes sends `SIGKILL`
+
+### Recommended Configuration
+
+For production deployments, configure adequate termination grace period:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+spec:
+  services:
+    VllmWorker:
+      extraPodSpec:
+        terminationGracePeriodSeconds: 60  # Allow time for request draining
+```
+
+### Health Check Integration
+
+Kubernetes uses health endpoints to determine pod readiness:
+
+- **During shutdown**: Endpoints become unavailable
+- **Readiness probe fails**: Traffic stops routing to the pod
+- **Graceful draining**: Existing requests complete
+
+## Best Practices
+
+### 1. Set Appropriate Grace Periods
+
+Match `terminationGracePeriodSeconds` to your expected request completion time:
+- Short requests (< 10s): 30s grace period
+- Long generation (> 30s): 120s+ grace period
+
+### 2. Enable Request Migration for Decode Workers
+
+If using disaggregated serving, enable migration for decode workers:
+
+```python
+--migration-limit 3  # Allow up to 3 migration attempts
+```
+
+This allows immediate shutdown while preserving request state.
+
+### 3. Monitor Shutdown Metrics
+
+Track shutdown behavior via logs:
+
+```
+INFO  Received shutdown signal, shutting down DistributedRuntime
+INFO  DistributedRuntime shutdown complete
+DEBUG Cleaning up worker
+```
+
+### 4. Handle Cleanup Errors
+
+Ensure cleanup methods handle errors gracefully:
+
+```python
+def cleanup(self):
+    for resource in self.resources:
+        try:
+            resource.cleanup()
+        except Exception as e:
+            logger.warning(f"Cleanup failed: {e}")
+            # Continue with other resources
+```
+
+## Related Documentation
+
+- [Request Migration](request_migration.md) - How requests migrate during shutdown
+- [Request Cancellation](request_cancellation.md) - Canceling in-flight requests
+- [Health Checks](../observability/health-checks.md) - Liveness and readiness probes
--- a/docs/fault_tolerance/request_rejection.md
+++ b/docs/fault_tolerance/request_rejection.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Request Rejection (Load Shedding)
+
+This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
+
+## Overview
+
+Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents:
+
+- Cascading failures from resource exhaustion
+- Degraded latency for all requests
+- Out-of-memory conditions on GPU workers
+
+When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later.
+
+## Architecture
+
+```
+                                    ┌─────────────────┐
+                                    │  Worker Monitor │
+                                    │  (Background)   │
+                                    └────────┬────────┘
+                                             │ Updates busy list
+                                             ▼
+┌──────────┐    ┌──────────┐    ┌─────────────────────┐    ┌──────────┐
+│  Client  │───▶│ Frontend │───▶│    Push Router      │───▶│  Worker  │
+└──────────┘    └──────────┘    │ (checks busy list)  │    └──────────┘
+                                └─────────────────────┘
+                                         │
+                                         │ If all workers busy
+                                         ▼
+                                ┌─────────────────────┐
+                                │   HTTP 503 Error    │
+                                │ "All workers busy"  │
+                                └─────────────────────┘
+```
+
+## Configuration
+
+### Frontend Arguments
+
+Configure busy thresholds when starting the frontend:
+
+```bash
+python -m dynamo.frontend \
+    --active-decode-blocks-threshold 0.85 \
+    --active-prefill-tokens-threshold 10000
+```
+
+| Argument | Type | Description |
+|----------|------|-------------|
+| `--active-decode-blocks-threshold` | float (0.0-1.0) | KV cache block utilization threshold |
+| `--active-prefill-tokens-threshold` | int | Prefill token count threshold |
+
+### Dynamic Configuration via API
+
+Thresholds can be adjusted at runtime via the `/busy_threshold` endpoint:
+
+#### Set Thresholds
+
+```bash
+curl -X POST http://localhost:8000/busy_threshold \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "active_decode_blocks_threshold": 0.85,
+    "active_prefill_tokens_threshold": 10000
+  }'
+```
+
+#### Get Current Thresholds
+
+```bash
+curl http://localhost:8000/busy_threshold
+```
+
+Response:
+```json
+{
+  "thresholds": [
+    {
+      "model": "Qwen/Qwen3-0.6B",
+      "active_decode_blocks_threshold": 0.85,
+      "active_prefill_tokens_threshold": 10000
+    }
+  ]
+}
+```
+
+## Busy Detection Logic
+
+Workers are marked as "busy" based on a dual-threshold system. A worker is considered busy when **either** threshold is exceeded.
+
+### KV Cache Block Threshold
+
+Monitors the percentage of KV cache blocks in use:
+
+```
+busy = active_decode_blocks / kv_total_blocks > threshold
+```
+
+Example: With `active_decode_blocks_threshold=0.85`, a worker using 87% of its KV cache blocks is marked busy.
+
+### Prefill Token Threshold
+
+Monitors the number of tokens currently being prefilled:
+
+```
+busy = active_prefill_tokens > threshold
+```
+
+Example: With `active_prefill_tokens_threshold=10000`, a worker prefilling 12,000 tokens is marked busy.
+
+### Data-Parallel Rank Aggregation
+
+For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if **ALL** ranks are busy:
+
+```python
+def is_busy(worker):
+    return all(rank.is_busy() for rank in worker.dp_ranks)
+```
+
+This prevents false positives when only some ranks are temporarily loaded.
+
+## Worker Load Monitoring
+
+The `KvWorkerMonitor` runs as a background task that:
+
+1. Subscribes to KV cache metrics events from workers
+2. Maintains load state for each worker instance
+3. Recalculates busy instances when metrics change
+4. Updates the router with the current busy list
+
+### Metrics Collected
+
+Workers publish these metrics for monitoring:
+
+| Metric | Description |
+|--------|-------------|
+| `active_decode_blocks` | Number of KV cache blocks currently in use |
+| `kv_total_blocks` | Total KV cache blocks available |
+| `active_prefill_tokens` | Number of tokens currently being prefilled |
+
+## Rejection Behavior
+
+### Request Flow
+
+1. Request arrives at frontend
+2. Push router checks if busy threshold is configured
+3. If configured, router retrieves list of free (non-busy) instances
+4. If no free instances exist (but instances are registered):
+   - Request is rejected with `PipelineError::ServiceOverloaded`
+   - HTTP 503 response is returned to client
+
+### Error Response
+
+When requests are rejected, clients receive:
+
+```http
+HTTP/1.1 503 Service Unavailable
+Content-Type: application/json
+
+{
+  "message": "Service temporarily unavailable: All workers are busy, please retry later",
+  "type": "service_unavailable",
+  "code": 503
+}
+```
+
+### Client Retry Strategy
+
+Clients should implement exponential backoff when receiving 503 responses:
+
+```python
+import time
+import random
+
+def send_with_retry(request, max_retries=5):
+    for attempt in range(max_retries):
+        response = send_request(request)
+        if response.status_code != 503:
+            return response
+
+        # Exponential backoff with jitter
+        wait_time = min(60, (2 ** attempt) + random.uniform(0, 1))
+        time.sleep(wait_time)
+
+    raise Exception("Max retries exceeded")
+```
+
+## Monitoring
+
+### Prometheus Metrics
+
+Track rejection behavior with these metrics:
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `dynamo_tasks_rejected_total` | Counter | Total number of rejected tasks |
+| `dynamo_queued_requests` | Gauge | Requests waiting in HTTP queue |
+
+### Example Prometheus Queries
+
+```promql
+# Rejection rate over 5 minutes
+rate(dynamo_tasks_rejected_total[5m])
+
+# Percentage of requests rejected
+sum(rate(dynamo_tasks_rejected_total[5m])) /
+sum(rate(dynamo_tasks_issued_total[5m])) * 100
+```
+
+### Grafana Alerting
+
+Example alert for high rejection rate:
+
+```yaml
+alert: HighRequestRejectionRate
+expr: |
+  sum(rate(dynamo_tasks_rejected_total[5m])) /
+  sum(rate(dynamo_tasks_issued_total[5m])) > 0.1
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "High request rejection rate"
+  description: "More than 10% of requests are being rejected"
+```
+
+## Tuning Thresholds
+
+### Conservative Settings (Latency-Focused)
+
+For applications prioritizing low latency:
+
+```bash
+--active-decode-blocks-threshold 0.70
+--active-prefill-tokens-threshold 5000
+```
+
+- Rejects earlier, before workers become fully loaded
+- Maintains lower queue depths
+- Better tail latencies
+
+### Aggressive Settings (Throughput-Focused)
+
+For applications prioritizing throughput:
+
+```bash
+--active-decode-blocks-threshold 0.95
+--active-prefill-tokens-threshold 20000
+```
+
+- Allows higher worker utilization
+- May increase latency variability
+- Better overall throughput
+
+### Disabled (No Rejection)
+
+To disable request rejection entirely:
+
+```bash
+# Simply don't set the threshold arguments
+python -m dynamo.frontend
+```
+
+Without thresholds configured, all requests are accepted regardless of worker load.
+
+## Best Practices
+
+### 1. Start Conservative, Then Tune
+
+Begin with conservative thresholds and increase based on observed behavior:
+
+```bash
+# Start here
+--active-decode-blocks-threshold 0.75
+
+# Increase if rejection rate is too high
+--active-decode-blocks-threshold 0.85
+```
+
+### 2. Monitor Before Enabling
+
+Observe worker load patterns before setting thresholds:
+
+```bash
+# Watch KV cache utilization
+watch -n 1 'curl -s localhost:8000/metrics | grep kv_blocks'
+```
+
+### 3. Use Both Thresholds for Disaggregated Serving
+
+In disaggregated deployments:
+- Use `active_prefill_tokens_threshold` for prefill workers
+- Use `active_decode_blocks_threshold` for decode workers
+
+### 4. Coordinate with Autoscaling
+
+If using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling:
+
+```yaml
+# HPA triggers at 70% utilization
+# Rejection at 85% provides buffer
+--active-decode-blocks-threshold 0.85
+```
+
+## Related Documentation
+
+- [Health Checks](../observability/health-checks.md) - Worker health monitoring
+- [Metrics](../observability/metrics.md) - Available Prometheus metrics
+- [Request Migration](request_migration.md) - Handling failed requests
--- a/docs/fault_tolerance/testing.md
+++ b/docs/fault_tolerance/testing.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Fault Tolerance Testing
+
+This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.
+
+## Overview
+
+Dynamo's fault tolerance test suite is located in `tests/fault_tolerance/` and includes:
+
+| Test Category | Location | Purpose |
+|---------------|----------|---------|
+| Cancellation | `cancellation/` | Request cancellation during in-flight operations |
+| Migration | `migration/` | Request migration when workers fail |
+| etcd HA | `etcd_ha/` | etcd failover and recovery |
+| Hardware | `hardware/` | GPU and network fault injection |
+| Deployment | `deploy/` | End-to-end deployment testing |
+
+## Test Directory Structure
+
+```
+tests/fault_tolerance/
+├── cancellation/
+│   ├── test_vllm.py
+│   ├── test_trtllm.py
+│   ├── test_sglang.py
+│   └── utils.py
+├── migration/
+│   ├── test_vllm.py
+│   ├── test_trtllm.py
+│   ├── test_sglang.py
+│   └── utils.py
+├── etcd_ha/
+│   ├── test_vllm.py
+│   ├── test_trtllm.py
+│   ├── test_sglang.py
+│   └── utils.py
+├── hardware/
+│   └── fault_injection_service/
+│       ├── api_service/
+│       └── agents/
+├── deploy/
+│   ├── test_deployment.py
+│   ├── scenarios.py
+│   ├── base_checker.py
+│   └── ...
+└── client.py
+```
+
+## Request Cancellation Tests
+
+Test that in-flight requests can be properly canceled.
+
+### Running Cancellation Tests
+
+```bash
+# Run all cancellation tests
+pytest tests/fault_tolerance/cancellation/ -v
+
+# Run for specific backend
+pytest tests/fault_tolerance/cancellation/test_vllm.py -v
+```
+
+### Cancellation Test Utilities
+
+The `cancellation/utils.py` module provides:
+
+#### CancellableRequest
+
+Thread-safe request cancellation via TCP socket manipulation:
+
+```python
+from tests.fault_tolerance.cancellation.utils import CancellableRequest
+
+request = CancellableRequest()
+
+# Send request in separate thread
+thread = Thread(target=send_request, args=(request,))
+thread.start()
+
+# Cancel after some time
+time.sleep(1)
+request.cancel()  # Closes underlying socket
+```
+
+#### send_completion_request / send_chat_completion_request
+
+Send cancellable completion requests:
+
+```python
+from tests.fault_tolerance.cancellation.utils import (
+    send_completion_request,
+    send_chat_completion_request
+)
+
+# Non-streaming
+response = send_completion_request(
+    base_url="http://localhost:8000",
+    model="Qwen/Qwen3-0.6B",
+    prompt="Hello, world!",
+    max_tokens=100
+)
+
+# Streaming with cancellation
+responses = send_chat_completion_request(
+    base_url="http://localhost:8000",
+    model="Qwen/Qwen3-0.6B",
+    messages=[{"role": "user", "content": "Hello!"}],
+    stream=True,
+    cancellable_request=request
+)
+```
+
+#### poll_for_pattern
+
+Wait for specific patterns in logs:
+
+```python
+from tests.fault_tolerance.cancellation.utils import poll_for_pattern
+
+# Wait for cancellation confirmation
+found = poll_for_pattern(
+    log_file="/var/log/dynamo/worker.log",
+    pattern="Request cancelled",
+    timeout=30,
+    interval=0.5
+)
+```
+
+## Migration Tests
+
+Test that requests migrate to healthy workers when failures occur.
+
+### Running Migration Tests
+
+```bash
+# Run all migration tests
+pytest tests/fault_tolerance/migration/ -v
+
+# Run for specific backend
+pytest tests/fault_tolerance/migration/test_vllm.py -v
+```
+
+### Migration Test Utilities
+
+The `migration/utils.py` module provides:
+
+- Frontend wrapper with configurable request planes
+- Long-running request spawning for migration scenarios
+- Health check disabling for controlled testing
+
+### Example Migration Test
+
+```python
+def test_migration_on_worker_failure():
+    # Start deployment with 2 workers
+    deployment = start_deployment(workers=2)
+
+    # Send long-running request
+    request_thread = spawn_long_request(max_tokens=1000)
+
+    # Kill one worker mid-generation
+    kill_worker(deployment.workers[0])
+
+    # Verify request completes on remaining worker
+    response = request_thread.join()
+    assert response.status_code == 200
+    assert len(response.tokens) > 0
+```
+
+## etcd HA Tests
+
+Test system behavior during etcd failures and recovery.
+
+### Running etcd HA Tests
+
+```bash
+pytest tests/fault_tolerance/etcd_ha/ -v
+```
+
+### Test Scenarios
+
+- **Leader failover**: etcd leader node fails, cluster elects new leader
+- **Network partition**: etcd node becomes unreachable
+- **Recovery**: System recovers after etcd becomes available
+
+## Hardware Fault Injection
+
+The fault injection service enables testing under simulated hardware failures.
+
+### Fault Injection Service
+
+Located at `tests/fault_tolerance/hardware/fault_injection_service/`, this FastAPI service orchestrates fault injection:
+
+```bash
+# Start the fault injection service
+cd tests/fault_tolerance/hardware/fault_injection_service
+python -m api_service.main
+```
+
+### Supported Fault Types
+
+#### GPU Faults
+
+| Fault Type | Description |
+|------------|-------------|
+| `XID_ERROR` | Simulate GPU XID error (various codes) |
+| `THROTTLE` | GPU thermal throttling |
+| `MEMORY_PRESSURE` | GPU memory exhaustion |
+| `OVERHEAT` | GPU overheating condition |
+| `COMPUTE_OVERLOAD` | GPU compute saturation |
+
+#### Network Faults
+
+| Fault Type | Description |
+|------------|-------------|
+| `FRONTEND_WORKER` | Partition between frontend and workers |
+| `WORKER_NATS` | Partition between workers and NATS |
+| `WORKER_WORKER` | Partition between workers |
+| `CUSTOM` | Custom network partition |
+
+### Fault Injection API
+
+#### Inject GPU Fault
+
+```bash
+curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
+  -H "Content-Type: application/json" \
+  -d '{
+    "target_pod": "vllm-worker-0",
+    "fault_type": "XID_ERROR",
+    "severity": "HIGH"
+  }'
+```
+
+#### Inject Specific XID Error
+
+```bash
+# Inject XID 79 (GPU memory page fault)
+curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
+  -H "Content-Type: application/json" \
+  -d '{"target_pod": "vllm-worker-0"}'
+```
+
+Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120
+
+#### Inject Network Partition
+
+```bash
+curl -X POST http://localhost:8080/api/v1/faults/network/inject \
+  -H "Content-Type: application/json" \
+  -d '{
+    "partition_type": "FRONTEND_WORKER",
+    "duration_seconds": 30
+  }'
+```
+
+#### Recover from Fault
+
+```bash
+curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover
+```
+
+#### List Active Faults
+
+```bash
+curl http://localhost:8080/api/v1/faults
+```
+
+### GPU Fault Injector Agent
+
+The GPU fault injector runs as a DaemonSet on worker nodes:
+
+```yaml
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: gpu-fault-injector
+spec:
+  selector:
+    matchLabels:
+      app: gpu-fault-injector
+  template:
+    spec:
+      containers:
+      - name: agent
+        image: dynamo/gpu-fault-injector:latest
+        securityContext:
+          privileged: true
+        volumeMounts:
+        - name: dev
+          mountPath: /dev
+```
+
+The agent injects fake XID messages via `/dev/kmsg` to trigger NVSentinel detection.
+
+## Deployment Testing Framework
+
+The `deploy/` directory contains an end-to-end testing framework.
+
+### Test Phases
+
+Tests run through three phases:
+
+| Phase | Description |
+|-------|-------------|
+| `STANDARD` | Baseline performance under normal conditions |
+| `OVERFLOW` | System behavior during fault/overload |
+| `RECOVERY` | System recovery after fault resolution |
+
+### Scenario Configuration
+
+Define test scenarios in `scenarios.py`:
+
+```python
+from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure
+
+scenario = Scenario(
+    name="worker_failure_migration",
+    backend="vllm",
+    load=Load(
+        clients=10,
+        requests_per_client=100,
+        max_tokens=256
+    ),
+    failure=Failure(
+        type="pod_kill",
+        target="vllm-worker-0",
+        trigger_after_requests=50
+    )
+)
+```
+
+### Running Deployment Tests
+
+```bash
+# Run all deployment tests
+pytest tests/fault_tolerance/deploy/test_deployment.py -v
+
+# Run specific scenario
+pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v
+```
+
+### Validation Checkers
+
+The framework includes pluggable validators:
+
+```python
+from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext
+
+class MigrationChecker(BaseChecker):
+    def check(self, context: ValidationContext) -> bool:
+        # Verify migrations occurred
+        migrations = context.metrics.get("migrations_total", 0)
+        return migrations > 0
+```
+
+### Results Parsing
+
+Parse test results for analysis:
+
+```python
+from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test
+
+results = process_overflow_recovery_test(log_dir="/path/to/logs")
+print(f"Success rate: {results['success_rate']}")
+print(f"P99 latency: {results['p99_latency_ms']}ms")
+```
+
+## Client Utilities
+
+The `client.py` module provides shared client functionality:
+
+### Multi-Threaded Load Generation
+
+```python
+from tests.fault_tolerance.client import client
+
+# Generate load with multiple clients
+results = client(
+    base_url="http://localhost:8000",
+    num_clients=10,
+    requests_per_client=100,
+    model="Qwen/Qwen3-0.6B",
+    max_tokens=256,
+    log_dir="/tmp/test_logs"
+)
+```
+
+### Request Options
+
+| Parameter | Description |
+|-----------|-------------|
+| `base_url` | Frontend URL |
+| `num_clients` | Number of concurrent clients |
+| `requests_per_client` | Requests per client |
+| `model` | Model name |
+| `max_tokens` | Max tokens per request |
+| `log_dir` | Directory for client logs |
+| `endpoint` | `completions` or `chat/completions` |
+
+## Running the Full Test Suite
+
+### Prerequisites
+
+1. Kubernetes cluster with GPU nodes
+2. Dynamo deployment
+3. etcd cluster (for HA tests)
+4. Fault injection service (for hardware tests)
+
+### Environment Setup
+
+```bash
+export KUBECONFIG=/path/to/kubeconfig
+export DYNAMO_NAMESPACE=dynamo-test
+export FRONTEND_URL=http://localhost:8000
+```
+
+### Run All Tests
+
+```bash
+# Install test dependencies
+pip install pytest pytest-asyncio
+
+# Run all fault tolerance tests
+pytest tests/fault_tolerance/ -v --tb=short
+
+# Run with specific markers
+pytest tests/fault_tolerance/ -v -m "not slow"
+```
+
+### Test Markers
+
+| Marker | Description |
+|--------|-------------|
+| `slow` | Long-running tests (> 5 minutes) |
+| `gpu` | Requires GPU resources |
+| `k8s` | Requires Kubernetes cluster |
+| `etcd_ha` | Requires multi-node etcd |
+
+## Best Practices
+
+### 1. Isolate Test Environments
+
+Run fault tolerance tests in dedicated namespaces:
+
+```bash
+kubectl create namespace dynamo-fault-test
+```
+
+### 2. Clean Up After Tests
+
+Ensure fault injection is recovered:
+
+```bash
+# List and recover all active faults
+curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
+  xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover
+```
+
+### 3. Collect Logs
+
+Preserve logs for debugging:
+
+```bash
+pytest tests/fault_tolerance/ -v \
+  --log-dir=/tmp/fault_test_logs \
+  --capture=no
+```
+
+### 4. Monitor During Tests
+
+Watch system state during tests:
+
+```bash
+# Terminal 1: Watch pods
+watch kubectl get pods -n dynamo-test
+
+# Terminal 2: Watch metrics
+watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'
+```
+
+## Related Documentation
+
+- [Request Migration](request_migration.md) - Migration implementation details
+- [Request Cancellation](request_cancellation.md) - Cancellation implementation
+- [Health Checks](../observability/health-checks.md) - Health monitoring
+- [Metrics](../observability/metrics.md) - Available metrics for monitoring
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -37,14 +37,19 @@
   kvbm/vllm-setup.md
   kvbm/trtllm-setup.md
   agents/tool-calling.md
-   guides/jail_stream_readme.md
-   guides/request_plane.md
+   development/jail_stream.md

   router/kv_cache_routing.md
   router/kv_events.md
   planner/load_planner.md
+   fault_tolerance/README.md
   fault_tolerance/request_migration.md
   fault_tolerance/request_cancellation.md
+   fault_tolerance/graceful_shutdown.md
+   fault_tolerance/request_rejection.md
+   fault_tolerance/testing.md
+   design_docs/request_plane.md
+   design_docs/event_plane.md

   backends/trtllm/multinode/multinode-examples.md
   backends/trtllm/llama4_plus_eagle.md
@@ -81,5 +86,5 @@

 ..   TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
     have some outdated names/references and need a refresh.
-..   TODO: Add an OpenAI frontend doc and then add top-level Frontends section
-     to index.rst pointing to both OpenAI HTTP and KServe GRPC docs.
+..   TODO: Add an OpenAI frontend doc to complement the KServe GRPC doc
+     in the Frontends section.
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -64,6 +64,7 @@ Quickstart
   Tuning Disaggregated Performance <performance/tuning.md>
   Writing Python Workers in Dynamo <development/backend-guide.md>
   Observability (Local) <_sections/observability>
+   Fault Tolerance <_sections/fault_tolerance>
   Glossary <reference/glossary.md>

 .. toctree::
@@ -71,6 +72,7 @@ Quickstart
   :caption: Components

   Backends <_sections/backends>
+   Frontends <_sections/frontends>
   Router <router/README>
   Planner <planner/planner_intro>
   KVBM <kvbm/kvbm_intro>
@@ -83,3 +85,5 @@ Quickstart
   Architecture Flow <design_docs/dynamo_flow.md>
   Disaggregated Serving <design_docs/disagg_serving.md>
   Distributed Runtime <design_docs/distributed_runtime.md>
+   Request Plane <design_docs/request_plane.md>
+   Event Plane <design_docs/event_plane.md>