docs: migrate existing docs to fern (#5445)

Signed-off-by: Jont828 <jt572@cornell.edu> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>

docs: migrate existing docs to fern (#5445)
Signed-off-by: Jont828 <jt572@cornell.edu> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>
f9050aae · Jonathan Tong · GitHub · f238d23a · f9050aae · f9050aae
Unverified Commit f9050aae authored Jan 26, 2026 by Jonathan Tong Committed by GitHub Jan 26, 2026
20 changed files
--- a/fern/pages/design-docs/distributed-runtime.md
+++ b/fern/pages/design-docs/distributed-runtime.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Dynamo Distributed Runtime"
+---
+
+## Overview
+
+Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
+
+- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
+- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
+- `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
+- `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.
+
+While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
+
+For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple workers:
+
+- `Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the `Processor`.
+- `Processor`: When a new request arrives, `Processor` applies the chat template and performs the tokenization.
+Then, it routes the request to the `Worker`.
+- `Worker` components (e.g., `VllmDecodeWorker`, `SGLangDecodeWorker`, `TrtllmWorker`): Perform the actual computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
+
+Since the workers are deployed in different processes, each of them has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-agg`). Then, under their namespace, they have their own `Component`s: `Frontend` uses the `make_engine` function which handles HTTP serving and routing automatically, while worker components create components with names like `worker`, `decode`, or `prefill` and register endpoints like `generate`, `flush_cache`, or `clear_kv_blocks`. The `Frontend` component doesn't explicitly create endpoints - instead, the `make_engine` function handles the HTTP server and worker discovery. Worker components create their endpoints programmatically using the `component.endpoint()` method. Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("worker")`), and their `Endpoint`s are created using the `component.endpoint()` method.
+
+## Initialization
+
+In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.
+
+:::caution
+The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
+:::
+
+- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following two services:
+    - etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
+    - NATS (both static and dynamic mode): for messaging.
+
+  where etcd and NATS are two global services (there could be multiple etcd and NATS services for high availability).
+
+  For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
+- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
+- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` as the service identifier and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
+- `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
+  - NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
+  - etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
+
+## Calling Endpoints
+
+Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
+
+The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
+
+- `random`: randomly select an endpoint to hit
+- `round_robin`: select endpoints in round-robin order
+- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
+
+After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
+
+## Examples
+
+We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`:
+
+- Rust: `/lib/runtime/examples/`
+- Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `components/src/dynamo` for full implementation details.
+
+
--- a/fern/pages/design-docs/dynamo-flow.md
+++ b/fern/pages/design-docs/dynamo-flow.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Dynamo Architecture Flow"
+---
+
+This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm). Color-coded flows indicate different types of operations:
+
+## 🔵 Main Request Flow (Blue)
+The primary user journey through the system:
+
+1. **Discovery (S1)**: Client discovers the service endpoint
+2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
+3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
+4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
+
+## 🟠 Decision and Allocation Flow (Orange)
+The system's intelligent routing and resource allocation:
+
+4. **Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing
+5. **Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill
+5a. **Allocate (S5a)**: Decode Worker pre-allocates KV cache blocks in its local GPU memory
+6. **Queue (S6)**: If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
+
+## 🟢 Prefill Worker Flow (Green)
+The dedicated prefill processing pipeline:
+
+7. **NATS Pull (S7)**: PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
+8. **Load Metadata (S8)**: PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
+9. **Prefill (S9)**: Worker executes the prefill computation on the input tokens
+10. **NIXL Transfer (S10)**: Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker's pre-allocated blocks
+
+## 🟣 Completion Flow (Purple)
+The response generation and delivery:
+
+11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker
+12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data
+13. **Response (S13)**: The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
+
+## 🔗 Infrastructure Connections (Dotted lines)
+Coordination and messaging support:
+
+### ETCD Connections (Gray, dotted)
+- **Frontend, Processor, Planner**: Service discovery and registration
+- **Decode Worker, PrefillWorker**: NIXL metadata storage for GPU communication setup
+
+### NATS Connections (Teal, dotted)
+- **PrefillQueue**: JetStream consumer group for reliable work distribution
+- **Processor**: Load balancing across workers
+
+### Planning Connections (Gold, dotted)
+- **Frontend → Planner**: Metrics collection for auto-scaling decisions
+- **Planner → Workers**: Resource scaling commands for both Decode Worker and PrefillWorker
+
+## Technical Implementation Details
+
+### NIXL (NVIDIA Interchange Library):
+- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
+- Decode Worker publishes GPU metadata to ETCD for coordination
+- PrefillWorker loads metadata to establish direct communication channels
+- Block-based transfers (64–128 tokens per block) for efficient batching
+
+### Disaggregated KV Cache:
+- Each Decode Worker maintains local KV cache in its GPU memory
+- No shared storage bottlenecks—all transfers are direct worker-to-worker
+- Pre-allocated blocks ensure deterministic memory layout and performance
+
+```mermaid
+%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
+graph TD
+    %% Top Layer - Client & Frontend
+    Client["<b>HTTP Client</b>"]
+    S1[["<b>1 DISCOVERY</b>"]]
+    Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
+    S2[["<b>2 REQUEST</b>"]]
+
+    %% Processing Layer
+    Processor["<b>Processor</b><br/><i>Request Handler & Router</i>"]
+    S3[["<b>3 VALIDATE</b>"]]
+
+    %% Infrastructure - Positioned strategically to minimize crossings
+    subgraph INF["<b>Infrastructure Layer</b>"]
+        ETCD[("<b>ETCD</b><br/><i>Service Discovery &<br/>NIXL Metadata</i>")]
+        NATS[("<b>NATS</b><br/><i>Message Broker</i>")]
+        Planner["<b>Planner</b><br/><i>Resource Management<br/>Auto-scaling</i>"]
+    end
+
+    %% Worker Layer - Main processing
+    subgraph WL["<b>Worker Layer</b>"]
+        %% VllmWorker section
+        VllmWorker["<b>Decode Worker</b><br/><i>Handles Decoding & Disagg Decisions</i>"]
+        S4[["<b>4 QUERY</b>"]]
+        S5[["<b>5 DISAGG DECISION</b>"]]
+        S5a[["<b>5a ALLOCATE</b>"]]
+        S12[["<b>12 DECODE</b>"]]
+        S6[["<b>6 QUEUE</b>"]]
+        S13[["<b>13 RESPONSE</b>"]]
+
+        %% Storage positioned near workers
+        LocalKVCache[("<b>Local KV Cache</b><br/><i>Pre-allocated Blocks</i>")]
+
+        %% Prefill System - Right side to minimize crossings
+        subgraph PS["<b>Prefill System</b>"]
+            PrefillQueue["<b>Prefill Queue</b><br/><i>NATS JetStream<br/>Consumer Group</i>"]
+            PrefillWorker["<b>Prefill Worker</b><br/><i>Dedicated Prefill Processing<br/>(Multiple Instances)</i>"]
+            S7[["<b>7 NATS PULL</b>"]]
+            S8[["<b>8 LOAD METADATA</b>"]]
+            S9[["<b>9 PREFILL</b>"]]
+            S10[["<b>10 NIXL TRANSFER</b>"]]
+            S11[["<b>11 NOTIFY</b>"]]
+        end
+    end
+
+    %% Main Request Flow (Blue) - Clean vertical flow
+    Client -.-> S1
+    S1 -->|HTTP API Call| Frontend
+    Frontend -.-> S2
+    S2 -->|Process & Validate| Processor
+    Processor -.-> S3
+    S3 -->|Route to Worker| VllmWorker
+
+    %% VllmWorker Internal Flow (Orange)
+    VllmWorker -.-> S4
+    S4 -->|Query Prefix Cache Hit| S5
+    S5 -->|Prefill Length & Queue Check| S5a
+    S5a -->|Continue to Decode| S12
+
+    %% Allocation & Queuing (Orange) - Minimize crossings
+    S5a -->|Allocate KV Cache Blocks| LocalKVCache
+    VllmWorker --> S6
+    S6 -->|Put RemotePrefillRequest| PrefillQueue
+
+    %% Prefill Worker Flow (Green) - Self-contained within PS
+    PrefillQueue -.-> S7
+    S7 -->|Consumer Group Pull| PrefillWorker
+    PrefillWorker -.-> S8
+    PrefillWorker -.-> S9
+    S9 -->|Execute Prefill| S10
+    S10 -->|Direct GPU Transfer| LocalKVCache
+    PrefillWorker --> S11
+
+    %% Return Flow (Purple) - Clean return path
+    S11 -->|Completion Notification| S12
+    S12 -->|Decode from KV Cache| S13
+    S13 -->|Post-process Response| Processor
+    Processor -->|HTTP Response| Frontend
+    Frontend -->|Final Response| Client
+
+    %% Infrastructure Connections - Organized to avoid crossings
+    %% ETCD Connections - Grouped by proximity
+    Frontend -.->|Service Discovery| ETCD
+    Processor -.->|Service Discovery| ETCD
+    VllmWorker -.->|NIXL Metadata| ETCD
+    PrefillWorker -.->|NIXL Metadata| ETCD
+    S8 -.->|Load NIXL Metadata| ETCD
+    Planner -.->|Service Discovery| ETCD
+
+    %% NATS Connections - Direct to queue system
+    PrefillQueue -.->|JetStream| NATS
+    Processor -.->|Load Balancing| NATS
+
+    %% Planning Connections - Strategic positioning
+    Frontend -.->|Metrics| Planner
+    Planner -.->|Auto-scaling| VllmWorker
+    Planner -.->|Auto-scaling| PrefillWorker
+
+    %% Styling - Each component with unique colors
+    classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
+    classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
+    classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
+    classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
+    classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px
+    classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px
+    classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px
+    classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
+    classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
+    classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px
+    classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
+    classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
+    classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
+
+
+    class Client client
+    class Frontend frontend
+    class Processor processor
+    class VllmWorker worker
+    class PrefillQueue prefillQueue
+    class PrefillWorker prefillWorker
+    class Planner planner
+    class LocalKVCache storage
+    class ETCD etcd
+    class NATS nats
+    class PS prefillBox
+    class INF infraLayer
+    class WL workerLayer
+
+
+
+    %% Flow Colors - Different line styles to reduce visual clutter
+    %% Main Request Flow - Blue (solid)
+    linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
+    linkStyle 1 stroke:#1565C0,stroke-width:4px
+    linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
+    linkStyle 3 stroke:#1565C0,stroke-width:4px
+    linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
+    linkStyle 5 stroke:#1565C0,stroke-width:4px
+
+    %% Decision & Allocation Flow - Orange (mixed)
+    linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
+    linkStyle 7 stroke:#E65100,stroke-width:4px
+    linkStyle 8 stroke:#E65100,stroke-width:4px
+    linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
+
+    %% KV Cache & Queue - Orange (solid)
+    linkStyle 10 stroke:#E65100,stroke-width:4px
+    linkStyle 11 stroke:#E65100,stroke-width:4px
+    linkStyle 12 stroke:#E65100,stroke-width:4px
+
+    %% Prefill Worker Flow - Green (mixed)
+    linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
+    linkStyle 14 stroke:#2E7D32,stroke-width:4px
+    linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
+    linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
+    linkStyle 17 stroke:#2E7D32,stroke-width:4px
+    linkStyle 18 stroke:#2E7D32,stroke-width:4px
+    linkStyle 19 stroke:#2E7D32,stroke-width:4px
+
+    %% Completion Flow - Purple (mixed)
+    linkStyle 20 stroke:#6A1B9A,stroke-width:4px
+    linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3
+    linkStyle 22 stroke:#6A1B9A,stroke-width:4px
+    linkStyle 23 stroke:#6A1B9A,stroke-width:4px
+    linkStyle 24 stroke:#6A1B9A,stroke-width:4px
+
+    %% Infrastructure Flows - Lighter and dotted to reduce visual noise
+    %% ETCD Connections - Gray (dotted, thinner)
+    linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
+    linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
+    linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
+    linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
+    linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
+    linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
+
+    %% NATS Connections - Teal (dotted, thinner)
+    linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
+    linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
+
+    %% Planning Connections - Gold (dotted, thinner)
+    linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
+    linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
+    linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
+```
--- a/fern/pages/design-docs/event-plane.md
+++ b/fern/pages/design-docs/event-plane.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Event Plane Architecture"
+---
+
+This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS.
+
+## Overview
+
+Dynamo's coordination layer adapts to the deployment environment:
+
+| Deployment | Service Discovery | KV Events | Request Plane |
+|------------|-------------------|-----------|---------------|
+| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
+| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |
+
+<Note>
+The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
+</Note>
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                    Coordination Layer                                │
+│                                                                      │
+│  ┌─────────────────────────┐    ┌─────────────────────────────────┐ │
+│  │   Service Discovery     │    │            NATS                 │ │
+│  │                         │    │         (Optional)              │ │
+│  │  • K8s: CRDs + API      │    │  • KV Cache Events              │ │
+│  │  • Bare metal: etcd     │    │  • Router Replica Sync          │ │
+│  │                         │    │  • JetStream Persistence        │ │
+│  └─────────────────────────┘    └─────────────────────────────────┘ │
+│                                                                      │
+└─────────────────────────────────────────────────────────────────────┘
+                    │                          │
+         ┌──────────┴──────────┐    ┌─────────┴──────────┐
+         ▼                     ▼    ▼                    ▼
+    ┌─────────┐          ┌─────────┐              ┌─────────┐
+    │Frontend │          │ Planner │              │ Worker  │
+    └─────────┘          └─────────┘              └─────────┘
+```
+
+## Kubernetes-Native Service Discovery
+
+When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd.
+
+### Configuration
+
+The operator explicitly sets:
+```bash
+DYN_DISCOVERY_BACKEND=kubernetes
+```
+
+<Warning>
+This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
+</Warning>
+
+### How It Works
+
+1. **DynamoWorkerMetadata CRD**: Workers register their endpoints by creating/updating DynamoWorkerMetadata custom resources
+2. **EndpointSlices**: Used to signal readiness status to the system
+3. **K8s API Watches**: Components watch for CRD changes to discover available endpoints
+
+### Benefits
+
+- No external etcd cluster required
+- Native integration with Kubernetes lifecycle
+- Automatic cleanup when pods terminate
+- Works with standard K8s RBAC
+
+### Environment Variables (Injected by Operator)
+
+| Variable | Description |
+|----------|-------------|
+| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
+| `POD_NAME` | Current pod name |
+| `POD_NAMESPACE` | Current namespace |
+| `POD_UID` | Pod unique identifier |
+
+---
+
+## etcd Architecture (Default for All Deployments)
+
+When `DYN_DISCOVERY_BACKEND=kv_store` (the global default), etcd is used for service discovery.
+
+### Connection Configuration
+
+etcd connection is configured via environment variables:
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `ETCD_ENDPOINTS` | Comma-separated etcd URLs | `http://localhost:2379` |
+| `ETCD_AUTH_USERNAME` | Basic auth username | None |
+| `ETCD_AUTH_PASSWORD` | Basic auth password | None |
+| `ETCD_AUTH_CA` | CA certificate path (TLS) | None |
+| `ETCD_AUTH_CLIENT_CERT` | Client certificate path | None |
+| `ETCD_AUTH_CLIENT_KEY` | Client key path | None |
+
+Example:
+```bash
+export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379
+```
+
+### Lease Management
+
+Each `DistributedRuntime` maintains a primary lease with etcd:
+
+```
+┌────────────────────┐         ┌──────────────┐
+│ DistributedRuntime │◄────────│ Primary Lease │
+│                    │         │  TTL: 10s     │
+│  • Namespace       │         └───────┬───────┘
+│  • Components      │                 │
+│  • Endpoints       │                 │ Keep-Alive
+│                    │                 │ Heartbeat
+└────────────────────┘                 ▼
+                               ┌──────────────┐
+                               │     etcd     │
+                               └──────────────┘
+```
+
+**Lease Lifecycle:**
+
+1. **Creation**: Lease created during `DistributedRuntime` initialization
+2. **Keep-Alive**: Background task sends heartbeats at 50% of remaining TTL
+3. **Expiration**: If heartbeats stop, lease expires after TTL (10 seconds default)
+4. **Cleanup**: All keys associated with the lease are automatically deleted
+
+**Automatic Recovery:**
+
+- Reconnection with exponential backoff (50ms to 5s)
+- Deadline-based retry logic
+- Cancellation token propagation
+
+### Service Discovery
+
+Endpoints are registered in etcd for dynamic discovery:
+
+**Key Format:**
+```
+/services/{namespace}/{component}/{endpoint}/{instance_id}
+```
+
+**Example:**
+```
+/services/vllm-agg/backend/generate/694d98147d54be25
+```
+
+**Registration Data:**
+```json
+{
+  "namespace": "vllm-agg",
+  "component": "backend",
+  "endpoint": "generate",
+  "instance_id": 7587888160958628000,
+  "transport": {
+    "tcp": "192.168.1.10:9999"
+  }
+}
+```
+
+### Discovery Queries
+
+The discovery system supports multiple query patterns:
+
+| Query Type | Pattern | Use Case |
+|------------|---------|----------|
+| `AllEndpoints` | `/services/` | List all services |
+| `NamespacedEndpoints` | `/services/{namespace}/` | Filter by namespace |
+| `ComponentEndpoints` | `/services/{namespace}/{component}/` | Filter by component |
+| `Endpoint` | `/services/{namespace}/{component}/{endpoint}/` | Specific endpoint |
+
+### Watch Functionality
+
+Clients watch etcd prefixes for real-time updates:
+
+```python
+# Client watches for endpoint changes
+watcher = etcd.watch_prefix("/services/vllm-agg/backend/generate/")
+
+for event in watcher:
+    if event.type == "PUT":
+        # New endpoint registered
+        add_endpoint(event.value)
+    elif event.type == "DELETE":
+        # Endpoint removed (worker died)
+        remove_endpoint(event.key)
+```
+
+**Watch Features:**
+
+- Initial state retrieval with `get_and_watch_prefix()`
+- Automatic reconnection on stream failure
+- Revision tracking for no-event-loss guarantees
+- Event types: `PUT` (create/update) and `DELETE`
+
+### Distributed Locks
+
+etcd provides distributed locking for coordination:
+
+**Lock Types:**
+
+| Type | Key Pattern | Behavior |
+|------|-------------|----------|
+| Write Lock | `v1/{prefix}/writer` | Exclusive (no readers/writers) |
+| Read Lock | `v1/{prefix}/readers/{id}` | Shared (multiple readers) |
+
+**Operations:**
+
+```rust
+// Non-blocking write lock
+let lock = client.try_write_lock("my_resource").await?;
+
+// Blocking read lock with polling (100ms intervals)
+let lock = client.read_lock_with_wait("my_resource").await?;
+```
+
+## NATS Architecture
+
+### When NATS is Used
+
+NATS is used for:
+
+1. **KV Cache Events**: Real-time KV cache state updates for routing
+2. **Router Replica Sync**: Synchronizing router state across replicas
+3. **Legacy Request Plane**: NATS-based request transport (optional)
+
+### Configuration
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `NATS_SERVER` | NATS server URL | `nats://localhost:4222` |
+
+### Disabling NATS
+
+For deployments without KV-aware routing:
+
+```bash
+# Disable NATS and KV events
+python -m dynamo.frontend --no-kv-events
+```
+
+This enables "approximate mode" for KV routing without event persistence.
+
+### Event Publishing
+
+Components publish events to NATS subjects:
+
+```rust
+pub trait EventPublisher {
+    async fn publish(&self, event: &str, data: &[u8]) -> Result<()>;
+    async fn publish_serialized<T: Serialize>(&self, event: &str, data: &T) -> Result<()>;
+}
+```
+
+**Subject Naming:**
+```
+{base_subject}.{event_name}
+```
+
+Example:
+```
+vllm-agg.backend.kv_cache_update
+```
+
+### Event Subscription
+
+Components subscribe to events:
+
+```rust
+pub trait EventSubscriber {
+    async fn subscribe(&self, topic: &str) -> Result<Subscriber>;
+    async fn subscribe_typed<T: DeserializeOwned>(&self, topic: &str) -> Result<TypedSubscriber<T>>;
+}
+```
+
+### JetStream Persistence
+
+For durable event delivery, NATS JetStream provides:
+
+- Message persistence
+- Replay from offset
+- Consumer groups for load balancing
+- Acknowledgment tracking
+
+## Key-Value Store Abstraction
+
+Dynamo provides a unified KV store interface supporting multiple backends:
+
+### Supported Backends
+
+| Backend | Use Case | Configuration |
+|---------|----------|---------------|
+| `EtcdStore` | Production deployments | `ETCD_ENDPOINTS` |
+| `MemoryStore` | Testing, development | Default |
+| `NatsStore` | NATS-only deployments | `NATS_SERVER` |
+| `FileStore` | Local persistence | File path |
+
+### Store Interface
+
+```rust
+pub trait KvStore {
+    async fn get(&self, bucket: &str, key: &str) -> Result<Option<Vec<u8>>>;
+    async fn put(&self, bucket: &str, key: &str, value: &[u8]) -> Result<()>;
+    async fn delete(&self, bucket: &str, key: &str) -> Result<()>;
+    async fn watch(&self, bucket: &str) -> Result<WatchStream>;
+}
+```
+
+### Buckets
+
+Data is organized into logical buckets:
+
+| Bucket | Purpose |
+|--------|---------|
+| `v1/instances` | Endpoint instance registry |
+| `v1/mdc` | Model deployment cards |
+
+## Typed Prefix Watcher
+
+For type-safe watching of etcd prefixes:
+
+```rust
+// Watch and maintain HashMap of deserialized values
+let watcher = watch_prefix_with_extraction::<DiscoveryInstance>(
+    &etcd_client,
+    "/services/vllm-agg/",
+    lease_id_extractor,
+    value_extractor,
+).await?;
+
+// Receive updates via watch channel
+let instances = watcher.borrow();
+```
+
+**Key Extractors:**
+
+| Extractor | Description |
+|-----------|-------------|
+| `lease_id()` | Use lease ID as key |
+| `key_string()` | Extract key with prefix stripping |
+| `full_key_string()` | Use full etcd key |
+
+## Reliability Features
+
+### Connection Resilience
+
+**etcd Reconnection:**
+- Exponential backoff: 50ms to 5s
+- Deadline-based retry logic
+- Mutex ensures single concurrent reconnect
+
+**NATS Reconnection:**
+- Built-in reconnection in NATS client
+- Configurable max reconnect attempts
+- Buffering during disconnection
+
+### Lease-Based Cleanup
+
+When a worker crashes or loses connectivity:
+
+1. Keep-alive heartbeats stop
+2. Lease expires after TTL (10 seconds)
+3. All registered endpoints automatically deleted
+4. Clients receive DELETE watch events
+5. Traffic reroutes to healthy workers
+
+### Transaction Safety
+
+etcd transactions ensure atomic operations:
+
+```rust
+// Atomic create-if-not-exists
+let txn = Txn::new()
+    .when([Compare::create_revision(key, CompareOp::Equal, 0)])
+    .and_then([Op::put(key, value, options)]);
+
+etcd_client.txn(txn).await?;
+```
+
+This prevents race conditions in concurrent service registration.
+
+## Operational Modes
+
+### Kubernetes Mode (Requires Explicit Configuration)
+
+Native Kubernetes service discovery:
+
+```bash
+# Operator explicitly sets this (not auto-detected):
+export DYN_DISCOVERY_BACKEND=kubernetes
+
+# Workers register via K8s CRDs
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B
+
+# Frontend discovers workers via K8s API
+python -m dynamo.frontend
+```
+
+No etcd or NATS required for basic operation when using K8s discovery.
+
+### KV Store Mode (Global Default)
+
+Full service discovery with etcd:
+
+```bash
+# This is the default - no configuration needed
+# export DYN_DISCOVERY_BACKEND=kv_store  # (implicit)
+
+# Workers register with etcd
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B
+
+# Frontend discovers workers via etcd
+python -m dynamo.frontend
+```
+
+### KV-Aware Routing (Optional)
+
+Enable NATS for KV cache event tracking:
+
+```bash
+# Default: KV events enabled (requires NATS)
+python -m dynamo.frontend --router-mode kv
+
+# Disable KV events for prediction-based routing (no NATS)
+python -m dynamo.frontend --router-mode kv --no-kv-events
+```
+
+With `--no-kv-events`:
+- Router predicts cache state based on routing decisions
+- TTL-based expiration and LRU pruning
+- No NATS infrastructure required
+
+## Best Practices
+
+### 1. Use Kubernetes Discovery on K8s
+
+The Dynamo operator automatically sets `DYN_DISCOVERY_BACKEND=kubernetes` for pods. No additional setup required when using the operator.
+
+### 2. For Bare Metal: Deploy etcd Cluster
+
+For bare-metal production deployments, deploy a 3-node etcd cluster for high availability.
+
+### 3. Configure Appropriate TTLs (etcd mode)
+
+Balance between detection speed and overhead:
+
+- **Short TTL (5s)**: Faster failure detection, more keep-alive traffic
+- **Long TTL (30s)**: Less overhead, slower detection
+
+### 4. KV Routing Without NATS
+
+For simpler deployments without NATS:
+
+```bash
+# Use prediction-based KV routing
+python -m dynamo.frontend --router-mode kv --no-kv-events
+```
+
+This provides KV-aware routing with reduced accuracy but no NATS dependency.
+
+## Related Documentation
+
+- [Distributed Runtime](distributed-runtime.md) - Runtime architecture
+- [Request Plane](../guides/request-plane.md) - Request transport configuration
+- [Fault Tolerance](../fault-tolerance/request-cancellation.md) - Failure handling
--- a/fern/pages/development/backend-guide.md
+++ b/fern/pages/development/backend-guide.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Writing Python Workers in Dynamo"
+---
+
+This guide explains how to create your own Python worker in Dynamo.
+
+The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
+
+The Python file must do three things:
+1. Decorate a function to get the runtime
+2. Register on the network
+3. Attach a request handler
+
+```
+from dynamo.llm import ModelInput, ModelType, register_llm
+from dynamo.runtime import DistributedRuntime, dynamo_worker
+
+   # 1. Decorate a function to get the runtime
+   #
+   @dynamo_worker()
+   async def worker(runtime: DistributedRuntime):
+
+    # 2. Register ourselves on the network
+    #
+    component = runtime.namespace("namespace").component("component")
+    model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
+    model_input = ModelInput.Tokens # or ModelInput.Text if engine handles pre-processing
+    model_type = ModelType.Chat # or ModelType.Chat | ModelType.Completions if model can be deployed on chat and completions endpoints
+    endpoint = component.endpoint("endpoint")
+    # Optional last param to register_llm is model_name. If not present derives it from model_path
+    await register_llm(model_input, model_type, endpoint, model_path)
+
+    # Initialize your engine here
+    # engine = ...
+
+    # 3. Attach request handler
+    #
+    await endpoint.serve_endpoint(RequestHandler(engine).generate)
+
+class RequestHandler:
+
+    def __init__(self, engine):
+        ...
+
+    async def generate(self, request):
+        # Call the engine
+        # yield result dict
+        ...
+
+if __name__ == "__main__":
+    uvloop.install()
+    asyncio.run(worker())
+```
+
+
+The `model_path` can be:
+- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
+- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
+
+The `model_input` can be:
+- ModelInput.Tokens. Your engine expects pre-processed input (token IDs). Dynamo handles tokenization and pre-processing.
+- ModelInput.Text. Your engine expects raw text input and handles its own tokenization and pre-processing.
+
+The `model_type` can be:
+- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat).
+- ModelType.Completions. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions).
+
+`register_llm` can also take the following kwargs:
+- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name.
+- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
+- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
+- `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../fault-tolerance/request-migration.md). Defaults to 0.
+- `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
+
+See `examples/backends` for full code examples.
+
+## Component names
+
+A worker needs three names to register itself: namespace.component.endpoint
+
+* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
+* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
+* *Endpoint*: Like a URL. "generate", "load_metrics".
+* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.
+
+If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.
+
+If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.
+
+Example 1: Data parallel load balanced, one model one pipeline two instances.
+```
+Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 0
+Node 2: namespace: qwen3-32b, component: backend, endpoint: generate model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 2
+```
+
+Example 2: Two models, two pipelines.
+```
+Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B
+Node 2: namespace: llama3-1-8b, component: backend, endpoint: generat, model: /data/Llama-3.1-8B-Instruct/
+```
+
+Example 3: Different endpoints.
+
+The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.
+
+Example 4: Multiple component in a pipeline.
+
+In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`.
+
+## Migrate Ongoing Requests
+
+A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline.
+
+In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault-tolerance/request-migration.md) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
+
+<Warning>
+We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future.
+</Warning>
+
+Here's an example of how to implement this in your `RequestHandler`:
+
+```python
+class RequestHandler:
+
+    async def generate(self, request):
+        """Generate response, with support for request migration"""
+        for result in self.engine.generate_streaming(request):
+            # Check if we need to migrate before yielding each token
+            if is_shutting_down():
+                # Raising GeneratorExit closes the stream and triggers migration
+                raise GeneratorExit("Worker shutting down, migrating request")
+
+            yield result
+```
+
+When `GeneratorExit` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns.
+
+For more information about how request migration works, see the [Request Migration Architecture](../fault-tolerance/request-migration.md) documentation.
+
+## Request Cancellation
+
+Your Python worker's request handler can optionally support request cancellation by accepting a `context` argument after the `request` argument. This context object allows you to check for cancellation signals and respond appropriately:
+
+```python
+class RequestHandler:
+
+    async def generate(self, request, context):
+        """Generate response with cancellation support"""
+        for result in self.engine.generate_streaming(request):
+            # Check if the request has been cancelled
+            if context.is_stopped():
+                # Stop processing and clean up
+                break
+
+            yield result
+```
+
+The context parameter is optional - if your generate method doesn't include it in its signature, Dynamo will call your method without the context argument.
+
+For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](../fault-tolerance/request-cancellation.md) documentation.
--- a/fern/pages/development/runtime-guide.md
+++ b/fern/pages/development/runtime-guide.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Dynamo Runtime"
+---
+
+<h4>A Datacenter Scale Distributed Inference Serving Framework</h4>
+
+[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+
+Rust implementation of the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
+
+## Prerequisites
+
+### Install Rust and Cargo using [rustup](https://rustup.rs/):
+
+```bash
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+```
+
+### Build
+
+```
+cargo build
+cargo test
+```
+
+### Start Dependencies
+
+#### Docker Compose
+
+The simplest way to deploy the pre-requisite services is using
+[docker-compose](https://docs.docker.com/compose/install/linux/),
+defined in [deploy/docker-compose.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml).
+
+```
+# At the root of the repository:
+docker compose -f deploy/docker-compose.yml up -d
+```
+
+This will deploy a [NATS.io](https://nats.io/) server and an [etcd](https://etcd.io/)
+server used to communicate between and discover components at runtime.
+
+
+#### Local (alternate)
+
+To deploy the pre-requisite services locally instead of using `docker-compose`
+above, you can manually launch each:
+
+- [NATS.io](https://docs.nats.io/running-a-nats-service/introduction/installation) server with [Jetstream](https://docs.nats.io/nats-concepts/jetstream)
+    - example: `nats-server -js --trace`
+- [etcd](https://etcd.io) server
+    - follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally
+
+
+### Run Examples
+
+When developing or running examples, any process or user that shared your core-services (`etcd` and `nats.io`) will
+be operating within your distributed runtime.
+
+The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later.
+
+All examples require the `etcd` and `nats.io` pre-requisites to be running and available.
+
+#### Rust `hello_world`
+
+With two terminals open, in one window:
+
+```
+cd examples/hello_world
+cargo run --bin server
+```
+
+In the second terminal, execute:
+
+```
+cd examples/hello_world
+cargo run --bin client
+```
+
+which should yield some output similar to:
+```
+    Finished `dev` profile [unoptimized + debuginfo] target(s) in 6.25s
+     Running `target/debug/client`
+Annotated { data: Some("h"), id: None, event: None, comment: None }
+Annotated { data: Some("e"), id: None, event: None, comment: None }
+Annotated { data: Some("l"), id: None, event: None, comment: None }
+Annotated { data: Some("l"), id: None, event: None, comment: None }
+Annotated { data: Some("o"), id: None, event: None, comment: None }
+Annotated { data: Some(" "), id: None, event: None, comment: None }
+Annotated { data: Some("w"), id: None, event: None, comment: None }
+Annotated { data: Some("o"), id: None, event: None, comment: None }
+Annotated { data: Some("r"), id: None, event: None, comment: None }
+Annotated { data: Some("l"), id: None, event: None, comment: None }
+Annotated { data: Some("d"), id: None, event: None, comment: None }
+```
+
+#### Python
+
+See the [README.md](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/lib/bindings/python/README.md) for details
+
+The Python and Rust `hello_world` client and server examples are interchangeable,
+so you can start the Python `server.py` and talk to it from the Rust `client`.
--- a/fern/pages/fault-tolerance/README.md
+++ b/fern/pages/fault-tolerance/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Fault Tolerance"
+---
+
+Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability.
+
+## Overview
+
+Fault tolerance in Dynamo operates at multiple levels:
+
+| Layer | Mechanism | Purpose |
+|-------|-----------|---------|
+| **Request** | Migration, Cancellation | Handle in-flight request failures |
+| **Worker** | Health Checks, Graceful Shutdown | Detect and recover from worker failures |
+| **System** | Load Shedding, Request Rejection | Prevent system overload |
+| **Infrastructure** | etcd HA, NATS resilience | Handle infrastructure component failures |
+
+## Key Features
+
+### Request Migration
+
+When a worker fails during request processing, Dynamo can migrate in-progress requests to healthy workers. The migration system:
+
+- Preserves partial generation state (accumulated tokens)
+- Transparently continues generation on a new worker
+- Maintains seamless token flow to clients
+
+See [Request Migration](request-migration.md) for details.
+
+### Request Cancellation
+
+Dynamo supports canceling in-flight requests to free computational resources:
+
+- Graceful stop signals for clean termination
+- Kill signals for immediate termination
+- Hierarchical cancellation propagation through request chains
+
+See [Request Cancellation](request-cancellation.md) for details.
+
+### Graceful Shutdown
+
+Workers handle shutdown signals (SIGTERM/SIGINT) gracefully:
+
+- Immediately stop accepting new requests
+- Optionally drain in-flight requests before terminating
+- Clean up resources (engines, connections, temp files)
+
+See [Graceful Shutdown](graceful-shutdown.md) for details.
+
+### Request Rejection (Load Shedding)
+
+When workers are overloaded, Dynamo rejects new requests to prevent cascading failures:
+
+- Configurable busy thresholds based on KV cache utilization
+- Real-time worker load monitoring
+- HTTP 503 responses with retry guidance
+
+See [Request Rejection](request-rejection.md) for details.
+
+### Health Checks
+
+Dynamo provides multiple health check mechanisms:
+
+- **HTTP Endpoints**: `/health` and `/live` endpoints for orchestration
+- **Canary Health Checks**: Active monitoring via periodic test requests
+- **Engine Monitoring**: Automatic shutdown on engine failure detection
+
+See [Health Checks](../observability/health-checks.md) for details.
+
+## Configuration Quick Reference
+
+| Feature | Environment Variable | Default |
+|---------|---------------------|---------|
+| Worker health port | `DYN_SYSTEM_PORT` | `9090` |
+| Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` (K8s: `true`) |
+| Canary wait time | `DYN_CANARY_WAIT_TIME` | `10` seconds |
+| Health check timeout | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | `3` seconds |
+| Decode blocks threshold | `--active-decode-blocks-threshold` | None (disabled) |
+| Prefill tokens threshold | `--active-prefill-tokens-threshold` | None (disabled) |
+
+## Failure Scenarios and Recovery
+
+### Worker Pod Restart
+
+1. Worker receives SIGTERM from Kubernetes
+2. Endpoints are immediately invalidated (no new requests)
+3. In-flight requests complete or migrate (based on configuration)
+4. Resources are cleaned up
+5. Pod restarts with fresh state
+
+### Worker Crash (Unexpected)
+
+1. etcd lease expires (TTL-based detection)
+2. Client discovers endpoint removal via etcd watch
+3. New requests route to remaining healthy workers
+4. In-flight requests on crashed worker are migrated (if enabled)
+
+### Network Partition
+
+1. Worker loses connectivity to etcd/NATS
+2. Lease keep-alive fails, lease eventually expires
+3. Worker is removed from service discovery
+4. Traffic reroutes to reachable workers
+
+### GPU Failure
+
+1. Engine health check detects GPU error (XID, OOM, etc.)
+2. Worker initiates graceful shutdown
+3. Runtime is shut down, engine cleaned up
+4. Process exits with code 1 for pod restart
+
+## Testing Fault Tolerance
+
+Dynamo includes a comprehensive testing framework for validating fault tolerance:
+
+- Request cancellation tests
+- Migration tests with worker failures
+- etcd HA failover tests
+- Hardware fault injection (GPU XID, network partitions)
+
+See [Fault Tolerance Testing](testing.md) for details.
+
+## Related Documentation
+
+- [Observability](../observability/README.md) - Metrics and monitoring
+- [Distributed Runtime](../design-docs/distributed-runtime.md) - Service discovery architecture
+- [Event Plane](../design-docs/event-plane.md) - etcd and NATS coordination
--- a/fern/pages/fault-tolerance/graceful-shutdown.md
+++ b/fern/pages/fault-tolerance/graceful-shutdown.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Graceful Shutdown"
+---
+
+This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.
+
+## Overview
+
+Graceful shutdown in Dynamo ensures that:
+
+1. **No new requests are accepted** - Endpoints are immediately invalidated
+2. **In-flight requests complete** - Existing requests finish processing (configurable)
+3. **Resources are cleaned up** - Engines, connections, and temporary files are released
+4. **Pods restart cleanly** - Exit codes signal Kubernetes for proper restart behavior
+
+## Signal Handling
+
+All Dynamo components handle Unix signals for graceful shutdown:
+
+| Signal | Trigger | Behavior |
+|--------|---------|----------|
+| `SIGTERM` | Kubernetes pod termination | Graceful shutdown initiated |
+| `SIGINT` | Ctrl+C / manual interrupt | Graceful shutdown initiated |
+
+### Implementation
+
+Each component registers signal handlers at startup:
+
+```python
+def signal_handler():
+    asyncio.create_task(graceful_shutdown(runtime))
+
+for sig in (signal.SIGTERM, signal.SIGINT):
+    loop.add_signal_handler(sig, signal_handler)
+```
+
+The `graceful_shutdown()` function:
+1. Logs the shutdown signal
+2. Calls `runtime.shutdown()` to invalidate endpoints
+3. Waits for in-flight requests (based on configuration)
+4. Returns to allow cleanup to proceed
+
+## Endpoint Draining
+
+When `runtime.shutdown()` is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the `graceful_shutdown` parameter when serving the endpoint.
+
+### Configuration
+
+When registering an endpoint, the `graceful_shutdown` parameter controls draining behavior:
+
+```python
+generate_endpoint.serve_endpoint(
+    handler.generate,
+    graceful_shutdown=True,  # Wait for all requests to finish
+    metrics_labels=[("model", model_name)],
+    health_check_payload=health_check_payload,
+)
+```
+
+| `graceful_shutdown` | Behavior |
+|---------------------|----------|
+| `True` | Wait for all in-flight requests to complete before returning |
+| `False` | Return immediately without waiting for requests |
+
+### Component-Specific Behavior
+
+| Component | Default Behavior | Rationale |
+|-----------|------------------|-----------|
+| **Frontend** | N/A (HTTP server) | HTTP server handles its own shutdown |
+| **Prefill Workers** | `graceful_shutdown=True` | Prefill operations must complete to avoid wasted computation |
+| **Decode Workers** | Conditional | If migration is enabled (`migration_limit > 0`), shutdown immediately to allow migration; otherwise wait |
+| **Router** | `graceful_shutdown=True` | Ensure routing decisions complete |
+
+### Decode Worker Migration Integration
+
+Decode workers use conditional draining based on whether request migration is supported:
+
+```python
+generate_endpoint.serve_endpoint(
+    handler.generate,
+    graceful_shutdown=config.migration_limit <= 0,  # If no migration, wait for requests
+    ...
+)
+```
+
+When `migration_limit > 0`:
+- Worker shuts down immediately (`graceful_shutdown=False`)
+- In-flight requests are migrated to healthy workers
+- No request loss occurs
+
+When `migration_limit <= 0`:
+- Worker waits for in-flight requests (`graceful_shutdown=True`)
+- Migration is not available
+- Requests complete on the shutting-down worker
+
+## Resource Cleanup
+
+After endpoint draining, components clean up their resources in `finally` blocks:
+
+### vLLM Worker Cleanup
+
+```python
+finally:
+    logger.debug("Cleaning up worker")
+    handler.cleanup()
+```
+
+The handler's `cleanup()` method:
+- Removes temporary directories (LoRA adapters, etc.)
+- Releases engine resources
+
+### SGLang Worker Cleanup
+
+```python
+def cleanup(self) -> None:
+    # Cancel pending consume tasks
+    for task in self._consume_tasks:
+        if not task.done():
+            task.cancel()
+    self._consume_tasks.clear()
+
+    # Shutdown engine
+    self.engine.shutdown()
+```
+
+### TensorRT-LLM Worker Cleanup
+
+```python
+async def cleanup(self):
+    if self._llm:
+        try:
+            self._llm.shutdown()
+        except Exception as e:
+            logging.error(f"Error during cleanup: {e}")
+        finally:
+            self._llm = None
+```
+
+## Error-Initiated Shutdown
+
+Workers can initiate graceful shutdown when fatal errors occur:
+
+### Engine Health Monitoring (vLLM)
+
+The `VllmEngineMonitor` continuously checks engine health:
+
+```python
+async def _check_engine_health(self):
+    while True:
+        try:
+            await self.engine_client.check_health()
+            await asyncio.sleep(HEALTH_CHECK_INTERVAL)  # 2 seconds
+        except EngineDeadError as e:
+            logger.error(f"Health check failed: {e}")
+            self._shutdown_engine()
+            self.runtime.shutdown()
+            os._exit(1)
+```
+
+Configuration:
+- `HEALTH_CHECK_INTERVAL`: 2 seconds between checks
+- `ENGINE_SHUTDOWN_TIMEOUT`: 30 seconds max for engine shutdown
+
+### Fatal Error Handling (TensorRT-LLM)
+
+```python
+async def _initiate_shutdown(self, error: Exception):
+    logging.warning(f"Initiating graceful shutdown due to: {error}")
+
+    try:
+        if self.runtime:
+            self.runtime.shutdown()
+        if self.engine:
+            await self.engine.cleanup()
+    except Exception as cleanup_error:
+        logging.error(f"Error during graceful shutdown: {cleanup_error}")
+    finally:
+        logging.critical("Forcing process exit for restart")
+        os._exit(1)
+```
+
+## Kubernetes Integration
+
+### Pod Termination Flow
+
+1. Kubernetes sends `SIGTERM` to the pod
+2. Dynamo initiates graceful shutdown
+3. Pod has `terminationGracePeriodSeconds` to complete (default: 30s)
+4. If not terminated, Kubernetes sends `SIGKILL`
+
+### Recommended Configuration
+
+For production deployments, configure adequate termination grace period:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+spec:
+  services:
+    VllmWorker:
+      extraPodSpec:
+        terminationGracePeriodSeconds: 60  # Allow time for request draining
+```
+
+### Health Check Integration
+
+Kubernetes uses health endpoints to determine pod readiness:
+
+- **During shutdown**: Endpoints become unavailable
+- **Readiness probe fails**: Traffic stops routing to the pod
+- **Graceful draining**: Existing requests complete
+
+## Best Practices
+
+### 1. Set Appropriate Grace Periods
+
+Match `terminationGracePeriodSeconds` to your expected request completion time:
+- Short requests (< 10s): 30s grace period
+- Long generation (> 30s): 120s+ grace period
+
+### 2. Enable Request Migration for Decode Workers
+
+If using disaggregated serving, enable migration for decode workers:
+
+```python
+--migration-limit 3  # Allow up to 3 migration attempts
+```
+
+This allows immediate shutdown while preserving request state.
+
+### 3. Monitor Shutdown Metrics
+
+Track shutdown behavior via logs:
+
+```
+INFO  Received shutdown signal, shutting down DistributedRuntime
+INFO  DistributedRuntime shutdown complete
+DEBUG Cleaning up worker
+```
+
+### 4. Handle Cleanup Errors
+
+Ensure cleanup methods handle errors gracefully:
+
+```python
+def cleanup(self):
+    for resource in self.resources:
+        try:
+            resource.cleanup()
+        except Exception as e:
+            logger.warning(f"Cleanup failed: {e}")
+            # Continue with other resources
+```
+
+## Related Documentation
+
+- [Request Migration](request-migration.md) - How requests migrate during shutdown
+- [Request Cancellation](request-cancellation.md) - Canceling in-flight requests
+- [Health Checks](../observability/health-checks.md) - Liveness and readiness probes
--- a/fern/pages/fault-tolerance/request-cancellation.md
+++ b/fern/pages/fault-tolerance/request-cancellation.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Request Cancellation Architecture"
+---
+
+This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed.
+
+## AsyncEngineContext Trait
+
+At the core of Dynamo's request cancellation system is the `AsyncEngineContext` trait. This trait is associated with every request stream and provides lifecycle management for async operations, including stream identification, graceful shutdown capabilities, and immediate termination capabilities.
+
+### Key Methods
+
+#### Identification
+- **`id()`**: Returns the unique identifier for the stream. This ID is set by the user for request identification, and the same ID can be used for sub-requests to associate them with the original user request.
+
+#### Status Checking
+- **`is_stopped()`**: Returns `true` if graceful cancellation has been requested via `stop_generating()`. This represents a signal to the worker that the request has been cancelled and it should return early.
+- **`is_killed()`**: Returns `true` if a hard stop has been issued via `kill()`. This typically indicates that the network connection between client and server has been cut or an immediate termination is required.
+
+#### Async Status Monitoring
+- **`stopped()`**: An async method that completes when the context becomes stopped. If already stopped, returns immediately.
+- **`killed()`**: An async method that completes when the context becomes killed. If already killed, returns immediately.
+
+#### Cancellation Control
+- **`stop_generating()`**: The recommended method for cancelling a request. This informs the engine to stop producing results for the stream gracefully. This method is idempotent and does not invalidate results currently in the stream.
+- **`stop()`**: Alias for `stop_generating()`.
+- **`kill()`**: Extends `stop_generating()` but also indicates a preference to terminate without draining remaining items in the stream. This is implementation-specific and may not be supported by all engines.
+
+#### Child Request Management
+- **`link_child(child: Arc<dyn AsyncEngineContext>)`**: Links a child `AsyncEngineContext` to this context. When `stop_generating()`, `stop()`, or `kill()` is called on the parent context, the same method is automatically called on all linked child contexts in the order they were linked. This is especially useful in disaggregated serving scenarios where a frontend receives cancellation notification and needs to cancel requests to workers, and the worker can then cancel its sub-requests (e.g., remote prefill operations).
+
+### Thread Safety
+
+The `AsyncEngineContext` trait ensures thread-safety with `Send + Sync` bounds, allowing safe concurrent access across multiple threads and async tasks.
+
+## Python Bindings
+
+The `AsyncEngineContext` functionality is exposed to Python through the `Context` class, which provides a largely one-to-one mapping from Rust methods to Python methods.
+
+### Python Context Class
+
+The Python `Context` class wraps the Rust `AsyncEngineContext` and exposes the following methods:
+
+- **`id()`**: Returns the unique identifier for the context
+- **`is_stopped()`**: Synchronous method equivalent to the Rust `is_stopped()`
+- **`is_killed()`**: Synchronous method equivalent to the Rust `is_killed()`
+- **`stop_generating()`**: Issues a stop generating signal, equivalent to the Rust method
+- **`async_killed_or_stopped()`**: An async method that completes when the context becomes either killed or stopped, whichever happens first. This combines the functionality of the Rust `killed()` and `stopped()` async methods using `tokio::select!`.
+
+For a working example of request cancellation, see the [cancellation demo](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom_backend/cancellation/README.md).
+
+### Context Usage in Python
+
+The context is available optionally in both incoming and outgoing request scenarios:
+
+#### Incoming Requests
+For incoming requests, the generate method may optionally accept a `context` argument after the `request` argument. If the `context` parameter is specified in the method signature, it will receive the context object of the incoming request. Request handlers can:
+
+- Check for cancellation synchronously using `context.is_stopped()` before beginning expensive operations
+- Listen for cancellation asynchronously using `await context.async_killed_or_stopped()`
+
+Example:
+```python
+async def generate(self, request, context):
+    for i in range(1000):
+        # Check for cancellation before expensive work
+        if context.is_stopped():
+            raise asyncio.CancelledError
+
+        # Perform work...
+        await expensive_computation()
+        yield result
+```
+
+#### Outgoing Requests
+For outgoing requests, Python scripts may optionally provide a context object to outgoing runtime endpoint client router operations (such as `generate`, `round_robin`, `random`, `direct` methods) as a keyword argument. The script can cancel the outgoing request via the provided context object.
+
+This is especially useful when child outgoing requests need to be cancelled when the parent incoming request is cancelled. In such cases, the script can simply pass the incoming context object to the outgoing request, automatically linking the cancellation behavior.
+
+Example:
+```python
+async def generate(self, request, context):
+    # Forward the incoming context to outgoing request
+    # If the incoming request is cancelled, the outgoing request will be too
+    stream = await self.client.generate(request, context=context)
+    async for response in stream:
+        yield response
+```
+
+This design enables seamless cancellation propagation through multi-tier request chains, ensuring that when a client cancels a request, all associated sub-requests are automatically cancelled, saving computational resources across the entire request pipeline.
--- a/fern/pages/fault-tolerance/request-migration.md
+++ b/fern/pages/fault-tolerance/request-migration.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Request Migration Architecture"
+---
+
+This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience.
+
+## Overview
+
+Request migration is implemented through a Migration operator that sits in the LLM processing pipeline between the Backend operator and the service backend. When a worker fails during request processing, the migration system preserves the partial generation state and recreates the request on a new worker to continue from where the previous worker left off.
+
+## Architecture Components
+
+### Migrator
+
+The migration system is integrated into the LLM processing pipeline between the frontend preprocessing and the actual service backends. This positioning allows it to intercept all communication flows and manage failure scenarios transparently.
+
+Key responsibilities:
+- Intercepts all requests and responses flowing through the pipeline
+- Detects worker failure scenarios through error pattern matching
+- Manages retry logic with configurable migration limits
+- Tracks partial response state for seamless continuation
+
+### Migration Limit Configuration
+
+Each model can be configured with a migration limit parameter that specifies the maximum number of times a request can be migrated to another worker:
+
+- Default behavior: no migration allowed
+- Can be set independently for different engine types
+- Applicable to LLM worker nodes that perform inference
+- Allows engines to override user-specified limits for compatibility
+
+## Token State Tracking and Request Migration
+
+The core of the migration system is the ability to preserve and continue partial generations through token state management. This ensures that when a worker fails mid-generation, the new worker can seamlessly continue from the exact point of failure.
+
+### Token Accumulation Process
+
+When a request is being processed and responses are flowing back from a worker, the migration system tracks every token that has been successfully generated:
+
+1. **Initial Request State**: The system starts with the original preprocessed request containing the initial prompt tokens.
+
+2. **Response Tracking**: As each response arrives from the worker, the migration system extracts the newly generated tokens and appends them to the request's token sequence. This creates accumulates all tokens that have been generated.
+
+3. **Token Count Management**: The system also updates the remaining token budget to reflect the number of tokens already generated, ensuring that the total generation stays within the originally requested limits.
+
+### Migration Trigger Scenarios
+
+The migration system handles two distinct failure scenarios:
+
+#### 1. New Request Migration (Initial Connection Failure)
+
+**Scenario**: Worker is unreachable when creating the initial connection.
+
+**Error Pattern**: Communication system reports chosen worker instance is unavailable.
+
+**Migration Process**:
+- Detects connection failure during initial stream setup
+- Decrements migration retry count
+- Attempts to create a new stream with the original request
+- No partial state to preserve since generation hasn't started
+
+#### 2. Ongoing Request Migration (Mid-Stream Disconnection)
+
+**Scenario**: Connection lost during active generation after partial responses have been received.
+
+**Error Pattern**: Stream termination detected before generation completion.
+
+**Migration Process**:
+
+1. **Failure Detection**: The system detects the stream disconnection through error monitoring.
+
+2. **State Preservation**: At this point, the request's token sequence contains both the original prompt tokens and all successfully generated tokens from the failed worker.
+
+3. **New Stream Creation**: A fresh stream is created with the accumulated request state, ensuring the new worker has complete context.
+
+4. **Continuation**: The new worker receives the request with the full token context and continues generation from the exact point where the previous worker left off.
+
+### Seamless Token Flow and Request State Evolution
+
+From the client's perspective, the token stream appears continuous and uninterrupted. The client receives tokens from the first worker until failure occurs, then seamlessly continues receiving tokens from the backup worker without any indication of the underlying migration.
+
+The request state evolves dynamically during processing. Initially, the request contains only the original prompt tokens. As generation proceeds, each successfully generated token is appended to the request's token sequence, creating a growing record of the complete conversation context.
+
+When a migration occurs, this accumulated state is transferred to the new worker, which uses it to reconstruct the complete context. The new worker then continues generation as if it had been processing the request from the beginning, but starting from the current position in the sequence.
+
+The migration is transparent because:
+1. No tokens are lost or duplicated during the transition
+2. The new worker has complete context via the accumulated token sequence
+3. Generation continues from the exact failure point
+4. Response streaming maintains consistent format and timing
+
+This token accumulation mechanism ensures that migrations are truly seamless, preserving all computational work and maintaining generation quality across worker transitions.
+
+## Benefits
+
+1. **Fault Tolerance**: System continues operating during individual worker failures
+2. **Resource Efficiency**: Partial generations are preserved rather than restarted
+3. **Seamless User Experience**: Users experience no interruption during worker failures
+4. **Configurable Behavior**: Migration limits allow tuning based on deployment requirements
+5. **No Token Loss**: Complete preservation of generation state across migrations
+
+## Design Considerations
+
+The migration system is designed with several important architectural considerations:
+
+**Engine Compatibility**: Different LLM engines may have varying capabilities for handling migrated requests. The system allows engines to override migration settings to ensure compatibility and correctness.
+
+**Multi-Model Support**: Since a frontend may serve multiple models simultaneously, migration limits can be configured at the engine level, providing flexibility for different model types with varying reliability characteristics.
+
+**State Management**: The system carefully tracks not only token sequences but also metadata such as remaining token budgets, stop conditions, and sampling parameters to ensure complete state preservation.
+
+**Error Handling**: The migration system distinguishes between different types of failures and applies appropriate recovery strategies for each scenario.
+
+## Monitoring and Metrics
+
+The migration system exposes Prometheus metrics to monitor migration activity. These metrics are available on the frontend's `/metrics` endpoint (default port 8000):
+
+- `dynamo_frontend_model_migration_total`: Counter tracking the total number of request migrations
+  - Labels:
+    - `model`: The model name being served
+    - `migration_type`: Either `new_request` (initial connection failure) or `ongoing_request` (mid-stream disconnection)
+
+**Example metrics output:**
+```
+dynamo_frontend_model_migration_total{migration_type="ongoing_request",model="Qwen/Qwen3-0.6B"} 3
+dynamo_frontend_model_migration_total{migration_type="new_request",model="Qwen/Qwen3-0.6B"} 1
+```
+
+These metrics can be used to:
+- Monitor worker reliability and failure patterns
+- Alert on excessive migration rates indicating infrastructure issues
+- Track the effectiveness of fault tolerance mechanisms
+
+For more information on Dynamo metrics, see the [Metrics documentation](../observability/metrics.md).
+
+## Operational Impact
+
+Request migration fundamentally changes how the system handles failures, moving from a "fail-fast" approach to a "graceful degradation" model. This architectural shift enables higher availability and better resource utilization while maintaining the same external API contract for clients.
--- a/fern/pages/fault-tolerance/request-rejection.md
+++ b/fern/pages/fault-tolerance/request-rejection.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Request Rejection (Load Shedding)"
+---
+
+This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
+
+## Overview
+
+Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents:
+
+- Cascading failures from resource exhaustion
+- Degraded latency for all requests
+- Out-of-memory conditions on GPU workers
+
+When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later.
+
+## Architecture
+
+```
+                                    ┌─────────────────┐
+                                    │  Worker Monitor │
+                                    │  (Background)   │
+                                    └────────┬────────┘
+                                             │ Updates busy list
+                                             ▼
+┌──────────┐    ┌──────────┐    ┌─────────────────────┐    ┌──────────┐
+│  Client  │───▶│ Frontend │───▶│    Push Router      │───▶│  Worker  │
+└──────────┘    └──────────┘    │ (checks busy list)  │    └──────────┘
+                                └─────────────────────┘
+                                         │
+                                         │ If all workers busy
+                                         ▼
+                                ┌─────────────────────┐
+                                │   HTTP 503 Error    │
+                                │ "All workers busy"  │
+                                └─────────────────────┘
+```
+
+## Configuration
+
+### Frontend Arguments
+
+Configure busy thresholds when starting the frontend:
+
+```bash
+python -m dynamo.frontend \
+    --active-decode-blocks-threshold 0.85 \
+    --active-prefill-tokens-threshold 10000
+```
+
+| Argument | Type | Description |
+|----------|------|-------------|
+| `--active-decode-blocks-threshold` | float (0.0-1.0) | KV cache block utilization threshold |
+| `--active-prefill-tokens-threshold` | int | Prefill token count threshold |
+
+### Dynamic Configuration via API
+
+Thresholds can be adjusted at runtime via the `/busy_threshold` endpoint:
+
+#### Set Thresholds
+
+```bash
+curl -X POST http://localhost:8000/busy_threshold \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "active_decode_blocks_threshold": 0.85,
+    "active_prefill_tokens_threshold": 10000
+  }'
+```
+
+#### Get Current Thresholds
+
+```bash
+curl http://localhost:8000/busy_threshold
+```
+
+Response:
+```json
+{
+  "thresholds": [
+    {
+      "model": "Qwen/Qwen3-0.6B",
+      "active_decode_blocks_threshold": 0.85,
+      "active_prefill_tokens_threshold": 10000
+    }
+  ]
+}
+```
+
+## Busy Detection Logic
+
+Workers are marked as "busy" based on a dual-threshold system. A worker is considered busy when **either** threshold is exceeded.
+
+### KV Cache Block Threshold
+
+Monitors the percentage of KV cache blocks in use:
+
+```
+busy = active_decode_blocks / kv_total_blocks > threshold
+```
+
+Example: With `active_decode_blocks_threshold=0.85`, a worker using 87% of its KV cache blocks is marked busy.
+
+### Prefill Token Threshold
+
+Monitors the number of tokens currently being prefilled:
+
+```
+busy = active_prefill_tokens > threshold
+```
+
+Example: With `active_prefill_tokens_threshold=10000`, a worker prefilling 12,000 tokens is marked busy.
+
+### Data-Parallel Rank Aggregation
+
+For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if **ALL** ranks are busy:
+
+```python
+def is_busy(worker):
+    return all(rank.is_busy() for rank in worker.dp_ranks)
+```
+
+This prevents false positives when only some ranks are temporarily loaded.
+
+## Worker Load Monitoring
+
+The `KvWorkerMonitor` runs as a background task that:
+
+1. Subscribes to KV cache metrics events from workers
+2. Maintains load state for each worker instance
+3. Recalculates busy instances when metrics change
+4. Updates the router with the current busy list
+
+### Metrics Collected
+
+Workers publish these metrics for monitoring:
+
+| Metric | Description |
+|--------|-------------|
+| `active_decode_blocks` | Number of KV cache blocks currently in use |
+| `kv_total_blocks` | Total KV cache blocks available |
+| `active_prefill_tokens` | Number of tokens currently being prefilled |
+
+## Rejection Behavior
+
+### Request Flow
+
+1. Request arrives at frontend
+2. Push router checks if busy threshold is configured
+3. If configured, router retrieves list of free (non-busy) instances
+4. If no free instances exist (but instances are registered):
+   - Request is rejected with `PipelineError::ServiceOverloaded`
+   - HTTP 503 response is returned to client
+
+### Error Response
+
+When requests are rejected, clients receive:
+
+```http
+HTTP/1.1 503 Service Unavailable
+Content-Type: application/json
+
+{
+  "message": "Service temporarily unavailable: All workers are busy, please retry later",
+  "type": "service_unavailable",
+  "code": 503
+}
+```
+
+### Client Retry Strategy
+
+Clients should implement exponential backoff when receiving 503 responses:
+
+```python
+import time
+import random
+
+def send_with_retry(request, max_retries=5):
+    for attempt in range(max_retries):
+        response = send_request(request)
+        if response.status_code != 503:
+            return response
+
+        # Exponential backoff with jitter
+        wait_time = min(60, (2 ** attempt) + random.uniform(0, 1))
+        time.sleep(wait_time)
+
+    raise Exception("Max retries exceeded")
+```
+
+## Monitoring
+
+### Prometheus Metrics
+
+Track rejection behavior with these metrics:
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `dynamo_tasks_rejected_total` | Counter | Total number of rejected tasks |
+| `dynamo_queued_requests` | Gauge | Requests waiting in HTTP queue |
+
+### Example Prometheus Queries
+
+```promql
+# Rejection rate over 5 minutes
+rate(dynamo_tasks_rejected_total[5m])
+
+# Percentage of requests rejected
+sum(rate(dynamo_tasks_rejected_total[5m])) /
+sum(rate(dynamo_tasks_issued_total[5m])) * 100
+```
+
+### Grafana Alerting
+
+Example alert for high rejection rate:
+
+```yaml
+alert: HighRequestRejectionRate
+expr: |
+  sum(rate(dynamo_tasks_rejected_total[5m])) /
+  sum(rate(dynamo_tasks_issued_total[5m])) > 0.1
+for: 5m
+labels:
+  severity: warning
+annotations:
+  summary: "High request rejection rate"
+  description: "More than 10% of requests are being rejected"
+```
+
+## Tuning Thresholds
+
+### Conservative Settings (Latency-Focused)
+
+For applications prioritizing low latency:
+
+```bash
+--active-decode-blocks-threshold 0.70
+--active-prefill-tokens-threshold 5000
+```
+
+- Rejects earlier, before workers become fully loaded
+- Maintains lower queue depths
+- Better tail latencies
+
+### Aggressive Settings (Throughput-Focused)
+
+For applications prioritizing throughput:
+
+```bash
+--active-decode-blocks-threshold 0.95
+--active-prefill-tokens-threshold 20000
+```
+
+- Allows higher worker utilization
+- May increase latency variability
+- Better overall throughput
+
+### Disabled (No Rejection)
+
+To disable request rejection entirely:
+
+```bash
+# Simply don't set the threshold arguments
+python -m dynamo.frontend
+```
+
+Without thresholds configured, all requests are accepted regardless of worker load.
+
+## Best Practices
+
+### 1. Start Conservative, Then Tune
+
+Begin with conservative thresholds and increase based on observed behavior:
+
+```bash
+# Start here
+--active-decode-blocks-threshold 0.75
+
+# Increase if rejection rate is too high
+--active-decode-blocks-threshold 0.85
+```
+
+### 2. Monitor Before Enabling
+
+Observe worker load patterns before setting thresholds:
+
+```bash
+# Watch KV cache utilization
+watch -n 1 'curl -s localhost:8000/metrics | grep kv_blocks'
+```
+
+### 3. Use Both Thresholds for Disaggregated Serving
+
+In disaggregated deployments:
+- Use `active_prefill_tokens_threshold` for prefill workers
+- Use `active_decode_blocks_threshold` for decode workers
+
+### 4. Coordinate with Autoscaling
+
+If using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling:
+
+```yaml
+# HPA triggers at 70% utilization
+# Rejection at 85% provides buffer
+--active-decode-blocks-threshold 0.85
+```
+
+## Related Documentation
+
+- [Health Checks](../observability/health-checks.md) - Worker health monitoring
+- [Metrics](../observability/metrics.md) - Available Prometheus metrics
+- [Request Migration](request-migration.md) - Handling failed requests
--- a/fern/pages/fault-tolerance/testing.md
+++ b/fern/pages/fault-tolerance/testing.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Fault Tolerance Testing"
+---
+
+This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.
+
+## Overview
+
+Dynamo's fault tolerance test suite is located in `tests/fault_tolerance/` and includes:
+
+| Test Category | Location | Purpose |
+|---------------|----------|---------|
+| Cancellation | `cancellation/` | Request cancellation during in-flight operations |
+| Migration | `migration/` | Request migration when workers fail |
+| etcd HA | `etcd_ha/` | etcd failover and recovery |
+| Hardware | `hardware/` | GPU and network fault injection |
+| Deployment | `deploy/` | End-to-end deployment testing |
+
+## Test Directory Structure
+
+```
+tests/fault_tolerance/
+├── cancellation/
+│   ├── test_vllm.py
+│   ├── test_trtllm.py
+│   ├── test_sglang.py
+│   └── utils.py
+├── migration/
+│   ├── test_vllm.py
+│   ├── test_trtllm.py
+│   ├── test_sglang.py
+│   └── utils.py
+├── etcd_ha/
+│   ├── test_vllm.py
+│   ├── test_trtllm.py
+│   ├── test_sglang.py
+│   └── utils.py
+├── hardware/
+│   └── fault_injection_service/
+│       ├── api_service/
+│       └── agents/
+├── deploy/
+│   ├── test_deployment.py
+│   ├── scenarios.py
+│   ├── base_checker.py
+│   └── ...
+└── client.py
+```
+
+## Request Cancellation Tests
+
+Test that in-flight requests can be properly canceled.
+
+### Running Cancellation Tests
+
+```bash
+# Run all cancellation tests
+pytest tests/fault_tolerance/cancellation/ -v
+
+# Run for specific backend
+pytest tests/fault_tolerance/cancellation/test_vllm.py -v
+```
+
+### Cancellation Test Utilities
+
+The `cancellation/utils.py` module provides:
+
+#### CancellableRequest
+
+Thread-safe request cancellation via TCP socket manipulation:
+
+```python
+from tests.fault_tolerance.cancellation.utils import CancellableRequest
+
+request = CancellableRequest()
+
+# Send request in separate thread
+thread = Thread(target=send_request, args=(request,))
+thread.start()
+
+# Cancel after some time
+time.sleep(1)
+request.cancel()  # Closes underlying socket
+```
+
+#### send_completion_request / send_chat_completion_request
+
+Send cancellable completion requests:
+
+```python
+from tests.fault_tolerance.cancellation.utils import (
+    send_completion_request,
+    send_chat_completion_request
+)
+
+# Non-streaming
+response = send_completion_request(
+    base_url="http://localhost:8000",
+    model="Qwen/Qwen3-0.6B",
+    prompt="Hello, world!",
+    max_tokens=100
+)
+
+# Streaming with cancellation
+responses = send_chat_completion_request(
+    base_url="http://localhost:8000",
+    model="Qwen/Qwen3-0.6B",
+    messages=[{"role": "user", "content": "Hello!"}],
+    stream=True,
+    cancellable_request=request
+)
+```
+
+#### poll_for_pattern
+
+Wait for specific patterns in logs:
+
+```python
+from tests.fault_tolerance.cancellation.utils import poll_for_pattern
+
+# Wait for cancellation confirmation
+found = poll_for_pattern(
+    log_file="/var/log/dynamo/worker.log",
+    pattern="Request cancelled",
+    timeout=30,
+    interval=0.5
+)
+```
+
+## Migration Tests
+
+Test that requests migrate to healthy workers when failures occur.
+
+### Running Migration Tests
+
+```bash
+# Run all migration tests
+pytest tests/fault_tolerance/migration/ -v
+
+# Run for specific backend
+pytest tests/fault_tolerance/migration/test_vllm.py -v
+```
+
+### Migration Test Utilities
+
+The `migration/utils.py` module provides:
+
+- Frontend wrapper with configurable request planes
+- Long-running request spawning for migration scenarios
+- Health check disabling for controlled testing
+
+### Example Migration Test
+
+```python
+def test_migration_on_worker_failure():
+    # Start deployment with 2 workers
+    deployment = start_deployment(workers=2)
+
+    # Send long-running request
+    request_thread = spawn_long_request(max_tokens=1000)
+
+    # Kill one worker mid-generation
+    kill_worker(deployment.workers[0])
+
+    # Verify request completes on remaining worker
+    response = request_thread.join()
+    assert response.status_code == 200
+    assert len(response.tokens) > 0
+```
+
+## etcd HA Tests
+
+Test system behavior during etcd failures and recovery.
+
+### Running etcd HA Tests
+
+```bash
+pytest tests/fault_tolerance/etcd_ha/ -v
+```
+
+### Test Scenarios
+
+- **Leader failover**: etcd leader node fails, cluster elects new leader
+- **Network partition**: etcd node becomes unreachable
+- **Recovery**: System recovers after etcd becomes available
+
+## Hardware Fault Injection
+
+The fault injection service enables testing under simulated hardware failures.
+
+### Fault Injection Service
+
+Located at `tests/fault_tolerance/hardware/fault_injection_service/`, this FastAPI service orchestrates fault injection:
+
+```bash
+# Start the fault injection service
+cd tests/fault_tolerance/hardware/fault_injection_service
+python -m api_service.main
+```
+
+### Supported Fault Types
+
+#### GPU Faults
+
+| Fault Type | Description |
+|------------|-------------|
+| `XID_ERROR` | Simulate GPU XID error (various codes) |
+| `THROTTLE` | GPU thermal throttling |
+| `MEMORY_PRESSURE` | GPU memory exhaustion |
+| `OVERHEAT` | GPU overheating condition |
+| `COMPUTE_OVERLOAD` | GPU compute saturation |
+
+#### Network Faults
+
+| Fault Type | Description |
+|------------|-------------|
+| `FRONTEND_WORKER` | Partition between frontend and workers |
+| `WORKER_NATS` | Partition between workers and NATS |
+| `WORKER_WORKER` | Partition between workers |
+| `CUSTOM` | Custom network partition |
+
+### Fault Injection API
+
+#### Inject GPU Fault
+
+```bash
+curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
+  -H "Content-Type: application/json" \
+  -d '{
+    "target_pod": "vllm-worker-0",
+    "fault_type": "XID_ERROR",
+    "severity": "HIGH"
+  }'
+```
+
+#### Inject Specific XID Error
+
+```bash
+# Inject XID 79 (GPU memory page fault)
+curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
+  -H "Content-Type: application/json" \
+  -d '{"target_pod": "vllm-worker-0"}'
+```
+
+Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120
+
+#### Inject Network Partition
+
+```bash
+curl -X POST http://localhost:8080/api/v1/faults/network/inject \
+  -H "Content-Type: application/json" \
+  -d '{
+    "partition_type": "FRONTEND_WORKER",
+    "duration_seconds": 30
+  }'
+```
+
+#### Recover from Fault
+
+```bash
+curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover
+```
+
+#### List Active Faults
+
+```bash
+curl http://localhost:8080/api/v1/faults
+```
+
+### GPU Fault Injector Agent
+
+The GPU fault injector runs as a DaemonSet on worker nodes:
+
+```yaml
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: gpu-fault-injector
+spec:
+  selector:
+    matchLabels:
+      app: gpu-fault-injector
+  template:
+    spec:
+      containers:
+      - name: agent
+        image: dynamo/gpu-fault-injector:latest
+        securityContext:
+          privileged: true
+        volumeMounts:
+        - name: dev
+          mountPath: /dev
+```
+
+The agent injects fake XID messages via `/dev/kmsg` to trigger NVSentinel detection.
+
+## Deployment Testing Framework
+
+The `deploy/` directory contains an end-to-end testing framework.
+
+### Test Phases
+
+Tests run through three phases:
+
+| Phase | Description |
+|-------|-------------|
+| `STANDARD` | Baseline performance under normal conditions |
+| `OVERFLOW` | System behavior during fault/overload |
+| `RECOVERY` | System recovery after fault resolution |
+
+### Scenario Configuration
+
+Define test scenarios in `scenarios.py`:
+
+```python
+from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure
+
+scenario = Scenario(
+    name="worker_failure_migration",
+    backend="vllm",
+    load=Load(
+        clients=10,
+        requests_per_client=100,
+        max_tokens=256
+    ),
+    failure=Failure(
+        type="pod_kill",
+        target="vllm-worker-0",
+        trigger_after_requests=50
+    )
+)
+```
+
+### Running Deployment Tests
+
+```bash
+# Run all deployment tests
+pytest tests/fault_tolerance/deploy/test_deployment.py -v
+
+# Run specific scenario
+pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v
+```
+
+### Validation Checkers
+
+The framework includes pluggable validators:
+
+```python
+from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext
+
+class MigrationChecker(BaseChecker):
+    def check(self, context: ValidationContext) -> bool:
+        # Verify migrations occurred
+        migrations = context.metrics.get("migrations_total", 0)
+        return migrations > 0
+```
+
+### Results Parsing
+
+Parse test results for analysis:
+
+```python
+from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test
+
+results = process_overflow_recovery_test(log_dir="/path/to/logs")
+print(f"Success rate: {results['success_rate']}")
+print(f"P99 latency: {results['p99_latency_ms']}ms")
+```
+
+## Client Utilities
+
+The `client.py` module provides shared client functionality:
+
+### Multi-Threaded Load Generation
+
+```python
+from tests.fault_tolerance.client import client
+
+# Generate load with multiple clients
+results = client(
+    base_url="http://localhost:8000",
+    num_clients=10,
+    requests_per_client=100,
+    model="Qwen/Qwen3-0.6B",
+    max_tokens=256,
+    log_dir="/tmp/test_logs"
+)
+```
+
+### Request Options
+
+| Parameter | Description |
+|-----------|-------------|
+| `base_url` | Frontend URL |
+| `num_clients` | Number of concurrent clients |
+| `requests_per_client` | Requests per client |
+| `model` | Model name |
+| `max_tokens` | Max tokens per request |
+| `log_dir` | Directory for client logs |
+| `endpoint` | `completions` or `chat/completions` |
+
+## Running the Full Test Suite
+
+### Prerequisites
+
+1. Kubernetes cluster with GPU nodes
+2. Dynamo deployment
+3. etcd cluster (for HA tests)
+4. Fault injection service (for hardware tests)
+
+### Environment Setup
+
+```bash
+export KUBECONFIG=/path/to/kubeconfig
+export DYNAMO_NAMESPACE=dynamo-test
+export FRONTEND_URL=http://localhost:8000
+```
+
+### Run All Tests
+
+```bash
+# Install test dependencies
+pip install pytest pytest-asyncio
+
+# Run all fault tolerance tests
+pytest tests/fault_tolerance/ -v --tb=short
+
+# Run with specific markers
+pytest tests/fault_tolerance/ -v -m "not slow"
+```
+
+### Test Markers
+
+| Marker | Description |
+|--------|-------------|
+| `slow` | Long-running tests (> 5 minutes) |
+| `gpu` | Requires GPU resources |
+| `k8s` | Requires Kubernetes cluster |
+| `etcd_ha` | Requires multi-node etcd |
+
+## Best Practices
+
+### 1. Isolate Test Environments
+
+Run fault tolerance tests in dedicated namespaces:
+
+```bash
+kubectl create namespace dynamo-fault-test
+```
+
+### 2. Clean Up After Tests
+
+Ensure fault injection is recovered:
+
+```bash
+# List and recover all active faults
+curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
+  xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover
+```
+
+### 3. Collect Logs
+
+Preserve logs for debugging:
+
+```bash
+pytest tests/fault_tolerance/ -v \
+  --log-dir=/tmp/fault_test_logs \
+  --capture=no
+```
+
+### 4. Monitor During Tests
+
+Watch system state during tests:
+
+```bash
+# Terminal 1: Watch pods
+watch kubectl get pods -n dynamo-test
+
+# Terminal 2: Watch metrics
+watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'
+```
+
+## Related Documentation
+
+- [Request Migration](request-migration.md) - Migration implementation details
+- [Request Cancellation](request-cancellation.md) - Cancellation implementation
+- [Health Checks](../observability/health-checks.md) - Health monitoring
+- [Metrics](../observability/metrics.md) - Available metrics for monitoring
--- a/fern/pages/frontends/kserve.md
+++ b/fern/pages/frontends/kserve.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "KServe gRPC frontend"
+---
+
+## Motivation
+
+[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
+
+This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
+
+## Supported Endpoints
+
+* `ModelInfer` endpoint: KServe Standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1)
+* `ModelStreamInfer` endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto#L84-L92)
+* `ModelMetadata` endpoint: KServe standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#model-metadata-1)
+* `ModelConfig` endpoint: Triton extension endpoint as described [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md)
+
+## Starting the Frontend
+
+To start the KServe frontend, run the below command
+```
+python -m dynamo.frontend --kserve-grpc-server
+```
+
+## Registering a Backend
+
+Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
+* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
+* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
+* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference
+
+The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
+
+### OpenAI Completions
+
+Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor based messages and OpenAI message and imposes specific structure of the KServe request message.
+
+#### Model Metadata / Config
+
+The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
+```
+{
+    name: $MODEL_NAME,
+    version: 1,
+    platform: "dynamo",
+    backend: "dynamo", # model config specific
+    inputs: [
+        {
+            name: "text_input",
+            datatype: "BYTES",
+            shape: [1]
+        },
+        {
+            name: "streaming",
+            datatype: "BOOL",
+            shape: [1],
+            optional: true
+        }
+    ]
+    outputs: [
+        {
+            name: "text_output",
+            datatype: "BYTES",
+            shape: [-1]
+        },
+        {
+            name: "finish_reason",
+            datatype: "BYTES",
+            shape: [-1],
+            optional: true
+        }
+    ]
+}
+```
+
+#### Inference
+
+On receiving inference request, the following conversion will be performed:
+* `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
+* `streaming`: the element will be converted to `stream` field in OpenAI Completion request
+On receiving model response, the following conversion will be performed:
+* `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice.
+* `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.
+
+### Tensor
+
+This combination is used when the user is migrating an existing KServe based backend into Dynamo ecosystem.
+
+#### Model Metadata / Config
+
+When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
+* [TensorModelConfig](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
+* [triton_model_config](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/main/tests/frontend/grpc/echo_tensor_worker.py) for example.
+
+#### Inference
+
+When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
+
+## Python Bindings
+
+The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
--- a/fern/pages/getting-started/examples.md
+++ b/fern/pages/getting-started/examples.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Examples"
+---
+
+Explore practical examples to get started with NVIDIA Dynamo.
+
+## Quick Start Examples
+
+The [examples directory](https://github.com/ai-dynamo/dynamo/tree/main/examples) in the Dynamo repository contains ready-to-run examples for various use cases.
+
+### Backend Examples
+
+| Backend | Description | Link |
+|---------|-------------|------|
+| **vLLM** | Run inference with vLLM backend | [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm) |
+| **SGLang** | Run inference with SGLang backend | [examples/backends/sglang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang) |
+| **TensorRT-LLM** | Run inference with TensorRT-LLM backend | [examples/backends/trtllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm) |
+
+### Deployment Examples
+
+| Example | Description | Link |
+|---------|-------------|------|
+| **Basic Deployment** | Simple single-node deployment | [examples/basics](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics) |
+| **Kubernetes** | Deploy on Kubernetes | [examples/deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/deployments) |
+| **Multimodal** | Vision and multimodal models | [examples/multimodal](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal) |
+
+### Custom Backend Examples
+
+Learn how to create custom backends:
+
+| Example | Description | Link |
+|---------|-------------|------|
+| **Custom Backend** | Build your own backend | [examples/custom_backend](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom_backend) |
+
+## Running Examples
+
+Most examples can be run directly after installing Dynamo:
+
+```bash
+# Clone the repository
+git clone https://github.com/ai-dynamo/dynamo.git
+cd dynamo
+
+# Navigate to an example
+cd examples/backends/sglang
+
+# Follow the README in each example directory
+```
--- a/fern/pages/getting-started/installation.md
+++ b/fern/pages/getting-started/installation.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Installation"
+---
+
+## Pip (PyPI)
+
+Install a pre-built wheel from PyPI.
+
+```bash
+# Create a virtual environment and activate it
+uv venv venv
+source venv/bin/activate
+
+# Install Dynamo from PyPI (choose one backend extra)
+uv pip install "ai-dynamo[sglang]"  # or [vllm], [trtllm]
+```
+
+## Pip from source
+
+Install directly from a local checkout for development.
+
+```bash
+# Clone the repository
+git clone https://github.com/ai-dynamo/dynamo.git
+cd dynamo
+
+# Create a virtual environment and activate it
+uv venv venv
+source venv/bin/activate
+uv pip install ".[sglang]"  # or [vllm], [trtllm]
+```
+
+## Docker
+
+Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`).
+
+```bash
+# Run a container (mount your workspace if needed)
+docker run --rm -it \
+  --gpus all \
+  --network host \
+  nvcr.io/nvidia/ai-dynamo/sglang-runtime:latest  # or vllm, tensorrtllm
+```
--- a/fern/pages/getting-started/quickstart.md
+++ b/fern/pages/getting-started/quickstart.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Welcome to NVIDIA Dynamo"
+---
+
+The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.
+
+<Tip>
+**Discover the Latest Developments!**
+
+This guide is a snapshot of a specific point in time. For the latest information, examples, and Release Assets, see the [Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo/releases/latest).
+</Tip>
+
+## Quickstart
+
+Get started with Dynamo locally in just a few commands:
+
+### 1. Install Dynamo
+
+```bash
+# Install uv (recommended Python package manager)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Create virtual environment and install Dynamo
+uv venv venv
+source venv/bin/activate
+# Use prerelease flag to install RC versions of flashinfer and/or other dependencies
+uv pip install --prerelease=allow "ai-dynamo[sglang]"  # or [vllm], [trtllm]
+```
+
+### 2. Start etcd/NATS
+
+```bash
+# Fetch and start etcd and NATS using Docker Compose
+VERSION=$(uv pip show ai-dynamo | grep Version | cut -d' ' -f2)
+curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/refs/tags/v${VERSION}/deploy/docker-compose.yml
+docker compose -f docker-compose.yml up -d
+```
+
+### 3. Run Dynamo
+
+```bash
+# Start the OpenAI compatible frontend (default port is 8000)
+python -m dynamo.frontend
+
+# In another terminal, start an SGLang worker
+python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B
+```
+
+### 4. Test Your Deployment
+
+```bash
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "Qwen/Qwen3-0.6B",
+       "messages": [{"role": "user", "content": "Hello!"}],
+       "max_tokens": 50}'
+```
+
+## Key Features
+
+| Feature | Description |
+|---------|-------------|
+| **Multi-Backend Support** | vLLM, SGLang, and TensorRT-LLM backends |
+| **Disaggregated Serving** | Separate prefill and decode for optimal performance |
+| **KV Cache Routing** | Intelligent request routing based on KV cache state |
+| **Kubernetes Native** | Full operator and Helm chart support |
+| **Observability** | Prometheus metrics, Grafana dashboards, and tracing |
+
+## Documentation Overview
+
+### Backends
+- [vLLM Backend](../backends/vllm/README.md) - High-throughput serving with vLLM
+- [SGLang Backend](../backends/sglang/README.md) - Fast inference with SGLang
+- [TensorRT-LLM Backend](../backends/trtllm/README.md) - Optimized inference with TensorRT-LLM
+
+### Kubernetes Deployment
+- [Installation Guide updated](../kubernetes/installation-guide.md) - Deploy Dynamo on Kubernetes
+- [Operator Guide](../kubernetes/dynamo-operator.md) - Using the Dynamo Operator
+- [Autoscaling](../kubernetes/autoscaling.md) - Automatic scaling configuration
+
+### Architecture
+- [System Architecture](../design-docs/architecture.md) - Overall system design
+- [Disaggregated Serving](../design-docs/disagg-serving.md) - P/D separation architecture
+- [Distributed Runtime](../design-docs/distributed-runtime.md) - Runtime internals
+
+### Performance & Tuning
+- [Performance Tuning](../performance/tuning.md) - Optimize your deployment
+- [Benchmarking](../benchmarks/benchmarking.md) - Measure and compare performance
+- [AI Configurator](../performance/aiconfigurator.md) - Automated configuration
+
+## Getting Help
+
+- **GitHub Issues**: [Report bugs or request features](https://github.com/ai-dynamo/dynamo/issues)
+- **Discussions**: [Ask questions and share ideas](https://github.com/ai-dynamo/dynamo/discussions)
+- **Reference**: [CLI Reference](../reference/cli.md) | [Glossary](../reference/glossary.md) | [Support Matrix](./support-matrix.md)
--- a/fern/pages/getting-started/support-matrix.md
+++ b/fern/pages/getting-started/support-matrix.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Dynamo Support Matrix"
+---
+
+This document provides the support matrix for Dynamo, including hardware, software and build instructions.
+
+## Hardware Compatibility
+
+| **CPU Architecture** | **Status**   |
+| :------------------- | :----------- |
+| **x86_64**           | Supported    |
+| **ARM64**            | Supported    |
+
+
+### GPU Compatibility
+
+If you are using a **GPU**, the following GPU models and architectures are supported:
+
+| **GPU Architecture**                 | **Status** |
+| :----------------------------------- | :--------- |
+| **NVIDIA Blackwell Architecture**    | Supported  |
+| **NVIDIA Hopper Architecture**       | Supported  |
+| **NVIDIA Ada Lovelace Architecture** | Supported  |
+| **NVIDIA Ampere Architecture**       | Supported  |
+
+## Platform Architecture Compatibility
+
+**Dynamo** is compatible with the following platforms:
+
+| **Operating System** | **Version** | **Architecture** | **Status**   |
+| :------------------- | :---------- | :--------------- | :----------- |
+| **Ubuntu**           | 22.04       | x86_64           | Supported    |
+| **Ubuntu**           | 24.04       | x86_64           | Supported    |
+| **Ubuntu**           | 24.04       | ARM64            | Supported    |
+| **CentOS Stream**    | 9           | x86_64           | Experimental |
+
+<Note>
+Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04).
+Compatibility with other Linux distributions is expected but has not been officially verified yet.
+</Note>
+
+<Error>
+KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
+</Error>
+
+## Software Compatibility
+
+### Runtime Dependency
+
+| **Python Package** | **Version** | glibc version                         | CUDA Version |
+| :----------------- | :---------- | :------------------------------------ | :----------- |
+| ai-dynamo          | 0.8.0       | >=2.28                                |              |
+| ai-dynamo-runtime  | 0.8.0       | >=2.28 (Python 3.12 has known issues) |              |
+| NIXL               | 0.8.0       | >=2.27                                | >=11.8       |
+
+### Build Dependency
+
+The following table shows the dependency versions included with each Dynamo release:
+
+| **Dependency** | **main (ToT)** | **v0.8.0 (unreleased)** | **v0.7.1** | **v0.7.0.post1** | **v0.7.0** |
+| :------------- | :------------- | :---------------------- | :--------- | :--------------- | :--------- |
+| SGLang         | 0.5.7          | 0.5.7                   | 0.5.3.post4| 0.5.3.post4      | 0.5.3.post4|
+| TensorRT-LLM   | 1.2.0rc6       | 1.2.0rc6                | 1.2.0rc3   | 1.2.0rc3         | 1.2.0rc2   |
+| vLLM           | 0.13.0         | 0.12.0                  | 0.11.0     | 0.11.0           | 0.11.0     |
+| NIXL           | 0.8.0          | 0.8.0                   | 0.8.0      | 0.8.0            | 0.8.0      |
+
+<Note>
+**main (ToT)** reflects the current development branch. **v0.8.0** is the upcoming release (planned for January 14, 2025) and not yet available.
+</Note>
+
+
+<Warning>
+Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
+</Warning>
+
+### CUDA Support by Framework
+| **Dynamo Version**   | **SGLang**              | **TensorRT-LLM**        | **vLLM**                |
+| :------------------- | :-----------------------| :-----------------------| :-----------------------|
+| **Dynamo 0.7.1**     | CUDA 12.8               | CUDA 13.0               | CUDA 12.9               |
+
+## Cloud Service Provider Compatibility
+
+### AWS
+
+| **Host Operating System** | **Version** | **Architecture** | **Status** |
+| :------------------------ | :---------- | :--------------- | :--------- |
+| **Amazon Linux**          | 2023        | x86_64           | Supported¹ |
+
+<Error>
+There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
+</Error>
+
+## Build Support
+
+**Dynamo** currently provides build support in the following ways:
+
+- **Wheels**: We distribute Python wheels of Dynamo and KV Block Manager:
+  - [ai-dynamo](https://pypi.org/project/ai-dynamo/)
+  - [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/)
+  - **New as of Dynamo v0.7.0:** [kvbm](https://pypi.org/project/kvbm/) as a standalone implementation.
+
+- **Dynamo Runtime Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Runtime for each of the LLM inference frameworks on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
+  - [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime)
+  - [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime)
+  - [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)
+
+- **Dynamo Kubernetes Operator Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Operator on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
+  - [kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
+
+- **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo:
+  - [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds)
+  - [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform)
+  - [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph)
+
+- **Rust Crates**:
+  - [dynamo-runtime](https://crates.io/crates/dynamo-runtime/)
+  - [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai/)
+  - [dynamo-parsers](https://crates.io/crates/dynamo-parsers/)
+  - [dynamo-llm](https://crates.io/crates/dynamo-llm/)
+
+Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/blob/main/README.md#installation).
--- a/fern/pages/guides/jail-stream-readme.md
+++ b/fern/pages/guides/jail-stream-readme.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "JailedStream Implementation"
+---
+
+## Overview
+
+The `JailedStream` is a standalone implementation for handling "jail" detection in token streams. It provides a clean, builder-based API for accumulating tokens when certain sequences are detected, then releasing them as a single chunk when the jail ends.
+
+## Key Features
+
+- **Builder Pattern**: Clean configuration API using the builder pattern
+- **Configurable Sequences**: Support for multiple start/end jail sequences
+- **Tool Call Parsing**: Integrated tool call detection and parsing
+- **Stream Macro**: Uses `async-stream::stream!` for clean async implementation
+- **Standalone**: Completely independent of existing code
+- **Annotations**: Preserves annotations for observability
+
+## Implementation
+
+### Location
+- Main implementation: `lib/llm/src/protocols/openai/chat_completions/jail.rs`
+- Examples: `lib/llm/src/protocols/openai/chat_completions/jail_example.rs`
+
+### Usage
+
+```rust
+use crate::protocols::openai::chat_completions::jail::JailedStream;
+use dynamo_runtime::engine::{AsyncEngineContextProvider, ResponseStream};
+
+// Get your ResponseStream with context
+let response_stream: Pin<Box<ResponseStream<_>>> = get_stream_from_engine();
+
+// Extract context BEFORE passing to apply
+let context = response_stream.context();
+
+// Apply jail transformation (ResponseStream implements Stream)
+let jail = JailedStream::builder()
+    .tool_call_parser("nemotron_deci")
+    .build();
+
+let jailed_stream = jail.apply(response_stream);
+
+// Re-wrap with context when needed for engine consumption
+let final_stream = ResponseStream::new(Box::pin(jailed_stream), context);
+```
+
+### Advanced Configuration
+
+```rust
+// With custom jail sequences
+let jail = JailedStream::builder()
+    .jail_start_sequence("<TOOLCALL>")
+    .jail_end_sequence("</TOOLCALL>")
+    .tool_call_parser("nemotron_deci")
+    .build();
+
+// With multiple sequences
+let jail = JailedStream::builder()
+    .jail_start_sequences(vec!["<TOOLCALL>", "<FUNCTION>"])
+    .jail_end_sequences(vec!["</TOOLCALL>", "</FUNCTION>"])
+    .tool_call_parser("harmony")
+    .build();
+```
+
+## How It Works
+
+1. **Detection**: When a jail start sequence (or tool call start) is detected, the stream enters "jail" mode
+2. **Accumulation**: While jailed, tokens are accumulated in memory instead of being yielded
+3. **Annotations**: Empty chunks with annotations are sent downstream for observability
+4. **Release**: When a jail end sequence is detected OR the stream ends:
+   - Accumulated content is parsed for tool calls
+   - A single chunk with the parsed content is yielded
+5. **Pass-through**: Non-jailed content passes through unchanged
+
+## Testing
+
+The implementation includes comprehensive tests:
+
+- `test_jailed_stream_with_start_end_sequences`: Tests explicit jail sequences
+- `test_jailed_stream_with_tool_calls`: Tests tool call detection and parsing
+- `test_jailed_stream_no_jailing`: Tests normal pass-through behavior
+
+Run tests with:
+```bash
+cargo test -p dynamo-llm jail --lib
+```
+
+## Benefits
+
+1. **Standalone**: No modifications to existing code required
+2. **Clean API**: Builder pattern makes configuration intuitive
+3. **Flexible**: Supports multiple jail detection strategies
+4. **Maintainable**: Uses `stream!` macro for cleaner async code
+5. **Testable**: Comprehensive test suite with shared utilities
+6. **Efficient**: No unnecessary boxing or context handling in the library
+7. **Composable**: Can chain multiple stream transformers before re-adding context
+
+## Performance Optimizations
+
+- **No Boxing in Library**: Returns `impl Stream` instead of `Pin<Box<ResponseStream>>`
+- **Stack Pinning**: Uses `tokio::pin!()` instead of `Box::pin()` for better performance
+- **No Context Overhead**: JailedStream doesn't manage AsyncEngineContext
+- **Lazy Evaluation**: Only processes what's needed
+- **Efficient State Management**: Minimal cloning, only when entering jail state
+
+## Integration Options
+
+To replace the existing `apply_tool_calling_jail_internal` function:
+
+```rust
+// In preprocessor.rs
+pub fn apply_tool_calling_jail_with_parser(
+    &self,
+    stream: ManyOut<Annotated<NvCreateChatCompletionStreamResponse>>,
+) -> ManyOut<Annotated<NvCreateChatCompletionStreamResponse>> {
+    let jail = JailedStream::builder()
+        .tool_call_parser(self.tool_call_parser.clone())
+        .build();
+
+    jail.apply(stream)
+}
+```
+
+## Future Enhancements
+
+- Add support for regex patterns for jail sequences
+- Add metrics/telemetry for jail detection
+- Support for partial sequence matching across chunk boundaries
+- Configurable accumulation limits
+- Support for nested jails
\ No newline at end of file
--- a/fern/pages/guides/request-plane.md
+++ b/fern/pages/guides/request-plane.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Dynamo Request Planes User Guide"
+---
+
+## Overview
+
+Dynamo supports multiple transport mechanisms for its request plane (the communication layer between services). You can choose from three different request plane modes based on your deployment requirements:
+
+- **TCP** (default): Direct TCP connection for optimal performance
+- **NATS**: Message broker-based request plane
+- **HTTP**: HTTP/2-based request plane
+
+This guide explains how to configure and use request plane in your Dynamo deployment.
+
+## What is a Request Plane?
+
+The request plane is the transport layer that handles communication between Dynamo services (e.g., frontend to backend, worker to worker). Different request planes offer different trade-offs:
+
+| Request Plane | Suitable For | Characteristics |
+|--------------|----------|-----------------|
+| **NATS** | Production deployments with KV routing | Requires NATS infrastructure, provides pub/sub patterns, highest flexibility |
+| **TCP** | Low-latency direct communication | Direct connections, minimal overhead |
+| **HTTP** | Standard deployments, debugging | HTTP/2 protocol, easier observability with standard tools, widely compatible |
+
+## Request Plane vs KV Event Plane
+
+Dynamo has **two independent communication planes**:
+
+- **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`.
+- **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing.
+
+**Note:** if you are using `tcp` or `http` request plane and choose to use NATS for KV events, you must still configure NATS server using `NATS_SERVER` environment variable, e.g. `NATS_SERVER=nats://nats-hostname:port`.
+
+Because they are independent, you can mix them.
+
+For example, a deployment with TCP request plane can use different KV event planes:
+- **JetStream KV events**: requests use TCP, KV routing still uses NATS JetStream + object store for persistence.
+- **NATS Core KV events (local indexer)**: requests use TCP, KV events use NATS Core pub/sub and persistence lives on workers.
+- **no KV events**: requests use TCP and KV routing runs without events (no NATS required, but no event-backed persistence).
+
+## Configuration
+
+### Environment Variable
+
+Set the request plane mode using the `DYN_REQUEST_PLANE` environment variable:
+
+```bash
+export DYN_REQUEST_PLANE=<mode>
+```
+
+Where `<mode>` is one of:
+- `tcp` (default)
+- `nats`
+- `http`
+
+The value is case-insensitive.
+
+### Default Behavior
+
+If `DYN_REQUEST_PLANE` is not set or contains an invalid value, Dynamo defaults to `tcp`.
+
+## Usage Examples
+
+### Using TCP (Default)
+
+TCP is the default request plane and provides direct, low-latency communication between services.
+
+**Configuration:**
+
+```bash
+# TCP is the default, so no need to set DYN_REQUEST_PLANE explicitly
+# But you can explicitly set it if desired:
+export DYN_REQUEST_PLANE=tcp
+
+# Optional: Configure TCP server host and port
+export DYN_TCP_RPC_HOST=0.0.0.0  # Default host
+# export DYN_TCP_RPC_PORT=9999   # Optional: specify a fixed port
+
+# Run your Dynamo service
+DYN_REQUEST_PLANE=tcp python -m dynamo.frontend --http-port=8000 &
+DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B
+```
+
+**Note:** By default, TCP uses an OS-assigned free port (port 0). This is ideal for environments where multiple services may run on the same machine or when you want to avoid port conflicts. If you need a specific port (e.g., for firewall rules), set `DYN_TCP_RPC_PORT` explicitly.
+
+**When to use TCP:**
+- Simple deployments with direct service-to-service communication (e.g. frontend to backend)
+- Minimal infrastructure requirements (**no NATS needed unless you enable KV-event-backed routing/replica sync**)
+- Low-latency requirements
+
+**TCP Configuration Options:**
+
+Additional TCP-specific environment variables:
+- `DYN_TCP_RPC_HOST`: Server host address (default: auto-detected)
+- `DYN_TCP_RPC_PORT`: Server port. If not set, the OS assigns a free port automatically (recommended for most deployments). Set explicitly only if you need a specific port for firewall rules.
+- `DYN_TCP_MAX_MESSAGE_SIZE`: Maximum message size for TCP client (default: 32MB)
+- `DYN_TCP_REQUEST_TIMEOUT`: Request timeout for TCP client (default: 10 seconds)
+- `DYN_TCP_POOL_SIZE`: Connection pool size for TCP client (default: 50)
+- `DYN_TCP_CONNECT_TIMEOUT`: Connect timeout for TCP client (default: 3 seconds)
+- `DYN_TCP_CHANNEL_BUFFER`: Request channel buffer size for TCP client (default: 100)
+
+### Using HTTP
+
+HTTP/2 provides a standards-based request plane that's easy to debug and widely compatible.
+
+**Configuration:**
+
+```bash
+# Optional: Configure HTTP server host and port
+export DYN_HTTP_RPC_HOST=0.0.0.0      # Default host
+export DYN_HTTP_RPC_PORT=8888         # Default port
+export DYN_HTTP_RPC_ROOT_PATH=/v1/rpc # Default path
+
+# Run your Dynamo service
+DYN_REQUEST_PLANE=http python -m dynamo.frontend --http-port=8000 &
+DYN_REQUEST_PLANE=http python -m dynamo.vllm --model Qwen/Qwen3-0.6B
+```
+
+**When to use HTTP:**
+- Standard deployments requiring HTTP compatibility
+- Debugging scenarios (use curl, browser tools, etc.)
+- Integration with HTTP-based infrastructure
+- Load balancers and proxies that work with HTTP
+
+**HTTP Configuration Options:**
+
+Additional HTTP-specific environment variables:
+- `DYN_HTTP_RPC_HOST`: Server host address (default: auto-detected)
+- `DYN_HTTP_RPC_PORT`: Server port (default: 8888)
+- `DYN_HTTP_RPC_ROOT_PATH`: Root path for RPC endpoints (default: /v1/rpc)
+
+`DYN_HTTP2_*`: Various HTTP/2 client configuration options
+- `DYN_HTTP2_MAX_FRAME_SIZE`: Maximum frame size for HTTP client (default: 1MB)
+- `DYN_HTTP2_MAX_CONCURRENT_STREAMS`: Maximum concurrent streams for HTTP client (default: 1000)
+- `DYN_HTTP2_POOL_MAX_IDLE_PER_HOST`: Maximum idle connections per host for HTTP client (default: 100)
+- `DYN_HTTP2_POOL_IDLE_TIMEOUT_SECS`: Idle timeout for HTTP client (default: 90 seconds)
+- `DYN_HTTP2_KEEP_ALIVE_INTERVAL_SECS`: Keep-alive interval for HTTP client (default: 30 seconds)
+- `DYN_HTTP2_KEEP_ALIVE_TIMEOUT_SECS`: Keep-alive timeout for HTTP client (default: 10 seconds)
+- `DYN_HTTP2_ADAPTIVE_WINDOW`: Enable adaptive flow control (default: true)
+
+### Using NATS
+
+NATS provides durable jetstream messaging for request plane and can be used for KV events (and router replica sync).
+
+**Prerequisites:**
+- NATS server must be running and accessible
+- Configure NATS connection via standard Dynamo NATS environment variables
+
+```bash
+# Explicitly set to NATS
+export DYN_REQUEST_PLANE=nats
+
+# Run your Dynamo service
+DYN_REQUEST_PLANE=nats python -m dynamo.frontend --http-port=8000 &
+DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B
+```
+
+**When to use NATS:**
+- Production deployments with service discovery
+- Currently KV based routing require NATS. If you want to completely disable NATS, KV based routing won't be available
+- Need for message replay and persistence features
+
+Limitations:
+- NATS does not support payloads beyond 16MB (use TCP for larger payloads)
+
+## Complete Example
+
+Here's a complete example showing how to launch a Dynamo deployment with different request planes:
+
+See [`examples/backends/vllm/launch/agg_request_planes.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/agg_request_planes.sh) for a complete working example that demonstrates launching Dynamo with TCP, HTTP, or NATS request planes.
+
+
+## Real-World Example
+
+The Dynamo repository includes a complete example demonstrating all three request planes:
+
+**Location:** `examples/backends/vllm/launch/agg_request_planes.sh`
+
+```bash
+cd examples/backends/vllm/launch
+
+# Run with TCP
+./agg_request_planes.sh --tcp
+
+# Run with HTTP
+./agg_request_planes.sh --http
+
+# Run with NATS
+./agg_request_planes.sh --nats
+```
+
+## Architecture Details
+
+### Network Manager
+
+The request plane implementation is centralized in the Network Manager (`lib/runtime/src/pipeline/network/manager.rs`), which:
+
+1. Reads the `DYN_REQUEST_PLANE` environment variable at startup
+2. Creates the appropriate server and client implementations
+3. Provides a transport-agnostic interface to the rest of the codebase
+4. Manages all network configuration and lifecycle
+
+### Transport Abstraction
+
+All request plane implementations conform to common trait interfaces:
+- `RequestPlaneServer`: Server-side interface for receiving requests
+- `RequestPlaneClient`: Client-side interface for sending requests
+
+This abstraction means your application code doesn't need to change when switching request planes.
+
+### Configuration Loading
+
+Request plane configuration is loaded from environment variables at startup and cached globally. The configuration hierarchy is:
+
+1. **Mode Selection**: `DYN_REQUEST_PLANE` (defaults to `tcp`)
+2. **Transport-Specific Config**: Mode-specific environment variables (e.g., `DYN_TCP_*`, `DYN_HTTP2_*`)
+
+## Migration Guide
+
+### From NATS to TCP
+
+1. Stop your Dynamo services
+2. Set environment variable `DYN_REQUEST_PLANE=tcp`
+3. Optionally configure TCP-specific settings (e.g., `DYN_TCP_RPC_HOST`). Note: `DYN_TCP_RPC_PORT` is optional; if not set, an OS-assigned free port is used automatically.
+4. Restart your services
+
+
+### From NATS to HTTP
+
+1. Stop your Dynamo services
+2. Set environment variable `DYN_REQUEST_PLANE=http`
+3. Optionally configure HTTP-specific settings (`DYN_HTTP_RPC_PORT`, etc.)
+4. Restart your services
+
+### Testing the Migration
+
+After switching request planes, verify your deployment:
+
+```bash
+# Test with a simple request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+## Troubleshooting
+
+### Issue: Services Can't Communicate
+
+**Symptoms:** Requests timeout or fail to reach the backend
+
+**Solutions:**
+- Verify all services use the same `DYN_REQUEST_PLANE` setting
+- Check that server ports are not blocked by k8s network policies or firewalls
+- For TCP/HTTP: Ensure host/port configurations are correct and accessible
+- For NATS: Verify NATS server is running and accessible
+
+### Issue: "Invalid request plane mode" Error
+
+**Symptoms:** Service fails to start with configuration error
+
+**Solutions:**
+- Check `DYN_REQUEST_PLANE` spelling (valid values: `nats`, `tcp`, `http`)
+- Value is case-insensitive but must be one of the three options
+- If not set, defaults to `tcp`
+
+### Issue: Port Conflicts
+
+**Symptoms:** Server fails to start due to "address already in use"
+
+**Solutions:**
+- TCP: By default, TCP uses an OS-assigned free port, so port conflicts should be rare. If you explicitly set `DYN_TCP_RPC_PORT` to a specific port and get conflicts, either change the port or remove the setting to use automatic port assignment.
+- HTTP default port: 8888 (adjust environment variable `DYN_HTTP_RPC_PORT`)
+
+## Performance Considerations
+
+### Latency
+
+- **TCP**: Lowest latency due to direct connections and binary serialization
+- **HTTP**: Moderate latency with HTTP/2 overhead
+- **NATS**: Moderate latency due to nats jet stream persistence
+
+
+### Resource Usage
+
+- **TCP**: Minimal infrastructure (no additional services required)
+- **HTTP**: Minimal infrastructure (no additional services required)
+- **NATS**: Requires running NATS server (additional memory/CPU)
--- a/fern/pages/kubernetes/README.md
+++ b/fern/pages/kubernetes/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Deploying Dynamo on Kubernetes"
+---
+
+[Link to installation](../getting-started/installation.md)
+
+High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
+
+## Important Terminology
+
+**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created.
+- Used for: Resource isolation, RBAC, organizing deployments
+- Example: `dynamo-system`, `team-a-namespace`
+
+**Dynamo Namespace**: The logical namespace used by Dynamo components for [service discovery](service-discovery.md).
+- Used for: Runtime component communication, service discovery
+- Specified in: `.spec.services.<ServiceName>.dynamoNamespace` field
+- Example: `my-llm`, `production-model`, `dynamo-dev`
+
+These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
+
+## Prerequisites
+
+Before you begin, ensure you have the following tools installed:
+
+| Tool | Minimum Version | Installation Guide |
+|------|-----------------|-------------------|
+| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
+| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) |
+
+Verify your installation:
+```bash
+kubectl version --client  # Should show v1.24+
+helm version              # Should show v3.0+
+```
+
+For detailed installation instructions, see the [Prerequisites section](installation-guide.md#prerequisites) in the Installation Guide.
+
+## Pre-deployment Checks
+
+Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:
+
+```bash
+./deploy/pre-deployment/pre-deployment-check.sh
+```
+
+This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for more details.
+
+## 1. Install Platform First
+
+```bash
+# 1. Set environment
+export NAMESPACE=dynamo-system
+export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
+
+# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
+helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
+
+# 3. Install Platform
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
+```
+
+**For Shared/Multi-Tenant Clusters:**
+
+If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
+```bash
+--set dynamo-operator.namespaceRestriction.enabled=true
+```
+
+For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](installation-guide.md)**.
+
+## 2. Choose Your Backend
+
+Each backend has deployment examples and configuration options:
+
+| Backend      | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
+|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
+| **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**       | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
+| **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**           | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+
+## 3. Deploy Your First Model
+
+```bash
+export NAMESPACE=dynamo-system
+kubectl create namespace ${NAMESPACE}
+
+# to pull model from HF
+export HF_TOKEN=<Token-Here>
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="$HF_TOKEN" \
+  -n ${NAMESPACE};
+
+# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
+kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
+
+# Check status
+kubectl get dynamoGraphDeployment -n ${NAMESPACE}
+
+# Test it
+kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
+curl http://localhost:8000/v1/models
+```
+
+For SLA-based autoscaling, see [SLA Planner Quick Start Guide](../planner/sla-planner-quickstart.md).
+
+## Understanding Dynamo's Custom Resources
+
+Dynamo provides two main Kubernetes Custom Resources for deploying models:
+
+### DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration
+
+The **recommended approach** for generating optimal configurations. DGDR provides a high-level interface where you specify:
+- Model name and backend framework
+- SLA targets (latency requirements)
+- GPU type (optional)
+
+Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
+- SLA-driven configuration generation
+- Automated resource optimization
+- Users who want simplicity over control
+
+**Note**: DGDR generates a DGD spec which you can then use to deploy.
+
+### DynamoGraphDeployment (DGD) - Direct Configuration
+
+A lower-level interface that defines your complete inference pipeline:
+- Model configuration
+- Resource allocation (GPUs, memory)
+- Scaling policies
+- Frontend/backend connections
+
+Use this when you need fine-grained control or have already completed profiling.
+
+Refer to the [API Reference and Documentation](api-reference.md) for more details.
+
+## 📖 API Reference & Documentation
+
+For detailed technical specifications of Dynamo's Kubernetes resources:
+
+- **[API Reference](api-reference.md)** - Complete CRD field specifications for all Dynamo resources
+- **[Create Deployment](deployment/create-deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
+- **[Operator Guide](dynamo-operator.md)** - Dynamo operator configuration and management
+
+### Choosing Your Architecture Pattern
+
+When creating a deployment, select the architecture pattern that best fits your use case:
+
+- **Development / Testing** - Use `agg.yaml` as the base configuration
+- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
+- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
+
+### Frontend and Worker Components
+
+You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
+
+- Provides OpenAI-compatible `/v1/chat/completions` endpoint
+- Auto-discovers backend workers via [service discovery](service-discovery.md) (Kubernetes-native by default)
+- Routes requests and handles load balancing
+- Validates and preprocesses requests
+
+### Customizing Your Deployment
+
+Example structure:
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-llm
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: my-llm
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
+      dynamoNamespace: dynamo-dev
+      componentType: worker
+      replicas: 1
+      envFromSecret: hf-token-secret  # for HuggingFace models
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+          command: ["/bin/sh", "-c"]
+          args:
+            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
+```
+
+Worker command examples per backend:
+```yaml
+# vLLM worker
+args:
+  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
+
+# SGLang worker
+args:
+  - >-
+    python3 -m dynamo.sglang
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --tp 1
+    --trust-remote-code
+
+# TensorRT-LLM worker
+args:
+  - python3 -m dynamo.trtllm
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
+```
+
+Key customization points include:
+- **Model Configuration**: Specify model in the args command
+- **Resource Allocation**: Configure GPU requirements under `resources.limits`
+- **Scaling**: Set `replicas` for number of worker instances
+- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
+- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
+
+## Additional Resources
+
+- **[Examples](../getting-started/examples.md)** - Complete working examples
+- **[Create Custom Deployments](deployment/create-deployment.md)** - Build your own CRDs
+- **[Managing Models with DynamoModel](deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models
+- **[Operator Documentation](dynamo-operator.md)** - How the platform works
+- **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
+- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
+- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
+- **[Logging](observability/logging.md)** - For logging setup
+- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment
+- **[Grove](grove.md)** - For grove details and custom installation
+- **[Monitoring](observability/metrics.md)** - For monitoring setup
+- **[Model Caching with Fluid](model-caching-with-fluid.md)** - For model caching with Fluid
--- a/fern/pages/kubernetes/api-reference.md
+++ b/fern/pages/kubernetes/api-reference.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "API Reference"
+---
+
+<Info>
+This documentation is automatically generated from source code.
+Do not edit this file directly.
+</Info>
+
+## Packages
+- [nvidia.com/v1alpha1](#nvidiacomv1alpha1)
+
+
+## nvidia.com/v1alpha1
+
+Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
+
+This package defines the DynamoGraphDeploymentRequest (DGDR) custom resource, which provides
+a high-level, SLA-driven interface for deploying machine learning models on Dynamo.
+
+Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
+
+### Resource Types
+- [DynamoComponentDeployment](#dynamocomponentdeployment)
+- [DynamoGraphDeployment](#dynamographdeployment)
+- [DynamoGraphDeploymentRequest](#dynamographdeploymentrequest)
+- [DynamoGraphDeploymentScalingAdapter](#dynamographdeploymentscalingadapter)
+- [DynamoModel](#dynamomodel)
+
+
+
+#### Autoscaling
+
+
+
+Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter
+with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md
+for migration guidance. This field will be removed in a future API version.
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `enabled` _boolean_ | Deprecated: This field is ignored. |  |  |
+| `minReplicas` _integer_ | Deprecated: This field is ignored. |  |  |
+| `maxReplicas` _integer_ | Deprecated: This field is ignored. |  |  |
+| `behavior` _[HorizontalPodAutoscalerBehavior](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#horizontalpodautoscalerbehavior-v2-autoscaling)_ | Deprecated: This field is ignored. |  |  |
+| `metrics` _[MetricSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#metricspec-v2-autoscaling) array_ | Deprecated: This field is ignored. |  |  |
+
+
+
+
+#### ComponentKind
+
+_Underlying type:_ _string_
+
+ComponentKind represents the type of underlying Kubernetes resource.
+
+_Validation:_
+- Enum: [PodClique PodCliqueScalingGroup Deployment LeaderWorkerSet]
+
+_Appears in:_
+- [ServiceReplicaStatus](#servicereplicastatus)
+
+| Field | Description |
+| --- | --- |
+| `PodClique` | ComponentKindPodClique represents a PodClique resource.<br /> |
+| `PodCliqueScalingGroup` | ComponentKindPodCliqueScalingGroup represents a PodCliqueScalingGroup resource.<br /> |
+| `Deployment` | ComponentKindDeployment represents a Deployment resource.<br /> |
+| `LeaderWorkerSet` | ComponentKindLeaderWorkerSet represents a LeaderWorkerSet resource.<br /> |
+
+
+#### ConfigMapKeySelector
+
+
+
+ConfigMapKeySelector selects a specific key from a ConfigMap.
+Used to reference external configuration data stored in ConfigMaps.
+
+
+
+_Appears in:_
+- [ProfilingConfigSpec](#profilingconfigspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `name` _string_ | Name of the ConfigMap containing the desired data. |  | Required: \{\} <br /> |
+| `key` _string_ | Key in the ConfigMap to select. If not specified, defaults to "disagg.yaml". | disagg.yaml |  |
+
+
+#### DeploymentOverridesSpec
+
+
+
+DeploymentOverridesSpec allows users to customize metadata for auto-created DynamoGraphDeployments.
+When autoApply is enabled, these overrides are applied to the generated DGD resource.
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentRequestSpec](#dynamographdeploymentrequestspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `name` _string_ | Name is the desired name for the created DynamoGraphDeployment.<br />If not specified, defaults to the DGDR name. |  | Optional: \{\} <br /> |
+| `namespace` _string_ | Namespace is the desired namespace for the created DynamoGraphDeployment.<br />If not specified, defaults to the DGDR namespace. |  | Optional: \{\} <br /> |
+| `labels` _object (keys:string, values:string)_ | Labels are additional labels to add to the DynamoGraphDeployment metadata.<br />These are merged with auto-generated labels from the profiling process. |  | Optional: \{\} <br /> |
+| `annotations` _object (keys:string, values:string)_ | Annotations are additional annotations to add to the DynamoGraphDeployment metadata. |  | Optional: \{\} <br /> |
+| `workersImage` _string_ | WorkersImage specifies the container image to use for DynamoGraphDeployment worker components.<br />This image is used for both temporary DGDs created during online profiling and the final DGD.<br />If omitted, the image from the base config file (e.g., disagg.yaml) is used.<br />Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" |  | Optional: \{\} <br /> |
+
+
+#### DeploymentStatus
+
+
+
+DeploymentStatus tracks the state of an auto-created DynamoGraphDeployment.
+This status is populated when autoApply is enabled and a DGD is created.
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentRequestStatus](#dynamographdeploymentrequeststatus)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `name` _string_ | Name is the name of the created DynamoGraphDeployment. |  |  |
+| `namespace` _string_ | Namespace is the namespace of the created DynamoGraphDeployment. |  |  |
+| `state` _string_ | State is the current state of the DynamoGraphDeployment.<br />This value is mirrored from the DGD's status.state field. |  |  |
+| `created` _boolean_ | Created indicates whether the DGD has been successfully created.<br />Used to prevent recreation if the DGD is manually deleted by users. |  |  |
+
+
+
+
+#### DynamoComponentDeployment
+
+
+
+DynamoComponentDeployment is the Schema for the dynamocomponentdeployments API
+
+
+
+
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
+| `kind` _string_ | `DynamoComponentDeployment` | | |
+| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. |  |  |
+| `spec` _[DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)_ | Spec defines the desired state for this Dynamo component deployment. |  |  |
+
+
+#### DynamoComponentDeploymentSharedSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+- [DynamoGraphDeploymentSpec](#dynamographdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component<br />(such as Pod, Service, and Ingress when applicable). |  |  |
+| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. |  |  |
+| `serviceName` _string_ | The name of the component |  |  |
+| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). |  |  |
+| `subComponentType` _string_ | SubComponentType indicates the sub-role of this component (for example, "prefill"). |  |  |
+| `dynamoNamespace` _string_ | DynamoNamespace is deprecated and will be removed in a future version.<br />The DGD Kubernetes namespace and DynamoGraphDeployment name are used to construct the Dynamo namespace for each component |  | Optional: \{\} <br /> |
+| `globalDynamoNamespace` _boolean_ | GlobalDynamoNamespace indicates that the Component will be placed in the global Dynamo namespace |  |  |
+| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br />GPUs/devices, and any runtime-specific resources. |  |  |
+| `autoscaling` _[Autoscaling](#autoscaling)_ | Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter<br />with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md<br />for migration guidance. This field will be removed in a future API version. |  |  |
+| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. |  |  |
+| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. |  |  |
+| `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. |  |  |
+| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). |  |  |
+| `modelRef` _[ModelReference](#modelreference)_ | ModelRef references a model that this component serves<br />When specified, a headless service will be created for endpoint discovery |  |  |
+| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). |  |  |
+| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. |  |  |
+| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. |  |  |
+| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. |  |  |
+| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. |  |  |
+| `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br />When scalingAdapter is enabled, this field is managed by the<br />DynamoGraphDeploymentScalingAdapter and should not be modified directly. |  | Minimum: 0 <br /> |
+| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. |  |  |
+| `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br />When enabled, replicas are managed via DGDSA and external autoscalers can scale<br />the service using the Scale subresource. When disabled, replicas can be modified directly. |  |  |
+
+
+#### DynamoComponentDeploymentSpec
+
+
+
+DynamoComponentDeploymentSpec defines the desired state of DynamoComponentDeployment
+
+
+
+_Appears in:_
+- [DynamoComponentDeployment](#dynamocomponentdeployment)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm") |  | Enum: [sglang vllm trtllm] <br /> |
+| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component<br />(such as Pod, Service, and Ingress when applicable). |  |  |
+| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. |  |  |
+| `serviceName` _string_ | The name of the component |  |  |
+| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). |  |  |
+| `subComponentType` _string_ | SubComponentType indicates the sub-role of this component (for example, "prefill"). |  |  |
+| `dynamoNamespace` _string_ | DynamoNamespace is deprecated and will be removed in a future version.<br />The DGD Kubernetes namespace and DynamoGraphDeployment name are used to construct the Dynamo namespace for each component |  | Optional: \{\} <br /> |
+| `globalDynamoNamespace` _boolean_ | GlobalDynamoNamespace indicates that the Component will be placed in the global Dynamo namespace |  |  |
+| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br />GPUs/devices, and any runtime-specific resources. |  |  |
+| `autoscaling` _[Autoscaling](#autoscaling)_ | Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter<br />with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md<br />for migration guidance. This field will be removed in a future API version. |  |  |
+| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. |  |  |
+| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. |  |  |
+| `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. |  |  |
+| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). |  |  |
+| `modelRef` _[ModelReference](#modelreference)_ | ModelRef references a model that this component serves<br />When specified, a headless service will be created for endpoint discovery |  |  |
+| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). |  |  |
+| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. |  |  |
+| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. |  |  |
+| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. |  |  |
+| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. |  |  |
+| `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br />When scalingAdapter is enabled, this field is managed by the<br />DynamoGraphDeploymentScalingAdapter and should not be modified directly. |  | Minimum: 0 <br /> |
+| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. |  |  |
+| `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br />When enabled, replicas are managed via DGDSA and external autoscalers can scale<br />the service using the Scale subresource. When disabled, replicas can be modified directly. |  |  |
+
+
+#### DynamoGraphDeployment
+
+
+
+DynamoGraphDeployment is the Schema for the dynamographdeployments API.
+
+
+
+
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
+| `kind` _string_ | `DynamoGraphDeployment` | | |
+| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. |  |  |
+| `spec` _[DynamoGraphDeploymentSpec](#dynamographdeploymentspec)_ | Spec defines the desired state for this graph deployment. |  |  |
+| `status` _[DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)_ | Status reflects the current observed state of this graph deployment. |  |  |
+
+
+#### DynamoGraphDeploymentRequest
+
+
+
+DynamoGraphDeploymentRequest is the Schema for the dynamographdeploymentrequests API.
+It serves as the primary interface for users to request model deployments with
+specific performance and resource constraints, enabling SLA-driven deployments.
+
+Lifecycle:
+ 1. Initial → Pending: Validates spec and prepares for profiling
+ 2. Pending → Profiling: Creates and runs profiling job (online or AIC)
+ 3. Profiling → Ready/Deploying: Generates DGD spec after profiling completes
+ 4. Deploying → Ready: When autoApply=true, monitors DGD until Ready
+ 5. Ready: Terminal state when DGD is operational or spec is available
+ 6. DeploymentDeleted: Terminal state when auto-created DGD is manually deleted
+
+The spec becomes immutable once profiling starts. Users must delete and recreate
+the DGDR to modify configuration after this point.
+
+
+
+
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
+| `kind` _string_ | `DynamoGraphDeploymentRequest` | | |
+| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. |  |  |
+| `spec` _[DynamoGraphDeploymentRequestSpec](#dynamographdeploymentrequestspec)_ | Spec defines the desired state for this deployment request. |  |  |
+| `status` _[DynamoGraphDeploymentRequestStatus](#dynamographdeploymentrequeststatus)_ | Status reflects the current observed state of this deployment request. |  |  |
+
+
+#### DynamoGraphDeploymentRequestSpec
+
+
+
+DynamoGraphDeploymentRequestSpec defines the desired state of a DynamoGraphDeploymentRequest.
+This CRD serves as the primary interface for users to request model deployments with
+specific performance constraints and resource requirements, enabling SLA-driven deployments.
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentRequest](#dynamographdeploymentrequest)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `model` _string_ | Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").<br />This is a high-level identifier for easy reference in kubectl output and logs.<br />The controller automatically sets this value in profilingConfig.config.deployment.model. |  | Required: \{\} <br /> |
+| `backend` _string_ | Backend specifies the inference backend for profiling.<br />The controller automatically sets this value in profilingConfig.config.engine.backend.<br />Profiling runs on real GPUs or via AIC simulation to collect performance data. |  | Enum: [vllm sglang trtllm] <br />Required: \{\} <br /> |
+| `useMocker` _boolean_ | UseMocker indicates whether to deploy a mocker DynamoGraphDeployment instead of<br />a real backend deployment. When true, the deployment uses simulated engines that<br />don't require GPUs, using the profiling data to simulate realistic timing behavior.<br />Mocker is available in all backend images and useful for large-scale experiments.<br />Profiling still runs against the real backend (specified above) to collect performance data. | false |  |
+| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br />resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br />any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,<br />num_gpus_per_node) with values detected from the cluster.<br />Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\} <br /> |
+| `profilingConfig` _[ProfilingConfigSpec](#profilingconfigspec)_ | ProfilingConfig provides the complete configuration for the profiling job.<br />This configuration is passed directly to the profiler.<br />The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).<br />Note: deployment.model and engine.backend are automatically set from the high-level<br />modelName and backend fields and should not be specified in this config. |  | Required: \{\} <br /> |
+| `autoApply` _boolean_ | AutoApply indicates whether to automatically create a DynamoGraphDeployment<br />after profiling completes. If false, only the spec is generated and stored in status.<br />Users can then manually create a DGD using the generated spec. | false |  |
+| `deploymentOverrides` _[DeploymentOverridesSpec](#deploymentoverridesspec)_ | DeploymentOverrides allows customizing metadata for the auto-created DGD.<br />Only applicable when AutoApply is true. |  | Optional: \{\} <br /> |
+
+
+#### DynamoGraphDeploymentRequestStatus
+
+
+
+DynamoGraphDeploymentRequestStatus represents the observed state of a DynamoGraphDeploymentRequest.
+The controller updates this status as the DGDR progresses through its lifecycle.
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentRequest](#dynamographdeploymentrequest)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `state` _string_ | State is a high-level textual status of the deployment request lifecycle.<br />Possible values: "", "Pending", "Profiling", "Deploying", "Ready", "DeploymentDeleted", "Failed"<br />Empty string ("") represents the initial state before initialization. |  |  |
+| `backend` _string_ | Backend is extracted from profilingConfig.config.engine.backend for display purposes.<br />This field is populated by the controller and shown in kubectl output. |  | Optional: \{\} <br /> |
+| `observedGeneration` _integer_ | ObservedGeneration reflects the generation of the most recently observed spec.<br />Used to detect spec changes and enforce immutability after profiling starts. |  |  |
+| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the deployment request.<br />Standard condition types include: Validation, Profiling, SpecGenerated, DeploymentReady.<br />Conditions are merged by type on patch updates. |  |  |
+| `profilingResults` _string_ | ProfilingResults contains a reference to the ConfigMap holding profiling data.<br />Format: "configmap/\<name\>" |  | Optional: \{\} <br /> |
+| `generatedDeployment` _[RawExtension](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#rawextension-runtime-pkg)_ | GeneratedDeployment contains the full generated DynamoGraphDeployment specification<br />including metadata, based on profiling results. Users can extract this to create<br />a DGD manually, or it's used automatically when autoApply is true.<br />Stored as RawExtension to preserve all fields including metadata.<br />For mocker backends, this contains the mocker DGD spec. |  | EmbeddedResource: \{\} <br />Optional: \{\} <br /> |
+| `deployment` _[DeploymentStatus](#deploymentstatus)_ | Deployment tracks the auto-created DGD when AutoApply is true.<br />Contains name, namespace, state, and creation status of the managed DGD. |  | Optional: \{\} <br /> |
+
+
+#### DynamoGraphDeploymentScalingAdapter
+
+
+
+DynamoGraphDeploymentScalingAdapter provides a scaling interface for individual services
+within a DynamoGraphDeployment. It implements the Kubernetes scale
+subresource, enabling integration with HPA, KEDA, and custom autoscalers.
+
+The adapter acts as an intermediary between autoscalers and the DGD,
+ensuring that only the adapter controller modifies the DGD's service replicas.
+This prevents conflicts when multiple autoscaling mechanisms are in play.
+
+
+
+
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
+| `kind` _string_ | `DynamoGraphDeploymentScalingAdapter` | | |
+| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. |  |  |
+| `spec` _[DynamoGraphDeploymentScalingAdapterSpec](#dynamographdeploymentscalingadapterspec)_ |  |  |  |
+| `status` _[DynamoGraphDeploymentScalingAdapterStatus](#dynamographdeploymentscalingadapterstatus)_ |  |  |  |
+
+
+#### DynamoGraphDeploymentScalingAdapterSpec
+
+
+
+DynamoGraphDeploymentScalingAdapterSpec defines the desired state of DynamoGraphDeploymentScalingAdapter
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentScalingAdapter](#dynamographdeploymentscalingadapter)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `replicas` _integer_ | Replicas is the desired number of replicas for the target service.<br />This field is modified by external autoscalers (HPA/KEDA/Planner) or manually by users. |  | Minimum: 0 <br />Required: \{\} <br /> |
+| `dgdRef` _[DynamoGraphDeploymentServiceRef](#dynamographdeploymentserviceref)_ | DGDRef references the DynamoGraphDeployment and the specific service to scale. |  | Required: \{\} <br /> |
+
+
+#### DynamoGraphDeploymentScalingAdapterStatus
+
+
+
+DynamoGraphDeploymentScalingAdapterStatus defines the observed state of DynamoGraphDeploymentScalingAdapter
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentScalingAdapter](#dynamographdeploymentscalingadapter)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `replicas` _integer_ | Replicas is the current number of replicas for the target service.<br />This is synced from the DGD's service replicas and is required for the scale subresource. |  |  |
+| `selector` _string_ | Selector is a label selector string for the pods managed by this adapter.<br />Required for HPA compatibility via the scale subresource. |  |  |
+| `lastScaleTime` _[Time](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#time-v1-meta)_ | LastScaleTime is the last time the adapter scaled the target service. |  |  |
+
+
+#### DynamoGraphDeploymentServiceRef
+
+
+
+DynamoGraphDeploymentServiceRef identifies a specific service within a DynamoGraphDeployment
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentScalingAdapterSpec](#dynamographdeploymentscalingadapterspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `name` _string_ | Name of the DynamoGraphDeployment |  | MinLength: 1 <br />Required: \{\} <br /> |
+| `serviceName` _string_ | ServiceName is the key name of the service within the DGD's spec.services map to scale |  | MinLength: 1 <br />Required: \{\} <br /> |
+
+
+#### DynamoGraphDeploymentSpec
+
+
+
+DynamoGraphDeploymentSpec defines the desired state of DynamoGraphDeployment.
+
+
+
+_Appears in:_
+- [DynamoGraphDeployment](#dynamographdeployment)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `pvcs` _[PVC](#pvc) array_ | PVCs defines a list of persistent volume claims that can be referenced by components.<br />Each PVC must have a unique name that can be referenced in component specifications. |  | MaxItems: 100 <br />Optional: \{\} <br /> |
+| `services` _object (keys:string, values:[DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec))_ | Services are the services to deploy as part of this deployment. |  | MaxProperties: 25 <br />Optional: \{\} <br /> |
+| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the deployment unless<br />overridden by service-specific configuration. |  | Optional: \{\} <br /> |
+| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). |  | Enum: [sglang vllm trtllm] <br /> |
+
+
+#### DynamoGraphDeploymentStatus
+
+
+
+DynamoGraphDeploymentStatus defines the observed state of DynamoGraphDeployment.
+
+
+
+_Appears in:_
+- [DynamoGraphDeployment](#dynamographdeployment)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. |  |  |
+| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. |  |  |
+| `services` _object (keys:string, values:[ServiceReplicaStatus](#servicereplicastatus))_ | Services contains per-service replica status information.<br />The map key is the service name from spec.services. |  |  |
+
+
+#### DynamoModel
+
+
+
+DynamoModel is the Schema for the dynamo models API
+
+
+
+
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
+| `kind` _string_ | `DynamoModel` | | |
+| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. |  |  |
+| `spec` _[DynamoModelSpec](#dynamomodelspec)_ |  |  |  |
+| `status` _[DynamoModelStatus](#dynamomodelstatus)_ |  |  |  |
+
+
+#### DynamoModelSpec
+
+
+
+DynamoModelSpec defines the desired state of DynamoModel
+
+
+
+_Appears in:_
+- [DynamoModel](#dynamomodel)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `modelName` _string_ | ModelName is the full model identifier (e.g., "meta-llama/Llama-3.3-70B-Instruct-lora") |  | Required: \{\} <br /> |
+| `baseModelName` _string_ | BaseModelName is the base model identifier that matches the service label<br />This is used to discover endpoints via headless services |  | Required: \{\} <br /> |
+| `modelType` _string_ | ModelType specifies the type of model (e.g., "base", "lora", "adapter") | base | Enum: [base lora adapter] <br /> |
+| `source` _[ModelSource](#modelsource)_ | Source specifies the model source location (only applicable for lora model type) |  |  |
+
+
+#### DynamoModelStatus
+
+
+
+DynamoModelStatus defines the observed state of DynamoModel
+
+
+
+_Appears in:_
+- [DynamoModel](#dynamomodel)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `endpoints` _[EndpointInfo](#endpointinfo) array_ | Endpoints is the current list of all endpoints for this model |  |  |
+| `readyEndpoints` _integer_ | ReadyEndpoints is the count of endpoints that are ready |  |  |
+| `totalEndpoints` _integer_ | TotalEndpoints is the total count of endpoints |  |  |
+| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions represents the latest available observations of the model's state |  |  |
+
+
+#### EndpointInfo
+
+
+
+EndpointInfo represents a single endpoint (pod) serving the model
+
+
+
+_Appears in:_
+- [DynamoModelStatus](#dynamomodelstatus)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `address` _string_ | Address is the full address of the endpoint (e.g., "http://10.0.1.5:9090") |  |  |
+| `podName` _string_ | PodName is the name of the pod serving this endpoint |  |  |
+| `ready` _boolean_ | Ready indicates whether the endpoint is ready to serve traffic<br />For LoRA models: true if the POST /loras request succeeded with a 2xx status code<br />For base models: always false (no probing performed) |  |  |
+
+
+#### ExtraPodMetadata
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `annotations` _object (keys:string, values:string)_ |  |  |  |
+| `labels` _object (keys:string, values:string)_ |  |  |  |
+
+
+#### ExtraPodSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `mainContainer` _[Container](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#container-v1-core)_ |  |  |  |
+
+
+#### IngressSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `enabled` _boolean_ | Enabled exposes the component through an ingress or virtual service when true. |  |  |
+| `host` _string_ | Host is the base host name to route external traffic to this component. |  |  |
+| `useVirtualService` _boolean_ | UseVirtualService indicates whether to configure a service-mesh VirtualService instead of a standard Ingress. |  |  |
+| `virtualServiceGateway` _string_ | VirtualServiceGateway optionally specifies the gateway name to attach the VirtualService to. |  |  |
+| `hostPrefix` _string_ | HostPrefix is an optional prefix added before the host. |  |  |
+| `annotations` _object (keys:string, values:string)_ | Annotations to set on the generated Ingress/VirtualService resources. |  |  |
+| `labels` _object (keys:string, values:string)_ | Labels to set on the generated Ingress/VirtualService resources. |  |  |
+| `tls` _[IngressTLSSpec](#ingresstlsspec)_ | TLS holds the TLS configuration used by the Ingress/VirtualService. |  |  |
+| `hostSuffix` _string_ | HostSuffix is an optional suffix appended after the host. |  |  |
+| `ingressControllerClassName` _string_ | IngressControllerClassName selects the ingress controller class (e.g., "nginx"). |  |  |
+
+
+#### IngressTLSSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [IngressSpec](#ingressspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `secretName` _string_ | SecretName is the name of a Kubernetes Secret containing the TLS certificate and key. |  |  |
+
+
+
+
+#### ModelReference
+
+
+
+ModelReference identifies a model served by this component
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `name` _string_ | Name is the base model identifier (e.g., "llama-3-70b-instruct-v1") |  | Required: \{\} <br /> |
+| `revision` _string_ | Revision is the model revision/version (optional) |  |  |
+
+
+#### ModelSource
+
+
+
+ModelSource defines the source location of a model
+
+
+
+_Appears in:_
+- [DynamoModelSpec](#dynamomodelspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `uri` _string_ | URI is the model source URI<br />Supported formats:<br />- S3: s3://bucket/path/to/model<br />- HuggingFace: hf://org/model@revision_sha |  | Required: \{\} <br /> |
+
+
+#### MultinodeSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `nodeCount` _integer_ | Indicates the number of nodes to deploy for multinode components.<br />Total number of GPUs is NumberOfNodes * GPU limit.<br />Must be greater than 1. | 2 | Minimum: 2 <br /> |
+
+
+#### PVC
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentSpec](#dynamographdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `create` _boolean_ | Create indicates to create a new PVC |  |  |
+| `name` _string_ | Name is the name of the PVC |  | Required: \{\} <br /> |
+| `storageClass` _string_ | StorageClass to be used for PVC creation. Required when create is true. |  |  |
+| `size` _[Quantity](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#quantity-resource-api)_ | Size of the volume in Gi, used during PVC creation. Required when create is true. |  |  |
+| `volumeAccessMode` _[PersistentVolumeAccessMode](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#persistentvolumeaccessmode-v1-core)_ | VolumeAccessMode is the volume access mode of the PVC. Required when create is true. |  |  |
+
+
+#### ProfilingConfigSpec
+
+
+
+ProfilingConfigSpec defines configuration for the profiling process.
+This structure maps directly to the profile_sla.py config format.
+See benchmarks/profiler/utils/profiler_argparse.py for the complete schema.
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentRequestSpec](#dynamographdeploymentrequestspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `config` _[JSON](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#json-v1-apiextensions-k8s-io)_ | Config is the profiling configuration as arbitrary JSON/YAML. This will be passed directly to the profiler.<br />The profiler will validate the configuration and report any errors. |  | Optional: \{\} <br />Type: object <br /> |
+| `configMapRef` _[ConfigMapKeySelector](#configmapkeyselector)_ | ConfigMapRef is an optional reference to a ConfigMap containing the DynamoGraphDeployment<br />base config file (disagg.yaml). This is separate from the profiling config above.<br />The path to this config will be set as engine.config in the profiling config. |  | Optional: \{\} <br /> |
+| `profilerImage` _string_ | ProfilerImage specifies the container image to use for profiling jobs.<br />This image contains the profiler code and dependencies needed for SLA-based profiling.<br />Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" |  | Required: \{\} <br /> |
+| `outputPVC` _string_ | OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.<br />If specified, all profiling artifacts (logs, plots, configs, raw data) will be written<br />to this PVC instead of an ephemeral emptyDir volume. This allows users to access<br />complete profiling results after the job completes by mounting the PVC.<br />The PVC must exist in the same namespace as the DGDR.<br />If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.<br />Note: ConfigMaps are still created regardless of this setting for planner integration. |  | Optional: \{\} <br /> |
+| `resources` _[ResourceRequirements](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourcerequirements-v1-core)_ | Resources specifies the compute resource requirements for the profiling job container.<br />If not specified, no resource requests or limits are set. |  | Optional: \{\} <br /> |
+| `tolerations` _[Toleration](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#toleration-v1-core) array_ | Tolerations allows the profiling job to be scheduled on nodes with matching taints.<br />For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint. |  | Optional: \{\} <br /> |
+
+
+#### ResourceItem
+
+
+
+
+
+
+
+_Appears in:_
+- [Resources](#resources)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `cpu` _string_ | CPU specifies the CPU resource request/limit (e.g., "1000m", "2") |  |  |
+| `memory` _string_ | Memory specifies the memory resource request/limit (e.g., "4Gi", "8Gi") |  |  |
+| `gpu` _string_ | GPU indicates the number of GPUs to request.<br />Total number of GPUs is NumberOfNodes * GPU in case of multinode deployment. |  |  |
+| `gpuType` _string_ | GPUType can specify a custom GPU type, e.g. "gpu.intel.com/xe"<br />By default if not specified, the GPU type is "nvidia.com/gpu" |  |  |
+| `custom` _object (keys:string, values:string)_ | Custom specifies additional custom resource requests/limits |  |  |
+
+
+#### Resources
+
+
+
+Resources defines requested and limits for a component, including CPU, memory,
+GPUs/devices, and any runtime-specific resources.
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `requests` _[ResourceItem](#resourceitem)_ | Requests specifies the minimum resources required by the component |  |  |
+| `limits` _[ResourceItem](#resourceitem)_ | Limits specifies the maximum resources allowed for the component |  |  |
+| `claims` _[ResourceClaim](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourceclaim-v1-core) array_ | Claims specifies resource claims for dynamic resource allocation |  |  |
+
+
+#### ScalingAdapter
+
+
+
+ScalingAdapter configures whether a service uses the DynamoGraphDeploymentScalingAdapter
+for replica management. When enabled, the DGDSA owns the replicas field and
+external autoscalers (HPA, KEDA, Planner) can control scaling via the Scale subresource.
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `enabled` _boolean_ | Enabled indicates whether the ScalingAdapter should be enabled for this service.<br />When true, a DGDSA is created and owns the replicas field.<br />When false (default), no DGDSA is created and replicas can be modified directly in the DGD. | false |  |
+
+
+#### ServiceReplicaStatus
+
+
+
+ServiceReplicaStatus contains replica information for a single service.
+
+
+
+_Appears in:_
+- [DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `componentKind` _[ComponentKind](#componentkind)_ | ComponentKind is the underlying resource kind (e.g., "PodClique", "PodCliqueScalingGroup", "Deployment", "LeaderWorkerSet"). |  | Enum: [PodClique PodCliqueScalingGroup Deployment LeaderWorkerSet] <br /> |
+| `componentName` _string_ | ComponentName is the name of the underlying resource. |  |  |
+| `replicas` _integer_ | Replicas is the total number of non-terminated replicas.<br />Required for all component kinds. |  | Minimum: 0 <br /> |
+| `updatedReplicas` _integer_ | UpdatedReplicas is the number of replicas at the current/desired revision.<br />Required for all component kinds. |  | Minimum: 0 <br /> |
+| `readyReplicas` _integer_ | ReadyReplicas is the number of ready replicas.<br />Populated for PodClique, Deployment, and LeaderWorkerSet.<br />Not available for PodCliqueScalingGroup.<br />When nil, the field is omitted from the API response. |  | Minimum: 0 <br /> |
+| `availableReplicas` _integer_ | AvailableReplicas is the number of available replicas.<br />For Deployment: replicas ready for >= minReadySeconds.<br />For PodCliqueScalingGroup: replicas where all constituent PodCliques have >= MinAvailable ready pods.<br />Not available for PodClique or LeaderWorkerSet.<br />When nil, the field is omitted from the API response. |  | Minimum: 0 <br /> |
+
+
+#### SharedMemorySpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `disabled` _boolean_ |  |  |  |
+| `size` _[Quantity](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#quantity-resource-api)_ |  |  |  |
+
+
+#### VolumeMount
+
+
+
+VolumeMount references a PVC defined at the top level for volumes to be mounted by the component
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `name` _string_ | Name references a PVC name defined in the top-level PVCs map |  | Required: \{\} <br /> |
+| `mountPoint` _string_ | MountPoint specifies where to mount the volume.<br />If useAsCompilationCache is true and mountPoint is not specified,<br />a backend-specific default will be used. |  |  |
+| `useAsCompilationCache` _boolean_ | UseAsCompilationCache indicates this volume should be used as a compilation cache.<br />When true, backend-specific environment variables will be set and default mount points may be used. | false |  |
+
+
+# Operator Default Values Injection
+
+The Dynamo operator automatically applies default values to various fields when they are not explicitly specified in your deployments. These defaults include:
+
+- **Health Probes**: Startup, liveness, and readiness probes are configured differently for frontend, worker, and planner components. For example, worker components receive a startup probe with a 2-hour timeout (720 failures × 10 seconds) to accommodate long model loading times.
+
+- **Security Context**: All components receive `fsGroup: 1000` by default to ensure proper file permissions for mounted volumes. This can be overridden via the `extraPodSpec.securityContext` field.
+
+- **Shared Memory**: All components receive an 8Gi shared memory volume mounted at `/dev/shm` by default (can be disabled or resized via the `sharedMemory` field).
+
+- **Environment Variables**: Components automatically receive environment variables like `DYN_NAMESPACE`, `DYN_PARENT_DGD_K8S_NAME`, `DYNAMO_PORT`, and backend-specific variables.
+
+- **Pod Configuration**: Default `terminationGracePeriodSeconds` of 60 seconds and `restartPolicy: Always`.
+
+- **Autoscaling**: When enabled without explicit metrics, defaults to CPU-based autoscaling with 80% target utilization.
+
+- **Backend-Specific Behavior**: For multinode deployments, probes are automatically modified or removed for worker nodes depending on the backend framework (VLLM, SGLang, or TensorRT-LLM).
+
+## Pod Specification Defaults
+
+All components receive the following pod-level defaults unless overridden:
+
+- **`terminationGracePeriodSeconds`**: `60` seconds
+- **`restartPolicy`**: `Always`
+
+## Security Context
+
+The operator automatically applies default security context settings to all components to ensure proper file permissions, particularly for mounted volumes:
+
+- **`fsGroup`**: `1000` - Sets the group ownership of mounted volumes and any files created in those volumes
+
+This default ensures that non-root containers can write to mounted volumes (like model caches or persistent storage) without permission issues. The `fsGroup` setting is particularly important for:
+- Model downloads and caching
+- Compilation cache directories
+- Persistent volume claims (PVCs)
+- SSH key generation in multinode deployments
+
+### Overriding Security Context
+
+To override the default security context, specify your own `securityContext` in the `extraPodSpec` of your component:
+
+```yaml
+services:
+  YourWorker:
+    extraPodSpec:
+      securityContext:
+        fsGroup: 2000  # Custom group ID
+        runAsUser: 1000
+        runAsGroup: 1000
+        runAsNonRoot: true
+```
+
+**Important**: When you provide *any* `securityContext` object in `extraPodSpec`, the operator will not inject any defaults. This gives you complete control over the security context, including the ability to run as root (by omitting `runAsNonRoot` or setting it to `false`).
+
+### OpenShift and Security Context Constraints
+
+In OpenShift environments with Security Context Constraints (SCCs), you may need to omit explicit UID/GID values to allow OpenShift's admission controllers to assign them dynamically:
+
+```yaml
+services:
+  YourWorker:
+    extraPodSpec:
+      securityContext:
+        # Omit fsGroup to let OpenShift assign it based on SCC
+        # OpenShift will inject the appropriate UID range
+```
+
+Alternatively, if you want to keep the default `fsGroup: 1000` behavior and are certain your cluster allows it, you don't need to specify anything - the operator defaults will work.
+
+## Shared Memory Configuration
+
+Shared memory is enabled by default for all components:
+
+- **Enabled**: `true` (unless explicitly disabled via `sharedMemory.disabled`)
+- **Size**: `8Gi`
+- **Mount Path**: `/dev/shm`
+- **Volume Type**: `emptyDir` with `memory` medium
+
+To disable shared memory or customize the size, use the `sharedMemory` field in your component specification.
+
+## Health Probes by Component Type
+
+The operator applies different default health probes based on the component type.
+
+### Frontend Components
+
+Frontend components receive the following probe configurations:
+
+**Liveness Probe:**
+- **Type**: HTTP GET
+- **Path**: `/health`
+- **Port**: `http` (8000)
+- **Initial Delay**: 60 seconds
+- **Period**: 60 seconds
+- **Timeout**: 30 seconds
+- **Failure Threshold**: 10
+
+**Readiness Probe:**
+- **Type**: Exec command
+- **Command**: `curl -s http://localhost:${DYNAMO_PORT}/health | jq -e ".status == \"healthy\""`
+- **Initial Delay**: 60 seconds
+- **Period**: 60 seconds
+- **Timeout**: 30 seconds
+- **Failure Threshold**: 10
+
+### Worker Components
+
+Worker components receive the following probe configurations:
+
+**Liveness Probe:**
+- **Type**: HTTP GET
+- **Path**: `/live`
+- **Port**: `system` (9090)
+- **Period**: 5 seconds
+- **Timeout**: 30 seconds
+- **Failure Threshold**: 1
+
+**Readiness Probe:**
+- **Type**: HTTP GET
+- **Path**: `/health`
+- **Port**: `system` (9090)
+- **Period**: 10 seconds
+- **Timeout**: 30 seconds
+- **Failure Threshold**: 60
+
+**Startup Probe:**
+- **Type**: HTTP GET
+- **Path**: `/live`
+- **Port**: `system` (9090)
+- **Period**: 10 seconds
+- **Timeout**: 5 seconds
+- **Failure Threshold**: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s)
+
+<Note>
+**For larger models (typically >70B parameters) or slower storage systems, you may need to increase the `failureThreshold` to allow more time for model loading. Calculate the required threshold based on your expected startup time: `failureThreshold = (expected_startup_seconds / period)`. Override the startup probe in your component specification if the default 2-hour window is insufficient.**
+</Note>
+
+### Multinode Deployment Probe Modifications
+
+For multinode deployments, the operator modifies probes based on the backend framework and node role:
+
+#### VLLM Backend
+
+The operator automatically selects between two deployment modes based on parallelism configuration:
+
+**Tensor/Pipeline Parallel Mode** (when `world_size > GPUs_per_node`):
+- Uses Ray for distributed execution (`--distributed-executor-backend ray`)
+- **Leader nodes**: Starts Ray head and runs vLLM; all probes remain active
+- **Worker nodes**: Run Ray agents only; all probes (liveness, readiness, startup) are removed
+
+**Data Parallel Mode** (when `world_size × data_parallel_size > GPUs_per_node`):
+- **Worker nodes**: All probes (liveness, readiness, startup) are removed
+- **Leader nodes**: All probes remain active
+
+#### SGLang Backend
+- **Worker nodes**: All probes (liveness, readiness, startup) are removed
+
+#### TensorRT-LLM Backend
+- **Leader nodes**: All probes remain unchanged
+- **Worker nodes**:
+  - Liveness and startup probes are removed
+  - Readiness probe is replaced with a TCP socket check on SSH port (2222):
+    - **Initial Delay**: 20 seconds
+    - **Period**: 20 seconds
+    - **Timeout**: 5 seconds
+    - **Failure Threshold**: 10
+
+## Environment Variables
+
+The operator automatically injects environment variables based on component type and configuration:
+
+### All Components
+
+- **`DYN_NAMESPACE`**: The Dynamo namespace for the component
+- **`DYN_PARENT_DGD_K8S_NAME`**: The parent DynamoGraphDeployment Kubernetes resource name
+- **`DYN_PARENT_DGD_K8S_NAMESPACE`**: The parent DynamoGraphDeployment Kubernetes namespace
+
+### Frontend Components
+
+- **`DYNAMO_PORT`**: `8000`
+- **`DYN_HTTP_PORT`**: `8000`
+
+### Worker Components
+
+- **`DYN_SYSTEM_PORT`**: `9090` (automatically enables the system metrics server)
+- **`DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS`**: `["generate"]`
+- **`DYN_SYSTEM_ENABLED`**: `true` (needed for runtime images 0.6.1 and older)
+
+### Planner Components
+
+- **`PLANNER_PROMETHEUS_PORT`**: `9085`
+
+### VLLM Backend (with compilation cache)
+
+When a volume mount is configured with `useAsCompilationCache: true`:
+- **`VLLM_CACHE_ROOT`**: Set to the mount point of the cache volume
+
+## Service Account
+
+Planner components automatically receive the following service account:
+
+- **`serviceAccountName`**: `planner-serviceaccount`
+
+## Image Pull Secrets
+
+The operator automatically discovers and injects image pull secrets for container images. When a component specifies a container image, the operator:
+
+1. Scans all Kubernetes secrets of type `kubernetes.io/dockerconfigjson` in the component's namespace
+2. Extracts the docker registry server URLs from each secret's authentication configuration
+3. Matches the container image's registry host against the discovered registry URLs
+4. Automatically injects matching secrets as `imagePullSecrets` in the pod specification
+
+This eliminates the need to manually specify image pull secrets for each component. The operator maintains an internal index of docker secrets and their associated registries, refreshing this index periodically.
+
+**To disable automatic image pull secret discovery** for a specific component, add the following annotation:
+
+```yaml
+annotations:
+  nvidia.com/disable-image-pull-secret-discovery: "true"
+```
+
+## Autoscaling Defaults
+
+When autoscaling is enabled but no metrics are specified, the operator applies:
+
+- **Default Metric**: CPU utilization
+- **Target Average Utilization**: `80%`
+
+## Port Configurations
+
+Default container ports are configured based on component type:
+
+### Frontend Components
+- **Port**: 8000
+- **Protocol**: TCP
+- **Name**: `http`
+
+### Worker Components
+- **Port**: 9090
+- **Protocol**: TCP
+- **Name**: `system`
+
+### Planner Components
+- **Port**: 9085
+- **Protocol**: TCP
+- **Name**: `metrics`
+
+## Backend-Specific Configurations
+
+### VLLM
+- **Ray Head Port**: 6379 (for Ray cluster coordination in multinode TP/PP deployments)
+- **Data Parallel RPC Port**: 13445 (for data parallel multinode deployments)
+
+### SGLang
+- **Distribution Init Port**: 29500 (for multinode deployments)
+
+### TensorRT-LLM
+- **SSH Port**: 2222 (for multinode MPI communication)
+- **OpenMPI Environment**: `OMPI_MCA_orte_keep_fqdn_hostnames=1`
+
+## Implementation Reference
+
+For users who want to understand the implementation details or contribute to the operator, the default values described in this document are set in the following source files:
+
+- **Health Probes, Security Context & Pod Specifications**: [`internal/dynamo/graph.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/graph.go) - Contains the main logic for applying default probes, security context, environment variables, shared memory, and pod configurations
+- **Component-Specific Defaults**:
+  - [`internal/dynamo/component_frontend.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_frontend.go)
+  - [`internal/dynamo/component_worker.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_worker.go)
+  - [`internal/dynamo/component_planner.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_planner.go)
+- **Image Pull Secrets**: [`internal/secrets/docker.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/secrets/docker.go) - Implements the docker secret indexer and automatic discovery
+- **Backend-Specific Behavior**:
+  - [`internal/dynamo/backend_vllm.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_vllm.go)
+  - [`internal/dynamo/backend_sglang.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_sglang.go)
+  - [`internal/dynamo/backend_trtllm.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_trtllm.go)
+- **Constants & Annotations**: [`internal/consts/consts.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/consts/consts.go) - Defines annotation keys and other constants
+
+## Notes
+
+- All these defaults can be overridden by explicitly specifying values in your DynamoComponentDeployment or DynamoGraphDeployment resources
+- User-specified probes (via `livenessProbe`, `readinessProbe`, or `startupProbe` fields) take precedence over operator defaults
+- For security context, if you provide *any* `securityContext` in `extraPodSpec`, no defaults will be injected, giving you full control
+- For multinode deployments, some defaults are modified or removed as described above to accommodate distributed execution patterns
+- The `extraPodSpec.mainContainer` field can be used to override probe configurations set by the operator