docs: migrate existing docs to fern (#5445)

Signed-off-by: Jont828 <jt572@cornell.edu> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>

docs: migrate existing docs to fern (#5445)
Signed-off-by: Jont828 <jt572@cornell.edu> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>
f9050aae · Jonathan Tong · GitHub · f238d23a · f9050aae · f9050aae
Unverified Commit f9050aae authored Jan 26, 2026 by Jonathan Tong Committed by GitHub Jan 26, 2026
20 changed files
--- a/fern/pages/kubernetes/autoscaling.md
+++ b/fern/pages/kubernetes/autoscaling.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Autoscaling"
+---
+
+This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the `sglang-agg` example from `examples/backends/sglang/deploy/agg.yaml`.
+
+## Example DGD
+
+All examples in this guide use the following DGD:
+
+```yaml
+# examples/backends/sglang/deploy/agg.yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: sglang-agg
+  namespace: default
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: sglang-agg
+      componentType: frontend
+      replicas: 1
+
+    decode:
+      dynamoNamespace: sglang-agg
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+```
+
+**Key identifiers:**
+- **DGD name**: `sglang-agg`
+- **Namespace**: `default`
+- **Services**: `Frontend`, `decode`
+- **dynamo_namespace label**: `default-sglang-agg` (used for metric filtering)
+
+## Overview
+
+Dynamo provides flexible autoscaling through the `DynamoGraphDeploymentScalingAdapter` (DGDSA) resource. When you deploy a DGD, the operator automatically creates one adapter per service (unless explicitly disabled). These adapters implement the Kubernetes [Scale subresource](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource), enabling integration with:
+
+| Autoscaler | Description | Best For |
+|------------|-------------|----------|
+| **KEDA** | Event-driven autoscaling (recommended) | Most use cases |
+| **Kubernetes HPA** | Native horizontal pod autoscaling | Simple CPU/memory-based scaling |
+| **Dynamo Planner** | LLM-aware autoscaling with SLA optimization | Production LLM workloads |
+| **Custom Controllers** | Any scale-subresource-compatible controller | Custom requirements |
+
+<Warning>
+**Deprecation Notice:** The `spec.services[X].autoscaling` field in DGD is **deprecated and ignored**. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with `autoscaling` configured, you'll see a warning. Remove the field to silence the warning.
+</Warning>
+
+## Architecture
+
+```
+┌──────────────────────────────────┐          ┌─────────────────────────────────────┐
+│   DynamoGraphDeployment          │          │   Scaling Adapters (auto-created)   │
+│   "sglang-agg"                   │          │   (one per service)                 │
+├──────────────────────────────────┤          ├─────────────────────────────────────┤
+│                                  │          │                                     │
+│  spec.services:                  │          │  ┌─────────────────────────────┐    │      ┌──────────────────┐
+│                                  │          │  │ sglang-agg-frontend         │◄───┼──────│   Autoscalers    │
+│    ┌────────────────────────┐◄───┼──────────┼──│ spec.replicas: 1            │    │      │                  │
+│    │ Frontend: 1 replica    │    │          │  └─────────────────────────────┘    │      │  • KEDA          │
+│    └────────────────────────┘    │          │                                     │      │  • HPA           │
+│                                  │          │  ┌─────────────────────────────┐    │      │  • Planner       │
+│    ┌────────────────────────┐◄───┼──────────┼──│ sglang-agg-decode           │◄───┼──────│  • Custom        │
+│    │ decode:   1 replica    │    │          │  │ spec.replicas: 1            │    │      │                  │
+│    └────────────────────────┘    │          │  └─────────────────────────────┘    │      └──────────────────┘
+│                                  │          │                                     │
+└──────────────────────────────────┘          └─────────────────────────────────────┘
+```
+
+**How it works:**
+
+1. You deploy a DGD with services (Frontend, decode)
+2. The operator auto-creates one DGDSA per service
+3. Autoscalers (KEDA, HPA, Planner) target the adapters via `/scale` subresource
+4. Adapter controller syncs replica changes to the DGD
+5. DGD controller reconciles the underlying pods
+
+## Viewing Scaling Adapters
+
+After deploying the `sglang-agg` DGD, verify the auto-created adapters:
+
+```bash
+kubectl get dgdsa -n default
+
+# Example output:
+# NAME                  DGD         SERVICE    REPLICAS   AGE
+# sglang-agg-frontend   sglang-agg  Frontend   1          5m
+# sglang-agg-decode     sglang-agg  decode     1          5m
+```
+
+## Replica Ownership Model
+
+When DGDSA is enabled (the default), it becomes the **source of truth** for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.
+
+### How It Works
+
+1. **DGDSA owns replicas**: Autoscalers (HPA, KEDA, Planner) update the DGDSA's `spec.replicas`
+2. **DGDSA syncs to DGD**: The DGDSA controller writes the replica count to the DGD's service
+3. **Direct DGD edits blocked**: A validating webhook prevents users from directly editing `spec.services[X].replicas` in the DGD
+4. **Controllers allowed**: Only authorized controllers (operator, Planner) can modify DGD replicas
+
+### Manual Scaling with DGDSA Enabled
+
+When DGDSA is enabled, use `kubectl scale` on the adapter (not the DGD):
+
+```bash
+# ✅ Correct - scale via DGDSA
+kubectl scale dgdsa sglang-agg-decode --replicas=3
+
+# ❌ Blocked - direct DGD edit rejected by webhook
+kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'
+# Error: spec.services[decode].replicas cannot be modified directly when scaling adapter is enabled;
+#        use 'kubectl scale dgdsa/sglang-agg-decode --replicas=3' or update the DynamoGraphDeploymentScalingAdapter instead
+```
+
+## Enabling DGDSA for a Service
+
+By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: sglang-agg
+spec:
+  services:
+    Frontend:
+      replicas: 2        # ← No DGDSA by default, direct edits allowed
+
+    decode:
+      replicas: 1
+      scalingAdapter:
+        enabled: true    # ← DGDSA created, managed via adapter
+```
+
+**When to enable DGDSA:**
+- You want to use HPA, KEDA, or Planner for autoscaling
+- You want a clear separation between "desired scale" (adapter) and "deployment config" (DGD)
+- You want protection against accidental direct replica edits
+
+**When to keep DGDSA disabled (default):**
+- You want simple, manual replica management
+- You don't need autoscaling for that service
+- You prefer direct DGD edits over adapter-based scaling
+
+## Autoscaling with Dynamo Planner
+
+The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.
+
+**When to use Planner:**
+- You want LLM-optimized autoscaling out of the box
+- You need coordinated scaling across prefill/decode services
+- You want SLA-driven scaling (e.g., target TTFT < 500ms)
+
+**How Planner works:**
+
+Planner is deployed as a service component within your DGD. It:
+1. Queries Prometheus for frontend metrics (request rate, latency, etc.)
+2. Uses profiling data to predict optimal replica counts
+3. Scales prefill/decode workers to meet SLA targets
+
+**Deployment:**
+
+The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../planner/sla-planner-quickstart.md) for complete instructions.
+
+Example configurations with Planner:
+- `examples/backends/vllm/deploy/disagg_planner.yaml`
+- `examples/backends/sglang/deploy/disagg_planner.yaml`
+- `examples/backends/trtllm/deploy/disagg_planner.yaml`
+
+For more details, see the [SLA Planner documentation](../planner/sla-planner.md).
+
+## Autoscaling with Kubernetes HPA
+
+The Horizontal Pod Autoscaler (HPA) is Kubernetes' native autoscaling solution.
+
+**When to use HPA:**
+- You have simple, predictable scaling requirements
+- You want to use standard Kubernetes tooling
+- You need CPU or memory-based scaling
+
+> **Note**: For custom metrics (like TTFT or queue depth), consider using [KEDA](#autoscaling-with-keda-recommended) instead - it's simpler to configure.
+
+### Basic HPA (CPU-based)
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: sglang-agg-frontend-hpa
+  namespace: default
+spec:
+  scaleTargetRef:
+    apiVersion: nvidia.com/v1alpha1
+    kind: DynamoGraphDeploymentScalingAdapter
+    name: sglang-agg-frontend
+  minReplicas: 1
+  maxReplicas: 10
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+  behavior:
+    scaleDown:
+      stabilizationWindowSeconds: 300
+    scaleUp:
+      stabilizationWindowSeconds: 0
+```
+
+### HPA with Dynamo Metrics
+
+Dynamo exports several metrics useful for autoscaling. These are available at the `/metrics` endpoint on each frontend pod.
+
+> **See also**: For a complete list of all Dynamo metrics, see the [Metrics Reference](../observability/metrics.md). For Prometheus and Grafana setup, see the [Prometheus and Grafana Setup Guide](../observability/prometheus-grafana.md).
+
+#### Available Dynamo Metrics
+
+| Metric | Type | Description | Good for scaling |
+|--------|------|-------------|------------------|
+| `dynamo_frontend_queued_requests` | Gauge | Requests waiting in HTTP queue | ✅ Workers |
+| `dynamo_frontend_inflight_requests` | Gauge | Concurrent requests to engine | ✅ All services |
+| `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
+| `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
+| `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
+| `kvstats_gpu_cache_usage_percent` | Gauge | GPU KV cache usage (0-1) | ✅ Decode |
+
+#### Metric Labels
+
+Dynamo metrics include these labels for filtering:
+
+| Label | Description | Example |
+|-------|-------------|---------|
+| `dynamo_namespace` | Unique DGD identifier (`{k8s-namespace}-{dynamoNamespace}`) | `default-sglang-agg` |
+| `model` | Model being served | `Qwen/Qwen3-0.6B` |
+
+> **Note**: When you have multiple DGDs in the same namespace, use `dynamo_namespace` to filter metrics for a specific DGD.
+
+#### Example: Scale Decode Service Based on TTFT
+
+Using HPA with Prometheus Adapter requires configuring external metrics.
+
+**Step 1: Configure Prometheus Adapter**
+
+Add this to your Helm values file (e.g., `prometheus-adapter-values.yaml`):
+
+```yaml
+# prometheus-adapter-values.yaml
+prometheus:
+  url: http://prometheus-kube-prometheus-prometheus.monitoring.svc
+  port: 9090
+
+rules:
+  external:
+  # TTFT p95 from frontend - used to scale decode
+  - seriesQuery: 'dynamo_frontend_time_to_first_token_seconds_bucket{namespace!=""}'
+    resources:
+      overrides:
+        namespace: {resource: "namespace"}
+    name:
+      as: "dynamo_ttft_p95_seconds"
+    metricsQuery: |
+      histogram_quantile(0.95,
+        sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{<<.LabelMatchers>>}[5m]))
+        by (le, namespace, dynamo_namespace)
+      )
+```
+
+**Step 2: Install Prometheus Adapter**
+
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo update
+
+helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
+  -n monitoring --create-namespace \
+  -f prometheus-adapter-values.yaml
+```
+
+**Step 3: Verify the metric is available**
+
+```bash
+kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/<your-namespace>/dynamo_ttft_p95_seconds" | jq
+```
+
+**Step 4: Create the HPA**
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: sglang-agg-decode-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: nvidia.com/v1alpha1
+    kind: DynamoGraphDeploymentScalingAdapter
+    name: sglang-agg-decode              # ← DGD name + service name (lowercase)
+  minReplicas: 1
+  maxReplicas: 10
+  metrics:
+  - type: External
+    external:
+      metric:
+        name: dynamo_ttft_p95_seconds
+        selector:
+          matchLabels:
+            dynamo_namespace: "default-sglang-agg"  # ← {namespace}-{dynamoNamespace}
+      target:
+        type: Value
+        value: "500m"  # Scale up when TTFT p95 > 500ms
+  behavior:
+    scaleDown:
+      stabilizationWindowSeconds: 60    # Wait 1 min before scaling down
+      policies:
+      - type: Pods
+        value: 1
+        periodSeconds: 30
+    scaleUp:
+      stabilizationWindowSeconds: 0      # Scale up immediately
+      policies:
+      - type: Pods
+        value: 2
+        periodSeconds: 30
+```
+
+**How it works:**
+1. Frontend pods export `dynamo_frontend_time_to_first_token_seconds` histogram
+2. Prometheus Adapter calculates p95 TTFT per `dynamo_namespace`
+3. HPA monitors this metric filtered by `dynamo_namespace: "default-sglang-agg"`
+4. When TTFT p95 > 500ms, HPA scales up the `sglang-agg-decode` adapter
+5. Adapter controller syncs the replica count to the DGD's `decode` service
+6. More decode workers are created, reducing TTFT
+
+#### Example: Scale Based on Queue Depth
+
+Add this rule to your `prometheus-adapter-values.yaml` (alongside the TTFT rule):
+
+```yaml
+# Add to rules.external in prometheus-adapter-values.yaml
+- seriesQuery: 'dynamo_frontend_queued_requests{namespace!=""}'
+  resources:
+    overrides:
+      namespace: {resource: "namespace"}
+  name:
+    as: "dynamo_queued_requests"
+  metricsQuery: |
+    sum(<<.Series>>{<<.LabelMatchers>>}) by (namespace, dynamo_namespace)
+```
+
+Then create the HPA:
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: sglang-agg-decode-queue-hpa
+  namespace: default
+spec:
+  scaleTargetRef:
+    apiVersion: nvidia.com/v1alpha1
+    kind: DynamoGraphDeploymentScalingAdapter
+    name: sglang-agg-decode
+  minReplicas: 1
+  maxReplicas: 10
+  metrics:
+  - type: External
+    external:
+      metric:
+        name: dynamo_queued_requests
+        selector:
+          matchLabels:
+            dynamo_namespace: "default-sglang-agg"
+      target:
+        type: Value
+        value: "10"  # Scale up when queue > 10 requests
+```
+
+## Autoscaling with KEDA (Recommended)
+
+KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus.
+
+**Advantages over HPA + Prometheus Adapter:**
+- No Prometheus Adapter configuration needed
+- PromQL queries are defined in the ScaledObject itself (declarative, per-deployment)
+- Easy to update - just `kubectl apply` the ScaledObject
+- Can scale to zero when idle
+- Supports multiple triggers per object
+
+**When to use KEDA:**
+- You want simpler configuration (no Prometheus Adapter to manage)
+- You need event-driven scaling (e.g., queue depth, Kafka, etc.)
+- You want to scale to zero when idle
+
+### Installing KEDA
+
+```bash
+# Add KEDA Helm repo
+helm repo add kedacore https://kedacore.github.io/charts
+helm repo update
+
+# Install KEDA
+helm install keda kedacore/keda \
+  --namespace keda \
+  --create-namespace
+
+# Verify installation
+kubectl get pods -n keda
+```
+
+> **Note**: If you have Prometheus Adapter installed, either uninstall it first (`helm uninstall prometheus-adapter -n monitoring`) or install KEDA with `--set metricsServer.enabled=false` to avoid API conflicts.
+
+### Example: Scale Decode Based on TTFT
+
+Using the `sglang-agg` DGD from `examples/backends/sglang/deploy/agg.yaml`:
+
+```yaml
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: sglang-agg-decode-scaler
+  namespace: default
+spec:
+  scaleTargetRef:
+    apiVersion: nvidia.com/v1alpha1
+    kind: DynamoGraphDeploymentScalingAdapter
+    name: sglang-agg-decode
+  minReplicaCount: 1
+  maxReplicaCount: 10
+  pollingInterval: 15      # Check metrics every 15 seconds
+  cooldownPeriod: 60       # Wait 60s before scaling down
+  triggers:
+  - type: prometheus
+    metadata:
+      # Update this URL to match your Prometheus service
+      serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
+      metricName: dynamo_ttft_p95
+      query: |
+        histogram_quantile(0.95,
+          sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
+          by (le)
+        )
+      threshold: "0.5"              # Scale up when TTFT p95 > 500ms (0.5 seconds)
+      activationThreshold: "0.1"    # Start scaling when TTFT > 100ms
+```
+
+Apply it:
+
+```bash
+kubectl apply -f sglang-agg-decode-scaler.yaml
+```
+
+### Verify KEDA Scaling
+
+```bash
+# Check ScaledObject status
+kubectl get scaledobject -n default
+
+# KEDA creates an HPA under the hood - you can see it
+kubectl get hpa -n default
+
+# Example output:
+# NAME                                REFERENCE                                              TARGETS      MINPODS   MAXPODS   REPLICAS
+# keda-hpa-sglang-agg-decode-scaler   DynamoGraphDeploymentScalingAdapter/sglang-agg-decode  45m/500m     1         10        1
+
+# Get detailed status
+kubectl describe scaledobject sglang-agg-decode-scaler -n default
+```
+
+### Example: Scale Based on Queue Depth
+
+```yaml
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: sglang-agg-decode-queue-scaler
+  namespace: default
+spec:
+  scaleTargetRef:
+    apiVersion: nvidia.com/v1alpha1
+    kind: DynamoGraphDeploymentScalingAdapter
+    name: sglang-agg-decode
+  minReplicaCount: 1
+  maxReplicaCount: 10
+  pollingInterval: 15
+  cooldownPeriod: 60
+  triggers:
+  - type: prometheus
+    metadata:
+      serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
+      metricName: dynamo_queued_requests
+      query: |
+        sum(dynamo_frontend_queued_requests{dynamo_namespace="default-sglang-agg"})
+      threshold: "10"    # Scale up when queue > 10 requests
+```
+
+### How KEDA Works
+
+KEDA creates and manages an HPA under the hood:
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│  You create: ScaledObject                                            │
+│    - scaleTargetRef: sglang-agg-decode                               │
+│    - triggers: prometheus query                                      │
+└──────────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│  KEDA Operator automatically creates: HPA                            │
+│    - name: keda-hpa-sglang-agg-decode-scaler                         │
+│    - scaleTargetRef: sglang-agg-decode                               │
+│    - metrics: External (from KEDA metrics server)                    │
+└──────────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│  DynamoGraphDeploymentScalingAdapter: sglang-agg-decode              │
+│    - spec.replicas: updated by HPA                                   │
+└──────────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│  DynamoGraphDeployment: sglang-agg                                   │
+│    - spec.services.decode.replicas: synced from adapter              │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+## Mixed Autoscaling
+
+For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services:
+
+```yaml
+---
+# HPA for Frontend (CPU-based)
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: sglang-agg-frontend-hpa
+  namespace: default
+spec:
+  scaleTargetRef:
+    apiVersion: nvidia.com/v1alpha1
+    kind: DynamoGraphDeploymentScalingAdapter
+    name: sglang-agg-frontend
+  minReplicas: 1
+  maxReplicas: 5
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+
+---
+# KEDA for Decode (TTFT-based)
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: sglang-agg-decode-scaler
+  namespace: default
+spec:
+  scaleTargetRef:
+    apiVersion: nvidia.com/v1alpha1
+    kind: DynamoGraphDeploymentScalingAdapter
+    name: sglang-agg-decode
+  minReplicaCount: 1
+  maxReplicaCount: 10
+  triggers:
+  - type: prometheus
+    metadata:
+      serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
+      query: |
+        histogram_quantile(0.95,
+          sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
+          by (le)
+        )
+      threshold: "0.5"
+```
+
+## Manual Scaling
+
+### With DGDSA Enabled (Default)
+
+When DGDSA is enabled (the default), scale via the adapter:
+
+```bash
+kubectl scale dgdsa sglang-agg-decode -n default --replicas=3
+```
+
+Verify the scaling:
+
+```bash
+kubectl get dgdsa sglang-agg-decode -n default
+
+# Output:
+# NAME                DGD         SERVICE   REPLICAS   AGE
+# sglang-agg-decode   sglang-agg  decode    3          10m
+```
+
+> **Note**: If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.
+
+### With DGDSA Disabled
+
+If you've disabled the scaling adapter for a service, edit the DGD directly:
+
+```bash
+kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'
+```
+
+Or edit the YAML (no `scalingAdapter.enabled: true` means direct edits are allowed):
+
+```yaml
+spec:
+  services:
+    decode:
+      replicas: 3
+      # No scalingAdapter.enabled means replicas can be edited directly
+```
+
+## Best Practices
+
+### 1. Choose One Autoscaler Per Service
+
+Avoid configuring multiple autoscalers for the same service:
+
+| Configuration | Status |
+|---------------|--------|
+| HPA for frontend, Planner for prefill/decode | ✅ Good |
+| KEDA for all services | ✅ Good |
+| Planner only (default) | ✅ Good |
+| HPA + Planner both targeting decode | ❌ Bad - they will fight |
+
+### 2. Use Appropriate Metrics
+
+| Service Type | Recommended Metrics | Dynamo Metric |
+|--------------|---------------------|---------------|
+| Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
+| Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
+| Decode | KV cache utilization, ITL | `kvstats_gpu_cache_usage_percent`, `dynamo_frontend_inter_token_latency_seconds` |
+
+### 3. Configure Stabilization Windows
+
+Prevent thrashing with appropriate stabilization:
+
+```yaml
+# HPA
+behavior:
+  scaleDown:
+    stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
+  scaleUp:
+    stabilizationWindowSeconds: 0    # Scale up immediately
+
+# KEDA
+spec:
+  cooldownPeriod: 300
+```
+
+### 4. Set Sensible Min/Max Replicas
+
+Always configure minimum and maximum replicas in your HPA/KEDA to prevent:
+- Scaling to zero (unless intentional)
+- Unbounded scaling that exhausts cluster resources
+
+## Troubleshooting
+
+### Adapters Not Created
+
+```bash
+# Check DGD status
+kubectl describe dgd sglang-agg -n default
+
+# Check operator logs
+kubectl logs -n dynamo-system deployment/dynamo-operator
+```
+
+### Scaling Not Working
+
+```bash
+# Check adapter status
+kubectl describe dgdsa sglang-agg-decode -n default
+
+# Check HPA/KEDA status
+kubectl describe hpa sglang-agg-decode-hpa -n default
+kubectl describe scaledobject sglang-agg-decode-scaler -n default
+
+# Verify metrics are available in Kubernetes metrics API
+kubectl get --raw /apis/external.metrics.k8s.io/v1beta1
+```
+
+### Metrics Not Available
+
+If HPA/KEDA shows `<unknown>` for metrics:
+
+```bash
+# Check if Dynamo metrics are being scraped
+kubectl port-forward -n default svc/sglang-agg-frontend 8000:8000
+curl http://localhost:8000/metrics | grep dynamo_frontend
+
+# Example output:
+# dynamo_frontend_queued_requests{model="Qwen/Qwen3-0.6B"} 2
+# dynamo_frontend_inflight_requests{model="Qwen/Qwen3-0.6B"} 5
+
+# Verify Prometheus is scraping the metrics
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
+# Then query: dynamo_frontend_time_to_first_token_seconds_bucket
+
+# Check KEDA operator logs
+kubectl logs -n keda deployment/keda-operator
+```
+
+### Rapid Scaling Up and Down
+
+If you see unstable scaling:
+
+1. Check if multiple autoscalers are targeting the same adapter
+2. Increase `cooldownPeriod` in KEDA ScaledObject
+3. Increase `stabilizationWindowSeconds` in HPA behavior
+
+## References
+
+- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
+- [KEDA Documentation](https://keda.sh/)
+- [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter)
+- [Planner Documentation](../planner/sla-planner.md)
+- [Dynamo Metrics Reference](../observability/metrics.md)
+- [Prometheus and Grafana Setup](../observability/prometheus-grafana.md)
+
--- a/fern/pages/kubernetes/deployment/create-deployment.md
+++ b/fern/pages/kubernetes/deployment/create-deployment.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Creating Kubernetes Deployments"
+---
+
+The scripts in the `examples/<backend>/launch` folder like [agg.sh](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
+The corresponding YAML files like [agg.yaml](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
+
+This guide explains how to create your own deployment files.
+
+## Step 1: Choose Your Architecture Pattern
+
+Before choosing a template, understand the different architecture patterns:
+
+### Aggregated Serving (agg.yaml)
+
+**Pattern**: Prefill and decode on the same GPU in a single process.
+
+**Suggested to use for**:
+- Small to medium models (under 70B parameters)
+- Development and testing
+- Low to moderate traffic
+- Simplicity is prioritized over maximum throughput
+
+**Tradeoffs**:
+- Simpler setup and debugging
+- Lower operational complexity
+- GPU utilization may not be optimal (prefill and decode compete for resources)
+- Lower throughput ceiling compared to disaggregated
+
+**Example**: [`agg.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml)
+
+### Aggregated + Router (agg_router.yaml)
+
+**Pattern**: Load balancer routing across multiple aggregated worker instances.
+
+**Suggested to use for**:
+- Medium traffic requiring high availability
+- Need horizontal scaling
+- Want some load balancing without disaggregation complexity
+
+**Tradeoffs**:
+- Better scalability than plain aggregated
+- High availability through multiple replicas
+- Still has GPU underutilization issues of aggregated serving
+- More complex than plain aggregated but simpler than disaggregated
+
+**Example**: [`agg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
+
+### Disaggregated Serving (disagg_router.yaml)
+
+**Pattern**: Separate prefill and decode workers with specialized optimization.
+
+**Suggested to use for**:
+- Production-style deployments
+- High throughput requirements
+- Large models (70B+ parameters)
+- Maximum GPU utilization needed
+
+**Tradeoffs**:
+- Maximum performance and throughput
+- Better GPU utilization (prefill and decode specialized)
+- Independent scaling of prefill and decode
+- More complex setup and debugging
+- Requires understanding of prefill/decode separation
+
+**Example**: [`disagg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/disagg_router.yaml)
+
+### Quick Selection Guide
+
+Select the architecture pattern as your template that best fits your use case.
+
+For example, when using the `vLLM` backend:
+
+- **Development / Testing**: Use [`agg.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml) as the base configuration.
+
+- **Production with Load Balancing**: Use [`agg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
+
+- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
+
+
+## Step 2: Customize the Template
+
+You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
+The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
+
+It serves the following roles:
+1. OpenAI-Compatible HTTP Server
+  * Provides `/v1/chat/completions` endpoint
+  * Handles HTTP request/response formatting
+  * Supports streaming responses
+  * Validates incoming requests
+
+2. Service Discovery and Routing
+  * Auto-discovers backend workers via etcd
+  * Routes requests to the appropriate Processor/Worker components
+  * Handles load balancing between multiple workers
+
+3. Request Preprocessing
+  * Initial request validation
+  * Model name verification
+  * Request format standardization
+
+You should then pick a worker and specialize the config. For example,
+
+```yaml
+VllmWorker:         # vLLM-specific config
+  enforce-eager: true
+  enable-prefix-caching: true
+
+SglangWorker:       # SGLang-specific config
+  router-mode: kv
+  disagg-mode: true
+
+TrtllmWorker:       # TensorRT-LLM-specific config
+  engine-config: ./engine.yaml
+  kv-cache-transfer: ucx
+```
+
+Here's a template structure based on the examples:
+
+```yaml
+    YourWorker:
+      dynamoNamespace: your-namespace
+      componentType: worker
+      replicas: N
+      envFromSecret: your-secrets  # e.g., hf-token-secret
+      # Health checks for worker initialization
+      readinessProbe:
+        exec:
+          command: ["/bin/sh", "-c", 'grep "Worker.*initialized" /tmp/worker.log']
+      resources:
+        requests:
+          gpu: "1"  # GPU allocation
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python -m dynamo.YOUR_INFERENCE_ENGINE --model YOUR_MODEL --your-flags
+```
+
+Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
+`extraPodSpec: -> mainContainer: -> args:`
+
+The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
+Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
+If you are a Dynamo contributor the [dynamo run guide](../../reference/cli.md) for details on how to run this command.
+
+
+## Step 3: Key Customization Points
+
+### Model Configuration
+
+```yaml
+   args:
+     - "python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flag"
+```
+
+### Resource Allocation
+
+```yaml
+   resources:
+     requests:
+       cpu: "N"
+       memory: "NGi"
+       gpu: "N"
+```
+
+### Scaling
+
+```yaml
+   replicas: N  # Number of worker instances
+```
+
+### Routing Mode
+```yaml
+   args:
+     - --router-mode
+     - kv  # Enable KV-cache routing
+```
+
+### Worker Specialization
+
+```yaml
+   args:
+     - --is-prefill-worker  # For disaggregated prefill workers
+```
+
+### Image Pull Secret Configuration
+
+#### Automatic Discovery and Injection
+
+By default, the Dynamo operator automatically discovers and injects image pull secrets based on container registry host matching. The operator scans Docker config secrets within the same namespace and matches their registry hostnames to the container image URLs, automatically injecting the appropriate secrets into the pod's `imagePullSecrets`.
+
+**Disabling Automatic Discovery:**
+To disable this behavior for a component and manually control image pull secrets:
+
+```yaml
+    YourWorker:
+      dynamoNamespace: your-namespace
+      componentType: worker
+      annotations:
+        nvidia.com/disable-image-pull-secret-discovery: "true"
+```
+
+When disabled, you can manually specify secrets as you would for a normal pod spec via:
+```yaml
+    YourWorker:
+      dynamoNamespace: your-namespace
+      componentType: worker
+      annotations:
+        nvidia.com/disable-image-pull-secret-discovery: "true"
+      extraPodSpec:
+        imagePullSecrets:
+          - name: my-registry-secret
+          - name: another-secret
+        mainContainer:
+          image: your-image
+```
+
+This automatic discovery eliminates the need to manually configure image pull secrets for each deployment.
+
+## Step 6: Deploy LoRA Adapters (Optional)
+
+After your base model deployment is running, you can deploy LoRA adapters using the `DynamoModel` custom resource. This allows you to fine-tune and extend your models without modifying the base deployment.
+
+To add a LoRA adapter to your deployment, link it using `modelRef` in your worker configuration:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+spec:
+  services:
+    Worker:
+      modelRef:
+        name: Qwen/Qwen3-0.6B  # Base model identifier
+      componentType: worker
+      # ... rest of worker config
+```
+
+Then create a `DynamoModel` resource for your LoRA:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: my-lora
+spec:
+  modelName: my-custom-lora
+  baseModelName: Qwen/Qwen3-0.6B  # Must match modelRef.name above
+  modelType: lora
+  source:
+    uri: s3://my-bucket/loras/my-lora
+```
+
+**For complete details on managing models and LoRA adapters, see:**
+📖 **[Managing Models with DynamoModel Guide](dynamomodel-guide.md)**
--- a/fern/pages/kubernetes/deployment/dynamomodel-guide.md
+++ b/fern/pages/kubernetes/deployment/dynamomodel-guide.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Managing Models with DynamoModel"
+---
+
+## Overview
+
+`DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to:
+
+- **Deploy LoRA adapters** on top of running base models
+- **Track model endpoints** and their readiness across your cluster
+- **Manage model lifecycle** declaratively with Kubernetes
+
+DynamoModel works alongside `DynamoGraphDeployment` (DGD) or `DynamoComponentDeployment` (DCD) resources. While DGD/DCD deploy the inference infrastructure (pods, services), DynamoModel handles model-specific operations like loading LoRA adapters.
+
+## Quick Start
+
+### Prerequisites
+
+Before creating a DynamoModel, you need:
+
+1. A running `DynamoGraphDeployment` or `DynamoComponentDeployment`
+2. Components configured with `modelRef` pointing to your base model
+3. Pods are ready and serving your base model
+
+For complete setup including DGD configuration, see [Integration with DynamoGraphDeployment](#integration-with-dynamographdeployment).
+
+### Deploy a LoRA Adapter
+
+**1. Create your DynamoModel:**
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: my-lora
+  namespace: dynamo-system
+spec:
+  modelName: my-custom-lora
+  baseModelName: Qwen/Qwen3-0.6B  # Must match modelRef.name in your DGD
+  modelType: lora
+  source:
+    uri: s3://my-bucket/loras/my-lora
+```
+
+**2. Apply and verify:**
+
+```bash
+# Apply the DynamoModel
+kubectl apply -f my-lora.yaml
+
+# Check status
+kubectl get dynamomodel my-lora
+```
+
+**Expected output:**
+```
+NAME      TOTAL   READY   AGE
+my-lora   2       2       30s
+```
+
+That's it! The operator automatically discovers endpoints and loads the LoRA.
+
+For detailed status monitoring, see [Monitoring & Operations](#monitoring--operations).
+
+## Understanding DynamoModel
+
+### Model Types
+
+DynamoModel supports three model types:
+
+| Type | Description | Use Case |
+|------|-------------|----------|
+| **`base`** | Reference to an existing base model | Tracking endpoints for a base model (default) |
+| **`lora`** | LoRA adapter that extends a base model | Deploy fine-tuned adapters on existing models |
+| **`adapter`** | Generic model adapter | Future extensibility for other adapter types |
+
+Most users will use **`lora`** to deploy fine-tuned models on top of their base model deployments.
+
+### How It Works
+
+When you create a DynamoModel, the operator:
+
+1. **Discovers endpoints**: Finds all pods running your `baseModelName` (by matching `modelRef.name` in DGD/DCD)
+2. **Creates service**: Automatically creates a Kubernetes Service to track these pods
+3. **Loads LoRA**: Calls the LoRA load API on each endpoint (for `lora` type)
+4. **Updates status**: Reports which endpoints are ready
+
+**Key linkage:**
+```yaml
+# DGD modelRef.name ↔ DynamoModel baseModelName must match
+Worker:
+  modelRef:
+    name: Qwen/Qwen3-0.6B
+---
+spec:
+  baseModelName: Qwen/Qwen3-0.6B
+```
+
+## Configuration Overview
+
+DynamoModel requires just a few key fields to deploy a model or adapter:
+
+| Field | Required | Purpose | Example |
+|-------|----------|---------|---------|
+| `modelName` | Yes | Model identifier | `my-custom-lora` |
+| `baseModelName` | Yes | Links to DGD modelRef | `Qwen/Qwen3-0.6B` |
+| `modelType` | No | Type: base/lora/adapter | `lora` (default: `base`) |
+| `source.uri` | For LoRA | Model location | `s3://bucket/path` or `hf://org/model` |
+
+**Example minimal LoRA configuration:**
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: my-lora
+spec:
+  modelName: my-custom-lora
+  baseModelName: Qwen/Qwen3-0.6B
+  modelType: lora
+  source:
+    uri: s3://my-bucket/my-lora
+```
+
+**For complete field specifications, validation rules, and all options, see:**
+📖 [DynamoModel API Reference](../api-reference.md#dynamomodel)
+
+### Status Summary
+
+The status shows discovered endpoints and their readiness:
+
+```bash
+kubectl get dynamomodel my-lora
+```
+
+**Key status fields:**
+- `totalEndpoints` / `readyEndpoints`: Counts of discovered vs ready endpoints
+- `endpoints[]`: List with addresses, pod names, and ready status
+- `conditions`: Standard Kubernetes conditions (EndpointsReady, ServicesFound)
+
+For detailed status usage, see the [Monitoring & Operations](#monitoring--operations) section below
+
+## Common Use Cases
+
+### Use Case 1: S3-Hosted LoRA Adapter
+
+Deploy a LoRA adapter stored in an S3 bucket.
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: customer-support-lora
+  namespace: production
+spec:
+  modelName: customer-support-adapter-v1
+  baseModelName: meta-llama/Llama-3.3-70B-Instruct
+  modelType: lora
+  source:
+    uri: s3://my-models-bucket/loras/customer-support/v1
+```
+
+**Prerequisites:**
+- S3 bucket accessible from your pods (IAM role or credentials)
+- Base model `meta-llama/Llama-3.3-70B-Instruct` running via DGD/DCD
+
+**Verification:**
+```bash
+# Check LoRA is loaded
+kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.readyEndpoints}'
+# Should output: 2 (or your number of replicas)
+
+# View which pods are serving
+kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.endpoints[*].podName}'
+```
+
+### Use Case 2: HuggingFace-Hosted LoRA
+
+Deploy a LoRA adapter from HuggingFace Hub.
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: multilingual-lora
+  namespace: dynamo-system
+spec:
+  modelName: multilingual-adapter
+  baseModelName: Qwen/Qwen3-0.6B
+  modelType: lora
+  source:
+    uri: hf://myorg/qwen-multilingual-lora@v1.0.0  # Optional: @revision
+```
+
+**Prerequisites:**
+- HuggingFace Hub accessible from your pods
+- If private repo: HF token configured as secret and mounted in pods
+- Base model `Qwen/Qwen3-0.6B` running via DGD/DCD
+
+**With HuggingFace token:**
+```yaml
+# In your DGD/DCD
+spec:
+  services:
+    worker:
+      envFromSecret: hf-token-secret  # Provides HF_TOKEN env var
+      modelRef:
+        name: Qwen/Qwen3-0.6B
+      # ... rest of config
+```
+
+### Use Case 3: Multiple LoRAs on Same Base Model
+
+Deploy multiple LoRA adapters on the same base model deployment.
+
+```yaml
+---
+# LoRA for customer support
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: support-lora
+spec:
+  modelName: support-adapter
+  baseModelName: Qwen/Qwen3-0.6B
+  modelType: lora
+  source:
+    uri: s3://models/support-lora
+
+---
+# LoRA for code generation
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: code-lora
+spec:
+  modelName: code-adapter
+  baseModelName: Qwen/Qwen3-0.6B  # Same base model
+  modelType: lora
+  source:
+    uri: s3://models/code-lora
+```
+
+Both LoRAs will be loaded on all pods serving `Qwen/Qwen3-0.6B`. Your application can then route requests to the appropriate adapter.
+
+## Monitoring & Operations
+
+### Checking Status
+
+**Quick status check:**
+```bash
+kubectl get dynamomodel
+```
+
+**Example output:**
+```
+NAME              TOTAL   READY   AGE
+my-lora           2       2       5m
+customer-lora     4       3       2h
+```
+
+**Detailed status:**
+```bash
+kubectl describe dynamomodel my-lora
+```
+
+**Example output:**
+```
+Name:         my-lora
+Namespace:    dynamo-system
+Spec:
+  Model Name:       my-custom-lora
+  Base Model Name:  Qwen/Qwen3-0.6B
+  Model Type:       lora
+  Source:
+    Uri:  s3://my-bucket/my-lora
+Status:
+  Ready Endpoints:  2
+  Total Endpoints:  2
+  Endpoints:
+    Address:   http://10.0.1.5:9090
+    Pod Name:  worker-0
+    Ready:     true
+    Address:   http://10.0.1.6:9090
+    Pod Name:  worker-1
+    Ready:     true
+  Conditions:
+    Type:     EndpointsReady
+    Status:   True
+    Reason:   EndpointsDiscovered
+Events:
+  Type    Reason              Message
+  ----    ------              -------
+  Normal  EndpointsReady      Discovered 2 ready endpoints for base model Qwen/Qwen3-0.6B
+```
+
+### Understanding Readiness
+
+An endpoint is **ready** when:
+1. The pod is running and healthy
+2. The LoRA load API call succeeded
+
+**Condition states:**
+- `EndpointsReady=True`: All endpoints are ready (full availability)
+- `EndpointsReady=False, Reason=NotReady`: Not all endpoints ready (check message for counts)
+- `EndpointsReady=False, Reason=NoEndpoints`: No endpoints found
+
+When `readyEndpoints < totalEndpoints`, the operator automatically retries loading every 30 seconds.
+
+### Viewing Endpoints
+
+**Get endpoint addresses:**
+```bash
+kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].address}' | tr ' ' '\n'
+```
+
+**Output:**
+```
+http://10.0.1.5:9090
+http://10.0.1.6:9090
+```
+
+**Get endpoint pod names:**
+```bash
+kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].podName}' | tr ' ' '\n'
+```
+
+**Check readiness of each endpoint:**
+```bash
+kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | {podName, ready}'
+```
+
+**Output:**
+```json
+{
+  "podName": "worker-0",
+  "ready": true
+}
+{
+  "podName": "worker-1",
+  "ready": true
+}
+```
+
+### Updating a Model
+
+To update a LoRA (e.g., deploy a new version):
+
+```bash
+# Edit the source URI
+kubectl edit dynamomodel my-lora
+
+# Or apply an updated YAML
+kubectl apply -f my-lora-v2.yaml
+```
+
+The operator will detect the change and reload the LoRA on all endpoints.
+
+### Deleting a Model
+
+```bash
+kubectl delete dynamomodel my-lora
+```
+
+For LoRA models, the operator will:
+1. Unload the LoRA from all endpoints
+2. Clean up associated resources
+3. Remove the DynamoModel CR
+
+The base model deployment (DGD/DCD) continues running normally.
+
+## Troubleshooting
+
+### No Endpoints Found
+
+**Symptom:**
+```yaml
+status:
+  totalEndpoints: 0
+  readyEndpoints: 0
+  conditions:
+  - type: EndpointsReady
+    status: "False"
+    reason: NoEndpoints
+    message: "No endpoint slices found for base model Qwen/Qwen3-0.6B"
+```
+
+**Common Causes:**
+
+1. **Base model deployment not running**
+   ```bash
+   # Check if pods exist
+   kubectl get pods -l nvidia.com/dynamo-component-type=worker
+   ```
+   **Solution:** Deploy your DGD/DCD first, wait for pods to be ready.
+
+2. **`baseModelName` mismatch**
+   ```bash
+   # Check modelRef in your DGD
+   kubectl get dynamographdeployment my-deployment -o yaml | grep -A2 modelRef
+   ```
+   **Solution:** Ensure `baseModelName` in DynamoModel exactly matches `modelRef.name` in DGD.
+
+3. **Pods not ready**
+   ```bash
+   # Check pod status
+   kubectl get pods -l nvidia.com/dynamo-component-type=worker
+   ```
+   **Solution:** Wait for pods to reach `Running` and `Ready` state.
+
+4. **Wrong namespace**
+   **Solution:** Ensure DynamoModel is in the same namespace as your DGD/DCD.
+
+### LoRA Load Failures
+
+**Symptom:**
+```yaml
+status:
+  totalEndpoints: 2
+  readyEndpoints: 0  # ← No endpoints ready despite pods existing
+  conditions:
+  - type: EndpointsReady
+    status: "False"
+    reason: NoReadyEndpoints
+```
+
+**Common Causes:**
+
+1. **Source URI not accessible**
+   ```bash
+   # Check operator logs
+   kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f | grep "Failed to load"
+   ```
+   **Solution:**
+   - For S3: Verify bucket permissions, IAM role, credentials
+   - For HuggingFace: Verify token is valid, repo exists and is accessible
+
+2. **Invalid LoRA format**
+   **Solution:** Ensure your LoRA weights are in the format expected by your backend framework (vLLM, SGLang, etc.)
+
+3. **Endpoint API errors**
+   ```bash
+   # Check operator logs for HTTP errors
+   kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "error"
+   ```
+   **Solution:** Check the backend framework's logs in the worker pods:
+   ```bash
+   kubectl logs worker-0
+   ```
+
+4. **Out of memory**
+   **Solution:** LoRA adapters require additional memory. Increase memory limits in your DGD:
+   ```yaml
+   resources:
+     limits:
+       memory: "32Gi"  # Increase if needed
+   ```
+
+### Status Shows Not Ready
+
+**Symptom:**
+Some endpoints remain not ready for extended periods.
+
+**Diagnosis:**
+```bash
+# Check which endpoints are not ready
+kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | select(.ready == false)'
+
+# View operator logs for that specific pod
+kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "worker-0"
+
+# Check the worker pod logs
+kubectl logs worker-0 | tail -50
+```
+
+**Common Causes:**
+
+1. **Network issues**: Pod can't reach S3/HuggingFace
+2. **Resource constraints**: Pod is OOMing or being throttled
+3. **API endpoint not responding**: Backend framework isn't serving the LoRA API
+
+**When to wait vs investigate:**
+- **Wait**: If readyEndpoints is increasing over time (LoRAs loading progressively)
+- **Investigate**: If stuck at same readyEndpoints for >5 minutes
+
+### Viewing Events and Logs
+
+**Check events:**
+```bash
+kubectl describe dynamomodel my-lora | tail -20
+```
+
+**View operator logs:**
+```bash
+# Follow logs
+kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f
+
+# Filter for specific model
+kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "my-lora"
+```
+
+**Common events and messages:**
+
+| Event/Message | Meaning | Action |
+|---------------|---------|--------|
+| `EndpointsReady` | All endpoints are ready | ✅ Good - full service availability |
+| `NotReady` | Not all endpoints ready | ⚠️ Check readyEndpoints count - operator will retry |
+| `PartialEndpointFailure` | Some endpoints failed to load | Check logs for errors |
+| `NoEndpointsFound` | No pods discovered | Verify DGD running and modelRef matches |
+| `EndpointDiscoveryFailed` | Can't query endpoints | Check operator RBAC permissions |
+| `Successfully reconciled` | Reconciliation complete | ✅ Good |
+
+## Integration with DynamoGraphDeployment
+
+This section shows the complete end-to-end workflow for deploying base models and LoRA adapters together.
+
+DynamoModel and DynamoGraphDeployment work together to provide complete model deployment:
+
+- **DGD**: Deploys the infrastructure (pods, services, resources)
+- **DynamoModel**: Manages model-specific operations (LoRA loading)
+
+### Linking Models to Components
+
+The connection is established through the `modelRef` field in your DGD:
+
+**Complete example:**
+
+```yaml
+---
+# 1. Deploy the base model infrastructure
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+spec:
+  backendFramework: vllm
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      dynamoNamespace: my-app
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
+
+    Worker:
+      # This modelRef creates the link to DynamoModel
+      modelRef:
+        name: Qwen/Qwen3-0.6B  # ← Key linking field
+
+      componentType: worker
+      replicas: 2
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
+          args:
+            - --model
+            - Qwen/Qwen3-0.6B
+            - --tensor-parallel-size
+            - "1"
+
+---
+# 2. Deploy LoRA adapters on top
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: my-lora
+spec:
+  modelName: my-custom-lora
+  baseModelName: Qwen/Qwen3-0.6B  # ← Must match modelRef.name above
+  modelType: lora
+  source:
+    uri: s3://my-bucket/loras/my-lora
+```
+
+### Deployment Workflow
+
+**Recommended order:**
+
+```bash
+# 1. Deploy base model infrastructure
+kubectl apply -f my-deployment.yaml
+
+# 2. Wait for pods to be ready
+kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-component-type=worker --timeout=5m
+
+# 3. Deploy LoRA adapters
+kubectl apply -f my-lora.yaml
+
+# 4. Verify LoRA is loaded
+kubectl get dynamomodel my-lora
+```
+
+**What happens behind the scenes:**
+
+| Step | DGD | DynamoModel |
+|------|-----|-------------|
+| 1 | Creates pods with modelRef | - |
+| 2 | Pods become running and ready | - |
+| 3 | - | CR created, discovers endpoints via auto-created Service |
+| 4 | - | Calls LoRA load API on each endpoint |
+| 5 | - | All endpoints ready ✓ |
+
+The operator automatically handles all service discovery - you don't configure services, labels, or selectors manually.
+
+## API Reference
+
+For complete field specifications, validation rules, and detailed type definitions, see:
+
+**📖 [Dynamo CRD API Reference](../api-reference.md#dynamomodel)**
+
+## Summary
+
+DynamoModel provides declarative model management for Dynamo deployments:
+
+✅ **Simple**: 2-step deployment of LoRA adapters
+✅ **Automatic**: Endpoint discovery and loading handled by operator
+✅ **Observable**: Rich status reporting and conditions
+✅ **Integrated**: Works seamlessly with DynamoGraphDeployment
+
+**Next Steps:**
+- Try the [Quick Start](#quick-start) example
+- Explore [Common Use Cases](#common-use-cases)
+- Check the [API Reference](../api-reference.md#dynamomodel) for advanced configuration
+
--- a/fern/pages/kubernetes/deployment/minikube-setup.md
+++ b/fern/pages/kubernetes/deployment/minikube-setup.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Minikube Setup Guide"
+---
+
+Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
+
+## 1. Install Minikube
+First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
+
+## 2. Configure GPU Support (Optional)
+Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
+
+<Tip>
+Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
+</Tip>
+
+
+## 3. Start Minikube
+Time to launch your local cluster!
+
+```bash
+# Start Minikube with GPU support (if configured)
+minikube start --driver docker --container-runtime docker --gpus all --memory=16000mb --cpus=8
+
+# Enable required addons
+minikube addons enable istio-provisioner
+minikube addons enable istio
+minikube addons enable storage-provisioner-rancher
+```
+
+## 4. Verify Installation
+Let's make sure everything is working correctly!
+
+```bash
+# Check Minikube status
+minikube status
+
+# Verify Istio installation
+kubectl get pods -n istio-system
+
+# Verify storage class
+kubectl get storageclass
+```
+
+## Next Steps
+
+Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](../installation-guide.md) to deploy the platform to your local cluster.
+
--- a/fern/pages/kubernetes/deployment/multinode-deployment.md
+++ b/fern/pages/kubernetes/deployment/multinode-deployment.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Multinode Deployment Guide"
+---
+
+This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
+
+## Overview
+
+Dynamo supports multinode deployments through the `multinode` section in resource specifications. This allows you to:
+
+- Distribute workloads across multiple physical nodes
+- Scale GPU resources beyond a single machine
+- Support large models requiring extensive tensor parallelism
+- Achieve high availability and fault tolerance
+
+## Basic requirements
+
+- **Kubernetes Cluster**: Version 1.24 or later
+- **GPU Nodes**: Multiple nodes with NVIDIA GPUs
+- **High-Speed Networking**: InfiniBand, RoCE, or high-bandwidth Ethernet (recommended for optimal performance)
+
+
+### Advanced Multinode Orchestration
+
+#### Using Grove (default)
+
+For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems:
+
+- **[Grove](https://github.com/NVIDIA/grove)**: Network topology-aware gang scheduling and auto-scaling for AI workloads
+- **[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler)**: Kubernetes native scheduler optimized for AI workloads at scale
+
+These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes.
+
+**Features Enabled with Grove:**
+- Declarative composition of AI workloads
+- Multi-level horizontal auto-scaling
+- Custom startup ordering for components
+- Resource-aware rolling updates
+
+
+[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a Kubernetes native scheduler optimized for AI workloads at large scale.
+
+**Features Enabled with KAI-Scheduler:**
+- Gang scheduling
+- Network topology-aware pod placement
+- AI workload-optimized scheduling algorithms
+- GPU resource awareness and allocation
+- Support for complex scheduling constraints
+- Integration with Grove for enhanced capabilities
+- Performance optimizations for large-scale deployments
+
+
+##### Prerequisites
+
+- [Grove](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) installed on the cluster
+- (Optional) [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with the default queue name `dynamo` created. If no queue annotation is specified on the DGD resource, the operator uses the `dynamo` queue by default. Custom queue names can be specified via the `nvidia.com/kai-scheduler-queue` annotation, but the queue must exist in the cluster before deployment.
+
+KAI-Scheduler is optional but recommended for advanced scheduling capabilities.
+
+#### Using LWS and Volcano
+
+LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.
+
+- **LWS**: [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
+- **Volcano**: [Volcano Installation](https://volcano.sh/en/docs/installation/)
+
+Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support.
+
+
+## Core Concepts
+
+### Orchestrator Selection Algorithm
+
+Dynamo automatically selects the best available orchestrator for multinode deployments using the following logic:
+
+#### When Both Grove and LWS are Available:
+- **Grove is selected by default** (recommended for advanced AI workloads)
+- **LWS is selected** if you explicitly set `nvidia.com/enable-grove: "false"` annotation on your DGD resource
+
+#### When Only One Orchestrator is Available:
+- The installed orchestrator (Grove or LWS) is automatically selected
+
+#### Scheduler Integration:
+- **With Grove**: Automatically integrates with [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) when available, providing:
+  - Advanced queue management via `nvidia.com/kai-scheduler-queue` annotation
+  - AI-optimized scheduling policies
+  - Resource-aware workload placement
+- **With LWS**: Uses Volcano scheduler for gang scheduling and resource coordination
+
+#### Configuration Examples:
+
+**Default (Grove with KAI-Scheduler):**
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-multinode-deployment
+  annotations:
+    nvidia.com/kai-scheduler-queue: "dynamo"
+spec:
+  # ... your deployment spec
+```
+
+> **Note:** The `nvidia.com/kai-scheduler-queue` annotation defaults to `"dynamo"`. If you specify a custom queue name, ensure the queue exists in your cluster before deploying. You can verify available queues with `kubectl get queues`.
+
+**Force LWS usage:**
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-multinode-deployment
+  annotations:
+    nvidia.com/enable-grove: "false"
+spec:
+  # ... your deployment spec
+```
+
+
+### The `multinode` Section
+
+The `multinode` section in a resource specification defines how many physical nodes the workload should span:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-multinode-deployment
+spec:
+  # ... your deployment spec
+  services:
+    my-service:
+      ...
+      multinode:
+        nodeCount: 2
+      resources:
+        limits:
+          gpu: "2"            # 2 GPUs per node
+```
+
+### GPU Distribution
+
+The relationship between `multinode.nodeCount` and `gpu` is multiplicative:
+
+- **`multinode.nodeCount`**: Number of physical nodes
+- **`gpu`**: Number of GPUs per node
+- **Total GPUs**: `multinode.nodeCount × gpu`
+
+**Example:**
+- `multinode.nodeCount: "2"` + `gpu: "4"` = 8 total GPUs (4 GPUs per node across 2 nodes)
+- `multinode.nodeCount: "4"` + `gpu: "8"` = 32 total GPUs (8 GPUs per node across 4 nodes)
+
+### Tensor Parallelism Alignment
+
+The tensor parallelism (`tp-size` or `--tp`) in your command/args must match the total number of GPUs:
+
+```yaml
+# Example: 2 multinode.nodeCount × 4 GPUs = 8 total GPUs
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-multinode-deployment
+spec:
+  # ... your deployment spec
+  services:
+    my-service:
+      ...
+      multinode:
+        nodeCount: 2
+      resources:
+        limits:
+          gpu: "4"
+      extraPodSpec:
+        mainContainer:
+          ...
+          args:
+            # Command args must use tp-size=8
+            - "--tp-size"
+            - "8"  # Must equal multinode.nodeCount × gpu
+
+```
+
+
+## Backend-Specific Operator Behavior
+
+When you deploy a multinode workload, the Dynamo operator automatically applies backend-specific configurations to enable distributed execution. Understanding these automatic modifications helps troubleshoot issues and optimize your deployments.
+
+### vLLM Backend
+
+For vLLM multinode deployments, the operator automatically selects and configures the appropriate distributed execution mode based on your parallelism settings:
+
+#### Deployment Modes
+
+The operator automatically determines the deployment mode based on your parallelism configuration:
+
+**1. Tensor/Pipeline Parallelism Mode (Single model across nodes)**
+- **When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size`
+- **Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism
+
+The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes.
+
+**Leader Node:**
+- **Command**: `ray start --head --port=6379 && <original-vllm-command> --distributed-executor-backend ray`
+- **Behavior**: Starts Ray head node, then runs vLLM which creates a placement group spanning all Ray workers
+- **Probes**: All health probes remain active (liveness, readiness, startup)
+
+**Worker Nodes:**
+- **Command**: `ray start --address=<leader-hostname>:6379 --block`
+- **Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers
+- **Probes**: All probes (liveness, readiness, startup) are automatically removed
+
+> **Note**: vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend.
+
+**2. Data Parallel Mode (Multiple model instances across nodes)**
+- **When used**: When `world_size × data_parallel_size > GPUs_per_node`
+- **Use case**: Running multiple independent model instances across nodes with data parallelism (e.g., MoE models with expert parallelism)
+
+**All Nodes (Leader and Workers):**
+- **Injected Flags**:
+  - `--data-parallel-address <leader-hostname>` - Address of the coordination server
+  - `--data-parallel-size-local <value>` - Number of data parallel workers per node
+  - `--data-parallel-rpc-port 13445` - RPC port for data parallel coordination
+  - `--data-parallel-start-rank <value>` - Starting rank for this node (calculated automatically)
+- **Probes**: Worker probes are removed; leader probes remain active
+
+**Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
+
+#### Why Ray for Multi-Node TP/PP?
+
+vLLM supports two distributed executor backends: `ray` and `mp`. For multi-node deployments:
+
+- **Ray executor**: vLLM creates a placement group and spawns Ray actors across the cluster. Workers don't run vLLM directly - the leader's vLLM process manages everything.
+- **mp executor**: Each node must run its own vLLM process with `--nnodes`, `--node-rank`, `--master-addr`, `--master-port`. This approach is more complex to orchestrate.
+
+The Dynamo operator uses Ray because:
+1. It aligns with vLLM's official multi-node documentation (see `multi-node-serving.sh`)
+2. Simpler orchestration - only the leader runs vLLM, workers just need Ray agents
+3. vLLM automatically handles placement group creation and worker management
+
+#### Compilation Cache Support
+When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets:
+- **`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point
+
+### SGLang Backend
+
+For SGLang multinode deployments, the operator injects distributed training parameters:
+
+#### Leader Node
+- **Distributed Flags**: Injects `--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank 0`
+- **Probes**: All health probes remain active
+
+#### Worker Nodes
+- **Distributed Flags**: Injects `--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank <dynamic-rank>`
+  - The `node-rank` is automatically determined from the pod's stateful identity
+- **Probes**: All probes (liveness, readiness, startup) are automatically removed
+
+**Note:** The operator intelligently injects these flags regardless of your command structure (direct Python commands or shell wrappers).
+
+### TensorRT-LLM Backend
+
+For TensorRT-LLM multinode deployments, the operator configures MPI-based communication:
+
+#### Leader Node
+- **SSH Configuration**: Automatically sets up SSH keys and configuration from a Kubernetes secret
+- **MPI Command**: Wraps your command in an `mpirun` command with:
+  - Proper host list including all worker nodes
+  - SSH configuration for passwordless authentication on port 2222
+  - Environment variable propagation to all nodes
+  - Activation of the Dynamo virtual environment
+- **Probes**: All health probes remain active
+
+#### Worker Nodes
+- **SSH Daemon**: Replaces your command with SSH daemon setup and execution
+  - Generates host keys in user-writable directories (non-privileged)
+  - Configures SSH daemon to listen on port 2222
+  - Sets up authorized keys for leader access
+- **Probes**:
+  - **Liveness and Startup**: Removed (workers run SSH daemon, not the main application)
+  - **Readiness**: Replaced with TCP socket check on SSH port 2222
+    - Initial Delay: 20 seconds
+    - Period: 20 seconds
+    - Timeout: 5 seconds
+    - Failure Threshold: 10
+
+#### Additional Configuration
+- **Environment Variable**: `OMPI_MCA_orte_keep_fqdn_hostnames=1` is added to all nodes
+- **SSH Volume**: Automatically mounts the SSH keypair secret (typically named `mpirun-ssh-key-<deployment-name>`)
+
+**Important:** TensorRT-LLM requires an SSH keypair secret to be created before deployment. The secret name follows the pattern `mpirun-ssh-key-<component-name>`.
+
+### Compilation Cache Configuration
+
+The operator supports compilation cache volumes for backend-specific optimization:
+
+| Backend | Support Level | Environment Variables | Default Mount Point |
+|---------|--------------|----------------------|---------------------|
+| vLLM | Fully Supported | `VLLM_CACHE_ROOT` | User-specified |
+| SGLang | Partial Support | _None (pending upstream)_ | User-specified |
+| TensorRT-LLM | Partial Support | _None (pending upstream)_ | User-specified |
+
+To enable compilation cache, add a volume mount with `useAsCompilationCache: true` in your component specification. For vLLM, the operator will automatically configure the necessary environment variables. For other backends, volume mounts are created, but additional environment configuration may be required until upstream support is added.
+
+## Next Steps
+
+For additional support and examples, see the working multinode configurations in:
+
+- **SGLang**: [examples/backends/sglang/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/)
+- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/)
+- **vLLM**: [examples/backends/vllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/)
+
+These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
--- a/fern/pages/kubernetes/dynamo-operator.md
+++ b/fern/pages/kubernetes/dynamo-operator.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Working with Dynamo Kubernetes Operator"
+---
+
+## Overview
+
+Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
+
+## Architecture
+
+- **Operator Deployment:**
+  Deployed as a Kubernetes `Deployment` in a specific namespace.
+
+- **Controllers:**
+  - `DynamoGraphDeploymentController`: Watches `DynamoGraphDeployment` CRs and orchestrates graph deployments.
+  - `DynamoComponentDeploymentController`: Watches `DynamoComponentDeployment` CRs and handles individual component deployments.
+  - `DynamoModelController`: Watches `DynamoModel` CRs and manages model lifecycle (e.g., loading LoRA adapters).
+
+- **Workflow:**
+  1. A custom resource is created by the user or API server.
+  2. The corresponding controller detects the change and runs reconciliation.
+  3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
+  4. Status fields are updated to reflect the current state.
+
+## Deployment Modes
+
+The Dynamo operator supports three deployment modes to accommodate different cluster environments and use cases:
+
+### 1. Cluster-Wide Mode (Default)
+
+The operator monitors and manages DynamoGraph resources across **all namespaces** in the cluster.
+
+**When to Use:**
+- You have full cluster admin access
+- You want centralized management of all Dynamo workloads
+- Standard production deployment on a dedicated cluster
+
+---
+
+### 2. Namespace-Scoped Mode
+
+The operator monitors and manages DynamoGraph resources **only in a specific namespace**. A lease marker is created to signal the operator's presence to any cluster-wide operators.
+
+**When to Use:**
+- You're on a shared/multi-tenant cluster
+- You only have namespace-level permissions
+- You want to test a new operator version in isolation
+- You need to avoid conflicts with other operators
+
+**Installation:**
+```bash
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
+  --namespace my-namespace \
+  --create-namespace \
+  --set dynamo-operator.namespaceRestriction.enabled=true
+```
+
+---
+
+### 3. Hybrid Mode
+
+A **cluster-wide operator** manages most namespaces, while **one or more namespace-scoped operators** run in specific namespaces (e.g., for testing new versions). The cluster-wide operator automatically detects and excludes namespaces with namespace-scoped operators using lease markers.
+
+**When to Use:**
+- Running production workloads with a stable operator version
+- Testing new operator versions in isolated namespaces without affecting production
+- Gradual rollout of operator updates
+- Development/staging environments on production clusters
+
+**How It Works:**
+1. Namespace-scoped operator creates a lease named `dynamo-operator-namespace-scope` in its namespace
+2. Cluster-wide operator watches for these lease markers across all namespaces
+3. Cluster-wide operator automatically excludes any namespace with a lease marker
+4. If namespace-scoped operator stops, its lease expires (TTL: 30s by default)
+5. Cluster-wide operator automatically resumes managing that namespace
+
+**Setup Example:**
+
+```bash
+# 1. Install cluster-wide operator (production, v1.0.0)
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
+  --namespace dynamo-system \
+  --create-namespace
+
+# 2. Install namespace-scoped operator (testing, v2.0.0-beta)
+helm install dynamo-test dynamo-platform-${RELEASE_VERSION}.tgz \
+  --namespace test-namespace \
+  --create-namespace \
+  --set dynamo-operator.namespaceRestriction.enabled=true \
+  --set dynamo-operator.controllerManager.manager.image.tag=v2.0.0-beta
+
+**Observability:**
+
+```bash
+# List all namespaces with local operators
+kubectl get lease -A --field-selector metadata.name=dynamo-operator-namespace-scope
+
+# Check which operator version is running in a namespace
+kubectl get lease -n my-namespace dynamo-operator-namespace-scope \
+  -o jsonpath='{.spec.holderIdentity}'
+```
+
+
+## Custom Resource Definitions (CRDs)
+
+Dynamo provides the following Custom Resources:
+
+- **DynamoGraphDeployment (DGD)**: Deploys complete inference pipelines
+- **DynamoComponentDeployment (DCD)**: Deploys individual components
+- **DynamoModel**: Manages model lifecycle (e.g., loading LoRA adapters)
+
+For the complete technical API reference for Dynamo Custom Resource Definitions, see:
+
+**📖 [Dynamo CRD API Reference](api-reference.md)**
+
+For a user-focused guide on deploying and managing models with DynamoModel, see:
+
+**📖 [Managing Models with DynamoModel Guide](deployment/dynamomodel-guide.md)**
+
+## Webhooks
+
+The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation of custom resources before they are persisted to the cluster. Webhooks are **enabled by default** and ensure that invalid configurations are rejected immediately at the API server level.
+
+**Key Features:**
+- ✅ Shared certificate infrastructure across all webhook types
+- ✅ Automatic certificate generation (for testing/development)
+- ✅ cert-manager integration (for production)
+- ✅ Multi-operator support with lease-based coordination
+- ✅ Immutability enforcement for critical fields
+
+For complete documentation on webhooks, certificate management, and troubleshooting, see:
+
+**📖 [Webhooks Guide](webhooks.md)**
+
+## Installation
+
+### Quick Install with Helm
+
+```bash
+# Set environment
+export NAMESPACE=dynamo-system
+export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
+
+# Install Platform (includes operator)
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
+```
+
+> **Note:** For shared/multi-tenant clusters or testing scenarios, see [Deployment Modes](#deployment-modes) above for namespace-scoped and hybrid configurations.
+
+### Building from Source
+
+```bash
+# Set environment
+export NAMESPACE=dynamo-system
+export DOCKER_SERVER=your-registry.com/  # your container registry
+export IMAGE_TAG=latest
+
+# Build operator image
+cd deploy/operator
+docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG .
+docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
+cd -
+
+# Install CRDs
+cd deploy/helm/charts
+helm install dynamo-crds ./crds/ --namespace default
+
+# Install platform with custom operator image
+helm install dynamo-platform ./platform/ \
+  --namespace ${NAMESPACE} \
+  --create-namespace \
+  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
+  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}"
+```
+
+For detailed installation options, see the [Installation Guide](installation-guide.md)
+
+
+## Development
+
+- **Code Structure:**
+
+The operator is built using Kubebuilder and the operator-sdk, with the following structure:
+
+- `controllers/`: Reconciliation logic
+- `api/v1alpha1/`: CRD types
+- `config/`: Manifests and Helm charts
+
+
+## References
+
+- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
+- [Custom Resource Definitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
+- [Operator SDK](https://sdk.operatorframework.io/)
+- [Helm Best Practices for CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/)
--- a/fern/pages/kubernetes/fluxcd.md
+++ b/fern/pages/kubernetes/fluxcd.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "GitOps Deployment with FluxCD"
+---
+
+This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../backends/vllm/README.md) to demonstrate the workflow.
+
+## Prerequisites
+
+- A Kubernetes cluster with [Dynamo Kubernetes Platform](installation-guide.md) installed
+- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
+- A Git repository to store your deployment configurations
+
+## Workflow Overview
+
+The GitOps workflow for Dynamo deployments consists of three main steps:
+
+1. Build and push the Dynamo Operator
+2. Create and commit a DynamoGraphDeployment custom resource for initial deployment
+3. Update the graph by building a new version and updating the CR for subsequent updates
+
+## Step 1: Build and Push Dynamo Operator
+
+First, follow to [See Install Dynamo Kubernetes Platform](installation-guide.md).
+
+## Step 2: Create Initial Deployment
+
+Create a new file in your Git repository (e.g., `deployments/llm-agg.yaml`) with the following content:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: llm-agg
+spec:
+  pvcs:
+    - name: vllm-model-storage
+      size: 100Gi
+  services:
+    Frontend:
+      replicas: 1
+      envs:
+      - name: SPECIFIC_ENV_VAR
+        value: some_specific_value
+    Processor:
+      replicas: 1
+      envs:
+      - name: SPECIFIC_ENV_VAR
+        value: some_specific_value
+    VllmWorker:
+      replicas: 1
+      envs:
+      - name: SPECIFIC_ENV_VAR
+        value: some_specific_value
+      # Add PVC for model storage
+      volumeMounts:
+        - name: vllm-model-storage
+          mountPoint: /models
+```
+
+Commit and push this file to your Git repository. FluxCD will detect the new CR and create the initial Dynamo deployment in your cluster.
+
+## Step 3: Update Existing Deployment
+
+To update your pipeline, just update the associated DynamoGraphDeployment CRD
+
+The Dynamo operator will automatically reconcile it.
+
+## Monitoring the Deployment
+
+You can monitor the deployment status using:
+
+```bash
+
+export NAMESPACE=<namespace-with-the-dynamo-operator>
+
+# Check the DynamoGraphDeployment status
+kubectl get dynamographdeployment llm-agg -n $NAMESPACE
+```
\ No newline at end of file
--- a/fern/pages/kubernetes/grove.md
+++ b/fern/pages/kubernetes/grove.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Grove Deployment Guide"
+---
+
+Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
+
+## Overview
+
+Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource.
+
+### How Grove Works for Disaggregated Serving
+
+Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages:
+
+- **Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks
+- **Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns
+- **Resource Optimization**: Better utilization of hardware resources through specialized workload placement
+- **Fault Isolation**: Issues in one component don't necessarily affect others
+
+## Core Components and API Resources
+
+Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups:
+
+### PodCliqueSet
+The top-level Grove object that defines a group of components managed and colocated together. Key features include:
+- Support for autoscaling
+- Topology-aware spread of replicas for availability
+- Unified management of multiple disaggregated components
+
+### PodClique
+Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features:
+- Independent configuration options
+- Custom scaling logic support
+- Role-specific resource allocation
+
+### PodCliqueScalingGroup
+A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior.
+
+## Key Capabilities for Disaggregated Serving
+
+Grove provides several specialized features that make it particularly well-suited for disaggregated serving:
+
+### Flexible Gang Scheduling
+PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodCliqueSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
+
+### Multi-level Horizontal Auto-Scaling
+Supports pluggable horizontal auto-scaling solutions to scale PodCliqueSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
+
+### Network Topology-Aware Scheduling
+Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication.
+
+### Custom Startup Dependencies
+Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
+
+## Use Cases and Examples
+
+Grove specifically supports:
+
+- **Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick
+- **Single-node disaggregated inference** for optimized resource utilization
+- **Agentic pipelines of models** for complex AI workflows
+- **Standard aggregated serving** patterns for single node or single GPU inference
+
+## Integration with NVIDIA Dynamo
+
+Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack:
+
+### Complementary Roles
+- **Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads
+- **Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management
+
+### Release Coordination
+Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap.
+
+### Unified AI Platform
+The integration creates a comprehensive platform where:
+- Grove manages complex orchestration of disaggregated components
+- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations
+- Together they enable sophisticated AI serving architectures with simplified management
+
+## Architecture Benefits
+
+Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by:
+
+1. **Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition
+2. **Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate
+3. **Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components
+4. **Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling
+
+## Getting Started
+
+Grove relies on KAI Scheduler for resource allocation and scheduling.
+
+For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler).
+
+For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
+
+For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](deployment/multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
+
+For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
+
+Dynamo Kubernetes Platform also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Kubernetes Platform Deployment Installation Guide](installation-guide.md) for more details.
\ No newline at end of file
--- a/fern/pages/kubernetes/installation-guide.md
+++ b/fern/pages/kubernetes/installation-guide.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Installation Guide for Dynamo Kubernetes Platform"
+---
+
+Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
+
+## Before You Start
+
+Determine your cluster environment:
+
+**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
+- CRDs already installed cluster-wide - skip CRD installation step
+- A cluster-wide Dynamo operator is likely already running
+- **Do NOT install another operator** - use the existing cluster-wide operator
+- Only install a namespace-restricted operator if you specifically need to prevent the cluster-wide operator from managing your namespace (e.g., testing operator features you're developing)
+
+**Dedicated Cluster** (full cluster admin access):
+- You install CRDs yourself
+- Can use cluster-wide operator (default)
+
+**Local Development** (Minikube, testing):
+- See [Minikube Setup](deployment/minikube-setup.md) first, then follow installation steps below
+
+To check if CRDs already exist:
+```bash
+kubectl get crd | grep dynamo
+# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed
+```
+
+To check if a cluster-wide operator already exists:
+```bash
+# Check for cluster-wide operator and show its namespace
+kubectl get clusterrolebinding -o json | \
+  jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) |
+  "Cluster-wide operator found in namespace: \(.subjects[0].namespace)"'
+
+# If a cluster-wide operator exists: Do NOT install another operator
+# Only install namespace-restricted mode if you specifically need namespace isolation
+```
+
+## Installation Paths
+
+Platform is installed using Dynamo Kubernetes Platform [helm chart](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md).
+
+**Path A: Pre-built Artifacts**
+- Use case: Production deployment, shared or dedicated clusters
+- Source: NGC published Helm charts
+- Time: ~10 minutes
+- Jump to: [Path A](#path-a-production-install)
+
+**Path B: Custom Build from Source**
+- Use case: Contributing to Dynamo, using latest features from main branch, customization
+- Requirements: Docker build environment
+- Time: ~30 minutes
+- Jump to: [Path B](#path-b-custom-build-from-source)
+
+All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:
+
+```bash
+helm install ...
+  -f your-values.yaml
+```
+
+and/or setting values as flags to the helm install command, as follows:
+
+```bash
+helm install ...
+  --set "your-value=your-value"
+```
+
+## Prerequisites
+
+Before installing the Dynamo Kubernetes Platform, ensure you have the following tools and access:
+
+### Required Tools
+
+| Tool | Minimum Version | Description | Installation |
+|------|-----------------|-------------|--------------|
+| **kubectl** | v1.24+ | Kubernetes command-line tool | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
+| **Helm** | v3.0+ | Kubernetes package manager | [Install Helm](https://helm.sh/docs/intro/install/) |
+| **Docker** | Latest | Container runtime (Path B only) | [Install Docker](https://docs.docker.com/get-docker/) |
+
+### Cluster and Access Requirements
+
+- **Kubernetes cluster v1.24+** with admin or namespace-scoped access
+- **Cluster type determined** (shared vs dedicated) — see [Before You Start](#before-you-start)
+- **CRD status checked** if on a shared cluster
+- **NGC credentials** (optional) — required only if pulling NVIDIA images from NGC
+
+### Verify Installation
+
+Run the following to confirm your tools are correctly installed:
+
+```bash
+# Verify tools and versions
+kubectl version --client  # Should show v1.24+
+helm version              # Should show v3.0+
+docker version            # Required for Path B only
+
+# Set your release version
+export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
+```
+
+### Pre-Deployment Checks
+
+Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements:
+
+```bash
+./deploy/pre-deployment/pre-deployment-check.sh
+```
+
+This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for details.
+
+> **No cluster?** See [Minikube Setup](deployment/minikube-setup.md) for local development.
+
+**Estimated installation time:** 5-30 minutes depending on path
+
+## Path A: Production Install
+
+Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).
+
+```bash
+# 1. Set environment
+export NAMESPACE=dynamo-system
+export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
+
+# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
+helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
+
+# 3. Install Platform
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
+```
+
+**For Shared/Multi-Tenant Clusters:**
+
+If your cluster has namespace-restricted Dynamo operators, you MUST add namespace restriction to your installation:
+
+```bash
+# Add this flag to the helm install command above
+--set dynamo-operator.namespaceRestriction.enabled=true
+```
+
+Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
+
+If you see this validation error, you need namespace restriction:
+```
+VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
+Found existing namespace-restricted Dynamo operators in namespaces: ...
+```
+
+<Tip>
+For multinode deployments, you need to install multinode orchestration components:
+**Option 1 (Recommended): Grove + KAI Scheduler**
+- Grove and KAI Scheduler can be installed manually or through the dynamo-platform helm install command.
+- When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags:
+```bash
+--set "grove.enabled=true"
+--set "kai-scheduler.enabled=true"
+```
+**Option 2: LeaderWorkerSet (LWS) + Volcano**
+- If using LWS for multinode deployments, you must also install Volcano (required dependency):
+- [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
+- [Volcano Installation](https://volcano.sh/en/docs/installation/) (required for gang scheduling with LWS)
+- These must be installed manually before deploying multinode workloads with LWS.
+See the [Multinode Deployment Guide](deployment/multinode-deployment.md) for details on orchestrator selection.
+</Tip>
+
+<Tip>
+By default, Model Express Server is not used.
+If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command:
+</Tip>
+
+```bash
+--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
+```
+
+<Tip>
+By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces.
+If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true.
+You can also change the restricted namespace by setting the targetNamespace property.
+</Tip>
+
+```bash
+--set "dynamo-operator.namespaceRestriction.enabled=true"
+--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional
+```
+
+→ [Verify Installation](#verify-installation)
+
+## Path B: Custom Build from Source
+
+Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.
+
+Note: This gives you access to the latest unreleased features and fixes on the main branch.
+
+```bash
+# 1. Set environment
+export NAMESPACE=dynamo-system
+export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # or your registry
+export DOCKER_USERNAME='$oauthtoken'
+export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
+export IMAGE_TAG=${RELEASE_VERSION}
+
+# 2. Build operator
+cd deploy/operator
+
+# 2.1 Alternative 1 : Build and push the operator image for multiple platforms
+docker buildx create --name multiplatform --driver docker-container --bootstrap
+docker buildx use multiplatform
+docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG --push .
+
+# 2.2 Alternative 2 : Build and push the operator image for a single platform
+docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
+
+cd -
+
+# 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry)
+kubectl create namespace ${NAMESPACE}
+kubectl create secret docker-registry docker-imagepullsecret \
+  --docker-server=${DOCKER_SERVER} \
+  --docker-username=${DOCKER_USERNAME} \
+  --docker-password=${DOCKER_PASSWORD} \
+  --namespace=${NAMESPACE}
+
+cd deploy/helm/charts
+
+# 4. Install CRDs
+helm upgrade --install dynamo-crds ./crds/ --namespace default
+
+# 5. Install Platform
+helm dep build ./platform/
+
+# To install cluster-wide instead, set NS_RESTRICT_FLAGS="" (empty) or omit that line entirely.
+
+NS_RESTRICT_FLAGS="--set dynamo-operator.namespaceRestriction.enabled=true"
+helm install dynamo-platform ./platform/ \
+  --namespace "${NAMESPACE}" \
+  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
+  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
+  --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" \
+  ${NS_RESTRICT_FLAGS}
+
+```
+
+→ [Verify Installation](#verify-installation)
+
+## Verify Installation
+
+```bash
+# Check CRDs
+kubectl get crd | grep dynamo
+
+# Check operator and platform pods
+kubectl get pods -n ${NAMESPACE}
+# Expected: dynamo-operator-* and etcd-* and nats-* pods Running
+```
+
+## Next Steps
+
+1. **Deploy Model/Workflow**
+   ```bash
+   # Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
+   kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
+
+   # Port forward and test
+   kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
+   curl http://localhost:8000/v1/models
+   ```
+
+2. **Explore Backend Guides**
+   - [vLLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)
+   - [SGLang Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)
+   - [TensorRT-LLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)
+
+3. **Optional:**
+   - [Set up Prometheus & Grafana](observability/metrics.md)
+   - [SLA Planner Quickstart Guide](../planner/sla-planner-quickstart.md) (for SLA-aware scheduling and autoscaling)
+
+## Troubleshooting
+
+**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**
+
+```
+VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
+Found existing namespace-restricted Dynamo operators in namespaces: ...
+```
+
+Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.
+
+Solution: Add namespace restriction to your installation:
+```bash
+--set dynamo-operator.namespaceRestriction.enabled=true
+```
+
+Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
+
+**CRDs already exist**
+
+Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).
+
+Solution: Skip step 2 (CRD installation), proceed directly to platform installation.
+
+To check if CRDs exist:
+```bash
+kubectl get crd | grep dynamo
+```
+
+**Pods not starting?**
+```bash
+kubectl describe pod <pod-name> -n ${NAMESPACE}
+kubectl logs <pod-name> -n ${NAMESPACE}
+```
+
+**HuggingFace model access?**
+```bash
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+**Bitnami etcd "unrecognized" image?**
+
+```bash
+ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
+```
+This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog).
+
+just add the following to the helm install command:
+```bash
+--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"
+```
+
+**Clean uninstall?**
+
+To uninstall the platform, you can run the following command:
+```
+helm uninstall dynamo-platform --namespace ${NAMESPACE}
+```
+
+To uninstall the CRDs, follow these steps:
+
+Get all of the dynamo CRDs installed in your cluster:
+```bash
+kubectl get crd | grep "dynamo.*nvidia.com"
+```
+
+You should see something like this:
+```
+dynamocomponentdeployments.nvidia.com               2025-10-21T14:49:52Z
+dynamocomponents.nvidia.com                         2025-10-25T05:16:10Z
+dynamographdeploymentrequests.nvidia.com            2025-11-24T05:26:04Z
+dynamographdeployments.nvidia.com                   2025-09-04T20:56:40Z
+dynamographdeploymentscalingadapters.nvidia.com     2025-12-09T21:05:59Z
+dynamomodels.nvidia.com                             2025-11-07T00:19:43Z
+```
+
+Delete each CRD one by one:
+```bash
+kubectl delete crd <crd-name>
+```
+
+## Advanced Options
+
+- [Helm Chart Configuration](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md)
+- [Create custom deployments](deployment/create-deployment.md)
+- [Dynamo Operator details](dynamo-operator.md)
+- [Model Express Server details](https://github.com/ai-dynamo/modelexpress)
--- a/fern/pages/kubernetes/model-caching-with-fluid.md
+++ b/fern/pages/kubernetes/model-caching-with-fluid.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Model Caching with Fluid: Cloud-Native Data Orchestration and Acceleration"
+---
+
+Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.
+
+## Key Features
+
+- **Data Caching and Acceleration:** Cache remote data close to compute workloads for faster access.
+- **Unified Data Access:** Access data from S3, HDFS, NFS, and more through a single interface.
+- **Kubernetes Native:** Integrates with Kubernetes using CRDs for data management.
+- **Scalability:** Supports large-scale data and compute clusters.
+
+## Installation
+
+You can install Fluid on any Kubernetes cluster using Helm.
+
+**Prerequisites:**
+- Kubernetes >= 1.18
+- `kubectl` >= 1.18
+- `Helm` >= 3.5
+
+**Quick Install:**
+```sh
+kubectl create ns fluid-system
+helm repo add fluid https://fluid-cloudnative.github.io/charts
+helm repo update
+helm install fluid fluid/fluid -n fluid-system
+```
+For advanced configuration, see the [Fluid Installation Guide](https://fluid-cloudnative.github.io/docs/get-started/installation).
+
+## Pre-deployment Steps
+
+1. Install Fluid (see [Installation](#installation)).
+2. Create a Dataset and Runtime (see [the following example](#webufs-example)).
+3. Mount the resulting PVC in your workload.
+
+
+## Mounting Data Sources
+
+### WebUFS Example
+
+WebUFS allows mounting HTTP/HTTPS sources as filesystems.
+
+```yaml
+# Mount a public HTTP directory as a Fluid Dataset
+apiVersion: data.fluid.io/v1alpha1
+kind: Dataset
+metadata:
+  name: webufs-model
+spec:
+  mounts:
+    - mountPoint: https://myhost.org/path_to_my_model  # Replace with your HTTP source
+      name: webufs-model
+---
+apiVersion: data.fluid.io/v1alpha1
+kind: AlluxioRuntime
+metadata:
+  name: webufs-model
+spec:
+  replicas: 2
+  tieredstore:
+    levels:
+      - mediumtype: MEM
+        path: /dev/shm
+        quota: 2Gi
+        high: "0.95"
+        low: "0.7"
+```
+After applying, Fluid creates a PersistentVolumeClaim (PVC) named `webufs-model` containing the files.
+
+### S3 Example
+
+Mount an S3 bucket as a Fluid Dataset.
+
+```yaml
+# Mount an S3 bucket as a Fluid Dataset
+apiVersion: data.fluid.io/v1alpha1
+kind: Dataset
+metadata:
+  name: s3-model
+spec:
+  mounts:
+    - mountPoint: s3://<your-bucket>  # Replace with your bucket name
+      options:
+        alluxio.underfs.s3.endpoint: http://minio:9000  # S3 endpoint (e.g., MinIO)
+        alluxio.underfs.s3.disable.dns.buckets: "true"
+        aws.secretKey: "<your-secret>"
+        aws.accessKeyId: "<your-access-key>"
+---
+apiVersion: data.fluid.io/v1alpha1
+kind: AlluxioRuntime
+metadata:
+  name: s3-model
+spec:
+  replicas: 1
+  tieredstore:
+    levels:
+      - mediumtype: MEM
+        path: /dev/shm
+        quota: 1Gi
+        high: "0.95"
+        low: "0.7"
+---
+apiVersion: data.fluid.io/v1alpha1
+kind: DataLoad
+metadata:
+  name: s3-model-loader
+spec:
+  dataset:
+    name: s3-model
+    namespace: <your-namespace>  # Replace with your namespace
+  loadMetadata: true
+  target:
+    - path: "/"
+      replicas: 1
+```
+
+The resulting PVC is named `s3-model`.
+
+## Using HuggingFace Models with Fluid
+
+**Limitations:**
+- HuggingFace models are not exposed as simple filesystems or buckets.
+- No native integration exists between Fluid and the HuggingFace Hub API.
+
+**Workaround: Download and Upload to S3/MinIO**
+
+1. Download the model using the HuggingFace CLI or SDK.
+2. Upload the model files to a supported storage backend (S3, GCS, NFS).
+3. Mount that backend using Fluid.
+
+**Example Pod to Download and Upload:**
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: download-hf-to-minio
+spec:
+  restartPolicy: Never
+  containers:
+    - name: downloader
+      image: python:3.10-slim
+      command: ["sh", "-c"]
+      args:
+        - |
+          set -eux
+          pip install --no-cache-dir huggingface_hub awscli
+          BUCKET_NAME=hf-models
+          ENDPOINT_URL=http://minio:9000
+          MODEL_NAME=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+          LOCAL_DIR=/tmp/model
+          if ! aws --endpoint-url $ENDPOINT_URL s3 ls "s3://$BUCKET_NAME" > /dev/null 2>&1; then
+            aws --endpoint-url $ENDPOINT_URL s3 mb "s3://$BUCKET_NAME"
+          fi
+          huggingface-cli download $MODEL_NAME --local-dir $LOCAL_DIR --local-dir-use-symlinks False
+          aws --endpoint-url $ENDPOINT_URL s3 cp $LOCAL_DIR s3://$BUCKET_NAME/$MODEL_NAME --recursive
+      env:
+        - name: AWS_ACCESS_KEY_ID
+          value: "<your-access-key>"
+        - name: AWS_SECRET_ACCESS_KEY
+          value: "<your-secret>"
+      volumeMounts:
+        - name: tmp-volume
+          mountPath: /tmp/model
+  volumes:
+    - name: tmp-volume
+      emptyDir: {}
+```
+
+You can then use `s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B` as your Dataset mount.
+
+## Usage with Dynamo
+
+Mount the Fluid-generated PVC in your DynamoGraphDeployment:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: model-caching
+spec:
+  pvcs:
+    - name: s3-model
+  envs:
+    - name: HF_HOME
+      value: /model
+    - name: DYN_DEPLOYMENT_CONFIG
+      value: '{"Common": {"model": "/model", ...}}'
+  services:
+    VllmWorker:
+      volumeMounts:
+        - name: s3-model
+          mountPoint: /model
+    Processor:
+      volumeMounts:
+        - name: s3-model
+          mountPoint: /model
+```
+
+
+## Full example with llama3.3 70B
+
+### Performance
+
+When deploying LLaMA 3.3 70B using Fluid as the caching layer, we observed the best performance by configuring a single-node cache that holds 100% of the model files locally. By ensuring that the vllm worker pod is scheduled on the same node as the Fluid cache, we were able to eliminate network I/O bottlenecks, which resulted in the fastest model startup time and the highest inference efficiency during our tests.
+
+| Cache Configuration                          | vLLM Pod Placement               | Startup Time    |
+|----------------------------------------------|----------------------------------|-----------------|
+| ❌ No Cache (Download from HuggingFace)      | N/A                              | ~9 minutes      |
+| 🟡 Multi-Node Cache (100% Model Cached)      | Not on Cache Node                | ~18 minutes     |
+| 🟡 Multi-Node Cache (100% Model Cached)      | On Cache Node                    | ~10 minutes     |
+| ✅ Single-Node Cache (100% Model Cached)     | On Cache Node                    | ~80 seconds     |
+
+
+### Resources
+
+```yaml
+# dataset.yaml
+apiVersion: data.fluid.io/v1alpha1
+kind: Dataset
+metadata:
+  name: llama-3-3-70b-instruct-model
+  namespace: my-namespace
+spec:
+  mounts:
+    - mountPoint: s3://hf-models/meta-llama/Llama-3.3-70B-Instruct
+      options:
+        alluxio.underfs.s3.endpoint: http://minio:9000
+        alluxio.underfs.s3.disable.dns.buckets: "true"
+        aws.secretKey: "minioadmin"
+        aws.accessKeyId: "minioadmin"
+        alluxio.underfs.s3.streaming.upload.enabled: "true"
+        alluxio.underfs.s3.multipart.upload.threads: "20"
+        alluxio.underfs.s3.socket.timeout: "50s"
+        alluxio.underfs.s3.request.timeout: "60s"
+---
+# runtime.yaml
+apiVersion: data.fluid.io/v1alpha1
+kind: AlluxioRuntime
+metadata:
+  name: llama-3-3-70b-instruct-model
+  namespace: my-namespace
+spec:
+  replicas: 1
+  properties:
+    alluxio.user.file.readtype.default: CACHE_PROMOTE
+    alluxio.user.file.write.type.default: CACHE_THROUGH
+    alluxio.user.block.size.bytes.default: 128MB
+  tieredstore:
+    levels:
+      - mediumtype: MEM
+        path: /dev/shm
+        quota: 300Gi
+        high: "1.0"
+        low: "0.7"
+---
+# DataLoad - Preloads the model into cache
+apiVersion: data.fluid.io/v1alpha1
+kind: DataLoad
+metadata:
+  name: llama-3-3-70b-instruct-model-loader
+spec:
+  dataset:
+    name: llama-3-3-70b-instruct-model
+    namespace: my-namespace
+  loadMetadata: true
+  target:
+    - path: "/"
+      replicas: 1
+```
+
+and the associated DynamoGraphDeployment with pod affinity to schedule the vllm worker on the same node than the Alluxio cache worker
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-hello-world
+spec:
+  envs:
+  - name: DYN_LOG
+    value: "debug"
+  - name: DYN_DEPLOYMENT_CONFIG
+    value: '{"Common": {"model": "/model", "block-size": 64, "max-model-len": 16384},
+      "Frontend": {"served_model_name": "meta-llama/Llama-3.3-70B-Instruct", "endpoint":
+      "dynamo.Processor.chat/completions", "port": 8000}, "Processor": {"router":
+      "round-robin", "router-num-threads": 4, "common-configs": ["model", "block-size",
+      "max-model-len"]}, "VllmWorker": {"tensor-parallel-size": 4, "enforce-eager": true, "max-num-batched-tokens":
+      16384, "enable-prefix-caching": true, "ServiceArgs": {"workers": 1, "resources":
+      {"gpu": "4", "memory": "40Gi"}}, "common-configs": ["model", "block-size", "max-model-len"]},
+      "Planner": {"environment": "kubernetes", "no-operation": true}}'
+  pvcs:
+    - name: llama-3-3-70b-instruct-model
+  services:
+    Processor:
+      volumeMounts:
+        - name: llama-3-3-70b-instruct-model
+          mountPoint: /model
+    VllmWorker:
+      volumeMounts:
+        - name: llama-3-3-70b-instruct-model
+          mountPoint: /model
+      extraPodSpec:
+        affinity:
+          nodeAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+              nodeSelectorTerms:
+                - matchExpressions:
+                  - key: fluid.io/s-alluxio-my-namespace-llama-3-3-70b-instruct-model
+                    operator: In
+                    values:
+                      - "true"
+```
+
+
+## Troubleshooting & FAQ
+
+- **PVC not created?** Check Fluid and AlluxioRuntime pod logs.
+- **Model not found?** Ensure the model was uploaded to the correct bucket/path.
+- **Permission errors?** Verify S3/MinIO credentials and bucket policies.
+
+## Resources
+
+- [Fluid Documentation](https://fluid-cloudnative.github.io/)
+- [Alluxio Documentation](https://docs.alluxio.io/)
+- [MinIO Documentation](https://docs.min.io/)
+- [Hugging Face Hub](https://huggingface.co/docs/hub/index)
+- [Dynamo README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md)
+- [Dynamo Documentation](https://docs.nvidia.com/dynamo/latest/index.html)
--- a/fern/pages/kubernetes/observability/logging.md
+++ b/fern/pages/kubernetes/observability/logging.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Log Aggregation in Dynamo on Kubernetes"
+---
+
+This guide demonstrates how to set up logging for Dynamo in Kubernetes using Grafana Loki and Alloy. This setup provides a simple reference logging setup that can be followed in Kubernetes clusters including Minikube and MicroK8s.
+
+<Note>
+This setup is intended for development and testing purposes. For production environments, please refer to the official documentation for high-availability configurations.
+</Note>
+
+## Components Overview
+
+- **[Grafana Loki](https://grafana.com/oss/loki/)**: Fast and cost-effective Kubernetes-native log aggregation system.
+
+- **[Grafana Alloy](https://grafana.com/oss/alloy/)**: OpenTelemetry collector that replaces Promtail, gathering logs, metrics and traces from Kubernetes pods.
+
+- **[Grafana](https://grafana.com/grafana/)**: Visualization platform for querying and exploring logs.
+
+## Prerequisites
+
+### 1. Dynamo Kubernetes Platform
+
+This guide assumes you have installed Dynamo Kubernetes Platform. For more information, see [Dynamo Kubernetes Platform](../README.md).
+
+### 2. Kube-prometheus
+
+While this guide does not use Prometheus, it assumes Grafana is pre-installed with the kube-prometheus. For more information, see [kube-prometheus](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
+
+### 3. Environment Variables
+
+#### Kubernetes Setup Variables
+
+The following env variables are set:
+- `MONITORING_NAMESPACE`: The namespace where Loki is installed
+- `DYN_NAMESPACE`: The namespace where Dynamo Kubernetes Platform is installed
+
+```bash
+export MONITORING_NAMESPACE=monitoring
+export DYN_NAMESPACE=dynamo-system
+```
+
+#### Dynamo Logging Variables
+
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for Loki) | `true` |
+| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
+| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `true` |
+
+## Installation Steps
+
+### 1. Install Loki
+
+First, we'll install Loki in single binary mode, which is ideal for testing and development:
+
+```bash
+# Add the Grafana Helm repository
+helm repo add grafana https://grafana.github.io/helm-charts
+helm repo update
+
+# Install Loki
+helm install --values deploy/observability/k8s/logging/values/loki-values.yaml loki grafana/loki -n $MONITORING_NAMESPACE
+```
+
+Our configuration (`loki-values.yaml`) sets up Loki in a simple configuration that is suitable for testing and development. It uses a local MinIO for storage. The installation pods can be viewed with:
+```bash
+kubectl get pods -n $MONITORING_NAMESPACE -l app=loki
+```
+
+### 2. Install Grafana Alloy
+
+Next, install the Grafana Alloy collector to gather logs from your Kubernetes cluster and forward them to Loki. Here we use the Helm chart `k8s-monitoring` provided by Grafana to install the collector:
+
+```bash
+# Generate a custom values file with the namespace information
+envsubst < deploy/observability/k8s/logging/values/alloy-values.yaml > alloy-custom-values.yaml
+
+# Install the collector
+helm install --values alloy-custom-values.yaml alloy grafana/k8s-monitoring -n $MONITORING_NAMESPACE
+```
+
+The values file (`alloy-values.yaml`) includes the following configurations for the collector:
+- Destination to forward logs to Loki
+- Namespace to collect logs from
+- Pod labels to be mapped to Loki labels
+- Collection method (kubernetesApi or tailing `/var/log/containers/`)
+
+```yaml
+destinations:
+- name: loki
+  type: loki
+  url: http://loki-gateway.$MONITORING_NAMESPACE.svc.cluster.local/loki/api/v1/push
+podLogs:
+  enabled: true
+  gatherMethod: kubernetesApi # collect logs from the kubernetes api, rather than /var/log/containers/; friendly for testing and development
+  collector: alloy-logs
+  labels:
+    app_kubernetes_io_name: app.kubernetes.io/name
+    nvidia_com_dynamo_component_type: nvidia.com/dynamo-component-type
+    nvidia_com_dynamo_graph_deployment_name: nvidia.com/dynamo-graph-deployment-name
+  labelsToKeep:
+  - "app_kubernetes_io_name"
+  - "container"
+  - "instance"
+  - "job"
+  - "level"
+  - "namespace"
+  - "service_name"
+  - "service_namespace"
+  - "deployment_environment"
+  - "deployment_environment_name"
+  - "nvidia_com_dynamo_component_type" # extract this label from the dynamo graph deployment
+  - "nvidia_com_dynamo_graph_deployment_name" # extract this label from the dynamo graph deployment
+  namespaces:
+  - $DYN_NAMESPACE
+```
+
+### 3. Configure Grafana with the Loki datasource and Dynamo Logs dashboard
+
+We will be viewing the logs associated with our DynamoGraphDeployment in Grafana. To do this, we need to configure Grafana with the Loki datasource and Dynamo Logs dashboard.
+
+Since we are using Grafana with the Prometheus Operator, we can simply apply the following ConfigMaps to quickly achieve this configuration.
+
+```bash
+# Configure Grafana with the Loki datasource
+envsubst < deploy/observability/k8s/logging/grafana/loki-datasource.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
+
+# Configure Grafana with the Dynamo Logs dashboard
+envsubst < deploy/observability/k8s/logging/grafana/logging-dashboard.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
+```
+
+<Note>
+If using Grafana installed without the Prometheus Operator, you can manually import the Loki datasource and Dynamo Logs dashboard using the Grafana UI.
+</Note>
+
+### 4. Deploy a DynamoGraphDeployment with JSONL Logging
+
+At this point, we should have everything in place to collect and view logs in our Grafana instance. All that is left is to deploy a DynamoGraphDeployment to collect logs from.
+
+To enable structured logs in a DynamoGraphDeployment, we need to set the `DYN_LOGGING_JSONL` environment variable to `1`. This is done for us in the `agg_logging.yaml` setup for the Sglang backend. We can now deploy the DynamoGraphDeployment with:
+
+```bash
+kubectl apply -n $DYN_NAMESPACE -f examples/backends/sglang/deploy/agg_logging.yaml
+```
+
+Send a few chat completions requests to generate structured logs across the frontend and worker pods across the DynamoGraphDeployment. We are now all set to view the logs in Grafana.
+
+## Viewing Logs in Grafana
+
+Port-forward the Grafana service to access the UI:
+
+```bash
+kubectl port-forward svc/prometheus-grafana 3000:80 -n $MONITORING_NAMESPACE
+```
+
+If everything is working, under Home > Dashboards > Dynamo Logs, you should see a dashboard that can be used to view the logs associated with our DynamoGraphDeployments
+
+The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g., frontend, worker, etc.).
--- a/fern/pages/kubernetes/observability/metrics.md
+++ b/fern/pages/kubernetes/observability/metrics.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Dynamo Metrics Collection on Kubernetes"
+---
+
+## Overview
+
+This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the kube-prometheus-stack. The kube-prometheus-stack provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components.
+
+## Prerequisites
+
+### Install kube-prometheus-stack
+If you don't have an existing Prometheus setup, you'll likely want to install the kube-prometheus-stack. This is a collection of Kubernetes manifests that includes the Prometheus Operator, Prometheus, Grafana, and other monitoring components in a pre-configured setup. The stack introduces custom resources that make it easy to deploy and manage monitoring in Kubernetes:
+
+- `PodMonitor`: Automatically discovers and scrapes metrics from pods based on label selectors
+- `ServiceMonitor`: Similar to PodMonitor but works with Services
+- `PrometheusRule`: Defines alerting and recording rules
+
+For a basic installation:
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo update
+# Values allow PodMonitors to be picked up that are outside of the kube-prometheus-stack helm release
+helm install prometheus -n monitoring --create-namespace prometheus-community/kube-prometheus-stack \
+  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
+  --set prometheus.prometheusSpec.podMonitorNamespaceSelector="{}" \
+  --set prometheus.prometheusSpec.probeNamespaceSelector="{}"
+```
+
+<Note>
+The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
+</Note>
+
+### Install Dynamo Operator
+Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../installation-guide.md) for detailed instructions on deploying the Dynamo operator.
+Make sure to set the `prometheusEndpoint` to the Prometheus endpoint you installed in the previous step.
+
+```bash
+helm install dynamo-platform ...
+  --set prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+```
+
+
+### Node Exporter for CPU/Memory Metrics
+
+The Dynamo Grafana dashboard includes panels for node-level CPU utilization, system load, and container resource usage. These metrics are collected and exported to Prometheus via [node-exporter](https://github.com/prometheus/node_exporter), which exposes hardware and OS metrics from Linux systems.
+
+<Note>
+The kube-prometheus-stack installation described above includes node-exporter by default. If you're using a custom Prometheus setup, you'll need to ensure node-exporter is deployed as a DaemonSet on your cluster nodes.
+</Note>
+
+To verify node-exporter is running:
+
+```bash
+kubectl get daemonset -A | grep node-exporter
+```
+
+If node-exporter is not running, you can install it via the kube-prometheus-stack or deploy it separately. For more information, see the [node-exporter documentation](https://github.com/prometheus/node_exporter).
+
+### DCGM Metrics Collection (Optional)
+
+GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
+
+```bash
+kubectl get daemonset -A | grep dcgm-exporter
+```
+
+If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
+
+
+## Deploy a DynamoGraphDeployment
+
+Let's start by deploying a simple vLLM aggregated deployment:
+
+```bash
+export NAMESPACE=dynamo-system # namespace where dynamo operator is installed
+pushd examples/backends/vllm/deploy
+kubectl apply -f agg.yaml -n $NAMESPACE
+popd
+```
+
+This will create two components:
+- A Frontend component exposing metrics on its HTTP port
+- A Worker component exposing metrics on its system port
+
+Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
+- Deployment configuration: See the [vLLM README](../../backends/vllm/README.md)
+- Available metrics: See the [metrics guide](../../observability/metrics.md)
+
+### Validate the Deployment
+
+Let's send some test requests to populate metrics:
+
+```bash
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "messages": [
+    {
+        "role": "user",
+        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
+    }
+    ],
+    "stream": true,
+    "max_tokens": 30
+  }'
+```
+
+For more information about validating the deployment, see the [vLLM README](../../backends/vllm/README.md).
+
+## Set Up Metrics Collection
+
+### Create PodMonitors
+
+The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods:
+- `nvidia.com/metrics-enabled: "true"` - Enables metrics collection
+- `nvidia.com/dynamo-component-type: "frontend|worker"` - Identifies the component type
+
+> **Note**: You can opt-out specific deployments from metrics collection by adding this annotation to your DynamoGraphDeployment:
+```yaml
+apiVersion: nvidia.com/v1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+  annotations:
+    nvidia.com/enable-metrics: "false"
+spec:
+  # …
+```
+
+### Configure Grafana Dashboard
+
+Apply the Dynamo dashboard configuration to populate Grafana with the Dynamo dashboard:
+```bash
+kubectl apply -n monitoring -f deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
+```
+
+The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_dashboard: "1"`, the Grafana will discover and populate it to its list of available dashboards. The dashboard includes panels for:
+- Frontend request rates
+- Time to first token
+- Inter-token latency
+- Request duration
+- Input/Output sequence lengths
+- GPU utilization via DCGM
+- Node CPU utilization and system load
+- Container CPU usage per pod
+- Memory usage per pod
+
+## Viewing the Metrics
+
+### In Prometheus
+```bash
+kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
+```
+
+Visit http://localhost:9090 and try these example queries:
+- `dynamo_frontend_requests_total`
+- `dynamo_frontend_time_to_first_token_seconds_bucket`
+
+![Prometheus UI showing Dynamo metrics](../../../assets/img/prometheus-k8s.png)
+
+### In Grafana
+```bash
+# Get Grafana credentials
+export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
+export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
+echo "Grafana user: $GRAFANA_USER"
+echo "Grafana password: $GRAFANA_PASSWORD"
+
+# Port forward Grafana service
+kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
+```
+
+Visit http://localhost:3000 and log in with the credentials captured above.
+
+Once logged in, find the Dynamo dashboard under General.
+
+![Grafana dashboard showing Dynamo metrics](../../../assets/img/grafana-k8s.png)
--- a/fern/pages/kubernetes/quickstart.md
+++ b/fern/pages/kubernetes/quickstart.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Deploying Dynamo on Kubernetes"
+---
+
+High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
+
+## Important Terminology
+
+**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created.
+- Used for: Resource isolation, RBAC, organizing deployments
+- Example: `dynamo-system`, `team-a-namespace`
+
+**Dynamo Namespace**: The logical namespace used by Dynamo components for [service discovery](service-discovery.md).
+- Used for: Runtime component communication, service discovery
+- Specified in: `.spec.services.<ServiceName>.dynamoNamespace` field
+- Example: `my-llm`, `production-model`, `dynamo-dev`
+
+These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
+
+## Prerequisites
+
+Before you begin, ensure you have the following tools installed:
+
+| Tool | Minimum Version | Installation Guide |
+|------|-----------------|-------------------|
+| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
+| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) |
+
+Verify your installation:
+```bash
+kubectl version --client  # Should show v1.24+
+helm version              # Should show v3.0+
+```
+
+For detailed installation instructions, see the [Prerequisites section](installation-guide.md#prerequisites) in the Installation Guide.
+
+## Pre-deployment Checks
+
+Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:
+
+```bash
+./deploy/pre-deployment/pre-deployment-check.sh
+```
+
+This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for more details.
+
+## 1. Install Platform First
+
+```bash
+# 1. Set environment
+export NAMESPACE=dynamo-system
+export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
+
+# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
+helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
+
+# 3. Install Platform
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
+```
+
+**For Shared/Multi-Tenant Clusters:**
+
+If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
+```bash
+--set dynamo-operator.namespaceRestriction.enabled=true
+```
+
+For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](installation-guide.md)**.
+
+## 2. Choose Your Backend
+
+Each backend has deployment examples and configuration options:
+
+| Backend      | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
+|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
+| **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**       | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
+| **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**           | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+
+## 3. Deploy Your First Model
+
+```bash
+export NAMESPACE=dynamo-system
+kubectl create namespace ${NAMESPACE}
+
+# to pull model from HF
+export HF_TOKEN=<Token-Here>
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="$HF_TOKEN" \
+  -n ${NAMESPACE};
+
+# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
+kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
+
+# Check status
+kubectl get dynamoGraphDeployment -n ${NAMESPACE}
+
+# Test it
+kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
+curl http://localhost:8000/v1/models
+```
+
+For SLA-based autoscaling, see [SLA Planner Quick Start Guide](../planner/sla-planner-quickstart.md).
+
+## Understanding Dynamo's Custom Resources
+
+Dynamo provides two main Kubernetes Custom Resources for deploying models:
+
+### DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration
+
+The **recommended approach** for generating optimal configurations. DGDR provides a high-level interface where you specify:
+- Model name and backend framework
+- SLA targets (latency requirements)
+- GPU type (optional)
+
+Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
+- SLA-driven configuration generation
+- Automated resource optimization
+- Users who want simplicity over control
+
+**Note**: DGDR generates a DGD spec which you can then use to deploy.
+
+### DynamoGraphDeployment (DGD) - Direct Configuration
+
+A lower-level interface that defines your complete inference pipeline:
+- Model configuration
+- Resource allocation (GPUs, memory)
+- Scaling policies
+- Frontend/backend connections
+
+Use this when you need fine-grained control or have already completed profiling.
+
+Refer to the [API Reference and Documentation](api-reference.md) for more details.
+
+## 📖 API Reference & Documentation
+
+For detailed technical specifications of Dynamo's Kubernetes resources:
+
+- **[API Reference](api-reference.md)** - Complete CRD field specifications for all Dynamo resources
+- **[Create Deployment](deployment/create-deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
+- **[Operator Guide](dynamo-operator.md)** - Dynamo operator configuration and management
+
+### Choosing Your Architecture Pattern
+
+When creating a deployment, select the architecture pattern that best fits your use case:
+
+- **Development / Testing** - Use `agg.yaml` as the base configuration
+- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
+- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
+
+### Frontend and Worker Components
+
+You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
+
+- Provides OpenAI-compatible `/v1/chat/completions` endpoint
+- Auto-discovers backend workers via [service discovery](service-discovery.md) (Kubernetes-native by default)
+- Routes requests and handles load balancing
+- Validates and preprocesses requests
+
+### Customizing Your Deployment
+
+Example structure:
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-llm
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: my-llm
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
+      dynamoNamespace: dynamo-dev
+      componentType: worker
+      replicas: 1
+      envFromSecret: hf-token-secret  # for HuggingFace models
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+          command: ["/bin/sh", "-c"]
+          args:
+            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
+```
+
+Worker command examples per backend:
+```yaml
+# vLLM worker
+args:
+  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
+
+# SGLang worker
+args:
+  - >-
+    python3 -m dynamo.sglang
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --tp 1
+    --trust-remote-code
+
+# TensorRT-LLM worker
+args:
+  - python3 -m dynamo.trtllm
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
+```
+
+Key customization points include:
+- **Model Configuration**: Specify model in the args command
+- **Resource Allocation**: Configure GPU requirements under `resources.limits`
+- **Scaling**: Set `replicas` for number of worker instances
+- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
+- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
+
+## Additional Resources
+
+- **[Examples](../getting-started/examples.md)** - Complete working examples
+- **[Create Custom Deployments](deployment/create-deployment.md)** - Build your own CRDs
+- **[Managing Models with DynamoModel](deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models
+- **[Operator Documentation](dynamo-operator.md)** - How the platform works
+- **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
+- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
+- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
+- **[Logging](observability/logging.md)** - For logging setup
+- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment
+- **[Grove](grove.md)** - For grove details and custom installation
+- **[Monitoring](observability/metrics.md)** - For monitoring setup
+- **[Model Caching with Fluid](model-caching-with-fluid.md)** - For model caching with Fluid
--- a/fern/pages/kubernetes/service-discovery.md
+++ b/fern/pages/kubernetes/service-discovery.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Service Discovery"
+---
+
+Dynamo components (frontends, workers, planner) need to be able to discover each other and their capabilities at runtime. We refer to this as service discovery. There are 2 kinds of service discovery backends supported on Kubernetes.
+
+## Discovery Backends
+
+| Backend | Default | Dependencies | Use Case |
+|---------|---------|--------------|----------|
+| **Kubernetes** | ✅ Yes | None (native K8s) | Recommended for all Kubernetes deployments |
+| **KV Store (etcd)** | No | etcd cluster | Legacy deployments |
+
+## Kubernetes Discovery (Default)
+
+Kubernetes discovery is the default and recommended backend when running on Kubernetes. It uses native Kubernetes primitives to facilitate discovery of components:
+
+- **DynamoWorkerMetadata CRD**: Each worker stores its registered endpoints and model cards in a Custom Resource
+- **EndpointSlices**: EndpointSlices signal each component's readiness status
+
+### Implementation Details
+
+Each pod runs a **discovery daemon** that watches both EndpointSlices and DynamoWorkerMetadata CRs. A pod is only discoverable when it appears as "ready" in an EndpointSlice AND has a corresponding `DynamoWorkerMetadata` CR. This correlation ensures pods aren't discoverable until they're ready, metadata is immediately available, and stale entries are cleaned up when pods terminate.
+
+#### DynamoWorkerMetadata CRD
+
+Each worker pod creates a `DynamoWorkerMetadata` CR that stores its discovery metadata:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoWorkerMetadata
+metadata:
+  name: my-worker-pod-abc123
+  namespace: dynamo-system
+  ownerReferences:
+    - apiVersion: v1
+      kind: Pod
+      name: my-worker-pod-abc123
+      uid: <pod-uid>
+      controller: true
+spec:
+  data:
+    endpoints:
+      "dynamo/backend/generate":
+        type: Endpoint
+        namespace: dynamo
+        component: backend
+        endpoint: generate
+        instance_id: 12345678901234567890
+        transport:
+          nats_tcp: "dynamo_backend.generate-abc123"
+    model_cards: {}
+```
+
+The CR is named after the pod and includes an owner reference for automatic garbage collection when the pod is deleted.
+
+#### EndpointSlices
+
+While DynamoWorkerMetadata resources provide an up-to-date snapshot of a component's capabilities, EndpointSlices give a snapshot of health of the various Dynamo components.
+
+The operator creates a Kubernetes Service targeting the Dynamo components. The Kubernetes controller in turn creates and maintains EndpointSlice resources that keep track of the readiness of the pods targeted by the Service. Watching these slices gives us an up-to-date snapshot of which Dynamo components are ready to serve traffic.
+
+##### Readiness Probes
+A pod is marked ready if the readiness probe succeeds. On Dynamo workers, this is when the `generate` endpoint is available and healthy. These probes are configured by the Dynamo operator for each pod/component.
+
+#### RBAC
+
+Each Dynamo component pod is automatically given a ServiceAccount that allows it to watch `EndpointSlice` and `DynamoWorkerMetadata` resources within its namespace.
+
+#### Environment Variables
+
+The following environment variables are automatically injected into pods by the operator to facilitate service discovery:
+
+| Variable | Description |
+|----------|-------------|
+| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
+| `POD_NAME` | Pod name (via downward API) |
+| `POD_NAMESPACE` | Pod namespace (via downward API) |
+| `POD_UID` | Pod UID (via downward API) |
+
+The pod's instance ID is deterministically generated by hashing the pod name, ensuring consistent identity and correlation between EndpointSlices and CRs.
+
+## KV Store Discovery (etcd)
+
+To use etcd-based discovery instead of Kubernetes-native discovery, add the annotation to your DynamoGraphDeployment:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+  annotations:
+    nvidia.com/dynamo-discovery-backend: etcd
+spec:
+  services:
+    # ...
+```
+
+This requires an etcd cluster to be available. The etcd connection is configured via the platform Helm chart.
--- a/fern/pages/kubernetes/webhooks.md
+++ b/fern/pages/kubernetes/webhooks.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Webhooks"
+---
+
+This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Architecture](#architecture)
+- [Configuration](#configuration)
+  - [Enabling/Disabling Webhooks](#enablingdisabling-webhooks)
+  - [Certificate Management Options](#certificate-management-options)
+  - [Advanced Configuration](#advanced-configuration)
+- [Certificate Management](#certificate-management)
+  - [Automatic Certificates (Default)](#automatic-certificates-default)
+  - [cert-manager Integration](#cert-manager-integration)
+  - [External Certificates](#external-certificates)
+- [Multi-Operator Deployments](#multi-operator-deployments)
+- [Troubleshooting](#troubleshooting)
+
+---
+
+## Overview
+
+The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.
+
+All webhook types (validating, mutating, conversion, etc.) share the same **webhook server** and **TLS certificate infrastructure**, making certificate management consistent across all webhook operations.
+
+### Key Features
+
+- ✅ **Enabled by default** - Zero-touch validation out of the box
+- ✅ **Shared certificate infrastructure** - All webhook types use the same TLS certificates
+- ✅ **Automatic certificate generation** - No manual certificate management required
+- ✅ **Defense in depth** - Controllers validate when webhooks are disabled
+- ✅ **cert-manager integration** - Optional integration for automated certificate lifecycle
+- ✅ **Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
+- ✅ **Immutability enforcement** - Critical fields protected via CEL validation rules
+
+### Current Webhook Types
+
+- **Validating Webhooks**: Validate custom resource specifications before persistence
+  - `DynamoComponentDeployment` validation
+  - `DynamoGraphDeployment` validation
+  - `DynamoModel` validation
+
+**Note:** Future releases may add mutating webhooks (for defaults/transformations) and conversion webhooks (for CRD version migrations). All will use the same certificate infrastructure described in this document.
+
+---
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         API Server                               │
+│  1. User submits CR (kubectl apply)                             │
+│  2. API server calls ValidatingWebhookConfiguration             │
+└────────────────────────┬────────────────────────────────────────┘
+                         │ HTTPS (TLS required)
+                         ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                  Webhook Server (in Operator Pod)                │
+│  3. Validates CR against business rules                         │
+│  4. Returns admit/deny decision + warnings                      │
+└─────────────────────────────────────────────────────────────────┘
+                         │
+                         ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                      API Server                                  │
+│  5. If admitted: Persist CR to etcd                             │
+│  6. If denied: Return error to user                             │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Validation Flow
+
+1. **Webhook validation** (if enabled): Validates at API server level
+2. **CEL validation**: Kubernetes-native immutability checks (always active)
+3. **Controller validation** (if webhooks disabled): Defense-in-depth validation during reconciliation
+
+---
+
+## Configuration
+
+### Enabling/Disabling Webhooks
+
+Webhooks are **enabled by default**. To disable them:
+
+```yaml
+# Platform-level values.yaml
+dynamo-operator:
+  webhook:
+    enabled: false
+```
+
+**When to disable webhooks:**
+- During development/testing when rapid iteration is needed
+- In environments where admission webhooks are not supported
+- When troubleshooting validation issues
+
+**Note:** When webhooks are disabled, controllers perform validation during reconciliation (defense in depth).
+
+---
+
+### Certificate Management Options
+
+The operator supports three certificate management modes:
+
+| Mode | Description | Use Case |
+|------|-------------|----------|
+| **Automatic (Default)** | Helm hooks generate self-signed certificates | Testing and development environments |
+| **cert-manager** | Integrate with cert-manager for automated lifecycle | Production deployments with cert-manager |
+| **External** | Bring your own certificates | Production deployments with custom PKI |
+
+---
+
+### Advanced Configuration
+
+#### Complete Configuration Reference
+
+```yaml
+dynamo-operator:
+  webhook:
+    # Enable/disable validation webhooks
+    enabled: true
+
+    # Certificate management
+    certManager:
+      enabled: false
+      issuerRef:
+        kind: Issuer
+        name: selfsigned-issuer
+
+    # Certificate secret configuration
+    certificateSecret:
+      name: webhook-server-cert
+      external: false
+
+    # Certificate validity period (automatic generation only)
+    certificateValidity: 3650  # 10 years
+
+    # Certificate generator image (automatic generation only)
+    certGenerator:
+      image:
+        repository: bitnami/kubectl
+        tag: latest
+
+    # Webhook behavior configuration
+    failurePolicy: Fail        # Fail (reject on error) or Ignore (allow on error)
+    timeoutSeconds: 10         # Webhook timeout
+
+    # Namespace filtering (advanced)
+    namespaceSelector: {}      # Kubernetes label selector for namespaces
+```
+
+#### Failure Policy
+
+```yaml
+# Fail: Reject resources if webhook is unavailable (recommended for production)
+webhook:
+  failurePolicy: Fail
+
+# Ignore: Allow resources if webhook is unavailable (use with caution)
+webhook:
+  failurePolicy: Ignore
+```
+
+**Recommendation:** Use `Fail` in production to ensure validation is always enforced. Only use `Ignore` if you need high availability and can tolerate occasional invalid resources.
+
+#### Namespace Filtering
+
+Control which namespaces are validated (applies to **cluster-wide operator** only):
+
+```yaml
+# Only validate resources in namespaces with specific labels
+webhook:
+  namespaceSelector:
+    matchLabels:
+      dynamo-validation: enabled
+
+# Or exclude specific namespaces
+webhook:
+  namespaceSelector:
+    matchExpressions:
+    - key: dynamo-validation
+      operator: NotIn
+      values: ["disabled"]
+```
+
+**Note:** For **namespace-restricted operators**, the namespace selector is automatically set to validate only the operator's namespace. This configuration is ignored in namespace-restricted mode.
+
+---
+
+## Certificate Management
+
+### Automatic Certificates (Default)
+
+**Zero configuration required!** Certificates are automatically generated during `helm install` and `helm upgrade`.
+
+#### How It Works
+
+1. **Pre-install/pre-upgrade hook**: Generates self-signed TLS certificates
+   - Root CA (valid 10 years)
+   - Server certificate (valid 10 years)
+   - Stores in Secret: `<release>-webhook-server-cert`
+
+2. **Post-install/post-upgrade hook**: Injects CA bundle into `ValidatingWebhookConfiguration`
+   - Reads `ca.crt` from Secret
+   - Patches `ValidatingWebhookConfiguration` with base64-encoded CA bundle
+
+3. **Operator pod**: Mounts certificate secret and serves webhook on port 9443
+
+#### Certificate Validity
+
+- **Root CA**: 10 years
+- **Server Certificate**: 10 years (same as Root CA)
+- **Automatic rotation**: Certificates are re-generated on every `helm upgrade`
+
+#### Smart Certificate Generation
+
+The certificate generation hook is intelligent:
+- ✅ **Checks existing certificates** before generating new ones
+- ✅ **Skips generation** if valid certificates exist (valid for 30+ days with correct SANs)
+- ✅ **Regenerates** only when needed (missing, expiring soon, or incorrect SANs)
+
+This means:
+- Fast `helm upgrade` operations (no unnecessary cert generation)
+- Safe to run `helm upgrade` frequently
+- Certificates persist across reinstalls (stored in Secret)
+
+#### Manual Certificate Rotation
+
+If you need to rotate certificates manually:
+
+```bash
+# Delete the certificate secret
+kubectl delete secret <release>-webhook-server-cert -n <namespace>
+
+# Upgrade the release to regenerate certificates
+helm upgrade <release> dynamo-platform -n <namespace>
+```
+
+---
+
+### cert-manager Integration
+
+For clusters with cert-manager installed, you can enable automated certificate lifecycle management.
+
+#### Prerequisites
+
+1. **cert-manager installed** (v1.0+)
+2. **CA issuer configured** (e.g., `selfsigned-issuer`)
+
+#### Configuration
+
+```yaml
+dynamo-operator:
+  webhook:
+    certManager:
+      enabled: true
+      issuerRef:
+        kind: Issuer              # Or ClusterIssuer
+        name: selfsigned-issuer   # Your issuer name
+```
+
+#### How It Works
+
+1. **Helm creates Certificate resource**: Requests TLS certificate from cert-manager
+2. **cert-manager generates certificate**: Based on configured issuer
+3. **cert-manager stores in Secret**: `<release>-webhook-server-cert`
+4. **cert-manager ca-injector**: Automatically injects CA bundle into `ValidatingWebhookConfiguration`
+5. **Operator pod**: Mounts certificate secret and serves webhook
+
+#### Benefits Over Automatic Mode
+
+- ✅ **Automated rotation**: cert-manager renews certificates before expiration
+- ✅ **Custom validity periods**: Configure certificate lifetime
+- ✅ **CA rotation support**: ca-injector handles CA updates automatically
+- ✅ **Integration with existing PKI**: Use your organization's certificate infrastructure
+
+#### Certificate Rotation
+
+With cert-manager, certificate rotation is **fully automated**:
+
+1. **Leaf certificate rotation** (default: every year)
+   - cert-manager auto-renews before expiration
+   - controller-runtime auto-reloads new certificate
+   - **No pod restart required**
+   - **No caBundle update required** (same Root CA)
+
+2. **Root CA rotation** (every 10 years)
+   - cert-manager rotates Root CA
+   - ca-injector auto-updates caBundle in `ValidatingWebhookConfiguration`
+   - **No manual intervention required**
+
+#### Example: Self-Signed Issuer
+
+```yaml
+apiVersion: cert-manager.io/v1
+kind: Issuer
+metadata:
+  name: selfsigned-issuer
+  namespace: dynamo-system
+spec:
+  selfSigned: {}
+---
+# Enable in platform values.yaml
+dynamo-operator:
+  webhook:
+    certManager:
+      enabled: true
+      issuerRef:
+        kind: Issuer
+        name: selfsigned-issuer
+```
+
+---
+
+### External Certificates
+
+Bring your own certificates for custom PKI requirements.
+
+#### Steps
+
+1. **Create certificate secret manually**:
+
+```bash
+kubectl create secret tls <release>-webhook-server-cert \
+  --cert=tls.crt \
+  --key=tls.key \
+  -n <namespace>
+
+# Also add ca.crt to the secret
+kubectl patch secret <release>-webhook-server-cert -n <namespace> \
+  --type='json' \
+  -p='[{"op": "add", "path": "/data/ca.crt", "value": "'$(base64 -w0 < ca.crt)'"}]'
+```
+
+2. **Configure operator to use external secret**:
+
+```yaml
+dynamo-operator:
+  webhook:
+    certificateSecret:
+      external: true
+    caBundle: <base64-encoded-ca-cert>  # Must manually specify
+```
+
+3. **Deploy operator**:
+
+```bash
+helm install dynamo-platform . -n <namespace> -f values.yaml
+```
+
+#### Certificate Requirements
+
+- **Secret name**: Must match `webhook.certificateSecret.name` (default: `webhook-server-cert`)
+- **Secret keys**: `tls.crt`, `tls.key`, `ca.crt`
+- **Certificate SAN**: Must include `<service-name>.<namespace>.svc`
+  - Example: `dynamo-platform-dynamo-operator-webhook-service.dynamo-system.svc`
+
+---
+
+## Multi-Operator Deployments
+
+The operator supports running both **cluster-wide** and **namespace-restricted** instances simultaneously using a **lease-based coordination mechanism**.
+
+### Scenario
+
+```
+Cluster:
+├─ Operator A (cluster-wide, namespace: platform-system)
+│  └─ Validates all namespaces EXCEPT team-a
+└─ Operator B (namespace-restricted, namespace: team-a)
+   └─ Validates only team-a namespace
+```
+
+### How It Works
+
+1. **Namespace-restricted operator** creates a Lease in its namespace
+2. **Cluster-wide operator** watches for Leases named `dynamo-operator-ns-lock`
+3. **Cluster-wide operator** skips validation for namespaces with active Leases
+4. **Namespace-restricted operator** validates resources in its namespace
+
+### Lease Configuration
+
+The lease mechanism is **automatically configured** based on deployment mode:
+
+```yaml
+# Cluster-wide operator (default)
+namespaceRestriction:
+  enabled: false
+# → Watches for leases in all namespaces
+# → Skips validation for namespaces with active leases
+
+# Namespace-restricted operator
+namespaceRestriction:
+  enabled: true
+  namespace: team-a
+# → Creates lease in team-a namespace
+# → Does NOT check for leases (no cluster permissions)
+```
+
+### Deployment Example
+
+```bash
+# 1. Deploy cluster-wide operator
+helm install platform-operator dynamo-platform \
+  -n platform-system \
+  --set namespaceRestriction.enabled=false
+
+# 2. Deploy namespace-restricted operator for team-a
+helm install team-a-operator dynamo-platform \
+  -n team-a \
+  --set namespaceRestriction.enabled=true \
+  --set namespaceRestriction.namespace=team-a
+```
+
+### ValidatingWebhookConfiguration Naming
+
+The webhook configuration name reflects the deployment mode:
+
+- **Cluster-wide**: `<release>-validating`
+- **Namespace-restricted**: `<release>-validating-<namespace>`
+
+Example:
+
+```bash
+# Cluster-wide
+platform-operator-validating
+
+# Namespace-restricted (team-a)
+team-a-operator-validating-team-a
+```
+
+This allows multiple webhook configurations to coexist without conflicts.
+
+### Lease Health
+
+If the namespace-restricted operator is deleted or becomes unhealthy:
+- Lease expires after `leaseDuration + gracePeriod` (default: ~30 seconds)
+- Cluster-wide operator automatically resumes validation for that namespace
+
+---
+
+## Troubleshooting
+
+### Webhook Not Called
+
+**Symptoms:**
+- Invalid resources are accepted
+- No validation errors in logs
+
+**Checks:**
+
+1. **Verify webhook is enabled**:
+```bash
+kubectl get validatingwebhookconfiguration | grep dynamo
+```
+
+2. **Check webhook configuration**:
+```bash
+kubectl get validatingwebhookconfiguration <name> -o yaml
+# Verify:
+# - caBundle is present and non-empty
+# - clientConfig.service points to correct service
+# - webhooks[].namespaceSelector matches your namespace
+```
+
+3. **Verify webhook service exists**:
+```bash
+kubectl get service -n <namespace> | grep webhook
+```
+
+4. **Check operator logs for webhook startup**:
+```bash
+kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep webhook
+# Should see: "Webhooks are enabled - webhooks will validate, controllers will skip validation"
+# Should see: "Starting webhook server"
+```
+
+---
+
+### Connection Refused Errors
+
+**Symptoms:**
+```
+Error from server (InternalError): Internal error occurred: failed calling webhook:
+Post "https://...webhook-service...:443/validate-...": dial tcp ...:443: connect: connection refused
+```
+
+**Checks:**
+
+1. **Verify operator pod is running**:
+```bash
+kubectl get pods -n <namespace> -l app.kubernetes.io/name=dynamo-operator
+```
+
+2. **Check webhook server is listening**:
+```bash
+# Port-forward to pod
+kubectl port-forward -n <namespace> pod/<operator-pod> 9443:9443
+
+# In another terminal, test connection
+curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeployment
+# Should NOT get "connection refused"
+```
+
+3. **Verify webhook port in deployment**:
+```bash
+kubectl get deployment -n <namespace> <release>-dynamo-operator -o yaml | grep -A5 "containerPort: 9443"
+```
+
+4. **Check for webhook initialization errors**:
+```bash
+kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep -i error
+```
+
+---
+
+### Certificate Errors
+
+**Symptoms:**
+```
+Error from server (InternalError): Internal error occurred: failed calling webhook:
+x509: certificate signed by unknown authority
+```
+
+**Checks:**
+
+1. **Verify caBundle is present**:
+```bash
+kubectl get validatingwebhookconfiguration <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
+# Should output a valid PEM certificate
+```
+
+2. **Verify certificate secret exists**:
+```bash
+kubectl get secret -n <namespace> <release>-webhook-server-cert
+```
+
+3. **Check certificate validity**:
+```bash
+kubectl get secret -n <namespace> <release>-webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
+# Check:
+# - Not expired
+# - SAN includes: <service-name>.<namespace>.svc
+```
+
+4. **Check CA injection job logs**:
+```bash
+kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
+```
+
+---
+
+### Helm Hook Job Failures
+
+**Symptoms:**
+- `helm install` or `helm upgrade` hangs or fails
+- Certificate generation errors
+
+**Checks:**
+
+1. **List hook jobs**:
+```bash
+kubectl get jobs -n <namespace> | grep webhook
+```
+
+2. **Check job logs**:
+```bash
+# Certificate generation
+kubectl logs -n <namespace> job/<release>-webhook-cert-gen-<revision>
+
+# CA injection
+kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
+```
+
+3. **Check RBAC permissions**:
+```bash
+# Verify ServiceAccount exists
+kubectl get sa -n <namespace> <release>-webhook-ca-inject
+
+# Verify ClusterRole and ClusterRoleBinding exist
+kubectl get clusterrole <release>-webhook-ca-inject
+kubectl get clusterrolebinding <release>-webhook-ca-inject
+```
+
+4. **Manual cleanup**:
+```bash
+# Delete failed jobs
+kubectl delete job -n <namespace> <release>-webhook-cert-gen-<revision>
+kubectl delete job -n <namespace> <release>-webhook-ca-inject-<revision>
+
+# Retry helm upgrade
+helm upgrade <release> dynamo-platform -n <namespace>
+```
+
+---
+
+### Validation Errors Not Clear
+
+**Symptoms:**
+- Webhook rejects resource but error message is unclear
+
+**Solution:**
+
+Check operator logs for detailed validation errors:
+
+```bash
+kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep "validate create\|validate update"
+```
+
+Webhook logs include:
+- Resource name and namespace
+- Validation errors with context
+- Warnings for immutable field changes
+
+---
+
+### Stuck Deleting Resources
+
+**Symptoms:**
+- Resource stuck in "Terminating" state
+- Webhook blocks finalizer removal
+
+**Solution:**
+
+The webhook automatically skips validation for resources being deleted. If stuck:
+
+1. **Check if webhook is blocking**:
+```bash
+kubectl describe <resource-type> <name> -n <namespace>
+# Look for events mentioning webhook errors
+```
+
+2. **Temporarily disable webhook**:
+```bash
+# Option 1: Delete ValidatingWebhookConfiguration
+kubectl delete validatingwebhookconfiguration <name>
+
+# Option 2: Set failurePolicy to Ignore
+kubectl patch validatingwebhookconfiguration <name> \
+  --type='json' \
+  -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
+```
+
+3. **Delete resource again**:
+```bash
+kubectl delete <resource-type> <name> -n <namespace>
+```
+
+4. **Re-enable webhook**:
+```bash
+helm upgrade <release> dynamo-platform -n <namespace>
+```
+
+---
+
+## Best Practices
+
+### Production Deployments
+
+1. ✅ **Keep webhooks enabled** (default) for real-time validation
+2. ✅ **Use `failurePolicy: Fail`** (default) to ensure validation is enforced
+3. ✅ **Monitor webhook latency** - Validation adds ~10-50ms per resource operation
+4. ✅ **Use cert-manager** for automated certificate lifecycle in large deployments
+5. ✅ **Test webhook configuration** in staging before production
+
+### Development Deployments
+
+1. ✅ **Disable webhooks** for rapid iteration if needed
+2. ✅ **Use `failurePolicy: Ignore`** if webhook availability is problematic
+3. ✅ **Keep automatic certificates** (simpler than cert-manager for dev)
+
+### Multi-Tenant Deployments
+
+1. ✅ **Deploy one cluster-wide operator** for platform-wide validation
+2. ✅ **Deploy namespace-restricted operators** for tenant-specific namespaces
+3. ✅ **Monitor lease health** to ensure coordination works correctly
+4. ✅ **Use unique release names** per namespace to avoid naming conflicts
+
+---
+
+## Additional Resources
+
+- [Kubernetes Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
+- [cert-manager Documentation](https://cert-manager.io/docs/)
+- [Kubebuilder Webhook Tutorial](https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation.html)
+- [CEL Validation Rules](https://kubernetes.io/docs/reference/using-api/cel/)
+
+---
+
+## Support
+
+For issues or questions:
+- Check [Troubleshooting](#troubleshooting) section
+- Review operator logs: `kubectl logs -n <namespace> deployment/<release>-dynamo-operator`
+- Open an issue on GitHub
+
--- a/fern/pages/kvbm/kvbm-architecture.md
+++ b/fern/pages/kvbm/kvbm-architecture.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "KVBM Architecture"
+---
+
+The KVBM serves as a critical infrastructure component for scaling LLM inference workloads efficiently. By cleanly separating runtime logic from memory management, and by enabling distributed block sharing, KVBM lays the foundation for high-throughput, multi-node, and memory-disaggregated AI systems.
+
+![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../../assets/img/kvbm-architecture.png)
+**High level layered architecture view of Dynamo KV Block manager and how it interfaces with different components of LLM inference ecosystem**
+
+The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (TRTLLM, vLLM and SGLang)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBM’s block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.
+
+The middle layer, the KVBM layer, encapsulates the core logic of the KV block manager and serves as the runtime substrate for managing block memory. The KVBM adapter layer normalizes the representations and data layout for the incoming requests across runtimes and forwards them to the core memory manager. The KVBM and the core modules implement required internal functionality, such as table lookups, memory allocation, block layout management, lifecycle, and state transitions and block reuse or eviction was on policies. The KVBM layer also has required abstractions for external components to override or augment its behavior.
+
+The last layer, the NIXL layer, provides unified support for enabling all data and storage transactions. NIXL enables P2P GPU transfers, enables RDMA and NVLINK remote memory sharing, dynamic block registration and metadata exchange and provides a plugin interface for storage backends.
+
+NIXL integrates with several backends:
+
+- Block memory (Eg. GPU HBM, Host DRAM, Remote DRAM, Local SSD when exposed as block device)
+- Local file system (for example, POSIX)
+- Remote file system (for example, NFS)
+- Object stores (for example, S3-compatible)
+- Cloud storage (for example, blob storage APIs)
+
+**[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** abstracts away the registration and integration complexity for each backends via custom optimizable plugin architecture and enables memory blocks to be published, serialized, and accessed remotely, allowing the disaggregation of compute and memory across nodes. Combined with the Dynamo KV Block Manager (KVBM), storage providers no longer need to retrofit or optimize individual LLM inference engines. Instead, they can focus on tuning their own stack, providing optimized endpoints, knowing that integration is smooth, standardized, and efficient. And for those who *do* want to go further, Dynamo KVBM offers a clean separation of concerns, making custom optimization not only possible, but simple.
\ No newline at end of file
--- a/fern/pages/kvbm/kvbm-components.md
+++ b/fern/pages/kvbm/kvbm-components.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "Understanding KVBM components"
+---
+
+KVBM design takes inspiration from the KV block managers used in vLLM and SGLang, with an added influence from historical memory tiering strategies common in general GPU programming. For more details, [See KVBM Reading](kvbm-reading.md). The figure below illustrates the internal components of KVBM.
+
+![Internal Components of Dynamo KVBM. ](../../assets/img/kvbm-components.png)
+**Internal Components of Dynamo KVBM**
+
+## KVBM Components
+### Core
+- **KvBlockManager**: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs.
+- **Scheduler**: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1).
+- **Config (config.rs)**: Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts.
+- **KvBlockManagerState**: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks.
+- **Events/Metrics**: Observability components emitting counters/gauges and event hooks for integration/testing.
+
+### Layouts and Blocks
+- **LayoutConfig & LayoutType**: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry.
+- **Blocks & Metadata**: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by `sequence_hash`.
+
+### Transfer Manager
+- **TransferManager**: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device).
+
+### Storage & Pools
+- **Device Pool(G1)**: GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device).
+- **Host Pool(G2)**: CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O.
+- **Disk Pool(G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.
+
+## KVBM DataFlows
+![KVBM Data Flows. ](../../assets/img/kvbm-data-flows.png)
+**KVBM Data Flows from device to other memory hierarchies**
+
+**Device → Host (Offload)**
+* Triggered explicitly requested to offload by the connector scheduler.
+* Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy.
+* Host pool registers the new immutable block (dedup by sequence hash).
+
+**Host → Disk (Offload)**
+* Local Disk: NIXL Write via POSIX; GDS when available.
+* Remote Disk (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical.
+* Triggered on registered host blocks or explicit offload requests.
+* Worker allocates a Disk block and performs NIXL Write (Host→Disk).
+* Disk pool registers the new immutable block (dedup by sequence hash).
+
+**Host → Device (Onboard)**
+* Called to bring a host block into GPU memory.
+* Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy.
+* Device pool registers the new immutable block.
+
+**Disk → Device (Onboard)**
+* Called to bring a disk block directly into GPU memory.
+* Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS.
+* Device pool registers the new immutable block.
--- a/fern/pages/kvbm/kvbm-design-deepdive.md
+++ b/fern/pages/kvbm/kvbm-design-deepdive.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "KVBM components"
+---
+
+The design of the KVBM is inspired from vLLM and SGLang KV block managers but with a twist from historical memory tiering design aspired in general GPU programming. [See KVBM Reading](kvbm-reading.md). The following figure shows the internal architecture of KVBM and how it works across workers using NIXL.
+
+![Internal architecture and key modules in the Dynamo KVBM. ](../../assets/img/kvbm-internal-arch.png)
+**Internal architecture and key modules in the Dynamo KVBM**
+
+## KvBlockManager as Orchestration Layer
+
+The `KvBlockManager <H, D>` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Critical to note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
+
+`KvBlockManager<H, D>` owns:
+
+* A device-side `BlockPool<Device>`
+* A host-side `BlockPool<Host>`
+* A remote NIXL agent that supports communication and memory sharing across nodes
+* A block set registry for remote lookup and import/export of block metadata
+
+Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness.
+
+## Block Layout and Memory Mapping
+
+Each block is a 2D array `[num_layers][page_size × inner_dim]`. `BlockLayouttrait` abstracts the memory layout. The default implementation,`FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation:
+
+
+```none
+block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
+```
+
+
+Both CPU and GPU pools share this memory layout, but they use storage-specific backends:
+
+* `DeviceStorage` → CUDA device buffer
+* `PinnedStorage` → page-locked host memory
+* `SystemStorage` → CPU heap memory (fallback/test)
+* `NixlStorage` → remote memory through NIXL RDMA handles (includes storage)
+
+Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a StorageAllocator.
+
+## BlockPool and Memory Pools (Active and Inactive)
+
+Each `BlockPool<T>` (where `T` is `DeviceStorage`, `PinnedStorage`, and so forth) tracks two sub-pools:
+
+* `ActivePool`: Contains blocks currently in use by sequences
+* `InactivePool`: Recycled blocks ready for allocation; think free list
+
+When a token block is requested (for example, `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
+
+The state machine (`BlockState`) that tracks the block lifecycle transitions includes:
+
+| State | Description | Ownership | Valid Actions/Transitions |
+| ----- | ----- | ----- | ----- |
+| Reset | Block hasn't been initialized or was reset. No associated sequence. | Held in InactivePool, reusable | init_sequence(salt_hash) → Partial |
+| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | add_token() / add_tokens() (accumulate)- commit() → Complete- reset() → Reset |
+| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | register() → Registered- reset() → Reset |
+| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. Can use block for lookups | Shared ownership (global registry) | Auto drop() → triggers Remove event and transitions to Reset |
+
+This table lists the valid KVBM transitions:
+
+| From → To | Trigger | Validation |
+| ----- | ----- | ----- |
+| Reset → Partial | initsequence(salt_hash) | Must not be in use |
+| Partial → Complete | commit() | Must be full |
+| Complete → Registered | register() | Must be finalized |
+| Registered → Reset | Drop of RegistrationHandle | Automatic |
+| Partial → Reset | Aborted sequence | Explicit or drop |
+| Complete → Reset | Invalidated | Explicit or drop |
+
+Consider this example lifecycle of a block in the KVBM; in it, a sequence requests a new KV block:
+
+1. Allocator pops from InactivePool → Block is in Reset
+2. `init_sequence()` → Transitions to Partial
+3. Tokens are appended → State remains Partial
+4. On full → `commit()` → State becomes Complete
+5. `register()` → Block is hashed and moved to Registered. Blocks can now be used to lookup.
+6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset
+
+## Lifecycle Management using RAII and Event Plane
+
+The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop:
+
+* `PublishHandle` triggers Register events
+* Dropping it triggers Remove events
+
+This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
+
+## Remote Memory Integration using NIXL
+
+The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include:
+
+* `nixl_register()`: Registers memory region with NIXL runtime
+* `serialize() / deserialize()`: Converts layout and memory into transferable descriptors
+* `import_remote_blockset()`: Loads remote node's block layouts into the manager
+* `get_remote_blocks_mutable()`: Fetches transferable memory views from another node
+
+`RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
+
+The left side of the figure in [KVBM Components](kvbm-components.md) illustrates a bidirectional remote memory registration and layout synchronization protocol between workers (for example, Worker 1 and Worker 2) using NIXL. The following steps break down the process:
+
+1. *Agent Creation & Memory Registration:*
+
+   Each worker independently sets up a NixlAgent:
+    * Registers its memory regions (that is, device memory) through `nixl_register()`.
+    * These regions correspond to blocks managed in the local BlockPool.
+      Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout.
+
+2. *Metadata exchange:*
+
+   After memory registration, workers exchange serialized layout metadata, encapsulated in a `SerializedNixlBlockLayout`.
+
+    Why is this step critical?
+   * LLM inference workloads often differ in *tensor parallel (TP)* configurations:
+     * Worker 1 might have TP=4, while Worker 2 has TP=8.
+     * Thus, even if both systems use similar `FullyContiguous` layouts, their internal slicing and alignment assumptions differ.
+   * The metadata exchange bridges this semantic mismatch by sharing:
+     * LayoutConfig (num_layers, page_size, inner_dim, dtype)
+     * BlockSetID
+     * Base address + stride information (including alignment)
+     * Device ID + memory type (host/device)
+   * Once the workers share metadata, each can reconstruct the layout on its side using deserialize().
+  This enables NIXL to:
+      * Understand where each layer/block resides
+      * Perform correct gather-scatter operations during RDMA-like transfers
+
+    Without this step, remote fetches would result in data corruption or misaligned tokens.
+
+3. *Serialization & Deserialization: Making Layouts Portable*
+
+   In the serialization stage, KVBM exports and `FullyContiguous::serialize()` encodes:
+   * FullyContiguousConfig
+   * base_offset
+   * Physical memory descriptors (NixlStorage), including:
+     * Memory type (VRAM, DRAM)
+     * Address & size
+     * Device ID
+
+       The system sends this using NIXL transfer and then injects it into a KVBM scheduler state. In the deserialization stage, `SerializedNixlBlockLayout::deserialize()` rehydrates this into:
+       * A fully reconstructed memory layout view
+       * Local representation of a remote memory slice with correct offsets and size semantics
+       It also enables direct access to remote memory with consistent logical semantics
+       This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
+
+4. *Ownership handles and lifetime tracking*
+
+    Memory ownership in NIXL is tightly coupled with RAII-based handles:
+     * When a block is registered, it returns a `PublishHandle` which wraps a `RegistrationHandle`
+     * On drop of this handle, an automatic Remove event is published, which:
+       * Deregisters the block from the NIXL layer
+       * Removes it from the remote block registry
+     * This ensures that, once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes
+       This mechanism avoids:
+     * Stale memory access
+     * Dangling pointers on GPU or host
+     * Manual deregistration bugs
+       The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency
+
+
+## Storage backends and pluggability
+
+You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers. We defer detailed integration guidance, since we collaborate with storage partners to simplify and standardize these integration paths.
+
+```mermaid
+---
+title: Example KVBM System Architecture
+---
+flowchart TD
+    A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"]
+
+    B --> C["NIXL Storage Agent<br/>- Volume registration<br/>- get()/put() abstraction"]
+    B --> D["Event Plane<br/>- NATS-based Pub/Sub<br/>- StoreEvent / RemoveEvent"]
+
+    C --> E["G4 Storage Infrastructure<br/>(SSD, Object store, etc.)<br/>- Store KV blocks"]
+    D --> F["Storage Provider Subscriber<br/>- Parse Events<br/>- Build fast tree/index<br/>- Optimize G4 tiering"]
+```
+
+For now, the following breakdown provides a high-level understanding of how KVBM interacts with external storage using the NIXL storage interface and the Dynamo Event Plane:
+
+### NIXL Storage Interface (for Backend Integration)
+
+The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
+
+* registerVolume(descriptor): Register a logical volume for KV cache data.
+* unregisterVolume(): Cleanly deregister and release volume mappings.
+* get() / put(): Block-level APIs used by KVBM to fetch and store token blocks.
+
+These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Please note that these APIs are still being finalized.
+
+### Dynamo Event Plane (Pub/Sub Coordination Layer)
+
+To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations. Particularly there are two events emitted.
+
+* StoreEvent: Emitted when a KV block is registered.
+* RemoveEvent: Emitted when a KV block is released or evicted.
+
+Each KVEvent (\~100 bytes) contains:
+
+* sequence_hash: Unique identifier of the KV block
+* prefix_hash: Prefix grouping for query-level aggregation
+* block_size: Size in bytes
+* storage_location: Logical volume identifier
+* event_type: Store or Remove
+* extra_metadata: Reserved fields for partner-specific optimization
+
+For scalability, the system batches and publishes these events periodically (for example, every \~10s, or dynamically based on system load).
+
+### A conceptual design of a storage advisor
+
+This section provides an overview for the storage provider who is interested in integrating as a custom backend to KVBM and providing optimized performance. **Please note, this is optional for KVBM integration with a backend.**
+
+External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
+
+* Storage volumes are pre-provisioned and mounted by the storage provider.
+* These volumes are then registered with Dynamo through the NIXL Storage Agent using registerVolume() APIs. Dynamo itself does not manage mounts or provisioning.
+* The Dynamo KV Block Manager interacts only with logical block-level APIs (that is, get() and put()).
+* In parallel, the Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel.
+* Storage vendors implement a lightweight subscriber process that listens to these events without interfering with the KV Manager's runtime behavior.
+* This decoupling ensures that external storage systems can optimize block placement and lifecycle tracking without modifying or instrumenting the core Dynamo codebase.
+
+Now, to enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream. Here is a high level conceptual design:
+
+* On receiving a StoreEvent, the storage system:
+  * Inserts a record into an internal prefix tree, hash map, or LRU index.
+  * This record includes the prefix_hash and sequence_hash, which logically identify the token block and its grouping.
+  * Associated metadata (for example, block_size, storage_location) is also captured.
+* On receiving a RemoveEvent, the system:
+  * Deletes or prunes the corresponding record from its index.
+  * Optionally triggers cleanup or tier migration workflows.
+
+This event-driven indexing allows the storage system to track which KV blocks are live and where they belong—enabling low-latency lookup, efficient space reclamation, and multi-tier coordination. With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies, such as:
+
+* Hot block promotion: Frequently accessed KV blocks can be migrated to fast SSD volumes.
+* Cold block demotion: Infrequently used blocks can be demoted to slower storage (for example, HDDs, cloud object storage).
+* Proactive compaction: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks.
+
+These optimizations are performed entirely outside of Dynamo, with the assumption that storage providers adhere to SLA guarantees and volume availability.
+
+Critically, this entire system is designed to be non-intrusive:
+
+* The Dynamo KV Block Manager remains agnostic to how data is stored or optimized.
+* The Event Plane doesn't block or intercept any critical path of inference.
+* Storage vendors are given the freedom to innovate and optimize without requiring changes to the inference runtime.
+
+This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
--- a/fern/pages/kvbm/kvbm-integrations.md
+++ b/fern/pages/kvbm/kvbm-integrations.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "KVBM Integrations"
+---
+
+KVBM Integrates with Inference frameworks (vLLM, TRTLLM, SGLang) via Connector APIs to influence KV caching behaviour, scheduling, and forward pass execution.
+There are two components of the interface, Scheduler and Worker. Scheduler(leader) is responsible for the orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion. Worker is responsible for reading metadata built by the scheduler(leader), does async onboarding/ offloading at the end of the forward pass.
+
+## Typical KVBM Integrations
+
+The following figure shows the typical integration of KVBM with inference frameworks (vLLM used as an example)
+
+![vLLM KVBM Integration ](../../assets/img/kvbm-integrations.png)
+**vLLM KVBM Integration**
+
+
+## How to run KVBM with Frameworks
+* Instructions to [run KVBM in vLLM](vllm-setup.md)
+* Instructions to [run KVBM with TRTLLM](trtllm-setup.md)
+
+## Onboarding
+![Onboarding blocks from Host to Device](../../assets/img/kvbm-onboard-host2device.png)
+**Onboarding blocks from Host to Device**
+![Onboarding blocks from Disk to Device](../../assets/img/kvbm-onboard-disk2device.png)
+**Onboarding blocks from Disk to Device**
+
+## Offloading
+![Offloading blocks from Device to Host&Disk](../../assets/img/kvbm-offload.png)
+**Offloading blocks from Device to Host&Disk**
--- a/fern/pages/kvbm/kvbm-intro.md
+++ b/fern/pages/kvbm/kvbm-intro.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: "KV Block Manager"
+---
+
+The Dynamo KV Block Manager (KVBM) is a scalable runtime component
+designed to handle memory allocation, management, and remote sharing of
+Key-Value (KV) blocks for inference tasks across heterogeneous and
+distributed environments. It acts as a unified memory layer for
+frameworks like vLLM, SGLang, and TRT-LLM.
+
+It offers:
+
+- A **unified memory API** that spans GPU memory(in future) , pinned
+  host memory, remote RDMA-accessible memory, local or distributed pool
+  of SSDs and remote file/object/cloud storage systems.
+- Support for evolving **block lifecycles** (allocate → register →
+  match) with event-based state transitions that storage can subscribe
+  to.
+- Integration with **NIXL**, a dynamic memory exchange layer used for
+  remote registration, sharing, and access of memory blocks over
+  RDMA/NVLink.
+
+The Dynamo KV Block Manager serves as a reference implementation that
+emphasizes modularity and extensibility. Its pluggable design enables
+developers to customize components and optimize for specific
+performance, memory, and deployment needs.
+
+