docs: migrate Fern docs from fern/ into docs/ (#6206)

Signed-off-by: Jont828 <jt572@cornell.edu>

docs: migrate Fern docs from fern/ into docs/ (#6206)
Signed-off-by: Jont828 <jt572@cornell.edu>
39d645e5 · Jonathan Tong · GitHub · d381e6ff · d381e6ff · d381e6ff
Unverified Commit 39d645e5 authored Feb 11, 2026 by Jonathan Tong Committed by GitHub Feb 11, 2026
20 changed files
--- a/docs/kubernetes/chrek/standalone.md
+++ b/docs/kubernetes/chrek/standalone.md
-# ChReK Standalone Usage Guide
-> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.
-This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
-## Table of Contents
- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Step 1: Deploy ChReK](#step-1-deploy-chrek)
- [Step 2: Build Checkpoint-Enabled Images](#step-2-build-checkpoint-enabled-images)
- [Step 3: Create Checkpoint Jobs](#step-3-create-checkpoint-jobs)
- [Step 4: Restore from Checkpoints](#step-4-restore-from-checkpoints)
- [Environment Variables Reference](#environment-variables-reference)
- [Checkpoint Flow Explained](#checkpoint-flow-explained)
- [Troubleshooting](#troubleshooting)
---
-## Overview
-When using ChReK standalone, you are responsible for:
-1. **Deploying the ChReK Helm chart** (DaemonSet + PVC)
-2. **Building checkpoint-enabled container images** with the restore entrypoint
-3. **Creating checkpoint jobs** with the correct environment variables
-4. **Creating restore pods** that detect and use the checkpoints
-The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.
---
-## Prerequisites
- Kubernetes cluster with:
-  - NVIDIA GPUs with checkpoint support
-  - **Privileged security context allowed** (⚠️ required for CRIU - see [Security Considerations](#security-considerations))
-  - PVC storage (ReadWriteMany recommended for multi-node)
- Docker or compatible container runtime for building images
- Access to the ChReK source code: `deploy/chrek/`
-### Security Considerations
-⚠️ **Important**: ChReK restore operations **require privileged mode**, which has significant security implications:
- **Privileged containers** can access all host devices and bypass most security restrictions
- This may violate security policies in production environments
- Privileged containers, if compromised, can potentially compromise node security
-**Recommended for:**
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
-**Not recommended for:**
- ❌ Multi-tenant clusters without proper isolation
- ❌ Security-sensitive production workloads without risk assessment
- ❌ Environments with strict security compliance requirements
-### Technical Limitations
-⚠️ **Current Restrictions:**
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations are not yet supported
- **Network state**: Active TCP connections are closed during restore
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
---
-## Step 1: Deploy ChReK
-### Install the Helm Chart
-```bash
-# Clone the repository
-git clone https://github.com/ai-dynamo/dynamo.git
-cd dynamo
-# Install ChReK in your namespace
-helm install chrek ./deploy/helm/charts/chrek \
-  --namespace my-app \
-  --create-namespace \
-  --set storage.pvc.size=100Gi \
-  --set storage.pvc.storageClass=your-storage-class
-```
-### Verify Installation
-```bash
-# Check the DaemonSet is running
-kubectl get daemonset -n my-app
-# NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
-# chrek-agent   3         3         3       3            3
-# Check the PVC is bound
-kubectl get pvc -n my-app
-# NAME        STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS
-# chrek-pvc   Bound    pvc-xyz    100Gi      RWX            your-storage-class
-```
---
-## Step 2: Build Checkpoint-Enabled Images
-ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.
-### Quick Start: Using the Placeholder Target (Recommended)
-```bash
-cd deploy/chrek
-# Define your images
-export BASE_IMAGE="your-app:latest"           # Your existing application image
-export RESTORE_IMAGE="your-app:checkpoint-enabled"  # Output checkpoint-enabled image
-# Build using the placeholder target
-docker build \
-  --target placeholder \
-  --build-arg BASE_IMAGE="$BASE_IMAGE" \
-  -t "$RESTORE_IMAGE" \
-  .
-# Push to your registry
-docker push "$RESTORE_IMAGE"
-```
-**Example with a Dynamo vLLM image:**
-```bash
-cd deploy/chrek
-export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
-export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"
-docker build \
-  --target placeholder \
-  --build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
-  -t "$RESTORE_IMAGE" \
-  .
-```
-### What the Placeholder Target Does
-The ChReK Dockerfile's `placeholder` stage automatically:
- ✅ Builds the restore-entrypoint binary
- ✅ Injects it into `/usr/local/bin/restore-entrypoint`
- ✅ Adds `smart-entrypoint.sh` to `/usr/local/bin/`
- ✅ Sets executable permissions
- ✅ Configures the entrypoint to detect and restore checkpoints
- ✅ Preserves your original application CMD
-### Alternative: Manual Multi-Stage Build
-If you need more control, you can create your own Dockerfile:
-```dockerfile
-# Stage 1: Build restore-entrypoint
-FROM golang:1.23-alpine AS restore-builder
-WORKDIR /build
-COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint
-COPY deploy/chrek/pkg ./pkg
-COPY deploy/chrek/go.mod deploy/chrek/go.sum ./
-RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint
-# Stage 2: Your application image
-FROM your-base-image:latest
-# Copy restore-entrypoint
-COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint
-# Copy smart-entrypoint.sh
-COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh
-RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint
-# Set smart-entrypoint as the default entrypoint
-ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]
-# Your application command (becomes CMD, can be overridden)
-CMD ["python", "your_app.py"]
-```
-> **💡 Tip**: Using the `placeholder` target is the recommended approach as it's maintained with the ChReK codebase and ensures compatibility.
---
-## Step 3: Create Checkpoint Jobs
-A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.
-### Required Environment Variables
-Your checkpoint job MUST set these environment variables:
-| Variable | Description | Example |
-|----------|-------------|---------|
-| `DYN_CHECKPOINT_SIGNAL_FILE` | Path where DaemonSet writes completion signal | `/checkpoint-signal/my-checkpoint.done` |
-| `DYN_READY_FOR_CHECKPOINT_FILE` | Path where your app signals it's ready | `/tmp/ready-for-checkpoint` |
-| `DYN_CHECKPOINT_HASH` | Unique identifier for this checkpoint | `abc123def456` |
-| `DYN_CHECKPOINT_LOCATION` | Directory where checkpoint is stored | `/checkpoints/abc123def456` |
-| `DYN_CHECKPOINT_STORAGE_TYPE` | Storage backend type | `pvc` |
-### Required Labels
-Add this label to enable DaemonSet checkpoint detection:
-```yaml
-labels:
-  nvidia.com/checkpoint-source: "true"
-```
-### Example Checkpoint Job
-```yaml
-apiVersion: batch/v1
-kind: Job
-metadata:
-  name: checkpoint-my-model
-  namespace: my-app
-spec:
-  template:
-    metadata:
-      labels:
-        nvidia.com/checkpoint-source: "true"  # Required for DaemonSet detection
-    spec:
-      restartPolicy: Never
-      # Init container to clean up stale signal files
-      initContainers:
-      - name: cleanup-signal-file
-        image: busybox:latest
-        command:
-        - sh
-        - -c
-        - |
-          rm -f /checkpoint-signal/my-checkpoint.done || true
-          echo "Signal file cleanup complete"
-        volumeMounts:
-        - name: checkpoint-signal
-          mountPath: /checkpoint-signal
-      containers:
-      - name: main
-        image: my-app:checkpoint-enabled
-        # Security context required for CRIU
-        securityContext:
-          privileged: true
-          capabilities:
-            add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
-        # Readiness probe: Pod becomes Ready when model is loaded
-        # This is what triggers the DaemonSet to start checkpointing
-        readinessProbe:
-          exec:
-            command: ["sh", "-c", "cat ${DYN_READY_FOR_CHECKPOINT_FILE}"]
-          initialDelaySeconds: 15
-          periodSeconds: 2
-        # Remove liveness/startup probes for checkpoint jobs
-        # Model loading can take several minutes
-        livenessProbe: null
-        startupProbe: null
-        # Checkpoint-related environment variables
-        env:
-        - name: DYN_CHECKPOINT_SIGNAL_FILE
-          value: "/checkpoint-signal/my-checkpoint.done"
-        - name: DYN_READY_FOR_CHECKPOINT_FILE
-          value: "/tmp/ready-for-checkpoint"
-        - name: DYN_CHECKPOINT_HASH
-          value: "abc123def456"
-        - name: DYN_CHECKPOINT_LOCATION
-          value: "/checkpoints/abc123def456"
-        - name: DYN_CHECKPOINT_STORAGE_TYPE
-          value: "pvc"
-        # GPU request
-        resources:
-          limits:
-            nvidia.com/gpu: 1
-        # Required volume mounts
-        volumeMounts:
-        - name: checkpoint-storage
-          mountPath: /checkpoints
-        - name: checkpoint-signal
-          mountPath: /checkpoint-signal
-        - name: tmp
-          mountPath: /tmp
-      volumes:
-      - name: checkpoint-storage
-        persistentVolumeClaim:
-          claimName: chrek-pvc
-      - name: checkpoint-signal
-        hostPath:
-          path: /var/lib/chrek/signals
-          type: DirectoryOrCreate
-      - name: tmp
-        emptyDir: {}
-```
-### Application Code Requirements
-Your application must implement the checkpoint flow. Here's the pattern used by Dynamo vLLM:
-```python
-import os
-import time
-def main():
-    # 1. Check for checkpoint mode
-    signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE")
-    ready_file = os.environ.get("DYN_READY_FOR_CHECKPOINT_FILE")
-    restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE")
-    is_checkpoint_mode = signal_file is not None
-    if is_checkpoint_mode:
-        print("Checkpoint mode detected")
-        # 2. Load your model/application
-        model = load_model()
-        # 3. Optional: Put model to sleep to reduce memory footprint
-        # model.sleep()
-        # 4. Write ready file (for application use, not DaemonSet)
-        if ready_file:
-            with open(ready_file, "w") as f:
-                f.write("ready")
-            print(f"Wrote checkpoint ready file: {ready_file}")
-        # 5. Log readiness messages (helps debugging)
-        print("CHECKPOINT_READY: Model loaded, ready for container checkpoint")
-        print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}")
-        print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}")
-        # 6. Wait for checkpoint completion OR restore detection
-        while True:
-            # Check if we've been restored (marker file created by restore entrypoint)
-            if os.path.exists(restore_marker):
-                print(f"Detected restore from checkpoint (marker: {restore_marker})")
-                # Continue with normal application flow
-                break
-            # Check if checkpoint is complete (signal file created by DaemonSet)
-            if os.path.exists(signal_file):
-                print(f"Checkpoint signal file detected: {signal_file}")
-                print("Checkpoint complete, exiting")
-                return  # Exit gracefully
-            time.sleep(1)
-    # Normal application flow (or post-restore flow)
-    run_application()
-```
-**Important Notes:**
-1. **Ready File & Readiness Probe**: The checkpoint job must have a readiness probe that checks for the ready file:
-   ```yaml
-   readinessProbe:
-     exec:
-       command: ["sh", "-c", "cat ${DYN_READY_FOR_CHECKPOINT_FILE}"]
-     initialDelaySeconds: 15
-     periodSeconds: 2
-   ```
-   The ChReK DaemonSet triggers checkpointing when:
-   - Pod has `nvidia.com/checkpoint-source: "true"` label
-   - Pod status is `Ready` (readiness probe passes = ready file exists)
-2. **Restore Marker**: Created by `restore-entrypoint` before CRIU restore, allows the restored process to detect it was restored
-3. **Two Exit Paths**:
-   - **Signal file found**: Checkpoint complete, exit gracefully
-   - **Restore marker found**: Process was restored, continue running
---
-## Step 4: Restore from Checkpoints
-Restore pods automatically detect and restore from checkpoints if they exist.
-### Example Restore Pod
-```yaml
-apiVersion: v1
-kind: Pod
-metadata:
-  name: my-app-restored
-  namespace: my-app
-spec:
-  restartPolicy: Never
-  containers:
-  - name: main
-    image: my-app:checkpoint-enabled
-    # Security context required for CRIU restore
-    securityContext:
-      privileged: true
-      capabilities:
-        add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
-    # Set checkpoint environment variables
-    env:
-    - name: DYN_CHECKPOINT_HASH
-      value: "abc123def456"  # Must match checkpoint job
-    - name: DYN_CHECKPOINT_PATH
-      value: "/checkpoints"  # Base path (hash appended automatically)
-    - name: DYN_RESTORE_MARKER_FILE
-      value: "/tmp/dynamo-restored"
-    # GPU request
-    resources:
-      limits:
-        nvidia.com/gpu: 1
-    # Mount checkpoint storage (READ-ONLY is fine for restore)
-    volumeMounts:
-    - name: checkpoint-storage
-      mountPath: /checkpoints
-      readOnly: true
-    - name: checkpoint-signal
-      mountPath: /checkpoint-signal
-  volumes:
-  - name: checkpoint-storage
-    persistentVolumeClaim:
-      claimName: chrek-pvc
-  - name: checkpoint-signal
-    hostPath:
-      path: /var/lib/chrek/signals
-      type: DirectoryOrCreate
-```
-### How Restore Works
-1. **Smart Entrypoint Detects Checkpoint**: The `smart-entrypoint.sh` checks if a checkpoint exists at `/checkpoints/${DYN_CHECKPOINT_HASH}/`
-2. **Calls Restore Entrypoint**: If found, calls `/usr/local/bin/restore-entrypoint` which invokes CRIU
-3. **CRIU Restores Process**: The entire process tree is restored from the checkpoint, including GPU state
-4. **Application Continues**: Your application resumes exactly where it was checkpointed
---
-## Environment Variables Reference
-### Checkpoint Jobs
-| Variable | Required | Description |
-|----------|----------|-------------|
-| `DYN_CHECKPOINT_SIGNAL_FILE` | Yes | Full path to signal file (e.g., `/checkpoint-signal/my-checkpoint.done`) |
-| `DYN_READY_FOR_CHECKPOINT_FILE` | Yes | Full path where app signals readiness (e.g., `/tmp/ready-for-checkpoint`) |
-| `DYN_CHECKPOINT_HASH` | Yes | Unique checkpoint identifier (alphanumeric string) |
-| `DYN_CHECKPOINT_LOCATION` | Yes | Directory where checkpoint is stored (e.g., `/checkpoints/abc123`) |
-| `DYN_CHECKPOINT_STORAGE_TYPE` | Yes | Storage backend: `pvc`, `s3`, or `oci` |
-### Restore Pods
-| Variable | Required | Description |
-|----------|----------|-------------|
-| `DYN_CHECKPOINT_HASH` | Yes | Checkpoint identifier (must match checkpoint job) |
-| `DYN_CHECKPOINT_PATH` | Yes | Base checkpoint directory (hash appended automatically) |
-| `DYN_RESTORE_MARKER_FILE` | Yes | Path for restore marker file |
-### Optional CRIU Tuning (Advanced)
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `CRIU_TIMEOUT` | `0` (unlimited) | CRIU operation timeout in seconds |
-| `CRIU_LOG_LEVEL` | `4` | CRIU log verbosity (0-4) |
-| `CRIU_WORK_DIR` | `/tmp` | CRIU working directory |
-| `CUDA_PLUGIN_DIR` | `/usr/local/lib/criu` | Path to CRIU CUDA plugin |
-| `CRIU_SKIP_IN_FLIGHT` | `false` | Skip in-flight TCP connections |
-| `CRIU_AUTO_DEDUP` | `false` | Enable auto-deduplication |
-| `CRIU_LAZY_PAGES` | `false` | Enable lazy page migration (experimental) |
-| `WAIT_FOR_CHECKPOINT` | `false` | Wait for checkpoint to appear before starting |
-| `RESTORE_WAIT_TIMEOUT` | `300` | Max seconds to wait for checkpoint |
-| `DEBUG` | `false` | Enable debug mode (sleeps 300s on error) |
---
-## Checkpoint Flow Explained
-### 1. Checkpoint Creation Flow
-```
-┌─────────────────────────────────────────────────────────────┐
-│ 1. Pod starts with nvidia.com/checkpoint-source=true label  │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 2. Application loads model and creates ready file           │
-│    /tmp/ready-for-checkpoint                                 │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 3. Pod becomes Ready (kubelet readiness probe passes)       │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 4. ChReK DaemonSet detects:                                 │
-│    - Pod is Ready                                            │
-│    - Has checkpoint-source label                             │
-│    - Ready file exists: /tmp/ready-for-checkpoint           │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 5. DaemonSet executes CRIU checkpoint via runc:             │
-│    - Freezes container process                               │
-│    - Dumps memory (CPU + GPU)                                │
-│    - Saves to /checkpoints/${HASH}/                          │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 6. DaemonSet writes signal file:                            │
-│    /checkpoint-signal/${HASH}.done                           │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 7. Application detects signal file and exits gracefully     │
-└─────────────────────────────────────────────────────────────┘
-```
-### 2. Restore Flow
-```
-┌─────────────────────────────────────────────────────────────┐
-│ 1. Pod starts with DYN_CHECKPOINT_HASH set                  │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 2. smart-entrypoint.sh checks for checkpoint:               │
-│    /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done      │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ├─ Not Found ─────────────────┐
-                       │                              │
-                       ▼                              ▼
-           ┌───────────────────────┐    ┌──────────────────────┐
-           │ Checkpoint exists     │    │ Cold start           │
-           └──────────┬────────────┘    │ Run original CMD     │
-                      │                 └──────────────────────┘
-                      ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 3. Call restore-entrypoint with checkpoint path             │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 4. restore-entrypoint extracts checkpoint and calls CRIU:   │
-│    criu restore --images-dir /checkpoints/${HASH}/images    │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 5. CRIU restores process from checkpoint                    │
-│    - Restores memory (CPU + GPU)                             │
-│    - Restores file descriptors                               │
-│    - Resumes process execution                               │
-└──────────────────────┬──────────────────────────────────────┘
-                       │
-                       ▼
-┌─────────────────────────────────────────────────────────────┐
-│ 6. Application continues from checkpointed state            │
-│    (Model already loaded, GPU memory initialized)           │
-└─────────────────────────────────────────────────────────────┘
-```
---
-## Troubleshooting
-### Checkpoint Not Created
-**Symptom**: Job runs but no checkpoint appears in `/checkpoints/`
-**Checks**:
-1. Verify the pod has the label:
-   ```bash
-   kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}'
-   ```
-2. Check pod readiness:
-   ```bash
-   kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
-   ```
-3. Check ready file was created:
-   ```bash
-   kubectl exec <pod-name> -- ls -la /tmp/ready-for-checkpoint
-   ```
-4. Check DaemonSet logs:
-   ```bash
-   kubectl logs -n my-app daemonset/chrek-agent --all-containers
-   ```
-### Restore Fails
-**Symptom**: Pod fails to restore from checkpoint
-**Checks**:
-1. Verify checkpoint files exist:
-   ```bash
-   kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/
-   ```
-2. Check privileged mode is enabled:
-   ```bash
-   kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].securityContext.privileged}'
-   ```
-3. Check CRIU logs in `/tmp/criu-restore.log`:
-   ```bash
-   kubectl exec <pod-name> -- cat /tmp/criu-restore.log
-   ```
-4. Ensure checkpoint and restore have same:
-   - Container image
-   - GPU count
-   - Volume mounts
-   - Environment variables (except POD_NAME, POD_IP, etc.)
-### Permission Denied Errors
-**Symptom**: `CRIU: Permission denied` or `Operation not permitted`
-**Solution**: Ensure pod has:
-```yaml
-securityContext:
-  privileged: true
-  capabilities:
-    add:
-    - SYS_ADMIN
-    - SYS_PTRACE
-    - SYS_CHROOT
-```
-### Signal File Not Appearing
-**Symptom**: Application waits forever for signal file
-**Checks**:
-1. Verify hostPath mount is correct:
-   ```bash
-   kubectl get pod <pod-name> -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}'
-   ```
-2. Check DaemonSet has access to the same path:
-   ```bash
-   kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}'
-   ```
-3. Verify paths match exactly:
-   - Pod: `/var/lib/chrek/signals`
-   - DaemonSet: `/var/lib/chrek/signals`
---
-## Additional Resources
- [ChReK Helm Chart Values](../../deploy/helm/charts/chrek/values.yaml)
- [Smart Entrypoint Script](../../deploy/chrek/scripts/smart-entrypoint.sh)
- [CRIU Documentation](https://criu.org/Main_Page)
- [CUDA Checkpoint Plugin](https://docs.nvidia.com/cuda/cuda-checkpoint-plugin/)
---
-## Getting Help
-If you encounter issues:
-1. Check the [Troubleshooting](#troubleshooting) section
-2. Review DaemonSet logs: `kubectl logs -n <namespace> daemonset/chrek-agent`
-3. Open an issue on [GitHub](https://github.com/ai-dynamo/dynamo/issues)
--- a/docs/kubernetes/deployment/create_deployment.md
+++ b/docs/kubernetes/deployment/create_deployment.md
-# Creating Kubernetes Deployments
-The scripts in the `examples/<backend>/launch` folder like [agg.sh](../../../examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
-The corresponding YAML files like [agg.yaml](../../../examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
-This guide explains how to create your own deployment files.
-## Step 1: Choose Your Architecture Pattern
-Before choosing a template, understand the different architecture patterns:
-### Aggregated Serving (agg.yaml)
-**Pattern**: Prefill and decode on the same GPU in a single process.
-**Suggested to use for**:
- Small to medium models (under 70B parameters)
- Development and testing
- Low to moderate traffic
- Simplicity is prioritized over maximum throughput
-**Tradeoffs**:
- Simpler setup and debugging
- Lower operational complexity
- GPU utilization may not be optimal (prefill and decode compete for resources)
- Lower throughput ceiling compared to disaggregated
-**Example**: [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml)
-### Aggregated + Router (agg_router.yaml)
-**Pattern**: Load balancer routing across multiple aggregated worker instances.
-**Suggested to use for**:
- Medium traffic requiring high availability
- Need horizontal scaling
- Want some load balancing without disaggregation complexity
-**Tradeoffs**:
- Better scalability than plain aggregated
- High availability through multiple replicas
- Still has GPU underutilization issues of aggregated serving
- More complex than plain aggregated but simpler than disaggregated
-**Example**: [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml)
-### Disaggregated Serving (disagg_router.yaml)
-**Pattern**: Separate prefill and decode workers with specialized optimization.
-**Suggested to use for**:
- Production-style deployments
- High throughput requirements
- Large models (70B+ parameters)
- Maximum GPU utilization needed
-**Tradeoffs**:
- Maximum performance and throughput
- Better GPU utilization (prefill and decode specialized)
- Independent scaling of prefill and decode
- More complex setup and debugging
- Requires understanding of prefill/decode separation
-**Example**: [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml)
-### Quick Selection Guide
-Select the architecture pattern as your template that best fits your use case.
-For example, when using the `vLLM` backend:
- **Development / Testing**: Use [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml) as the base configuration.
- **Production with Load Balancing**: Use [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
-## Step 2: Customize the Template
-You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
-The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
-It serves the following roles:
-1. OpenAI-Compatible HTTP Server
-  * Provides `/v1/chat/completions` endpoint
-  * Handles HTTP request/response formatting
-  * Supports streaming responses
-  * Validates incoming requests
-2. Service Discovery and Routing
-  * Auto-discovers backend workers via etcd
-  * Routes requests to the appropriate Processor/Worker components
-  * Handles load balancing between multiple workers
-3. Request Preprocessing
-  * Initial request validation
-  * Model name verification
-  * Request format standardization
-You should then pick a worker and specialize the config. For example,
-```yaml
-VllmWorker:         # vLLM-specific config
-  enforce-eager: true
-  enable-prefix-caching: true
-SglangWorker:       # SGLang-specific config
-  router-mode: kv
-  disagg-mode: true
-TrtllmWorker:       # TensorRT-LLM-specific config
-  engine-config: ./engine.yaml
-  kv-cache-transfer: ucx
-```
-Here's a template structure based on the examples:
-```yaml
-    YourWorker:
-      dynamoNamespace: your-namespace
-      componentType: worker
-      replicas: N
-      envFromSecret: your-secrets  # e.g., hf-token-secret
-      # Health checks for worker initialization
-      readinessProbe:
-        exec:
-          command: ["/bin/sh", "-c", 'grep "Worker.*initialized" /tmp/worker.log']
-      resources:
-        requests:
-          gpu: "1"  # GPU allocation
-      extraPodSpec:
-        mainContainer:
-          image: your-image
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - python -m dynamo.YOUR_INFERENCE_ENGINE --model YOUR_MODEL --your-flags
-```
-Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
-`extraPodSpec: -> mainContainer: -> args:`
-The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
-Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
-If you are a Dynamo contributor the [dynamo run guide](../../reference/cli.md) for details on how to run this command.
-## Step 3: Key Customization Points
-### Model Configuration
-```yaml
-   args:
-     - "python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flag"
-```
-### Resource Allocation
-```yaml
-   resources:
-     requests:
-       cpu: "N"
-       memory: "NGi"
-       gpu: "N"
-```
-### Scaling
-```yaml
-   replicas: N  # Number of worker instances
-```
-### Routing Mode
-```yaml
-   args:
-     - --router-mode
-     - kv  # Enable KV-cache routing
-```
-### Worker Specialization
-```yaml
-   args:
-     - --is-prefill-worker  # For disaggregated prefill workers
-```
-### Image Pull Secret Configuration
-#### Automatic Discovery and Injection
-By default, the Dynamo operator automatically discovers and injects image pull secrets based on container registry host matching. The operator scans Docker config secrets within the same namespace and matches their registry hostnames to the container image URLs, automatically injecting the appropriate secrets into the pod's `imagePullSecrets`.
-**Disabling Automatic Discovery:**
-To disable this behavior for a component and manually control image pull secrets:
-```yaml
-    YourWorker:
-      dynamoNamespace: your-namespace
-      componentType: worker
-      annotations:
-        nvidia.com/disable-image-pull-secret-discovery: "true"
-```
-When disabled, you can manually specify secrets as you would for a normal pod spec via:
-```yaml
-    YourWorker:
-      dynamoNamespace: your-namespace
-      componentType: worker
-      annotations:
-        nvidia.com/disable-image-pull-secret-discovery: "true"
-      extraPodSpec:
-        imagePullSecrets:
-          - name: my-registry-secret
-          - name: another-secret
-        mainContainer:
-          image: your-image
-```
-This automatic discovery eliminates the need to manually configure image pull secrets for each deployment.
-## Step 6: Deploy LoRA Adapters (Optional)
-After your base model deployment is running, you can deploy LoRA adapters using the `DynamoModel` custom resource. This allows you to fine-tune and extend your models without modifying the base deployment.
-To add a LoRA adapter to your deployment, link it using `modelRef` in your worker configuration:
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-spec:
-  services:
-    Worker:
-      modelRef:
-        name: Qwen/Qwen3-0.6B  # Base model identifier
-      componentType: worker
-      # ... rest of worker config
-```
-Then create a `DynamoModel` resource for your LoRA:
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoModel
-metadata:
-  name: my-lora
-spec:
-  modelName: my-custom-lora
-  baseModelName: Qwen/Qwen3-0.6B  # Must match modelRef.name above
-  modelType: lora
-  source:
-    uri: s3://my-bucket/loras/my-lora
-```
-**For complete details on managing models and LoRA adapters, see:**
-📖 **[Managing Models with DynamoModel Guide](./dynamomodel-guide.md)**
--- a/docs/kubernetes/deployment/dynamomodel-guide.md
+++ b/docs/kubernetes/deployment/dynamomodel-guide.md
-# Managing Models with DynamoModel
-## Overview
-`DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to:
- **Deploy LoRA adapters** on top of running base models
- **Track model endpoints** and their readiness across your cluster
- **Manage model lifecycle** declaratively with Kubernetes
-DynamoModel works alongside `DynamoGraphDeployment` (DGD) or `DynamoComponentDeployment` (DCD) resources. While DGD/DCD deploy the inference infrastructure (pods, services), DynamoModel handles model-specific operations like loading LoRA adapters.
-## Quick Start
-### Prerequisites
-Before creating a DynamoModel, you need:
-1. A running `DynamoGraphDeployment` or `DynamoComponentDeployment`
-2. Components configured with `modelRef` pointing to your base model
-3. Pods are ready and serving your base model
-For complete setup including DGD configuration, see [Integration with DynamoGraphDeployment](#integration-with-dynamographdeployment).
-### Deploy a LoRA Adapter
-**1. Create your DynamoModel:**
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoModel
-metadata:
-  name: my-lora
-  namespace: dynamo-system
-spec:
-  modelName: my-custom-lora
-  baseModelName: Qwen/Qwen3-0.6B  # Must match modelRef.name in your DGD
-  modelType: lora
-  source:
-    uri: s3://my-bucket/loras/my-lora
-```
-**2. Apply and verify:**
-```bash
-# Apply the DynamoModel
-kubectl apply -f my-lora.yaml
-# Check status
-kubectl get dynamomodel my-lora
-```
-**Expected output:**
-```
-NAME      TOTAL   READY   AGE
-my-lora   2       2       30s
-```
-That's it! The operator automatically discovers endpoints and loads the LoRA.
-For detailed status monitoring, see [Monitoring & Operations](#monitoring--operations).
-## Understanding DynamoModel
-### Model Types
-DynamoModel supports three model types:
-| Type | Description | Use Case |
-|------|-------------|----------|
-| **`base`** | Reference to an existing base model | Tracking endpoints for a base model (default) |
-| **`lora`** | LoRA adapter that extends a base model | Deploy fine-tuned adapters on existing models |
-| **`adapter`** | Generic model adapter | Future extensibility for other adapter types |
-Most users will use **`lora`** to deploy fine-tuned models on top of their base model deployments.
-### How It Works
-When you create a DynamoModel, the operator:
-1. **Discovers endpoints**: Finds all pods running your `baseModelName` (by matching `modelRef.name` in DGD/DCD)
-2. **Creates service**: Automatically creates a Kubernetes Service to track these pods
-3. **Loads LoRA**: Calls the LoRA load API on each endpoint (for `lora` type)
-4. **Updates status**: Reports which endpoints are ready
-**Key linkage:**
-```yaml
-# DGD modelRef.name ↔ DynamoModel baseModelName must match
-Worker:
-  modelRef:
-    name: Qwen/Qwen3-0.6B
---
-spec:
-  baseModelName: Qwen/Qwen3-0.6B
-```
-## Configuration Overview
-DynamoModel requires just a few key fields to deploy a model or adapter:
-| Field | Required | Purpose | Example |
-|-------|----------|---------|---------|
-| `modelName` | Yes | Model identifier | `my-custom-lora` |
-| `baseModelName` | Yes | Links to DGD modelRef | `Qwen/Qwen3-0.6B` |
-| `modelType` | No | Type: base/lora/adapter | `lora` (default: `base`) |
-| `source.uri` | For LoRA | Model location | `s3://bucket/path` or `hf://org/model` |
-**Example minimal LoRA configuration:**
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoModel
-metadata:
-  name: my-lora
-spec:
-  modelName: my-custom-lora
-  baseModelName: Qwen/Qwen3-0.6B
-  modelType: lora
-  source:
-    uri: s3://my-bucket/my-lora
-```
-**For complete field specifications, validation rules, and all options, see:**
-📖 [DynamoModel API Reference](../api_reference.md#dynamomodel)
-### Status Summary
-The status shows discovered endpoints and their readiness:
-```bash
-kubectl get dynamomodel my-lora
-```
-**Key status fields:**
- `totalEndpoints` / `readyEndpoints`: Counts of discovered vs ready endpoints
- `endpoints[]`: List with addresses, pod names, and ready status
- `conditions`: Standard Kubernetes conditions (EndpointsReady, ServicesFound)
-For detailed status usage, see the [Monitoring & Operations](#monitoring--operations) section below
-## Common Use Cases
-### Use Case 1: S3-Hosted LoRA Adapter
-Deploy a LoRA adapter stored in an S3 bucket.
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoModel
-metadata:
-  name: customer-support-lora
-  namespace: production
-spec:
-  modelName: customer-support-adapter-v1
-  baseModelName: meta-llama/Llama-3.3-70B-Instruct
-  modelType: lora
-  source:
-    uri: s3://my-models-bucket/loras/customer-support/v1
-```
-**Prerequisites:**
- S3 bucket accessible from your pods (IAM role or credentials)
- Base model `meta-llama/Llama-3.3-70B-Instruct` running via DGD/DCD
-**Verification:**
-```bash
-# Check LoRA is loaded
-kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.readyEndpoints}'
-# Should output: 2 (or your number of replicas)
-# View which pods are serving
-kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.endpoints[*].podName}'
-```
-### Use Case 2: HuggingFace-Hosted LoRA
-Deploy a LoRA adapter from HuggingFace Hub.
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoModel
-metadata:
-  name: multilingual-lora
-  namespace: dynamo-system
-spec:
-  modelName: multilingual-adapter
-  baseModelName: Qwen/Qwen3-0.6B
-  modelType: lora
-  source:
-    uri: hf://myorg/qwen-multilingual-lora@v1.0.0  # Optional: @revision
-```
-**Prerequisites:**
- HuggingFace Hub accessible from your pods
- If private repo: HF token configured as secret and mounted in pods
- Base model `Qwen/Qwen3-0.6B` running via DGD/DCD
-**With HuggingFace token:**
-```yaml
-# In your DGD/DCD
-spec:
-  services:
-    worker:
-      envFromSecret: hf-token-secret  # Provides HF_TOKEN env var
-      modelRef:
-        name: Qwen/Qwen3-0.6B
-      # ... rest of config
-```
-### Use Case 3: Multiple LoRAs on Same Base Model
-Deploy multiple LoRA adapters on the same base model deployment.
-```yaml
---
-# LoRA for customer support
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoModel
-metadata:
-  name: support-lora
-spec:
-  modelName: support-adapter
-  baseModelName: Qwen/Qwen3-0.6B
-  modelType: lora
-  source:
-    uri: s3://models/support-lora
---
-# LoRA for code generation
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoModel
-metadata:
-  name: code-lora
-spec:
-  modelName: code-adapter
-  baseModelName: Qwen/Qwen3-0.6B  # Same base model
-  modelType: lora
-  source:
-    uri: s3://models/code-lora
-```
-Both LoRAs will be loaded on all pods serving `Qwen/Qwen3-0.6B`. Your application can then route requests to the appropriate adapter.
-## Monitoring & Operations
-### Checking Status
-**Quick status check:**
-```bash
-kubectl get dynamomodel
-```
-**Example output:**
-```
-NAME              TOTAL   READY   AGE
-my-lora           2       2       5m
-customer-lora     4       3       2h
-```
-**Detailed status:**
-```bash
-kubectl describe dynamomodel my-lora
-```
-**Example output:**
-```
-Name:         my-lora
-Namespace:    dynamo-system
-Spec:
-  Model Name:       my-custom-lora
-  Base Model Name:  Qwen/Qwen3-0.6B
-  Model Type:       lora
-  Source:
-    Uri:  s3://my-bucket/my-lora
-Status:
-  Ready Endpoints:  2
-  Total Endpoints:  2
-  Endpoints:
-    Address:   http://10.0.1.5:9090
-    Pod Name:  worker-0
-    Ready:     true
-    Address:   http://10.0.1.6:9090
-    Pod Name:  worker-1
-    Ready:     true
-  Conditions:
-    Type:     EndpointsReady
-    Status:   True
-    Reason:   EndpointsDiscovered
-Events:
-  Type    Reason              Message
-  ----    ------              -------
-  Normal  EndpointsReady      Discovered 2 ready endpoints for base model Qwen/Qwen3-0.6B
-```
-### Understanding Readiness
-An endpoint is **ready** when:
-1. The pod is running and healthy
-2. The LoRA load API call succeeded
-**Condition states:**
- `EndpointsReady=True`: All endpoints are ready (full availability)
- `EndpointsReady=False, Reason=NotReady`: Not all endpoints ready (check message for counts)
- `EndpointsReady=False, Reason=NoEndpoints`: No endpoints found
-When `readyEndpoints < totalEndpoints`, the operator automatically retries loading every 30 seconds.
-### Viewing Endpoints
-**Get endpoint addresses:**
-```bash
-kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].address}' | tr ' ' '\n'
-```
-**Output:**
-```
-http://10.0.1.5:9090
-http://10.0.1.6:9090
-```
-**Get endpoint pod names:**
-```bash
-kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].podName}' | tr ' ' '\n'
-```
-**Check readiness of each endpoint:**
-```bash
-kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | {podName, ready}'
-```
-**Output:**
-```json
-{
-  "podName": "worker-0",
-  "ready": true
-}
-{
-  "podName": "worker-1",
-  "ready": true
-}
-```
-### Updating a Model
-To update a LoRA (e.g., deploy a new version):
-```bash
-# Edit the source URI
-kubectl edit dynamomodel my-lora
-# Or apply an updated YAML
-kubectl apply -f my-lora-v2.yaml
-```
-The operator will detect the change and reload the LoRA on all endpoints.
-### Deleting a Model
-```bash
-kubectl delete dynamomodel my-lora
-```
-For LoRA models, the operator will:
-1. Unload the LoRA from all endpoints
-2. Clean up associated resources
-3. Remove the DynamoModel CR
-The base model deployment (DGD/DCD) continues running normally.
-## Troubleshooting
-### No Endpoints Found
-**Symptom:**
-```yaml
-status:
-  totalEndpoints: 0
-  readyEndpoints: 0
-  conditions:
-  - type: EndpointsReady
-    status: "False"
-    reason: NoEndpoints
-    message: "No endpoint slices found for base model Qwen/Qwen3-0.6B"
-```
-**Common Causes:**
-1. **Base model deployment not running**
-   ```bash
-   # Check if pods exist
-   kubectl get pods -l nvidia.com/dynamo-component-type=worker
-   ```
-   **Solution:** Deploy your DGD/DCD first, wait for pods to be ready.
-2. **`baseModelName` mismatch**
-   ```bash
-   # Check modelRef in your DGD
-   kubectl get dynamographdeployment my-deployment -o yaml | grep -A2 modelRef
-   ```
-   **Solution:** Ensure `baseModelName` in DynamoModel exactly matches `modelRef.name` in DGD.
-3. **Pods not ready**
-   ```bash
-   # Check pod status
-   kubectl get pods -l nvidia.com/dynamo-component-type=worker
-   ```
-   **Solution:** Wait for pods to reach `Running` and `Ready` state.
-4. **Wrong namespace**
-   **Solution:** Ensure DynamoModel is in the same namespace as your DGD/DCD.
-### LoRA Load Failures
-**Symptom:**
-```yaml
-status:
-  totalEndpoints: 2
-  readyEndpoints: 0  # ← No endpoints ready despite pods existing
-  conditions:
-  - type: EndpointsReady
-    status: "False"
-    reason: NoReadyEndpoints
-```
-**Common Causes:**
-1. **Source URI not accessible**
-   ```bash
-   # Check operator logs
-   kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f | grep "Failed to load"
-   ```
-   **Solution:**
-   - For S3: Verify bucket permissions, IAM role, credentials
-   - For HuggingFace: Verify token is valid, repo exists and is accessible
-2. **Invalid LoRA format**
-   **Solution:** Ensure your LoRA weights are in the format expected by your backend framework (vLLM, SGLang, etc.)
-3. **Endpoint API errors**
-   ```bash
-   # Check operator logs for HTTP errors
-   kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "error"
-   ```
-   **Solution:** Check the backend framework's logs in the worker pods:
-   ```bash
-   kubectl logs worker-0
-   ```
-4. **Out of memory**
-   **Solution:** LoRA adapters require additional memory. Increase memory limits in your DGD:
-   ```yaml
-   resources:
-     limits:
-       memory: "32Gi"  # Increase if needed
-   ```
-### Status Shows Not Ready
-**Symptom:**
-Some endpoints remain not ready for extended periods.
-**Diagnosis:**
-```bash
-# Check which endpoints are not ready
-kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | select(.ready == false)'
-# View operator logs for that specific pod
-kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "worker-0"
-# Check the worker pod logs
-kubectl logs worker-0 | tail -50
-```
-**Common Causes:**
-1. **Network issues**: Pod can't reach S3/HuggingFace
-2. **Resource constraints**: Pod is OOMing or being throttled
-3. **API endpoint not responding**: Backend framework isn't serving the LoRA API
-**When to wait vs investigate:**
- **Wait**: If readyEndpoints is increasing over time (LoRAs loading progressively)
- **Investigate**: If stuck at same readyEndpoints for >5 minutes
-### Viewing Events and Logs
-**Check events:**
-```bash
-kubectl describe dynamomodel my-lora | tail -20
-```
-**View operator logs:**
-```bash
-# Follow logs
-kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f
-# Filter for specific model
-kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "my-lora"
-```
-**Common events and messages:**
-| Event/Message | Meaning | Action |
-|---------------|---------|--------|
-| `EndpointsReady` | All endpoints are ready | ✅ Good - full service availability |
-| `NotReady` | Not all endpoints ready | ⚠️ Check readyEndpoints count - operator will retry |
-| `PartialEndpointFailure` | Some endpoints failed to load | Check logs for errors |
-| `NoEndpointsFound` | No pods discovered | Verify DGD running and modelRef matches |
-| `EndpointDiscoveryFailed` | Can't query endpoints | Check operator RBAC permissions |
-| `Successfully reconciled` | Reconciliation complete | ✅ Good |
-## Integration with DynamoGraphDeployment
-This section shows the complete end-to-end workflow for deploying base models and LoRA adapters together.
-DynamoModel and DynamoGraphDeployment work together to provide complete model deployment:
- **DGD**: Deploys the infrastructure (pods, services, resources)
- **DynamoModel**: Manages model-specific operations (LoRA loading)
-### Linking Models to Components
-The connection is established through the `modelRef` field in your DGD:
-**Complete example:**
-```yaml
---
-# 1. Deploy the base model infrastructure
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-spec:
-  backendFramework: vllm
-  services:
-    Frontend:
-      componentType: frontend
-      replicas: 1
-      dynamoNamespace: my-app
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
-    Worker:
-      # This modelRef creates the link to DynamoModel
-      modelRef:
-        name: Qwen/Qwen3-0.6B  # ← Key linking field
-      componentType: worker
-      replicas: 2
-      resources:
-        limits:
-          gpu: "1"
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
-          args:
-            - --model
-            - Qwen/Qwen3-0.6B
-            - --tensor-parallel-size
-            - "1"
---
-# 2. Deploy LoRA adapters on top
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoModel
-metadata:
-  name: my-lora
-spec:
-  modelName: my-custom-lora
-  baseModelName: Qwen/Qwen3-0.6B  # ← Must match modelRef.name above
-  modelType: lora
-  source:
-    uri: s3://my-bucket/loras/my-lora
-```
-### Deployment Workflow
-**Recommended order:**
-```bash
-# 1. Deploy base model infrastructure
-kubectl apply -f my-deployment.yaml
-# 2. Wait for pods to be ready
-kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-component-type=worker --timeout=5m
-# 3. Deploy LoRA adapters
-kubectl apply -f my-lora.yaml
-# 4. Verify LoRA is loaded
-kubectl get dynamomodel my-lora
-```
-**What happens behind the scenes:**
-| Step | DGD | DynamoModel |
-|------|-----|-------------|
-| 1 | Creates pods with modelRef | - |
-| 2 | Pods become running and ready | - |
-| 3 | - | CR created, discovers endpoints via auto-created Service |
-| 4 | - | Calls LoRA load API on each endpoint |
-| 5 | - | All endpoints ready ✓ |
-The operator automatically handles all service discovery - you don't configure services, labels, or selectors manually.
-## API Reference
-For complete field specifications, validation rules, and detailed type definitions, see:
-**📖 [Dynamo CRD API Reference](../api_reference.md#dynamomodel)**
-## Summary
-DynamoModel provides declarative model management for Dynamo deployments:
-✅ **Simple**: 2-step deployment of LoRA adapters
-✅ **Automatic**: Endpoint discovery and loading handled by operator
-✅ **Observable**: Rich status reporting and conditions
-✅ **Integrated**: Works seamlessly with DynamoGraphDeployment
-**Next Steps:**
- Try the [Quick Start](#quick-start) example
- Explore [Common Use Cases](#common-use-cases)
- Check the [API Reference](../api_reference.md#dynamomodel) for advanced configuration
--- a/docs/kubernetes/deployment/minikube.md
+++ b/docs/kubernetes/deployment/minikube.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Minikube Setup Guide
-Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
-## 1. Install Minikube
-First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
-## 2. Configure GPU Support (Optional)
-Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
-> [!TIP]
-> Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
-## 3. Start Minikube
-Time to launch your local cluster!
-```bash
-# Start Minikube with GPU support (if configured)
-minikube start --driver docker --container-runtime docker --gpus all --memory=16000mb --cpus=8
-# Enable required addons
-minikube addons enable istio-provisioner
-minikube addons enable istio
-minikube addons enable storage-provisioner-rancher
-```
-## 4. Verify Installation
-Let's make sure everything is working correctly!
-```bash
-# Check Minikube status
-minikube status
-# Verify Istio installation
-kubectl get pods -n istio-system
-# Verify storage class
-kubectl get storageclass
-```
-## Next Steps
-Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](../installation_guide.md) to deploy the platform to your local cluster.
--- a/docs/kubernetes/deployment/multinode-deployment.md
+++ b/docs/kubernetes/deployment/multinode-deployment.md
-# Multinode Deployment Guide
-This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
-## Overview
-Dynamo supports multinode deployments through the `multinode` section in resource specifications. This allows you to:
- Distribute workloads across multiple physical nodes
- Scale GPU resources beyond a single machine
- Support large models requiring extensive tensor parallelism
- Achieve high availability and fault tolerance
-## Basic requirements
- **Kubernetes Cluster**: Version 1.24 or later
- **GPU Nodes**: Multiple nodes with NVIDIA GPUs
- **High-Speed Networking**: InfiniBand, RoCE, or high-bandwidth Ethernet (recommended for optimal performance)
-### Advanced Multinode Orchestration
-#### Using Grove (default)
-For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems:
- **[Grove](https://github.com/NVIDIA/grove)**: Network topology-aware gang scheduling and auto-scaling for AI workloads
- **[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler)**: Kubernetes native scheduler optimized for AI workloads at scale
-These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes.
-**Features Enabled with Grove:**
- Declarative composition of AI workloads
- Multi-level horizontal auto-scaling
- Custom startup ordering for components
- Resource-aware rolling updates
-[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a Kubernetes native scheduler optimized for AI workloads at large scale.
-**Features Enabled with KAI-Scheduler:**
- Gang scheduling
- Network topology-aware pod placement
- AI workload-optimized scheduling algorithms
- GPU resource awareness and allocation
- Support for complex scheduling constraints
- Integration with Grove for enhanced capabilities
- Performance optimizations for large-scale deployments
-##### Prerequisites
- [Grove](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) installed on the cluster
- (Optional) [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with the default queue name `dynamo` created. If no queue annotation is specified on the DGD resource, the operator uses the `dynamo` queue by default. Custom queue names can be specified via the `nvidia.com/kai-scheduler-queue` annotation, but the queue must exist in the cluster before deployment.
-KAI-Scheduler is optional but recommended for advanced scheduling capabilities.
-#### Using LWS and Volcano
-LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.
- **LWS**: [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
- **Volcano**: [Volcano Installation](https://volcano.sh/en/docs/installation/)
-Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support.
-## Core Concepts
-### Orchestrator Selection Algorithm
-Dynamo automatically selects the best available orchestrator for multinode deployments using the following logic:
-#### When Both Grove and LWS are Available:
- **Grove is selected by default** (recommended for advanced AI workloads)
- **LWS is selected** if you explicitly set `nvidia.com/enable-grove: "false"` annotation on your DGD resource
-#### When Only One Orchestrator is Available:
- The installed orchestrator (Grove or LWS) is automatically selected
-#### Scheduler Integration:
- **With Grove**: Automatically integrates with [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) when available, providing:
-  - Advanced queue management via `nvidia.com/kai-scheduler-queue` annotation
-  - AI-optimized scheduling policies
-  - Resource-aware workload placement
- **With LWS**: Uses Volcano scheduler for gang scheduling and resource coordination
-#### Configuration Examples:
-**Default (Grove with KAI-Scheduler):**
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-multinode-deployment
-  annotations:
-    nvidia.com/kai-scheduler-queue: "dynamo"
-spec:
-  # ... your deployment spec
-```
-> **Note:** The `nvidia.com/kai-scheduler-queue` annotation defaults to `"dynamo"`. If you specify a custom queue name, ensure the queue exists in your cluster before deploying. You can verify available queues with `kubectl get queues`.
-**Force LWS usage:**
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-multinode-deployment
-  annotations:
-    nvidia.com/enable-grove: "false"
-spec:
-  # ... your deployment spec
-```
-### The `multinode` Section
-The `multinode` section in a resource specification defines how many physical nodes the workload should span:
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-multinode-deployment
-spec:
-  # ... your deployment spec
-  services:
-    my-service:
-      ...
-      multinode:
-        nodeCount: 2
-      resources:
-        limits:
-          gpu: "2"            # 2 GPUs per node
-```
-### GPU Distribution
-The relationship between `multinode.nodeCount` and `gpu` is multiplicative:
- **`multinode.nodeCount`**: Number of physical nodes
- **`gpu`**: Number of GPUs per node
- **Total GPUs**: `multinode.nodeCount × gpu`
-**Example:**
- `multinode.nodeCount: "2"` + `gpu: "4"` = 8 total GPUs (4 GPUs per node across 2 nodes)
- `multinode.nodeCount: "4"` + `gpu: "8"` = 32 total GPUs (8 GPUs per node across 4 nodes)
-### Tensor Parallelism Alignment
-The tensor parallelism (`tp-size` or `--tp`) in your command/args must match the total number of GPUs:
-```yaml
-# Example: 2 multinode.nodeCount × 4 GPUs = 8 total GPUs
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-multinode-deployment
-spec:
-  # ... your deployment spec
-  services:
-    my-service:
-      ...
-      multinode:
-        nodeCount: 2
-      resources:
-        limits:
-          gpu: "4"
-      extraPodSpec:
-        mainContainer:
-          ...
-          args:
-            # Command args must use tp-size=8
-            - "--tp-size"
-            - "8"  # Must equal multinode.nodeCount × gpu
-```
-## Backend-Specific Operator Behavior
-When you deploy a multinode workload, the Dynamo operator automatically applies backend-specific configurations to enable distributed execution. Understanding these automatic modifications helps troubleshoot issues and optimize your deployments.
-### vLLM Backend
-For vLLM multinode deployments, the operator automatically selects and configures the appropriate distributed execution mode based on your parallelism settings:
-#### Deployment Modes
-The operator automatically determines the deployment mode based on your parallelism configuration:
-**1. Tensor/Pipeline Parallelism Mode (Single model across nodes)**
- **When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size`
- **Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism
-The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes.
-**Leader Node:**
- **Command**: `ray start --head --port=6379 && <original-vllm-command> --distributed-executor-backend ray`
- **Behavior**: Starts Ray head node, then runs vLLM which creates a placement group spanning all Ray workers
- **Probes**: All health probes remain active (liveness, readiness, startup)
-**Worker Nodes:**
- **Command**: `ray start --address=<leader-hostname>:6379 --block`
- **Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers
- **Probes**: All probes (liveness, readiness, startup) are automatically removed
-> **Note**: vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend.
-**2. Data Parallel Mode (Multiple model instances across nodes)**
- **When used**: When `world_size × data_parallel_size > GPUs_per_node`
- **Use case**: Running multiple independent model instances across nodes with data parallelism (e.g., MoE models with expert parallelism)
-**All Nodes (Leader and Workers):**
- **Injected Flags**:
-  - `--data-parallel-address <leader-hostname>` - Address of the coordination server
-  - `--data-parallel-size-local <value>` - Number of data parallel workers per node
-  - `--data-parallel-rpc-port 13445` - RPC port for data parallel coordination
-  - `--data-parallel-start-rank <value>` - Starting rank for this node (calculated automatically)
- **Probes**: Worker probes are removed; leader probes remain active
-**Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
-#### Why Ray for Multi-Node TP/PP?
-vLLM supports two distributed executor backends: `ray` and `mp`. For multi-node deployments:
- **Ray executor**: vLLM creates a placement group and spawns Ray actors across the cluster. Workers don't run vLLM directly - the leader's vLLM process manages everything.
- **mp executor**: Each node must run its own vLLM process with `--nnodes`, `--node-rank`, `--master-addr`, `--master-port`. This approach is more complex to orchestrate.
-The Dynamo operator uses Ray because:
-1. It aligns with vLLM's official multi-node documentation (see `multi-node-serving.sh`)
-2. Simpler orchestration - only the leader runs vLLM, workers just need Ray agents
-3. vLLM automatically handles placement group creation and worker management
-#### Compilation Cache Support
-When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets:
- **`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point
-### SGLang Backend
-For SGLang multinode deployments, the operator injects distributed training parameters:
-#### Leader Node
- **Distributed Flags**: Injects `--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank 0`
- **Probes**: All health probes remain active
-#### Worker Nodes
- **Distributed Flags**: Injects `--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank <dynamic-rank>`
-  - The `node-rank` is automatically determined from the pod's stateful identity
- **Probes**: All probes (liveness, readiness, startup) are automatically removed
-**Note:** The operator intelligently injects these flags regardless of your command structure (direct Python commands or shell wrappers).
-### TensorRT-LLM Backend
-For TensorRT-LLM multinode deployments, the operator configures MPI-based communication:
-#### Leader Node
- **SSH Configuration**: Automatically sets up SSH keys and configuration from a Kubernetes secret
- **MPI Command**: Wraps your command in an `mpirun` command with:
-  - Proper host list including all worker nodes
-  - SSH configuration for passwordless authentication on port 2222
-  - Environment variable propagation to all nodes
-  - Activation of the Dynamo virtual environment
- **Probes**: All health probes remain active
-#### Worker Nodes
- **SSH Daemon**: Replaces your command with SSH daemon setup and execution
-  - Generates host keys in user-writable directories (non-privileged)
-  - Configures SSH daemon to listen on port 2222
-  - Sets up authorized keys for leader access
- **Probes**:
-  - **Liveness and Startup**: Removed (workers run SSH daemon, not the main application)
-  - **Readiness**: Replaced with TCP socket check on SSH port 2222
-    - Initial Delay: 20 seconds
-    - Period: 20 seconds
-    - Timeout: 5 seconds
-    - Failure Threshold: 10
-#### Additional Configuration
- **Environment Variable**: `OMPI_MCA_orte_keep_fqdn_hostnames=1` is added to all nodes
- **SSH Volume**: Automatically mounts the SSH keypair secret (typically named `mpirun-ssh-key-<deployment-name>`)
-**Important:** TensorRT-LLM requires an SSH keypair secret to be created before deployment. The secret name follows the pattern `mpirun-ssh-key-<component-name>`.
-### Compilation Cache Configuration
-The operator supports compilation cache volumes for backend-specific optimization:
-| Backend | Support Level | Environment Variables | Default Mount Point |
-|---------|--------------|----------------------|---------------------|
-| vLLM | Fully Supported | `VLLM_CACHE_ROOT` | User-specified |
-| SGLang | Partial Support | _None (pending upstream)_ | User-specified |
-| TensorRT-LLM | Partial Support | _None (pending upstream)_ | User-specified |
-To enable compilation cache, add a volume mount with `useAsCompilationCache: true` in your component specification. For vLLM, the operator will automatically configure the necessary environment variables. For other backends, volume mounts are created, but additional environment configuration may be required until upstream support is added.
-## Next Steps
-For additional support and examples, see the working multinode configurations in:
- **SGLang**: [examples/backends/sglang/deploy/](../../../examples/backends/sglang/deploy/)
- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](../../../examples/backends/trtllm/deploy/)
- **vLLM**: [examples/backends/vllm/deploy/](../../../examples/backends/vllm/deploy/)
-These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
-```{toctree}
-:hidden:
-Grove <../grove>
-```
--- a/docs/kubernetes/dynamo_operator.md
+++ b/docs/kubernetes/dynamo_operator.md
-# Working with Dynamo Kubernetes Operator
-## Overview
-Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
-## Architecture
- **Operator Deployment:**
-  Deployed as a Kubernetes `Deployment` in a specific namespace.
- **Controllers:**
-  - `DynamoGraphDeploymentController`: Watches `DynamoGraphDeployment` CRs and orchestrates graph deployments.
-  - `DynamoComponentDeploymentController`: Watches `DynamoComponentDeployment` CRs and handles individual component deployments.
-  - `DynamoModelController`: Watches `DynamoModel` CRs and manages model lifecycle (e.g., loading LoRA adapters).
- **Workflow:**
-  1. A custom resource is created by the user or API server.
-  2. The corresponding controller detects the change and runs reconciliation.
-  3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
-  4. Status fields are updated to reflect the current state.
-## Deployment Modes
-The Dynamo operator supports three deployment modes to accommodate different cluster environments and use cases:
-### 1. Cluster-Wide Mode (Default)
-The operator monitors and manages DynamoGraph resources across **all namespaces** in the cluster.
-**When to Use:**
- You have full cluster admin access
- You want centralized management of all Dynamo workloads
- Standard production deployment on a dedicated cluster
---
-### 2. Namespace-Scoped Mode
-The operator monitors and manages DynamoGraph resources **only in a specific namespace**. A lease marker is created to signal the operator's presence to any cluster-wide operators.
-**When to Use:**
- You're on a shared/multi-tenant cluster
- You only have namespace-level permissions
- You want to test a new operator version in isolation
- You need to avoid conflicts with other operators
-**Installation:**
-```bash
-helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
-  --namespace my-namespace \
-  --create-namespace \
-  --set dynamo-operator.namespaceRestriction.enabled=true
-```
---
-### 3. Hybrid Mode
-A **cluster-wide operator** manages most namespaces, while **one or more namespace-scoped operators** run in specific namespaces (e.g., for testing new versions). The cluster-wide operator automatically detects and excludes namespaces with namespace-scoped operators using lease markers.
-**When to Use:**
- Running production workloads with a stable operator version
- Testing new operator versions in isolated namespaces without affecting production
- Gradual rollout of operator updates
- Development/staging environments on production clusters
-**How It Works:**
-1. Namespace-scoped operator creates a lease named `dynamo-operator-namespace-scope` in its namespace
-2. Cluster-wide operator watches for these lease markers across all namespaces
-3. Cluster-wide operator automatically excludes any namespace with a lease marker
-4. If namespace-scoped operator stops, its lease expires (TTL: 30s by default)
-5. Cluster-wide operator automatically resumes managing that namespace
-**Setup Example:**
-```bash
-# 1. Install cluster-wide operator (production, v1.0.0)
-helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
-  --namespace dynamo-system \
-  --create-namespace
-# 2. Install namespace-scoped operator (testing, v2.0.0-beta)
-helm install dynamo-test dynamo-platform-${RELEASE_VERSION}.tgz \
-  --namespace test-namespace \
-  --create-namespace \
-  --set dynamo-operator.namespaceRestriction.enabled=true \
-  --set dynamo-operator.controllerManager.manager.image.tag=v2.0.0-beta
-```
-**Observability:**
-```bash
-# List all namespaces with local operators
-kubectl get lease -A --field-selector metadata.name=dynamo-operator-namespace-scope
-# Check which operator version is running in a namespace
-kubectl get lease -n my-namespace dynamo-operator-namespace-scope \
-  -o jsonpath='{.spec.holderIdentity}'
-```
-## Custom Resource Definitions (CRDs)
-Dynamo provides the following Custom Resources:
- **DynamoGraphDeployment (DGD)**: Deploys complete inference pipelines
- **DynamoComponentDeployment (DCD)**: Deploys individual components
- **DynamoModel**: Manages model lifecycle (e.g., loading LoRA adapters)
-For the complete technical API reference for Dynamo Custom Resource Definitions, see:
-**📖 [Dynamo CRD API Reference](./api_reference.md)**
-For a user-focused guide on deploying and managing models with DynamoModel, see:
-**📖 [Managing Models with DynamoModel Guide](./deployment/dynamomodel-guide.md)**
-## Webhooks
-The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation of custom resources before they are persisted to the cluster. Webhooks are **enabled by default** and ensure that invalid configurations are rejected immediately at the API server level.
-**Key Features:**
- ✅ Shared certificate infrastructure across all webhook types
- ✅ Automatic certificate generation (for testing/development)
- ✅ cert-manager integration (for production)
- ✅ Multi-operator support with lease-based coordination
- ✅ Immutability enforcement for critical fields
-For complete documentation on webhooks, certificate management, and troubleshooting, see:
-**📖 [Webhooks Guide](./webhooks.md)**
-## Observability
-The Dynamo Operator provides comprehensive observability through Prometheus metrics and Grafana dashboards. This allows you to monitor:
- **Controller Performance**: Reconciliation loop duration, success rates, and error rates by resource type
- **Webhook Activity**: Validation performance, admission rates, and denial patterns
- **Resource Inventory**: Current count of managed resources by state and namespace
- **Operational Health**: Success rates and health indicators for controllers and webhooks
-### Metrics Collection
-Metrics are automatically exposed on the operator's `/metrics` endpoint (port 8443 by default) and collected by Prometheus via a ServiceMonitor. The ServiceMonitor is automatically created when you install the operator via Helm (controlled by `metricsService.enabled`, which defaults to `true`).
-### Grafana Dashboard
-A pre-built Grafana dashboard is available for visualizing operator metrics. The dashboard includes:
- **Reconciliation Metrics**: Rate, duration (P95), and errors by resource type
- **Webhook Metrics**: Request rate, duration (P95), and denials by resource type and operation
- **Resource Inventory**: Count of DynamoGraphDeployments by state and namespace
- **Operational Health**: Success rate gauges for controllers and webhooks
-For complete setup instructions and metrics reference, see:
-**📖 [Operator Metrics Guide](./observability/operator-metrics.md)**
-## Installation
-### Quick Install with Helm
-```bash
-# Set environment
-export NAMESPACE=dynamo-system
-export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
-# Install Platform (includes operator)
-helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
-helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
-```
-> **Note:** For shared/multi-tenant clusters or testing scenarios, see [Deployment Modes](#deployment-modes) above for namespace-scoped and hybrid configurations.
-### Building from Source
-```bash
-# Set environment
-export NAMESPACE=dynamo-system
-export DOCKER_SERVER=your-registry.com/  # your container registry
-export IMAGE_TAG=latest
-# Build operator image
-cd deploy/operator
-docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG .
-docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
-cd -
-# Install CRDs
-cd deploy/helm/charts
-helm install dynamo-crds ./crds/ --namespace default
-# Install platform with custom operator image
-helm install dynamo-platform ./platform/ \
-  --namespace ${NAMESPACE} \
-  --create-namespace \
-  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
-  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
-  --set etcd.enabled=false \
-  --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret
-```
-For detailed installation options, see the [Installation Guide](./installation_guide.md)
-## Development
- **Code Structure:**
-The operator is built using Kubebuilder and the operator-sdk, with the following structure:
- `controllers/`: Reconciliation logic
- `api/v1alpha1/`: CRD types
- `config/`: Manifests and Helm charts
-## References
- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
- [Custom Resource Definitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
- [Operator SDK](https://sdk.operatorframework.io/)
- [Helm Best Practices for CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/)
--- a/docs/kubernetes/fluxcd.md
+++ b/docs/kubernetes/fluxcd.md
-# GitOps Deployment with FluxCD
-This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../backends/vllm/README.md) to demonstrate the workflow.
-## Prerequisites
- A Kubernetes cluster with [Dynamo Kubernetes Platform](./installation_guide.md) installed
- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations
-## Workflow Overview
-The GitOps workflow for Dynamo deployments consists of three main steps:
-1. Build and push the Dynamo Operator
-2. Create and commit a DynamoGraphDeployment custom resource for initial deployment
-3. Update the graph by building a new version and updating the CR for subsequent updates
-## Step 1: Build and Push Dynamo Operator
-First, follow to [See Install Dynamo Kubernetes Platform](./installation_guide.md).
-## Step 2: Create Initial Deployment
-Create a new file in your Git repository (e.g., `deployments/llm-agg.yaml`) with the following content:
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: llm-agg
-spec:
-  pvcs:
-    - name: vllm-model-storage
-      size: 100Gi
-  services:
-    Frontend:
-      replicas: 1
-      envs:
-      - name: SPECIFIC_ENV_VAR
-        value: some_specific_value
-    Processor:
-      replicas: 1
-      envs:
-      - name: SPECIFIC_ENV_VAR
-        value: some_specific_value
-    VllmWorker:
-      replicas: 1
-      envs:
-      - name: SPECIFIC_ENV_VAR
-        value: some_specific_value
-      # Add PVC for model storage
-      volumeMounts:
-        - name: vllm-model-storage
-          mountPoint: /models
-```
-Commit and push this file to your Git repository. FluxCD will detect the new CR and create the initial Dynamo deployment in your cluster.
-## Step 3: Update Existing Deployment
-To update your pipeline, just update the associated DynamoGraphDeployment CRD
-The Dynamo operator will automatically reconcile it.
-## Monitoring the Deployment
-You can monitor the deployment status using:
-```bash
-export NAMESPACE=<namespace-with-the-dynamo-operator>
-# Check the DynamoGraphDeployment status
-kubectl get dynamographdeployment llm-agg -n $NAMESPACE
-```
\ No newline at end of file
--- a/docs/kubernetes/grove.md
+++ b/docs/kubernetes/grove.md
-# Grove Deployment Guide
-Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
-## Overview
-Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource.
-### How Grove Works for Disaggregated Serving
-Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages:
- **Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks
- **Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns
- **Resource Optimization**: Better utilization of hardware resources through specialized workload placement
- **Fault Isolation**: Issues in one component don't necessarily affect others
-## Core Components and API Resources
-Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups:
-### PodCliqueSet
-The top-level Grove object that defines a group of components managed and colocated together. Key features include:
- Support for autoscaling
- Topology-aware spread of replicas for availability
- Unified management of multiple disaggregated components
-### PodClique
-Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features:
- Independent configuration options
- Custom scaling logic support
- Role-specific resource allocation
-### PodCliqueScalingGroup
-A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior.
-## Key Capabilities for Disaggregated Serving
-Grove provides several specialized features that make it particularly well-suited for disaggregated serving:
-### Flexible Gang Scheduling
-PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodCliqueSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
-### Multi-level Horizontal Auto-Scaling
-Supports pluggable horizontal auto-scaling solutions to scale PodCliqueSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
-### Network Topology-Aware Scheduling
-Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication.
-### Custom Startup Dependencies
-Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
-## Use Cases and Examples
-Grove specifically supports:
- **Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick
- **Single-node disaggregated inference** for optimized resource utilization
- **Agentic pipelines of models** for complex AI workflows
- **Standard aggregated serving** patterns for single node or single GPU inference
-## Integration with NVIDIA Dynamo
-Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack:
-### Complementary Roles
- **Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads
- **Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management
-### Release Coordination
-Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap.
-### Unified AI Platform
-The integration creates a comprehensive platform where:
- Grove manages complex orchestration of disaggregated components
- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations
- Together they enable sophisticated AI serving architectures with simplified management
-## Architecture Benefits
-Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by:
-1. **Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition
-2. **Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate
-3. **Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components
-4. **Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling
-## Getting Started
-Grove relies on KAI Scheduler for resource allocation and scheduling.
-For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler).
-For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
-For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](./deployment/multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
-For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
-Dynamo Kubernetes Platform also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Kubernetes Platform Deployment Installation Guide](./installation_guide.md) for more details.
\ No newline at end of file
--- a/docs/kubernetes/installation_guide.md
+++ b/docs/kubernetes/installation_guide.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Installation Guide for Dynamo Kubernetes Platform
-Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
-## Before You Start
-Determine your cluster environment:
-**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
- CRDs already installed cluster-wide - skip CRD installation step
- A cluster-wide Dynamo operator is likely already running
- **Do NOT install another operator** - use the existing cluster-wide operator
- Only install a namespace-restricted operator if you specifically need to prevent the cluster-wide operator from managing your namespace (e.g., testing operator features you're developing)
-**Dedicated Cluster** (full cluster admin access):
- You install CRDs yourself
- Can use cluster-wide operator (default)
-**Local Development** (Minikube, testing):
- See [Minikube Setup](deployment/minikube.md) first, then follow installation steps below
-To check if CRDs already exist:
-```bash
-kubectl get crd | grep dynamo
-# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed
-```
-To check if a cluster-wide operator already exists:
-```bash
-# Check for cluster-wide operator and show its namespace
-kubectl get clusterrolebinding -o json | \
-  jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) |
-  "Cluster-wide operator found in namespace: \(.subjects[0].namespace)"'
-# If a cluster-wide operator exists: Do NOT install another operator
-# Only install namespace-restricted mode if you specifically need namespace isolation
-```
-## Installation Paths
-Platform is installed using Dynamo Kubernetes Platform [helm chart](../../deploy/helm/charts/platform/README.md).
-**Path A: Pre-built Artifacts**
- Use case: Production deployment, shared or dedicated clusters
- Source: NGC published Helm charts
- Time: ~10 minutes
- Jump to: [Path A](#path-a-production-install)
-**Path B: Custom Build from Source**
- Use case: Contributing to Dynamo, using latest features from main branch, customization
- Requirements: Docker build environment
- Time: ~30 minutes
- Jump to: [Path B](#path-b-custom-build-from-source)
-All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:
-```bash
-helm install ...
-  -f your-values.yaml
-```
-and/or setting values as flags to the helm install command, as follows:
-```bash
-helm install ...
-  --set "your-value=your-value"
-```
-## Prerequisites
-Before installing the Dynamo Kubernetes Platform, ensure you have the following tools and access:
-### Required Tools
-| Tool | Minimum Version | Description | Installation |
-|------|-----------------|-------------|--------------|
-| **kubectl** | v1.24+ | Kubernetes command-line tool | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
-| **Helm** | v3.0+ | Kubernetes package manager | [Install Helm](https://helm.sh/docs/intro/install/) |
-| **Docker** | Latest | Container runtime (Path B only) | [Install Docker](https://docs.docker.com/get-docker/) |
-### Cluster and Access Requirements
- **Kubernetes cluster v1.24+** with admin or namespace-scoped access
- **Cluster type determined** (shared vs dedicated) — see [Before You Start](#before-you-start)
- **CRD status checked** if on a shared cluster
- **NGC credentials** (optional) — required only if pulling NVIDIA images from NGC
-### Verify Installation
-Run the following to confirm your tools are correctly installed:
-```bash
-# Verify tools and versions
-kubectl version --client  # Should show v1.24+
-helm version              # Should show v3.0+
-docker version            # Required for Path B only
-# Set your release version
-export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
-```
-### Pre-Deployment Checks
-Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements:
-```bash
-./deploy/pre-deployment/pre-deployment-check.sh
-```
-This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](../../deploy/pre-deployment/README.md) for details.
-> **No cluster?** See [Minikube Setup](deployment/minikube.md) for local development.
-**Estimated installation time:** 5-30 minutes depending on path
-## Path A: Production Install
-Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).
-```bash
-# 1. Set environment
-export NAMESPACE=dynamo-system
-export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
-# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
-helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
-helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
-# 3. Install Platform
-helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
-helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
-```
-**For Shared/Multi-Tenant Clusters:**
-If your cluster has namespace-restricted Dynamo operators, you MUST add namespace restriction to your installation:
-```bash
-# Add this flag to the helm install command above
--set dynamo-operator.namespaceRestriction.enabled=true
-```
-Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
-If you see this validation error, you need namespace restriction:
-```
-VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
-Found existing namespace-restricted Dynamo operators in namespaces: ...
-```
-> [!TIP]
-> For multinode deployments, you need to install multinode orchestration components:
->
-> **Option 1 (Recommended): Grove + KAI Scheduler**
-> - Grove and KAI Scheduler can be installed manually or through the dynamo-platform helm install command.
-> - When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags:
->
-> ```bash
-> --set "grove.enabled=true"
-> --set "kai-scheduler.enabled=true"
-> ```
->
-> **Option 2: LeaderWorkerSet (LWS) + Volcano**
-> - If using LWS for multinode deployments, you must also install Volcano (required dependency):
->   - [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
->   - [Volcano Installation](https://volcano.sh/en/docs/installation/) (required for gang scheduling with LWS)
-> - These must be installed manually before deploying multinode workloads with LWS.
->
-> See the [Multinode Deployment Guide](./deployment/multinode-deployment.md) for details on orchestrator selection.
-> [!TIP]
-> By default, Model Express Server is not used.
-> If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command:
-```bash
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
-```
-> [!TIP]
-> By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces.
-> If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true.
-> You can also change the restricted namespace by setting the targetNamespace property.
-```bash
--set "dynamo-operator.namespaceRestriction.enabled=true"
--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional
-```
-→ [Verify Installation](#verify-installation)
-## Path B: Custom Build from Source
-Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.
-Note: This gives you access to the latest unreleased features and fixes on the main branch.
-```bash
-# 1. Set environment
-export NAMESPACE=dynamo-system
-export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # or your registry
-export DOCKER_USERNAME='$oauthtoken'
-export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
-export IMAGE_TAG=${RELEASE_VERSION}
-# 2. Build operator
-cd deploy/operator
-# 2.1 Alternative 1 : Build and push the operator image for multiple platforms
-docker buildx create --name multiplatform --driver docker-container --bootstrap
-docker buildx use multiplatform
-docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG --push .
-# 2.2 Alternative 2 : Build and push the operator image for a single platform
-docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
-cd -
-# 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry)
-kubectl create namespace ${NAMESPACE}
-kubectl create secret docker-registry docker-imagepullsecret \
-  --docker-server=${DOCKER_SERVER} \
-  --docker-username=${DOCKER_USERNAME} \
-  --docker-password=${DOCKER_PASSWORD} \
-  --namespace=${NAMESPACE}
-cd deploy/helm/charts
-# 4. Install CRDs
-helm upgrade --install dynamo-crds ./crds/ --namespace default
-# 5. Install Platform
-helm dep build ./platform/
-# To install cluster-wide instead, set NS_RESTRICT_FLAGS="" (empty) or omit that line entirely.
-NS_RESTRICT_FLAGS="--set dynamo-operator.namespaceRestriction.enabled=true"
-helm install dynamo-platform ./platform/ \
-  --namespace "${NAMESPACE}" \
-  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
-  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
-  --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" \
-  ${NS_RESTRICT_FLAGS}
-```
-→ [Verify Installation](#verify-installation)
-## Verify Installation
-```bash
-# Check CRDs
-kubectl get crd | grep dynamo
-# Check operator and platform pods
-kubectl get pods -n ${NAMESPACE}
-# Expected: dynamo-operator-* and etcd-* and nats-* pods Running
-```
-## Next Steps
-1. **Deploy Model/Workflow**
-   ```bash
-   # Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
-   kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
-   # Port forward and test
-   kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
-   curl http://localhost:8000/v1/models
-   ```
-2. **Explore Backend Guides**
-   - [vLLM Deployments](../../examples/backends/vllm/deploy/README.md)
-   - [SGLang Deployments](../../examples/backends/sglang/deploy/README.md)
-   - [TensorRT-LLM Deployments](../../examples/backends/trtllm/deploy/README.md)
-3. **Optional:**
-   - [Set up Prometheus & Grafana](./observability/metrics.md)
-   - [SLA Planner Guide](../components/planner/planner_guide.md) (for SLA-aware scheduling and autoscaling)
-## Troubleshooting
-**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**
-```
-VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
-Found existing namespace-restricted Dynamo operators in namespaces: ...
-```
-Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.
-Solution: Add namespace restriction to your installation:
-```bash
--set dynamo-operator.namespaceRestriction.enabled=true
-```
-Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
-**CRDs already exist**
-Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).
-Solution: Skip step 2 (CRD installation), proceed directly to platform installation.
-To check if CRDs exist:
-```bash
-kubectl get crd | grep dynamo
-```
-**Pods not starting?**
-```bash
-kubectl describe pod <pod-name> -n ${NAMESPACE}
-kubectl logs <pod-name> -n ${NAMESPACE}
-```
-**HuggingFace model access?**
-```bash
-kubectl create secret generic hf-token-secret \
-  --from-literal=HF_TOKEN=${HF_TOKEN} \
-  -n ${NAMESPACE}
-```
-**Bitnami etcd "unrecognized" image?**
-```bash
-ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
-```
-This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog).
-just add the following to the helm install command:
-```bash
--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"
-```
-**Clean uninstall?**
-To uninstall the platform, you can run the following command:
-```
-helm uninstall dynamo-platform --namespace ${NAMESPACE}
-```
-To uninstall the CRDs, follow these steps:
-Get all of the dynamo CRDs installed in your cluster:
-```bash
-kubectl get crd | grep "dynamo.*nvidia.com"
-```
-You should see something like this:
-```
-dynamocomponentdeployments.nvidia.com               2025-10-21T14:49:52Z
-dynamocomponents.nvidia.com                         2025-10-25T05:16:10Z
-dynamographdeploymentrequests.nvidia.com            2025-11-24T05:26:04Z
-dynamographdeployments.nvidia.com                   2025-09-04T20:56:40Z
-dynamographdeploymentscalingadapters.nvidia.com     2025-12-09T21:05:59Z
-dynamomodels.nvidia.com                             2025-11-07T00:19:43Z
-```
-Delete each CRD one by one:
-```bash
-kubectl delete crd <crd-name>
-```
-## Advanced Options
- [Helm Chart Configuration](../../deploy/helm/charts/platform/README.md)
- [Create custom deployments](./deployment/create_deployment.md)
- [Dynamo Operator details](./dynamo_operator.md)
- [Model Express Server details](https://github.com/ai-dynamo/modelexpress)
--- a/docs/kubernetes/model_caching_with_fluid.md
+++ b/docs/kubernetes/model_caching_with_fluid.md
-# Model Caching with Fluid: Cloud-Native Data Orchestration and Acceleration
-Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.
-## Key Features
- **Data Caching and Acceleration:** Cache remote data close to compute workloads for faster access.
- **Unified Data Access:** Access data from S3, HDFS, NFS, and more through a single interface.
- **Kubernetes Native:** Integrates with Kubernetes using CRDs for data management.
- **Scalability:** Supports large-scale data and compute clusters.
-## Installation
-You can install Fluid on any Kubernetes cluster using Helm.
-**Prerequisites:**
- Kubernetes >= 1.18
- `kubectl` >= 1.18
- `Helm` >= 3.5
-**Quick Install:**
-```sh
-kubectl create ns fluid-system
-helm repo add fluid https://fluid-cloudnative.github.io/charts
-helm repo update
-helm install fluid fluid/fluid -n fluid-system
-```
-For advanced configuration, see the [Fluid Installation Guide](https://fluid-cloudnative.github.io/docs/get-started/installation).
-## Pre-deployment Steps
-1. Install Fluid (see [Installation](#installation)).
-2. Create a Dataset and Runtime (see [the following example](#webufs-example)).
-3. Mount the resulting PVC in your workload.
-## Mounting Data Sources
-### WebUFS Example
-WebUFS allows mounting HTTP/HTTPS sources as filesystems.
-```yaml
-# Mount a public HTTP directory as a Fluid Dataset
-apiVersion: data.fluid.io/v1alpha1
-kind: Dataset
-metadata:
-  name: webufs-model
-spec:
-  mounts:
-    - mountPoint: https://myhost.org/path_to_my_model  # Replace with your HTTP source
-      name: webufs-model
---
-apiVersion: data.fluid.io/v1alpha1
-kind: AlluxioRuntime
-metadata:
-  name: webufs-model
-spec:
-  replicas: 2
-  tieredstore:
-    levels:
-      - mediumtype: MEM
-        path: /dev/shm
-        quota: 2Gi
-        high: "0.95"
-        low: "0.7"
-```
-After applying, Fluid creates a PersistentVolumeClaim (PVC) named `webufs-model` containing the files.
-### S3 Example
-Mount an S3 bucket as a Fluid Dataset.
-```yaml
-# Mount an S3 bucket as a Fluid Dataset
-apiVersion: data.fluid.io/v1alpha1
-kind: Dataset
-metadata:
-  name: s3-model
-spec:
-  mounts:
-    - mountPoint: s3://<your-bucket>  # Replace with your bucket name
-      options:
-        alluxio.underfs.s3.endpoint: http://minio:9000  # S3 endpoint (e.g., MinIO)
-        alluxio.underfs.s3.disable.dns.buckets: "true"
-        aws.secretKey: "<your-secret>"
-        aws.accessKeyId: "<your-access-key>"
---
-apiVersion: data.fluid.io/v1alpha1
-kind: AlluxioRuntime
-metadata:
-  name: s3-model
-spec:
-  replicas: 1
-  tieredstore:
-    levels:
-      - mediumtype: MEM
-        path: /dev/shm
-        quota: 1Gi
-        high: "0.95"
-        low: "0.7"
---
-apiVersion: data.fluid.io/v1alpha1
-kind: DataLoad
-metadata:
-  name: s3-model-loader
-spec:
-  dataset:
-    name: s3-model
-    namespace: <your-namespace>  # Replace with your namespace
-  loadMetadata: true
-  target:
-    - path: "/"
-      replicas: 1
-```
-The resulting PVC is named `s3-model`.
-## Using HuggingFace Models with Fluid
-**Limitations:**
- HuggingFace models are not exposed as simple filesystems or buckets.
- No native integration exists between Fluid and the HuggingFace Hub API.
-**Workaround: Download and Upload to S3/MinIO**
-1. Download the model using the HuggingFace CLI or SDK.
-2. Upload the model files to a supported storage backend (S3, GCS, NFS).
-3. Mount that backend using Fluid.
-**Example Pod to Download and Upload:**
-```yaml
-apiVersion: v1
-kind: Pod
-metadata:
-  name: download-hf-to-minio
-spec:
-  restartPolicy: Never
-  containers:
-    - name: downloader
-      image: python:3.10-slim
-      command: ["sh", "-c"]
-      args:
-        - |
-          set -eux
-          pip install --no-cache-dir huggingface_hub awscli
-          BUCKET_NAME=hf-models
-          ENDPOINT_URL=http://minio:9000
-          MODEL_NAME=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-          LOCAL_DIR=/tmp/model
-          if ! aws --endpoint-url $ENDPOINT_URL s3 ls "s3://$BUCKET_NAME" > /dev/null 2>&1; then
-            aws --endpoint-url $ENDPOINT_URL s3 mb "s3://$BUCKET_NAME"
-          fi
-          huggingface-cli download $MODEL_NAME --local-dir $LOCAL_DIR --local-dir-use-symlinks False
-          aws --endpoint-url $ENDPOINT_URL s3 cp $LOCAL_DIR s3://$BUCKET_NAME/$MODEL_NAME --recursive
-      env:
-        - name: AWS_ACCESS_KEY_ID
-          value: "<your-access-key>"
-        - name: AWS_SECRET_ACCESS_KEY
-          value: "<your-secret>"
-      volumeMounts:
-        - name: tmp-volume
-          mountPath: /tmp/model
-  volumes:
-    - name: tmp-volume
-      emptyDir: {}
-```
-You can then use `s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B` as your Dataset mount.
-## Usage with Dynamo
-Mount the Fluid-generated PVC in your DynamoGraphDeployment:
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: model-caching
-spec:
-  pvcs:
-    - name: s3-model
-  envs:
-    - name: HF_HOME
-      value: /model
-    - name: DYN_DEPLOYMENT_CONFIG
-      value: '{"Common": {"model": "/model", ...}}'
-  services:
-    VllmWorker:
-      volumeMounts:
-        - name: s3-model
-          mountPoint: /model
-    Processor:
-      volumeMounts:
-        - name: s3-model
-          mountPoint: /model
-```
-## Full example with llama3.3 70B
-### Performance
-When deploying LLaMA 3.3 70B using Fluid as the caching layer, we observed the best performance by configuring a single-node cache that holds 100% of the model files locally. By ensuring that the vllm worker pod is scheduled on the same node as the Fluid cache, we were able to eliminate network I/O bottlenecks, which resulted in the fastest model startup time and the highest inference efficiency during our tests.
-| Cache Configuration                          | vLLM Pod Placement               | Startup Time    |
-|----------------------------------------------|----------------------------------|-----------------|
-| ❌ No Cache (Download from HuggingFace)      | N/A                              | ~9 minutes      |
-| 🟡 Multi-Node Cache (100% Model Cached)      | Not on Cache Node                | ~18 minutes     |
-| 🟡 Multi-Node Cache (100% Model Cached)      | On Cache Node                    | ~10 minutes     |
-| ✅ Single-Node Cache (100% Model Cached)     | On Cache Node                    | ~80 seconds     |
-### Resources
-```yaml
-# dataset.yaml
-apiVersion: data.fluid.io/v1alpha1
-kind: Dataset
-metadata:
-  name: llama-3-3-70b-instruct-model
-  namespace: my-namespace
-spec:
-  mounts:
-    - mountPoint: s3://hf-models/meta-llama/Llama-3.3-70B-Instruct
-      options:
-        alluxio.underfs.s3.endpoint: http://minio:9000
-        alluxio.underfs.s3.disable.dns.buckets: "true"
-        aws.secretKey: "minioadmin"
-        aws.accessKeyId: "minioadmin"
-        alluxio.underfs.s3.streaming.upload.enabled: "true"
-        alluxio.underfs.s3.multipart.upload.threads: "20"
-        alluxio.underfs.s3.socket.timeout: "50s"
-        alluxio.underfs.s3.request.timeout: "60s"
---
-# runtime.yaml
-apiVersion: data.fluid.io/v1alpha1
-kind: AlluxioRuntime
-metadata:
-  name: llama-3-3-70b-instruct-model
-  namespace: my-namespace
-spec:
-  replicas: 1
-  properties:
-    alluxio.user.file.readtype.default: CACHE_PROMOTE
-    alluxio.user.file.write.type.default: CACHE_THROUGH
-    alluxio.user.block.size.bytes.default: 128MB
-  tieredstore:
-    levels:
-      - mediumtype: MEM
-        path: /dev/shm
-        quota: 300Gi
-        high: "1.0"
-        low: "0.7"
---
-# DataLoad - Preloads the model into cache
-apiVersion: data.fluid.io/v1alpha1
-kind: DataLoad
-metadata:
-  name: llama-3-3-70b-instruct-model-loader
-spec:
-  dataset:
-    name: llama-3-3-70b-instruct-model
-    namespace: my-namespace
-  loadMetadata: true
-  target:
-    - path: "/"
-      replicas: 1
-```
-and the associated DynamoGraphDeployment with pod affinity to schedule the vllm worker on the same node than the Alluxio cache worker
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-hello-world
-spec:
-  envs:
-  - name: DYN_LOG
-    value: "debug"
-  - name: DYN_DEPLOYMENT_CONFIG
-    value: '{"Common": {"model": "/model", "block-size": 64, "max-model-len": 16384},
-      "Frontend": {"served_model_name": "meta-llama/Llama-3.3-70B-Instruct", "endpoint":
-      "dynamo.Processor.chat/completions", "port": 8000}, "Processor": {"router":
-      "round-robin", "router-num-threads": 4, "common-configs": ["model", "block-size",
-      "max-model-len"]}, "VllmWorker": {"tensor-parallel-size": 4, "enforce-eager": true, "max-num-batched-tokens":
-      16384, "enable-prefix-caching": true, "ServiceArgs": {"workers": 1, "resources":
-      {"gpu": "4", "memory": "40Gi"}}, "common-configs": ["model", "block-size", "max-model-len"]},
-      "Planner": {"environment": "kubernetes", "no-operation": true}}'
-  pvcs:
-    - name: llama-3-3-70b-instruct-model
-  services:
-    Processor:
-      volumeMounts:
-        - name: llama-3-3-70b-instruct-model
-          mountPoint: /model
-    VllmWorker:
-      volumeMounts:
-        - name: llama-3-3-70b-instruct-model
-          mountPoint: /model
-      extraPodSpec:
-        affinity:
-          nodeAffinity:
-            requiredDuringSchedulingIgnoredDuringExecution:
-              nodeSelectorTerms:
-                - matchExpressions:
-                  - key: fluid.io/s-alluxio-my-namespace-llama-3-3-70b-instruct-model
-                    operator: In
-                    values:
-                      - "true"
-```
-## Troubleshooting & FAQ
- **PVC not created?** Check Fluid and AlluxioRuntime pod logs.
- **Model not found?** Ensure the model was uploaded to the correct bucket/path.
- **Permission errors?** Verify S3/MinIO credentials and bucket policies.
-## Resources
- [Fluid Documentation](https://fluid-cloudnative.github.io/)
- [Alluxio Documentation](https://docs.alluxio.io/)
- [MinIO Documentation](https://docs.min.io/)
- [Hugging Face Hub](https://huggingface.co/docs/hub/index)
- [Dynamo README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md)
- [Dynamo Documentation](https://docs.nvidia.com/dynamo/latest/index.html)
--- a/docs/kubernetes/observability/logging.md
+++ b/docs/kubernetes/observability/logging.md
-# Log Aggregation in Dynamo on Kubernetes
-This guide demonstrates how to set up logging for Dynamo in Kubernetes using Grafana Loki and Alloy. This setup provides a simple reference logging setup that can be followed in Kubernetes clusters including Minikube and MicroK8s.
-> [!Note]
-> This setup is intended for development and testing purposes. For production environments, please refer to the official documentation for high-availability configurations.
-## Components Overview
- **[Grafana Loki](https://grafana.com/oss/loki/)**: Fast and cost-effective Kubernetes-native log aggregation system.
- **[Grafana Alloy](https://grafana.com/oss/alloy/)**: OpenTelemetry collector that replaces Promtail, gathering logs, metrics and traces from Kubernetes pods.
- **[Grafana](https://grafana.com/grafana/)**: Visualization platform for querying and exploring logs.
-## Prerequisites
-### 1. Dynamo Kubernetes Platform
-This guide assumes you have installed Dynamo Kubernetes Platform. For more information, see [Dynamo Kubernetes Platform](../README.md).
-### 2. Kube-prometheus
-While this guide does not use Prometheus, it assumes Grafana is pre-installed with the kube-prometheus. For more information, see [kube-prometheus](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
-### 3. Environment Variables
-#### Kubernetes Setup Variables
-The following env variables are set:
- `MONITORING_NAMESPACE`: The namespace where Loki is installed
- `DYN_NAMESPACE`: The namespace where Dynamo Kubernetes Platform is installed
-```bash
-export MONITORING_NAMESPACE=monitoring
-export DYN_NAMESPACE=dynamo-system
-```
-#### Dynamo Logging Variables
-| Variable | Description | Example |
-|----------|-------------|---------|
-| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for Loki) | `true` |
-| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
-| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `true` |
-## Installation Steps
-### 1. Install Loki
-First, we'll install Loki in single binary mode, which is ideal for testing and development:
-```bash
-# Add the Grafana Helm repository
-helm repo add grafana https://grafana.github.io/helm-charts
-helm repo update
-# Install Loki
-helm install --values deploy/observability/k8s/logging/values/loki-values.yaml loki grafana/loki -n $MONITORING_NAMESPACE
-```
-Our configuration (`loki-values.yaml`) sets up Loki in a simple configuration that is suitable for testing and development. It uses a local MinIO for storage. The installation pods can be viewed with:
-```bash
-kubectl get pods -n $MONITORING_NAMESPACE -l app=loki
-```
-### 2. Install Grafana Alloy
-Next, install the Grafana Alloy collector to gather logs from your Kubernetes cluster and forward them to Loki. Here we use the Helm chart `k8s-monitoring` provided by Grafana to install the collector:
-```bash
-# Generate a custom values file with the namespace information
-envsubst < deploy/observability/k8s/logging/values/alloy-values.yaml > alloy-custom-values.yaml
-# Install the collector
-helm install --values alloy-custom-values.yaml alloy grafana/k8s-monitoring -n $MONITORING_NAMESPACE
-```
-The values file (`alloy-values.yaml`) includes the following configurations for the collector:
- Destination to forward logs to Loki
- Namespace to collect logs from
- Pod labels to be mapped to Loki labels
- Collection method (kubernetesApi or tailing `/var/log/containers/`)
-```yaml
-destinations:
- name: loki
-  type: loki
-  url: http://loki-gateway.$MONITORING_NAMESPACE.svc.cluster.local/loki/api/v1/push
-podLogs:
-  enabled: true
-  gatherMethod: kubernetesApi # collect logs from the kubernetes api, rather than /var/log/containers/; friendly for testing and development
-  collector: alloy-logs
-  labels:
-    app_kubernetes_io_name: app.kubernetes.io/name
-    nvidia_com_dynamo_component_type: nvidia.com/dynamo-component-type
-    nvidia_com_dynamo_graph_deployment_name: nvidia.com/dynamo-graph-deployment-name
-  labelsToKeep:
-  - "app_kubernetes_io_name"
-  - "container"
-  - "instance"
-  - "job"
-  - "level"
-  - "namespace"
-  - "service_name"
-  - "service_namespace"
-  - "deployment_environment"
-  - "deployment_environment_name"
-  - "nvidia_com_dynamo_component_type" # extract this label from the dynamo graph deployment
-  - "nvidia_com_dynamo_graph_deployment_name" # extract this label from the dynamo graph deployment
-  namespaces:
-  - $DYN_NAMESPACE
-```
-### 3. Configure Grafana with the Loki datasource and Dynamo Logs dashboard
-We will be viewing the logs associated with our DynamoGraphDeployment in Grafana. To do this, we need to configure Grafana with the Loki datasource and Dynamo Logs dashboard.
-Since we are using Grafana with the Prometheus Operator, we can simply apply the following ConfigMaps to quickly achieve this configuration.
-```bash
-# Configure Grafana with the Loki datasource
-envsubst < deploy/observability/k8s/logging/grafana/loki-datasource.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
-# Configure Grafana with the Dynamo Logs dashboard
-kubectl apply -f deploy/observability/k8s/logging/grafana/logging-dashboard.yaml -n $MONITORING_NAMESPACE
-```
-> [!Note]
-> If using Grafana installed without the Prometheus Operator, you can manually import the Loki datasource and Dynamo Logs dashboard using the Grafana UI.
-### 4. Deploy a DynamoGraphDeployment with JSONL Logging
-At this point, we should have everything in place to collect and view logs in our Grafana instance. All that is left is to deploy a DynamoGraphDeployment to collect logs from.
-To enable structured logs in a DynamoGraphDeployment, we need to set the `DYN_LOGGING_JSONL` environment variable to `1`. This is done for us in the `agg_logging.yaml` setup for the Sglang backend. We can now deploy the DynamoGraphDeployment with:
-```bash
-kubectl apply -n $DYN_NAMESPACE -f examples/backends/sglang/deploy/agg_logging.yaml
-```
-Send a few chat completions requests to generate structured logs across the frontend and worker pods across the DynamoGraphDeployment. We are now all set to view the logs in Grafana.
-## Viewing Logs in Grafana
-Port-forward the Grafana service to access the UI:
-```bash
-kubectl port-forward svc/prometheus-grafana 3000:80 -n $MONITORING_NAMESPACE
-```
-If everything is working, under Home > Dashboards > Dynamo Logs, you should see a dashboard that can be used to view the logs associated with our DynamoGraphDeployments
-The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g., frontend, worker, etc.).
--- a/docs/kubernetes/observability/metrics.md
+++ b/docs/kubernetes/observability/metrics.md
-# Dynamo Metrics Collection on Kubernetes
-## Overview
-This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the kube-prometheus-stack. The kube-prometheus-stack provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components.
-## Prerequisites
-### Install kube-prometheus-stack
-If you don't have an existing Prometheus setup, you'll likely want to install the kube-prometheus-stack. This is a collection of Kubernetes manifests that includes the Prometheus Operator, Prometheus, Grafana, and other monitoring components in a pre-configured setup. The stack introduces custom resources that make it easy to deploy and manage monitoring in Kubernetes:
- `PodMonitor`: Automatically discovers and scrapes metrics from pods based on label selectors
- `ServiceMonitor`: Similar to PodMonitor but works with Services
- `PrometheusRule`: Defines alerting and recording rules
-For a basic installation:
-```bash
-helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
-helm repo update
-# Values allow PodMonitors to be picked up that are outside of the kube-prometheus-stack helm release
-helm install prometheus -n monitoring --create-namespace prometheus-community/kube-prometheus-stack \
-  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
-  --set prometheus.prometheusSpec.podMonitorNamespaceSelector.matchLabels=null \
-  --set prometheus.prometheusSpec.probeNamespaceSelector.matchLabels=null
-```
-> [!Note]
-> The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
-### Install Dynamo Operator
-Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../installation_guide.md) for detailed instructions on deploying the Dynamo operator.
-Make sure to set the `dynamo-operator.dynamo.metrics.prometheusEndpoint` to the Prometheus endpoint you installed in the previous step.
-```bash
-helm install dynamo-platform ...
-  --set dynamo-operator.dynamo.metrics.prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
-```
-### Node Exporter for CPU/Memory Metrics
-The Dynamo Grafana dashboard includes panels for node-level CPU utilization, system load, and container resource usage. These metrics are collected and exported to Prometheus via [node-exporter](https://github.com/prometheus/node_exporter), which exposes hardware and OS metrics from Linux systems.
-> [!Note]
-> The kube-prometheus-stack installation described above includes node-exporter by default. If you're using a custom Prometheus setup, you'll need to ensure node-exporter is deployed as a DaemonSet on your cluster nodes.
-To verify node-exporter is running:
-```bash
-kubectl get daemonset -A | grep node-exporter
-```
-If node-exporter is not running, you can install it via the kube-prometheus-stack or deploy it separately. For more information, see the [node-exporter documentation](https://github.com/prometheus/node_exporter).
-### DCGM Metrics Collection (Optional)
-GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
-```bash
-kubectl get daemonset -A | grep dcgm-exporter
-```
-If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
-## Deploy a DynamoGraphDeployment
-Let's start by deploying a simple vLLM aggregated deployment:
-```bash
-export NAMESPACE=dynamo-system # namespace where dynamo operator is installed
-pushd examples/backends/vllm/deploy
-kubectl apply -f agg.yaml -n $NAMESPACE
-popd
-```
-This will create two components:
- A Frontend component exposing metrics on its HTTP port
- A Worker component exposing metrics on its system port
-Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](../../backends/vllm/README.md)
- Available metrics: See the [metrics guide](../../observability/metrics.md)
-### Validate the Deployment
-Let's send some test requests to populate metrics:
-```bash
-curl localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [
-    {
-        "role": "user",
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
-    }
-    ],
-    "stream": true,
-    "max_tokens": 30
-  }'
-```
-For more information about validating the deployment, see the [vLLM README](../../backends/vllm/README.md).
-## Set Up Metrics Collection
-### Create PodMonitors
-The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods:
- `nvidia.com/metrics-enabled: "true"` - Enables metrics collection
- `nvidia.com/dynamo-component-type: "frontend|worker"` - Identifies the component type
-> **Note**: You can opt-out specific deployments from metrics collection by adding this annotation to your DynamoGraphDeployment:
-```yaml
-apiVersion: nvidia.com/v1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-  annotations:
-    nvidia.com/enable-metrics: "false"
-spec:
-  # …
-```
-### Configure Grafana Dashboard
-Apply the Dynamo dashboard configuration to populate Grafana with the Dynamo dashboard:
-```bash
-kubectl apply -n monitoring -f deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
-```
-The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_dashboard: "1"`, the Grafana will discover and populate it to its list of available dashboards. The dashboard includes panels for:
- Frontend request rates
- Time to first token
- Inter-token latency
- Request duration
- Input/Output sequence lengths
- GPU utilization via DCGM
- Node CPU utilization and system load
- Container CPU usage per pod
- Memory usage per pod
-## Viewing the Metrics
-### In Prometheus
-```bash
-kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
-```
-Visit http://localhost:9090 and try these example queries:
- `dynamo_frontend_requests_total`
- `dynamo_frontend_time_to_first_token_seconds_bucket`
-![Prometheus UI showing Dynamo metrics](../../images/prometheus-k8s.png)
-### In Grafana
-```bash
-# Get Grafana credentials
-export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
-export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
-echo "Grafana user: $GRAFANA_USER"
-echo "Grafana password: $GRAFANA_PASSWORD"
-# Port forward Grafana service
-kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
-```
-Visit http://localhost:3000 and log in with the credentials captured above.
-Once logged in, find the Dynamo dashboard under General.
-![Grafana dashboard showing Dynamo metrics](../../images/grafana-k8s.png)
-## Operator Metrics
-> **Note:** The metrics described above are for Dynamo **applications** (frontends, workers). The Dynamo **Operator** itself also exposes metrics for monitoring controller reconciliation, webhook validation, and resource inventory.
->
-> See the **[Operator Metrics Guide](operator-metrics.md)** for details on operator-specific metrics and the operator dashboard.
-```{toctree}
-:hidden:
-Logging <logging>
-Operator Metrics <operator-metrics>
-```
--- a/docs/kubernetes/observability/operator-metrics.md
+++ b/docs/kubernetes/observability/operator-metrics.md
-# Dynamo Operator Metrics
-## Overview
-The Dynamo Operator exposes Prometheus metrics for monitoring its own health and performance. These metrics are separate from application metrics (frontend/worker) and provide visibility into:
- **Controller Reconciliation**: How efficiently controllers process DynamoGraphDeployments, DynamoComponentDeployments, and DynamoModels
- **Webhook Validation**: Performance and outcomes of admission webhook requests
- **Resource Inventory**: Current count of managed resources by state and namespace
-## Prerequisites
-The operator metrics feature requires the same monitoring infrastructure as application metrics. For detailed setup instructions, see the [Kubernetes Metrics Guide](./metrics.md#prerequisites).
-**Quick checklist:**
- ✅ kube-prometheus-stack installed (for ServiceMonitor support)
- ✅ Prometheus and Grafana running
- ✅ Dynamo Operator installed via Helm
-## Metrics Collection
-### ServiceMonitor
-Operator metrics are automatically collected via a ServiceMonitor, which is created by the Helm chart when `metricsService.enabled: true` (default).
-**Unlike application metrics** (which use PodMonitor), the operator uses ServiceMonitor and requires no manual RBAC configuration. The operator's kube-rbac-proxy sidecar is configured with `--ignore-paths=/metrics` to allow Prometheus access.
-To verify the ServiceMonitor is created:
-```bash
-kubectl get servicemonitor -n dynamo-system
-```
-### Disabling Metrics Collection
-To disable operator metrics collection:
-```bash
-helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
-  --namespace dynamo-system \
-  --set dynamo-operator.metricsService.enabled=false
-```
-## Available Metrics
-All metrics use the `dynamo_operator` namespace prefix.
-### Reconciliation Metrics
-| Metric | Type | Labels | Description |
-|--------|------|--------|-------------|
-| `dynamo_operator_reconcile_duration_seconds` | Histogram | `resource_type`, `namespace`, `result` | Duration of reconciliation loops |
-| `dynamo_operator_reconcile_total` | Counter | `resource_type`, `namespace`, `result` | Total number of reconciliations |
-| `dynamo_operator_reconcile_errors_total` | Counter | `resource_type`, `namespace`, `error_type` | Total reconciliation errors by type |
-**Labels:**
- `resource_type`: `DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoModel`, `DynamoGraphDeploymentRequest`, `DynamoGraphDeploymentScalingAdapter`
- `namespace`: Target namespace of the resource
- `result`: `success`, `error`, `requeue`
- `error_type`: `not_found`, `already_exists`, `conflict`, `validation`, `bad_request`, `unauthorized`, `forbidden`, `timeout`, `server_timeout`, `unavailable`, `rate_limited`, `internal`
-### Webhook Metrics
-| Metric | Type | Labels | Description |
-|--------|------|--------|-------------|
-| `dynamo_operator_webhook_duration_seconds` | Histogram | `resource_type`, `operation` | Duration of webhook validation requests |
-| `dynamo_operator_webhook_requests_total` | Counter | `resource_type`, `operation`, `result` | Total webhook admission requests |
-| `dynamo_operator_webhook_denials_total` | Counter | `resource_type`, `operation`, `reason` | Total webhook denials with reasons |
-**Labels:**
- `resource_type`: Same as reconciliation metrics
- `operation`: `CREATE`, `UPDATE`, `DELETE`
- `result`: `allowed`, `denied`
- `reason`: Validation failure reason (e.g., `immutable_field_changed`, `invalid_config`)
-### Resource Inventory Metrics
-| Metric | Type | Labels | Description |
-|--------|------|--------|-------------|
-| `dynamo_operator_resources_total` | Gauge | `resource_type`, `namespace`, `status` | Current count of resources by state |
-**Labels:**
- `resource_type`: `DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoModel`, `DynamoGraphDeploymentRequest`, `DynamoGraphDeploymentScalingAdapter`
- `namespace`: Resource namespace
- `status`: Resource state derived from each CRD's status. Common values:
-  - `"ready"` - Resource is healthy and operational (DCD, DM, DGDSA)
-  - `"not_ready"` - Resource exists but is not operational (DCD, DM, DGDSA)
-  - `"unknown"` - State cannot be determined (default for empty status)
-  - DGD uses: `"pending"`, `"successful"`, `"failed"` from `.status.state`
-  - DGDR uses: `"Pending"`, `"Profiling"`, `"Deploying"`, `"Ready"`, `"DeploymentDeleted"`, `"Failed"` from `.status.state`
-## Example Queries
-### Reconciliation Performance
-```promql
-# P95 reconciliation duration by resource type
-histogram_quantile(0.95,
-  sum by (resource_type, le) (
-    rate(dynamo_operator_reconcile_duration_seconds_bucket[5m])
-  )
-)
-# Reconciliation rate by result
-sum by (resource_type, result) (
-  rate(dynamo_operator_reconcile_total[5m])
-)
-# Error rate by type
-sum by (resource_type, error_type) (
-  rate(dynamo_operator_reconcile_errors_total[5m])
-)
-```
-### Webhook Performance
-```promql
-# Webhook P95 latency
-histogram_quantile(0.95,
-  sum by (resource_type, le) (
-    rate(dynamo_operator_webhook_duration_seconds_bucket[5m])
-  )
-)
-# Webhook denial rate
-sum by (resource_type, operation, reason) (
-  rate(dynamo_operator_webhook_denials_total[5m])
-)
-```
-### Resource Inventory
-```promql
-# Total resources by type and state
-sum by (resource_type, status) (
-  dynamo_operator_resources_total
-)
-# DynamoGraphDeployments by state
-sum by (status) (
-  dynamo_operator_resources_total{resource_type="DynamoGraphDeployment"}
-)
-# All resources by namespace and state
-sum by (resource_type, namespace, status) (
-  dynamo_operator_resources_total
-)
-```
-## Grafana Dashboard
-A pre-built Grafana dashboard is available for visualizing operator metrics.
-### Dashboard Sections
-1. **Reconciliation Metrics** (3 panels)
-   - Reconciliation rate by resource type and result
-   - P95 reconciliation duration
-   - Reconciliation errors by type
-2. **Webhook Metrics** (3 panels)
-   - Webhook request rate by operation
-   - P95 webhook duration
-   - Webhook denials by reason
-3. **Resource Inventory** (2 panels)
-   - Resource inventory timeline by state and namespace (filterable by resource type)
-   - Current resource count by state (filterable by resource type)
-4. **Operational Health** (2 panels)
-   - Reconciliation success rate gauges
-   - Webhook admission success rate gauges
-### Deploying the Dashboard
-```bash
-kubectl apply -f deploy/observability/k8s/grafana-operator-dashboard-configmap.yaml
-```
-The dashboard will automatically appear in Grafana (assuming you have the Grafana dashboard sidecar configured, which is included in kube-prometheus-stack).
-### Finding the Dashboard
-1. Port-forward to Grafana (if needed):
-   ```bash
-   kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
-   ```
-2. Log in to Grafana at http://localhost:3000
-3. Navigate to **Dashboards** → Search for **"Dynamo Operator"**
-### Dashboard Filters
-The dashboard includes two filter variables:
- **Namespace**: View metrics across all namespaces or filter by specific ones (multi-select)
- **Resource Type**: Filter all panels by resource type or select "All" to see aggregated metrics across all CRDs (single select)
-When "All" is selected for Resource Type, all panels will show data for all five managed CRDs with resource_type labels for differentiation.
-## Accessing Metrics Directly
-For instructions on accessing Prometheus and Grafana, see the [Kubernetes Metrics Guide](./metrics.md#viewing-the-metrics).
-Once you have access to Prometheus, you can query operator metrics directly:
-```bash
-# Port-forward to Prometheus
-kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
-# Visit http://localhost:9090 and try queries like:
-# - dynamo_operator_reconcile_total
-# - dynamo_operator_webhook_requests_total
-# - dynamo_operator_resources_total
-```
-## Troubleshooting
-### Metrics Not Appearing in Prometheus
-1. **Check ServiceMonitor exists:**
-   ```bash
-   kubectl get servicemonitor -n dynamo-system | grep operator
-   ```
-2. **Check ServiceMonitor is discovered by Prometheus:**
-   - Go to Prometheus UI → Status → Targets
-   - Look for `serviceMonitor/dynamo-system/dynamo-platform-dynamo-operator-operator`
-   - Should show state: `UP`
-3. **Check Prometheus selector configuration:**
-   ```bash
-   kubectl get prometheus -o yaml | grep serviceMonitorSelector
-   ```
-   Ensure `serviceMonitorSelectorNilUsesHelmValues: false` was set during kube-prometheus-stack installation.
-### Dashboard Not Appearing in Grafana
-1. **Check ConfigMap is created:**
-   ```bash
-   kubectl get configmap -n monitoring grafana-operator-dashboard
-   ```
-2. **Check ConfigMap has the label:**
-   ```bash
-   kubectl get configmap -n monitoring grafana-operator-dashboard -o jsonpath='{.metadata.labels.grafana_dashboard}'
-   ```
-   Should return `"1"`
-3. **Check Grafana dashboard sidecar configuration:**
-   ```bash
-   kubectl get deployment -n monitoring prometheus-grafana -o yaml | grep -A 5 sidecar
-   ```
-   The sidecar should be configured to watch for `grafana_dashboard: "1"` label.
-4. **Restart Grafana pod** to force dashboard refresh:
-   ```bash
-   kubectl rollout restart deployment/prometheus-grafana -n monitoring
-   ```
-## Related Documentation
- [Kubernetes Metrics Guide](./metrics.md) - Application metrics for frontends and workers
- [Dynamo Operator Guide](../dynamo_operator.md) - Operator architecture and deployment modes
- [Operator Webhooks](../webhooks.md) - Webhook validation details
--- a/docs/kubernetes/service_discovery.md
+++ b/docs/kubernetes/service_discovery.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Service Discovery
-Dynamo components (frontends, workers, planner) need to be able to discover each other and their capabilities at runtime. We refer to this as service discovery. There are 2 kinds of service discovery backends supported on Kubernetes.
-## Discovery Backends
-| Backend | Default | Dependencies | Use Case |
-|---------|---------|--------------|----------|
-| **Kubernetes** | ✅ Yes | None (native K8s) | Recommended for all Kubernetes deployments |
-| **KV Store (etcd)** | No | etcd cluster | Legacy deployments |
-## Kubernetes Discovery (Default)
-Kubernetes discovery is the default and recommended backend when running on Kubernetes. It uses native Kubernetes primitives to facilitate discovery of components:
- **DynamoWorkerMetadata CRD**: Each worker stores its registered endpoints and model cards in a Custom Resource
- **EndpointSlices**: EndpointSlices signal each component's readiness status
-### Implementation Details
-Each pod runs a **discovery daemon** that watches both EndpointSlices and DynamoWorkerMetadata CRs. A pod is only discoverable when it appears as "ready" in an EndpointSlice AND has a corresponding `DynamoWorkerMetadata` CR. This correlation ensures pods aren't discoverable until they're ready, metadata is immediately available, and stale entries are cleaned up when pods terminate.
-#### DynamoWorkerMetadata CRD
-Each worker pod creates a `DynamoWorkerMetadata` CR that stores its discovery metadata:
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoWorkerMetadata
-metadata:
-  name: my-worker-pod-abc123
-  namespace: dynamo-system
-  ownerReferences:
-    - apiVersion: v1
-      kind: Pod
-      name: my-worker-pod-abc123
-      uid: <pod-uid>
-      controller: true
-spec:
-  data:
-    endpoints:
-      "dynamo/backend/generate":
-        type: Endpoint
-        namespace: dynamo
-        component: backend
-        endpoint: generate
-        instance_id: 12345678901234567890
-        transport:
-          nats_tcp: "dynamo_backend.generate-abc123"
-    model_cards: {}
-```
-The CR is named after the pod and includes an owner reference for automatic garbage collection when the pod is deleted.
-#### EndpointSlices
-While DynamoWorkerMetadata resources provide an up-to-date snapshot of a component's capabilities, EndpointSlices give a snapshot of health of the various Dynamo components.
-The operator creates a Kubernetes Service targeting the Dynamo components. The Kubernetes controller in turn creates and maintains EndpointSlice resources that keep track of the readiness of the pods targeted by the Service. Watching these slices gives us an up-to-date snapshot of which Dynamo components are ready to serve traffic.
-##### Readiness Probes
-A pod is marked ready if the readiness probe succeeds. On Dynamo workers, this is when the `generate` endpoint is available and healthy. These probes are configured by the Dynamo operator for each pod/component.
-#### RBAC
-Each Dynamo component pod is automatically given a ServiceAccount that allows it to watch `EndpointSlice` and `DynamoWorkerMetadata` resources within its namespace.
-#### Environment Variables
-The following environment variables are automatically injected into pods by the operator to facilitate service discovery:
-| Variable | Description |
-|----------|-------------|
-| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
-| `POD_NAME` | Pod name (via downward API) |
-| `POD_NAMESPACE` | Pod namespace (via downward API) |
-| `POD_UID` | Pod UID (via downward API) |
-The pod's instance ID is deterministically generated by hashing the pod name, ensuring consistent identity and correlation between EndpointSlices and CRs.
-## KV Store Discovery (etcd)
-To use etcd-based discovery instead of Kubernetes-native discovery, add the annotation to your DynamoGraphDeployment:
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-  annotations:
-    nvidia.com/dynamo-discovery-backend: etcd
-spec:
-  services:
-    # ...
-```
-This requires an etcd cluster to be available. The etcd connection is configured via the platform Helm chart.
--- a/docs/kubernetes/webhooks.md
+++ b/docs/kubernetes/webhooks.md
-# Webhooks
-This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
-## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Configuration](#configuration)
-  - [Enabling/Disabling Webhooks](#enablingdisabling-webhooks)
-  - [Certificate Management Options](#certificate-management-options)
-  - [Advanced Configuration](#advanced-configuration)
- [Certificate Management](#certificate-management)
-  - [Automatic Certificates (Default)](#automatic-certificates-default)
-  - [cert-manager Integration](#cert-manager-integration)
-  - [External Certificates](#external-certificates)
- [Multi-Operator Deployments](#multi-operator-deployments)
- [Troubleshooting](#troubleshooting)
---
-## Overview
-The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.
-All webhook types (validating, mutating, conversion, etc.) share the same **webhook server** and **TLS certificate infrastructure**, making certificate management consistent across all webhook operations.
-### Key Features
- ✅ **Enabled by default** - Zero-touch validation out of the box
- ✅ **Shared certificate infrastructure** - All webhook types use the same TLS certificates
- ✅ **Automatic certificate generation** - No manual certificate management required
- ✅ **Defense in depth** - Controllers validate when webhooks are disabled
- ✅ **cert-manager integration** - Optional integration for automated certificate lifecycle
- ✅ **Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
- ✅ **Immutability enforcement** - Critical fields protected via CEL validation rules
-### Current Webhook Types
- **Validating Webhooks**: Validate custom resource specifications before persistence
-  - `DynamoComponentDeployment` validation
-  - `DynamoGraphDeployment` validation
-  - `DynamoModel` validation
-**Note:** Future releases may add mutating webhooks (for defaults/transformations) and conversion webhooks (for CRD version migrations). All will use the same certificate infrastructure described in this document.
---
-## Architecture
-```
-┌─────────────────────────────────────────────────────────────────┐
-│                         API Server                               │
-│  1. User submits CR (kubectl apply)                             │
-│  2. API server calls ValidatingWebhookConfiguration             │
-└────────────────────────┬────────────────────────────────────────┘
-                         │ HTTPS (TLS required)
-                         ▼
-┌─────────────────────────────────────────────────────────────────┐
-│                  Webhook Server (in Operator Pod)                │
-│  3. Validates CR against business rules                         │
-│  4. Returns admit/deny decision + warnings                      │
-└─────────────────────────────────────────────────────────────────┘
-                         │
-                         ▼
-┌─────────────────────────────────────────────────────────────────┐
-│                      API Server                                  │
-│  5. If admitted: Persist CR to etcd                             │
-│  6. If denied: Return error to user                             │
-└─────────────────────────────────────────────────────────────────┘
-```
-### Validation Flow
-1. **Webhook validation** (if enabled): Validates at API server level
-2. **CEL validation**: Kubernetes-native immutability checks (always active)
-3. **Controller validation** (if webhooks disabled): Defense-in-depth validation during reconciliation
---
-## Configuration
-### Enabling/Disabling Webhooks
-Webhooks are **enabled by default**. To disable them:
-```yaml
-# Platform-level values.yaml
-dynamo-operator:
-  webhook:
-    enabled: false
-```
-**When to disable webhooks:**
- During development/testing when rapid iteration is needed
- In environments where admission webhooks are not supported
- When troubleshooting validation issues
-**Note:** When webhooks are disabled, controllers perform validation during reconciliation (defense in depth).
---
-### Certificate Management Options
-The operator supports three certificate management modes:
-| Mode | Description | Use Case |
-|------|-------------|----------|
-| **Automatic (Default)** | Helm hooks generate self-signed certificates | Testing and development environments |
-| **cert-manager** | Integrate with cert-manager for automated lifecycle | Production deployments with cert-manager |
-| **External** | Bring your own certificates | Production deployments with custom PKI |
---
-### Advanced Configuration
-#### Complete Configuration Reference
-```yaml
-dynamo-operator:
-  webhook:
-    # Enable/disable validation webhooks
-    enabled: true
-    # Certificate management
-    certManager:
-      enabled: false
-      issuerRef:
-        kind: Issuer
-        name: selfsigned-issuer
-    # Certificate secret configuration
-    certificateSecret:
-      name: webhook-server-cert
-      external: false
-    # Certificate validity period (automatic generation only)
-    certificateValidity: 3650  # 10 years
-    # Certificate generator image (automatic generation only)
-    certGenerator:
-      image:
-        repository: bitnami/kubectl
-        tag: latest
-    # Webhook behavior configuration
-    failurePolicy: Fail        # Fail (reject on error) or Ignore (allow on error)
-    timeoutSeconds: 10         # Webhook timeout
-    # Namespace filtering (advanced)
-    namespaceSelector: {}      # Kubernetes label selector for namespaces
-```
-#### Failure Policy
-```yaml
-# Fail: Reject resources if webhook is unavailable (recommended for production)
-webhook:
-  failurePolicy: Fail
-# Ignore: Allow resources if webhook is unavailable (use with caution)
-webhook:
-  failurePolicy: Ignore
-```
-**Recommendation:** Use `Fail` in production to ensure validation is always enforced. Only use `Ignore` if you need high availability and can tolerate occasional invalid resources.
-#### Namespace Filtering
-Control which namespaces are validated (applies to **cluster-wide operator** only):
-```yaml
-# Only validate resources in namespaces with specific labels
-webhook:
-  namespaceSelector:
-    matchLabels:
-      dynamo-validation: enabled
-# Or exclude specific namespaces
-webhook:
-  namespaceSelector:
-    matchExpressions:
-    - key: dynamo-validation
-      operator: NotIn
-      values: ["disabled"]
-```
-**Note:** For **namespace-restricted operators**, the namespace selector is automatically set to validate only the operator's namespace. This configuration is ignored in namespace-restricted mode.
---
-## Certificate Management
-### Automatic Certificates (Default)
-**Zero configuration required!** Certificates are automatically generated during `helm install` and `helm upgrade`.
-#### How It Works
-1. **Pre-install/pre-upgrade hook**: Generates self-signed TLS certificates
-   - Root CA (valid 10 years)
-   - Server certificate (valid 10 years)
-   - Stores in Secret: `<release>-webhook-server-cert`
-2. **Post-install/post-upgrade hook**: Injects CA bundle into `ValidatingWebhookConfiguration`
-   - Reads `ca.crt` from Secret
-   - Patches `ValidatingWebhookConfiguration` with base64-encoded CA bundle
-3. **Operator pod**: Mounts certificate secret and serves webhook on port 9443
-#### Certificate Validity
- **Root CA**: 10 years
- **Server Certificate**: 10 years (same as Root CA)
- **Automatic rotation**: Certificates are re-generated on every `helm upgrade`
-#### Smart Certificate Generation
-The certificate generation hook is intelligent:
- ✅ **Checks existing certificates** before generating new ones
- ✅ **Skips generation** if valid certificates exist (valid for 30+ days with correct SANs)
- ✅ **Regenerates** only when needed (missing, expiring soon, or incorrect SANs)
-This means:
- Fast `helm upgrade` operations (no unnecessary cert generation)
- Safe to run `helm upgrade` frequently
- Certificates persist across reinstalls (stored in Secret)
-#### Manual Certificate Rotation
-If you need to rotate certificates manually:
-```bash
-# Delete the certificate secret
-kubectl delete secret <release>-webhook-server-cert -n <namespace>
-# Upgrade the release to regenerate certificates
-helm upgrade <release> dynamo-platform -n <namespace>
-```
---
-### cert-manager Integration
-For clusters with cert-manager installed, you can enable automated certificate lifecycle management.
-#### Prerequisites
-1. **cert-manager installed** (v1.0+)
-2. **CA issuer configured** (e.g., `selfsigned-issuer`)
-#### Configuration
-```yaml
-dynamo-operator:
-  webhook:
-    certManager:
-      enabled: true
-      issuerRef:
-        kind: Issuer              # Or ClusterIssuer
-        name: selfsigned-issuer   # Your issuer name
-```
-#### How It Works
-1. **Helm creates Certificate resource**: Requests TLS certificate from cert-manager
-2. **cert-manager generates certificate**: Based on configured issuer
-3. **cert-manager stores in Secret**: `<release>-webhook-server-cert`
-4. **cert-manager ca-injector**: Automatically injects CA bundle into `ValidatingWebhookConfiguration`
-5. **Operator pod**: Mounts certificate secret and serves webhook
-#### Benefits Over Automatic Mode
- ✅ **Automated rotation**: cert-manager renews certificates before expiration
- ✅ **Custom validity periods**: Configure certificate lifetime
- ✅ **CA rotation support**: ca-injector handles CA updates automatically
- ✅ **Integration with existing PKI**: Use your organization's certificate infrastructure
-#### Certificate Rotation
-With cert-manager, certificate rotation is **fully automated**:
-1. **Leaf certificate rotation** (default: every year)
-   - cert-manager auto-renews before expiration
-   - controller-runtime auto-reloads new certificate
-   - **No pod restart required**
-   - **No caBundle update required** (same Root CA)
-2. **Root CA rotation** (every 10 years)
-   - cert-manager rotates Root CA
-   - ca-injector auto-updates caBundle in `ValidatingWebhookConfiguration`
-   - **No manual intervention required**
-#### Example: Self-Signed Issuer
-```yaml
-apiVersion: cert-manager.io/v1
-kind: Issuer
-metadata:
-  name: selfsigned-issuer
-  namespace: dynamo-system
-spec:
-  selfSigned: {}
---
-# Enable in platform values.yaml
-dynamo-operator:
-  webhook:
-    certManager:
-      enabled: true
-      issuerRef:
-        kind: Issuer
-        name: selfsigned-issuer
-```
---
-### External Certificates
-Bring your own certificates for custom PKI requirements.
-#### Steps
-1. **Create certificate secret manually**:
-```bash
-kubectl create secret tls <release>-webhook-server-cert \
-  --cert=tls.crt \
-  --key=tls.key \
-  -n <namespace>
-# Also add ca.crt to the secret
-kubectl patch secret <release>-webhook-server-cert -n <namespace> \
-  --type='json' \
-  -p='[{"op": "add", "path": "/data/ca.crt", "value": "'$(base64 -w0 < ca.crt)'"}]'
-```
-2. **Configure operator to use external secret**:
-```yaml
-dynamo-operator:
-  webhook:
-    certificateSecret:
-      external: true
-    caBundle: <base64-encoded-ca-cert>  # Must manually specify
-```
-3. **Deploy operator**:
-```bash
-helm install dynamo-platform . -n <namespace> -f values.yaml
-```
-#### Certificate Requirements
- **Secret name**: Must match `webhook.certificateSecret.name` (default: `webhook-server-cert`)
- **Secret keys**: `tls.crt`, `tls.key`, `ca.crt`
- **Certificate SAN**: Must include `<service-name>.<namespace>.svc`
-  - Example: `dynamo-platform-dynamo-operator-webhook-service.dynamo-system.svc`
---
-## Multi-Operator Deployments
-The operator supports running both **cluster-wide** and **namespace-restricted** instances simultaneously using a **lease-based coordination mechanism**.
-### Scenario
-```
-Cluster:
-├─ Operator A (cluster-wide, namespace: platform-system)
-│  └─ Validates all namespaces EXCEPT team-a
-└─ Operator B (namespace-restricted, namespace: team-a)
-   └─ Validates only team-a namespace
-```
-### How It Works
-1. **Namespace-restricted operator** creates a Lease in its namespace
-2. **Cluster-wide operator** watches for Leases named `dynamo-operator-ns-lock`
-3. **Cluster-wide operator** skips validation for namespaces with active Leases
-4. **Namespace-restricted operator** validates resources in its namespace
-### Lease Configuration
-The lease mechanism is **automatically configured** based on deployment mode:
-```yaml
-# Cluster-wide operator (default)
-namespaceRestriction:
-  enabled: false
-# → Watches for leases in all namespaces
-# → Skips validation for namespaces with active leases
-# Namespace-restricted operator
-namespaceRestriction:
-  enabled: true
-  namespace: team-a
-# → Creates lease in team-a namespace
-# → Does NOT check for leases (no cluster permissions)
-```
-### Deployment Example
-```bash
-# 1. Deploy cluster-wide operator
-helm install platform-operator dynamo-platform \
-  -n platform-system \
-  --set namespaceRestriction.enabled=false
-# 2. Deploy namespace-restricted operator for team-a
-helm install team-a-operator dynamo-platform \
-  -n team-a \
-  --set namespaceRestriction.enabled=true \
-  --set namespaceRestriction.namespace=team-a
-```
-### ValidatingWebhookConfiguration Naming
-The webhook configuration name reflects the deployment mode:
- **Cluster-wide**: `<release>-validating`
- **Namespace-restricted**: `<release>-validating-<namespace>`
-Example:
-```bash
-# Cluster-wide
-platform-operator-validating
-# Namespace-restricted (team-a)
-team-a-operator-validating-team-a
-```
-This allows multiple webhook configurations to coexist without conflicts.
-### Lease Health
-If the namespace-restricted operator is deleted or becomes unhealthy:
- Lease expires after `leaseDuration + gracePeriod` (default: ~30 seconds)
- Cluster-wide operator automatically resumes validation for that namespace
---
-## Troubleshooting
-### Webhook Not Called
-**Symptoms:**
- Invalid resources are accepted
- No validation errors in logs
-**Checks:**
-1. **Verify webhook is enabled**:
-```bash
-kubectl get validatingwebhookconfiguration | grep dynamo
-```
-2. **Check webhook configuration**:
-```bash
-kubectl get validatingwebhookconfiguration <name> -o yaml
-# Verify:
-# - caBundle is present and non-empty
-# - clientConfig.service points to correct service
-# - webhooks[].namespaceSelector matches your namespace
-```
-3. **Verify webhook service exists**:
-```bash
-kubectl get service -n <namespace> | grep webhook
-```
-4. **Check operator logs for webhook startup**:
-```bash
-kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep webhook
-# Should see: "Webhooks are enabled - webhooks will validate, controllers will skip validation"
-# Should see: "Starting webhook server"
-```
---
-### Connection Refused Errors
-**Symptoms:**
-```
-Error from server (InternalError): Internal error occurred: failed calling webhook:
-Post "https://...webhook-service...:443/validate-...": dial tcp ...:443: connect: connection refused
-```
-**Checks:**
-1. **Verify operator pod is running**:
-```bash
-kubectl get pods -n <namespace> -l app.kubernetes.io/name=dynamo-operator
-```
-2. **Check webhook server is listening**:
-```bash
-# Port-forward to pod
-kubectl port-forward -n <namespace> pod/<operator-pod> 9443:9443
-# In another terminal, test connection
-curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeployment
-# Should NOT get "connection refused"
-```
-3. **Verify webhook port in deployment**:
-```bash
-kubectl get deployment -n <namespace> <release>-dynamo-operator -o yaml | grep -A5 "containerPort: 9443"
-```
-4. **Check for webhook initialization errors**:
-```bash
-kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep -i error
-```
---
-### Certificate Errors
-**Symptoms:**
-```
-Error from server (InternalError): Internal error occurred: failed calling webhook:
-x509: certificate signed by unknown authority
-```
-**Checks:**
-1. **Verify caBundle is present**:
-```bash
-kubectl get validatingwebhookconfiguration <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
-# Should output a valid PEM certificate
-```
-2. **Verify certificate secret exists**:
-```bash
-kubectl get secret -n <namespace> <release>-webhook-server-cert
-```
-3. **Check certificate validity**:
-```bash
-kubectl get secret -n <namespace> <release>-webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
-# Check:
-# - Not expired
-# - SAN includes: <service-name>.<namespace>.svc
-```
-4. **Check CA injection job logs**:
-```bash
-kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
-```
---
-### Helm Hook Job Failures
-**Symptoms:**
- `helm install` or `helm upgrade` hangs or fails
- Certificate generation errors
-**Checks:**
-1. **List hook jobs**:
-```bash
-kubectl get jobs -n <namespace> | grep webhook
-```
-2. **Check job logs**:
-```bash
-# Certificate generation
-kubectl logs -n <namespace> job/<release>-webhook-cert-gen-<revision>
-# CA injection
-kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
-```
-3. **Check RBAC permissions**:
-```bash
-# Verify ServiceAccount exists
-kubectl get sa -n <namespace> <release>-webhook-ca-inject
-# Verify ClusterRole and ClusterRoleBinding exist
-kubectl get clusterrole <release>-webhook-ca-inject
-kubectl get clusterrolebinding <release>-webhook-ca-inject
-```
-4. **Manual cleanup**:
-```bash
-# Delete failed jobs
-kubectl delete job -n <namespace> <release>-webhook-cert-gen-<revision>
-kubectl delete job -n <namespace> <release>-webhook-ca-inject-<revision>
-# Retry helm upgrade
-helm upgrade <release> dynamo-platform -n <namespace>
-```
---
-### Validation Errors Not Clear
-**Symptoms:**
- Webhook rejects resource but error message is unclear
-**Solution:**
-Check operator logs for detailed validation errors:
-```bash
-kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep "validate create\|validate update"
-```
-Webhook logs include:
- Resource name and namespace
- Validation errors with context
- Warnings for immutable field changes
---
-### Stuck Deleting Resources
-**Symptoms:**
- Resource stuck in "Terminating" state
- Webhook blocks finalizer removal
-**Solution:**
-The webhook automatically skips validation for resources being deleted. If stuck:
-1. **Check if webhook is blocking**:
-```bash
-kubectl describe <resource-type> <name> -n <namespace>
-# Look for events mentioning webhook errors
-```
-2. **Temporarily disable webhook**:
-```bash
-# Option 1: Delete ValidatingWebhookConfiguration
-kubectl delete validatingwebhookconfiguration <name>
-# Option 2: Set failurePolicy to Ignore
-kubectl patch validatingwebhookconfiguration <name> \
-  --type='json' \
-  -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
-```
-3. **Delete resource again**:
-```bash
-kubectl delete <resource-type> <name> -n <namespace>
-```
-4. **Re-enable webhook**:
-```bash
-helm upgrade <release> dynamo-platform -n <namespace>
-```
---
-## Best Practices
-### Production Deployments
-1. ✅ **Keep webhooks enabled** (default) for real-time validation
-2. ✅ **Use `failurePolicy: Fail`** (default) to ensure validation is enforced
-3. ✅ **Monitor webhook latency** - Validation adds ~10-50ms per resource operation
-4. ✅ **Use cert-manager** for automated certificate lifecycle in large deployments
-5. ✅ **Test webhook configuration** in staging before production
-### Development Deployments
-1. ✅ **Disable webhooks** for rapid iteration if needed
-2. ✅ **Use `failurePolicy: Ignore`** if webhook availability is problematic
-3. ✅ **Keep automatic certificates** (simpler than cert-manager for dev)
-### Multi-Tenant Deployments
-1. ✅ **Deploy one cluster-wide operator** for platform-wide validation
-2. ✅ **Deploy namespace-restricted operators** for tenant-specific namespaces
-3. ✅ **Monitor lease health** to ensure coordination works correctly
-4. ✅ **Use unique release names** per namespace to avoid naming conflicts
---
-## Additional Resources
- [Kubernetes Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
- [cert-manager Documentation](https://cert-manager.io/docs/)
- [Kubebuilder Webhook Tutorial](https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation.html)
- [CEL Validation Rules](https://kubernetes.io/docs/reference/using-api/cel/)
---
-## Support
-For issues or questions:
- Check [Troubleshooting](#troubleshooting) section
- Review operator logs: `kubectl logs -n <namespace> deployment/<release>-dynamo-operator`
- Open an issue on GitHub
--- a/fern/main.css
+++ b/fern/main.css
--- a/docs/observability/README.md
+++ b/docs/observability/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-# Dynamo Observability
-## Getting Started Quickly
-This is an example to get started quickly on a single machine.
-### Prerequisites
-Install these on your machine:
- [Docker](https://docs.docker.com/get-docker/)
- [Docker Compose](https://docs.docker.com/compose/install/)
-### Starting the Observability Stack
-Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, and various exporters for metrics, tracing, and visualization.
-From the Dynamo root directory:
-```bash
-# Start infrastructure (NATS, etcd)
-docker compose -f deploy/docker-compose.yml up -d
-# Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
-docker compose -f deploy/docker-observability.yml up -d
-```
-For detailed setup instructions and configuration, see [Prometheus + Grafana Setup](prometheus-grafana.md).
-## Observability Documentations
-| Guide | Description | Environment Variables to Control |
-|-------|-------------|----------------------------------|
-| [Metrics](metrics.md) | Available metrics reference | `DYN_SYSTEM_PORT`† |
-| [Operator Metrics (Kubernetes)](../kubernetes/observability/operator-metrics.md) | Operator controller and webhook metrics for Kubernetes | N/A (configured via Helm) |
-| [Health Checks](health-checks.md) | Component health monitoring and readiness probes | `DYN_SYSTEM_PORT`†, `DYN_SYSTEM_STARTING_HEALTH_STATUS`, `DYN_SYSTEM_HEALTH_PATH`, `DYN_SYSTEM_LIVE_PATH`, `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` |
-| [Tracing](tracing.md) | Distributed tracing with OpenTelemetry and Tempo | `DYN_LOGGING_JSONL`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`†, `OTEL_SERVICE_NAME`† |
-| [Logging](logging.md) | Structured logging configuration | `DYN_LOGGING_JSONL`†, `DYN_LOG`, `DYN_LOG_USE_LOCAL_TZ`, `DYN_LOGGING_CONFIG_PATH`, `OTEL_SERVICE_NAME`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`† |
-**Variables marked with † are shared across multiple observability systems.**
-## Developer Guides
-| Guide | Description | Environment Variables to Control |
-|-------|-------------|----------------------------------|
-| [Metrics Developer Guide](metrics-developer-guide.md) | Creating custom metrics in Rust and Python | `DYN_SYSTEM_PORT`† |
-## Kubernetes
-For Kubernetes-specific setup and configuration, see [docs/kubernetes/observability/](../kubernetes/observability/).
-**Operator Metrics**: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the [Operator Metrics Guide](../kubernetes/observability/operator-metrics.md).
---
-## Topology
-This provides:
- **Prometheus** on `http://localhost:9090` - metrics collection and querying
- **Grafana** on `http://localhost:3000` - visualization dashboards (username: `dynamo`, password: `dynamo`)
- **Tempo** on `http://localhost:3200` - distributed tracing backend
- **DCGM Exporter** on `http://localhost:9401/metrics` - GPU metrics
- **NATS Exporter** on `http://localhost:7777/metrics` - NATS messaging metrics
-### Service Relationship Diagram
-```mermaid
-graph TD
-    BROWSER[Browser] -->|:3000| GRAFANA[Grafana :3000]
-    subgraph DockerComposeNetwork [Network inside Docker Compose]
-        NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222]
-        PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
-        PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
-        PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
-        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
-        PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
-        DYNAMOFE --> DYNAMOBACKEND
-        GRAFANA -->|:9090/query API| PROMETHEUS
-    end
-```
-The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.
-### Configuration Files
-The following configuration files are located in the `deploy/observability/` directory:
- [docker-compose.yml](../../deploy/docker-compose.yml): Defines NATS and etcd services
- [docker-observability.yml](../../deploy/docker-observability.yml): Defines Prometheus, Grafana, Tempo, and exporters
- [prometheus.yml](../../deploy/observability/prometheus.yml): Contains Prometheus scraping configuration
- [grafana-datasources.yml](../../deploy/observability/grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana_dashboards/dashboard-providers.yml](../../deploy/observability/grafana_dashboards/dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/dynamo.json](../../deploy/observability/grafana_dashboards/dynamo.json): A general Dynamo Dashboard for both SW and HW metrics
- [grafana_dashboards/dcgm-metrics.json](../../deploy/observability/grafana_dashboards/dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
- [grafana_dashboards/kvbm.json](../../deploy/observability/grafana_dashboards/kvbm.json): Contains Grafana dashboard configuration for KVBM metrics
-```{toctree}
-:hidden:
-Prometheus + Grafana Setup <prometheus-grafana>
-Metrics <metrics>
-Metrics Developer Guide <metrics-developer-guide>
-Health Checks <health-checks>
-Tracing <tracing>
-Logging <logging>
-```
--- a/docs/observability/health-checks.md
+++ b/docs/observability/health-checks.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-# Dynamo Health Checks
-## Overview
-Dynamo provides health check and liveness HTTP endpoints for each component which
-can be used to configure startup, liveness and readiness probes in
-orchestration frameworks such as Kubernetes.
-## Environment Variables
-| Variable | Description | Default | Example |
-|----------|-------------|---------|---------|
-| `DYN_SYSTEM_PORT` | System status server port | `8081` | `9090` |
-| `DYN_SYSTEM_STARTING_HEALTH_STATUS` | Initial health status | `notready` | `ready`, `notready` |
-| `DYN_SYSTEM_HEALTH_PATH` | Custom health endpoint path | `/health` | `/custom/health` |
-| `DYN_SYSTEM_LIVE_PATH` | Custom liveness endpoint path | `/live` | `/custom/live` |
-| `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` | Endpoints required for ready state | none | `["generate"]` |
-| `DYN_HEALTH_CHECK_ENABLED` | Enable canary health checks | `false` (K8s: `true`) | `true`, `false` |
-| `DYN_CANARY_WAIT_TIME` | Seconds before sending canary health check | `10` | `5`, `30` |
-| `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | Health check request timeout in seconds | `3` | `5`, `10` |
-## Getting Started Quickly
-Enable health checks and query endpoints:
-```bash
-# Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
-python -m dynamo.frontend &
-# Enable system status server on port 8081
-DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
-```
-Check health status:
-```bash
-# Frontend health (port 8000)
-curl -s localhost:8000/health | jq
-# Worker health (port 8081)
-curl -s localhost:8081/health | jq
-```
-## Frontend Liveness Check
-The frontend liveness endpoint reports a status of `live` as long as
-the service is running.
-> **Note**: Frontend liveness doesn't depend on worker health or liveness only on the Frontend service itself.
-### Example Request
-```
-curl -s localhost:8080/live -q | jq
-```
-### Example Response
-```
-{
-  "message": "Service is live",
-  "status": "live"
-}
-```
-## Frontend Health Check
-The frontend health endpoint reports a status of `healthy` as long as
-the service is running.  Once workers have been registered, the
-`health` endpoint will also list registered endpoints and instances.
-> **Note**: Frontend liveness doesn't depend on worker health or liveness only on the Frontend service itself.
-### Example Request
-```
-curl -v localhost:8080/health -q | jq
-```
-### Example Response
-Before workers are registered:
-```
-HTTP/1.1 200 OK
-content-type: application/json
-content-length: 72
-date: Wed, 03 Sep 2025 13:31:44 GMT
-{
-  "instances": [],
-  "message": "No endpoints available",
-  "status": "unhealthy"
-}
-```
-After workers are registered:
-```
-HTTP/1.1 200 OK
-content-type: application/json
-content-length: 609
-date: Wed, 03 Sep 2025 13:32:03 GMT
-{
-  "endpoints": [
-    "dyn://dynamo.backend.generate"
-  ],
-  "instances": [
-    {
-      "component": "backend",
-      "endpoint": "clear_kv_blocks",
-      "instance_id": 7587888160958628000,
-      "namespace": "dynamo",
-      "transport": {
-        "nats_tcp": "dynamo_backend.clear_kv_blocks-694d98147d54be25"
-      }
-    },
-    {
-      "component": "backend",
-      "endpoint": "generate",
-      "instance_id": 7587888160958628000,
-      "namespace": "dynamo",
-      "transport": {
-        "nats_tcp": "dynamo_backend.generate-694d98147d54be25"
-      }
-    },
-    {
-      "component": "backend",
-      "endpoint": "load_metrics",
-      "instance_id": 7587888160958628000,
-      "namespace": "dynamo",
-      "transport": {
-        "nats_tcp": "dynamo_backend.load_metrics-694d98147d54be25"
-      }
-    }
-  ],
-  "status": "healthy"
-}
-```
-## Worker Liveness and Health Check
-Health checks for components other than the frontend are enabled
-selectively based on environment variables. If a health check for a
-component is enabled the starting status can be set along with the set
-of endpoints that are required to be served before the component is
-declared `ready`.
-Once all endpoints declared in `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS`
-are served the component transitions to a `ready` state until the
-component is shutdown. The endpoints return HTTP status code of `HTTP/1.1 503 Service Unavailable`
-when initializing and HTTP status code `HTTP/1.1 200 OK` once ready.
-> **Note**: Both /live and /ready return the same information
-### Example Environment Setting
-```
-export DYN_SYSTEM_PORT=9090
-export DYN_SYSTEM_STARTING_HEALTH_STATUS="notready"
-export DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS="[\"generate\"]"
-```
-#### Example Request
-```
-curl -v localhost:9090/health | jq
-```
-#### Example Response
-Before endpoints are being served:
-```
-HTTP/1.1 503 Service Unavailable
-content-type: text/plain; charset=utf-8
-content-length: 96
-date: Wed, 03 Sep 2025 13:42:39 GMT
-{
-  "endpoints": {
-    "generate": "notready"
-  },
-  "status": "notready",
-  "uptime": {
-    "nanos": 313803539,
-    "secs": 12
-  }
-}
-```
-After endpoints are being served:
-```
-HTTP/1.1 200 OK
-content-type: text/plain; charset=utf-8
-content-length: 139
-date: Wed, 03 Sep 2025 13:42:45 GMT
-{
-  "endpoints": {
-    "clear_kv_blocks": "ready",
-    "generate": "ready",
-    "load_metrics": "ready"
-  },
-  "status": "ready",
-  "uptime": {
-    "nanos": 356504530,
-    "secs": 18
-  }
-}
-```
-## Canary Health Checks (Active Monitoring)
-In addition to the HTTP endpoints described above, Dynamo includes a **canary health check** system that actively monitors worker endpoints.
-### Overview
-The canary health check system:
- **Monitors endpoint health** by sending periodic test requests to worker endpoints
- **Only activates during idle periods** - if there's ongoing traffic, health checks are skipped to avoid overhead
- **Automatically enabled in Kubernetes** deployments via the operator
- **Disabled by default** in local/development environments
-### How It Works
-1. **Idle Detection**: After no activity on an endpoint for a configurable wait time (default: 10 seconds), a canary health check is triggered
-2. **Health Check Request**: A lightweight test request is sent to the endpoint with a minimal payload (generates 1 token)
-3. **Activity Resets Timer**: If normal requests arrive, the canary timer resets and no health check is sent
-4. **Timeout Handling**: If a health check doesn't respond within the timeout (default: 3 seconds), the endpoint is marked as unhealthy
-### Configuration
-#### In Kubernetes (Enabled by Default)
-Health checks are automatically enabled by the Dynamo operator. No additional configuration is required.
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-spec:
-  services:
-    VllmWorker:
-      componentType: worker
-      replicas: 2
-      # Health checks automatically enabled by operator
-```
-#### In Local/Development Environments (Disabled by Default)
-To enable health checks locally:
-```bash
-# Enable health checks
-export DYN_HEALTH_CHECK_ENABLED=true
-# Optional: Customize timing
-export DYN_CANARY_WAIT_TIME=5  # Wait 5 seconds before sending health check
-export DYN_HEALTH_CHECK_REQUEST_TIMEOUT=5  # 5 second timeout
-# Start worker
-python -m dynamo.vllm --model Qwen/Qwen3-0.6B
-```
-#### Configuration Options
-| Environment Variable | Description | Default | Notes |
-|---------------------|-------------|---------|-------|
-| `DYN_HEALTH_CHECK_ENABLED` | Enable/disable canary health checks | `false` (K8s: `true`) | Automatically set to `true` in K8s |
-| `DYN_CANARY_WAIT_TIME` | Seconds to wait (during idle) before sending health check | `10` | Lower values = more frequent checks |
-| `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | Max seconds to wait for health check response | `3` | Higher values = more tolerance for slow responses |
-### Health Check Payloads
-Each backend defines its own minimal health check payload:
- **vLLM**: Single token generation with minimal sampling options
- **TensorRT-LLM**: Single token with BOS token ID
- **SGLang**: Single token generation request
-These payloads are designed to:
- Complete quickly (< 100ms typically)
- Minimize GPU overhead
- Verify the full inference stack is working
-### Observing Health Checks
-When health checks are enabled, you'll see logs like:
-```
-INFO Health check manager started (canary_wait_time: 10s, request_timeout: 3s)
-INFO Spawned health check task for endpoint: generate
-INFO Canary timer expired for generate, sending health check
-INFO Health check successful for generate
-```
-If an endpoint fails:
-```
-WARN Health check timeout for generate
-ERROR Health check request failed for generate: connection refused
-```
-### When to Use Canary Health Checks
-**Enable in production (Kubernetes):**
- ✅ Detect unhealthy workers before they affect user traffic
- ✅ Enable faster failure detection and recovery
- ✅ Monitor worker availability continuously
-**Disable in development:**
- ✅ Reduce log noise during debugging
- ✅ Avoid overhead when not needed
- ✅ Simplify local testing
-### Troubleshooting
-**Health checks timing out:**
- Increase `DYN_HEALTH_CHECK_REQUEST_TIMEOUT`
- Check worker logs for errors
- Verify network connectivity
-**Too many health check logs:**
- Increase `DYN_CANARY_WAIT_TIME` to reduce frequency
- Or disable with `DYN_HEALTH_CHECK_ENABLED=false` in dev
-**Health checks not running:**
- Verify `DYN_HEALTH_CHECK_ENABLED=true` is set
- Check that `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` includes the endpoint
- Ensure the worker is serving the endpoint
-## Related Documentation
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
- [Dynamo Architecture Overview](../design_docs/architecture.md)
- [Backend Guide](../development/backend-guide.md)
--- a/docs/observability/logging.md
+++ b/docs/observability/logging.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-# Dynamo Logging
-## Overview
-Dynamo provides structured logging in both text as well as JSONL. When
-JSONL is enabled, logs support `trace_id` and `span_id` fields for
-distributed tracing. Span creation and exit events can be optionally
-enabled via the `DYN_LOGGING_SPAN_EVENTS` environment variable.
-## Environment Variables
-| Variable | Description | Default | Example |
-|----------|-------------|---------|---------|
-| `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` |
-| `DYN_LOGGING_SPAN_EVENTS` | Enable span entry/close event logging (`SPAN_FIRST_ENTRY`, `SPAN_CLOSED` messages) | `false` | `true` |
-| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `info` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
-| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps (default is UTC) | `false` | `true` |
-| `DYN_LOGGING_CONFIG_PATH` | Path to custom TOML logging configuration | none | `/path/to/config.toml` |
-| `OTEL_SERVICE_NAME` | Service name for trace and span information | `dynamo` | `dynamo-frontend` |
-| `OTEL_EXPORT_ENABLED` | Enable OTLP trace exporting | `false` | `true` |
-| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP exporter endpoint | `http://localhost:4317` | `http://tempo:4317` |
-## Getting Started Quickly
-### Start Observability Stack
-For collecting and visualizing logs with Grafana Loki (Kubernetes), or viewing trace context in logs alongside Grafana Tempo, start the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for instructions.
-### Enable Structured Logging
-Enable structured JSONL logging:
-```bash
-export DYN_LOGGING_JSONL=true
-export DYN_LOG=debug
-# Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
-python -m dynamo.frontend &
-python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
-```
-Logs will be written to stderr in JSONL format with trace context.
-## Available Logging Levels
-| **Logging Levels (Least to Most Verbose)** | **Description**                                                                 |
-|-------------------------------------------|---------------------------------------------------------------------------------|
-| **ERROR**                                 | Critical errors (e.g., unrecoverable failures, resource exhaustion)              |
-| **WARN**                                  | Unexpected or degraded situations (e.g., retries, recoverable errors)           |
-| **INFO**                                  | Operational information (e.g., startup/shutdown, major events)                 |
-| **DEBUG**                                 | General debugging information (e.g., variable values, flow control)            |
-| **TRACE**                                 | Very low-level, detailed information (e.g., internal algorithm steps)           |
-## Example Readable Format
-Environment Setting:
-```
-export DYN_LOG="info,dynamo_runtime::system_status_server:trace"
-export DYN_LOGGING_JSONL="false"
-```
-Resulting Log format:
-```
-2025-09-02T15:50:01.770028Z  INFO main.init: VllmWorker for Qwen/Qwen3-0.6B has been initialized
-2025-09-02T15:50:01.770195Z  INFO main.init: Reading Events from tcp://127.0.0.1:21555
-2025-09-02T15:50:01.770265Z  INFO main.init: Getting engine runtime configuration metadata from vLLM engine...
-2025-09-02T15:50:01.770316Z  INFO main.get_engine_cache_info: Cache config values: {'num_gpu_blocks': 24064}
-2025-09-02T15:50:01.770358Z  INFO main.get_engine_cache_info: Scheduler config values: {'max_num_seqs': 256, 'max_num_batched_tokens': 2048}
-```
-## Example JSONL Format
-Environment Setting:
-```
-export DYN_LOG="info,dynamo_runtime::system_status_server:trace"
-export DYN_LOGGING_JSONL="true"
-```
-Resulting Log format:
-```
-{"time":"2025-09-02T15:53:31.943377Z","level":"INFO","target":"log","message":"VllmWorker for Qwen/Qwen3-0.6B has been initialized","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":191,"log.target":"main.init"}
-{"time":"2025-09-02T15:53:31.943550Z","level":"INFO","target":"log","message":"Reading Events from tcp://127.0.0.1:26771","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":212,"log.target":"main.init"}
-{"time":"2025-09-02T15:53:31.943636Z","level":"INFO","target":"log","message":"Getting engine runtime configuration metadata from vLLM engine...","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":220,"log.target":"main.init"}
-{"time":"2025-09-02T15:53:31.943701Z","level":"INFO","target":"log","message":"Cache config values: {'num_gpu_blocks': 24064}","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":267,"log.target":"main.get_engine_cache_info"}
-{"time":"2025-09-02T15:53:31.943747Z","level":"INFO","target":"log","message":"Scheduler config values: {'max_num_seqs': 256, 'max_num_batched_tokens': 2048}","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":268,"log.target":"main.get_engine_cache_info"}
-```
-## Logging of Trace and Span IDs
-When `DYN_LOGGING_JSONL` is enabled, all logs include `trace_id` and `span_id` fields, and spans are automatically created for requests. This is useful for short debugging sessions where you want to examine trace context in logs without setting up a full tracing backend and for correlating log messages with traces.
-The trace and span information uses the OpenTelemetry format and libraries, which means the IDs are compatible with OpenTelemetry-based tracing backends like Tempo or Jaeger if you later choose to enable trace export.
-**Note:** This section has overlap with [Distributed Tracing with Tempo](tracing.md). For trace visualization in Grafana Tempo and persistent trace analysis, see [Distributed Tracing with Tempo](tracing.md).
-### Configuration for Logging
-To see trace information in logs:
-```bash
-export DYN_LOGGING_JSONL=true
-export DYN_LOG=debug  # Set to debug to see detailed trace logs
-# Start your Dynamo components (e.g., frontend and worker) (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
-python -m dynamo.frontend &
-python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
-```
-This enables JSONL logging with `trace_id` and `span_id` fields. Traces appear in logs but are not exported to any backend.
-### Example Request
-Send a request to generate logs with trace context:
-```bash
-curl -H 'Content-Type: application/json' \
-H 'x-request-id: test-trace-001' \
-d '{
-  "model": "Qwen/Qwen3-0.6B",
-  "max_completion_tokens": 100,
-  "messages": [
-    {"role": "user", "content": "What is the capital of France?"}
-  ]
-}' \
-http://localhost:8000/v1/chat/completions
-```
-Check the logs (stderr) for JSONL output containing `trace_id`, `span_id`, and `x_request_id` fields.
-## Trace and Span Information in Logs
-This section shows how trace and span information appears in JSONL logs. These logs can be used to understand request flows even without a trace visualization backend.
-### Example Disaggregated Trace in Grafana
-When viewing the corresponding trace in Grafana, you should be able to see something like the following:
-![Disaggregated Trace Example](grafana-disagg-trace.png)
-### Trace Overview
-Dynamo creates distributed traces that span across multiple services in a disaggregated serving setup. The following sections describe the key spans you'll see in Grafana when viewing traces for chat completion requests.
-#### Available Spans in Disaggregated Mode
-When running Dynamo in disaggregated mode, a typical request creates the following spans:
-##### 1. `http-request` (Frontend - Root Span)
-The root span for the entire request lifecycle, created in the **dynamo-frontend** service.
-**Key Attributes:**
- **Service**: `dynamo-frontend`
- **Operation**: Handles the HTTP request from client to completion
- **Duration**: Total end-to-end request time (includes prefill + decode)
- **Method**: HTTP method (typically `POST`)
- **URI**: Request endpoint (e.g., `/v1/chat/completions`)
- **Status**: Request completion status
- **Children**: Typically 2-3 child spans (routing span + worker spans)
-This span represents the complete request flow from when the frontend receives the HTTP request until the final response is sent back to the client.
-##### 2. `prefill_routing` (Frontend - Routing Span)
-A child span of `http-request`, created in the **dynamo-frontend** service during the routing phase.
-**Key Attributes:**
- **Service**: `dynamo-frontend`
- **Operation**: Routes the prefill request to an appropriate prefill worker
- **Duration**: Time spent selecting and the span of prefill.
- **Parent**: `http-request` span
-This span captures the routing logic and decision-making process and the request sent to the prefill worker.
-##### 3. `handle_payload` (Prefill Worker Span)
-A child span of `http-request`, created in the **dynamo-worker-vllm-prefill** service.
-**Key Attributes:**
- **Service**: `dynamo-worker-vllm-prefill` (or `dynamo-worker-sglang-prefill` for SGLang)
- **Operation**: Processes the prefill phase of generation
- **Duration**: Time to compute prefill (typically milliseconds to seconds)
- **Component**: `prefill`
- **Endpoint**: `generate`
- **Parent**: `http-request` span
-This span represents the actual prefill computation on a prefill-specialized worker, including prompt processing and initial KV cache generation.
-##### 4. `handle_payload` (Decode Worker Span)
-A child span of `http-request`, created in the **dynamo-worker-vllm-decode** service.
-**Key Attributes:**
- **Service**: `dynamo-worker-vllm-decode` (or `dynamo-worker-sglang-decode` for SGLang)
- **Operation**: Processes the decode phase of generation
- **Duration**: Time to generate all output tokens (typically seconds)
- **Component**: `decode` or `backend`
- **Endpoint**: `generate`
- **Parent**: `http-request` span
-This span represents the iterative token generation phase on a decode-specialized worker, which consumes the KV cache from prefill and produces output tokens.
-#### Understanding Span Metrics
-Each span provides several useful metrics:
-| Metric | Description |
-|--------|-------------|
-| **Duration** | Total time from span start to end |
-| **Busy Time** | Time actively processing (excluding waiting) |
-| **Idle Time** | Time spent waiting (e.g., for network, other services) |
-| **Start Time** | When the span began |
-| **Child Count** | Number of direct child spans |
-The relationship **Duration = Busy Time + Idle Time** helps identify where time is spent and potential bottlenecks.
-## Custom Request IDs in Logs
-You can provide a custom request ID using the `x-request-id` header. This ID will be attached to all spans and logs for that request, making it easier to correlate traces with application-level request tracking.
-### Example Request with Custom Request ID
-```sh
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H 'Content-Type: application/json' \
-  -H 'x-request-id: 8372eac7-5f43-4d76-beca-0a94cfb311d0' \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [
-      {
-        "role": "user",
-        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
-      }
-    ],
-    "stream": false,
-    "max_tokens": 1000
-  }'
-```
-All spans and logs for this request will include the `x_request_id` attribute with value `8372eac7-5f43-4d76-beca-0a94cfb311d0`.
-### Frontend Logs with Custom Request ID
-Notice how the `x_request_id` field appears in all log entries, alongside the `trace_id` (`80196f3e3a6fdf06d23bb9ada3788518`) and `span_id`:
-```
-{"time":"2025-10-31T21:06:45.397194Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
-{"time":"2025-10-31T21:06:45.418584Z","level":"DEBUG","file":"/opt/dynamo/lib/llm/src/kv_router/prefill_router.rs","line":232,"target":"dynamo_llm::kv_router::prefill_router","message":"Prefill succeeded, using disaggregated params for decode","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
-{"time":"2025-10-31T21:06:45.418854Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
-```
-## Related Documentation
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
- [Dynamo Architecture Overview](../design_docs/architecture.md)
- [Backend Guide](../development/backend-guide.md)
- [Log Aggregation in Kubernetes](../kubernetes/observability/logging.md)
--- a/docs/observability/metrics-developer-guide.md
+++ b/docs/observability/metrics-developer-guide.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-# Metrics Developer Guide
-This guide explains how to create and use custom metrics in Dynamo components using the Dynamo metrics API.
-## Metrics Exposure
-All metrics created via the Dynamo metrics API are automatically exposed on the `/metrics` HTTP endpoint in Prometheus Exposition Format text when the following environment variable is set:
- `DYN_SYSTEM_PORT=<port>` - Port for the metrics endpoint (set to positive value to enable, default: `-1` disabled)
-Example:
-```bash
-DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
-```
-Prometheus Exposition Format text metrics will be available at: `http://localhost:8081/metrics`
-## Metric Name Constants
-The [prometheus_names.rs](../../lib/runtime/src/metrics/prometheus_names.rs) module provides centralized metric name constants and sanitization functions to ensure consistency across all Dynamo components.
---
-## Metrics API in Rust
-The metrics API is accessible through the `.metrics()` method on runtime, namespace, component, and endpoint objects. See [Runtime Hierarchy](metrics.md#runtime-hierarchy) for details on the hierarchical structure.
-### Available Methods
- `.metrics().create_counter()`: Create a counter metric
- `.metrics().create_gauge()`: Create a gauge metric
- `.metrics().create_histogram()`: Create a histogram metric
- `.metrics().create_countervec()`: Create a counter with labels
- `.metrics().create_gaugevec()`: Create a gauge with labels
- `.metrics().create_histogramvec()`: Create a histogram with labels
-### Creating Metrics
-```rust
-use dynamo_runtime::DistributedRuntime;
-let runtime = DistributedRuntime::new()?;
-let endpoint = runtime.namespace("my_namespace").component("my_component").endpoint("my_endpoint");
-// Simple metrics
-let requests_total = endpoint.metrics().create_counter(
-    "requests_total",
-    "Total requests",
-    &[]
-)?;
-let active_connections = endpoint.metrics().create_gauge(
-    "active_connections",
-    "Active connections",
-    &[]
-)?;
-let latency = endpoint.metrics().create_histogram(
-    "latency_seconds",
-    "Request latency",
-    &[],
-    Some(vec![0.001, 0.01, 0.1, 1.0, 10.0])
-)?;
-```
-### Using Metrics
-```rust
-// Counters
-requests_total.inc();
-// Gauges
-active_connections.set(42.0);
-active_connections.inc();
-active_connections.dec();
-// Histograms
-latency.observe(0.023);  // 23ms
-```
-### Vector Metrics with Labels
-```rust
-// Create vector metrics with label names
-let requests_by_model = endpoint.metrics().create_countervec(
-    "requests_by_model",
-    "Requests by model",
-    &["model_type", "model_size"],
-    &[]
-)?;
-let memory_by_gpu = endpoint.metrics().create_gaugevec(
-    "gpu_memory_bytes",
-    "GPU memory by device",
-    &["gpu_id", "memory_type"],
-    &[]
-)?;
-// Use with specific label values
-requests_by_model.with_label_values(&["llama", "7b"]).inc();
-memory_by_gpu.with_label_values(&["0", "allocated"]).set(8192.0);
-```
-### Advanced Features
-**Custom histogram buckets:**
-```rust
-let latency = endpoint.metrics().create_histogram(
-    "latency_seconds",
-    "Request latency",
-    &[],
-    Some(vec![0.001, 0.01, 0.1, 1.0, 10.0])
-)?;
-```
-**Constant labels:**
-```rust
-let counter = endpoint.metrics().create_counter(
-    "requests_total",
-    "Total requests",
-    &[("region", "us-west"), ("env", "prod")]
-)?;
-```
---
-## Related Documentation
- [Metrics Overview](metrics.md)
- [Prometheus and Grafana Setup](prometheus-grafana.md)
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)