fix(recipes): address VDR feedback - fix bugs, improve docs, add READMEs (#5479)

Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: dagil-nvidia <dagil@nvidia.com>

fix(recipes): address VDR feedback - fix bugs, improve docs, add READMEs (#5479)
Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: dagil-nvidia <dagil@nvidia.com>
f366932a · Ben Hamm · GitHub · 53a609e5 · f366932a · f366932a
Unverified Commit f366932a authored Jan 22, 2026 by Ben Hamm Committed by GitHub Jan 22, 2026
20 changed files
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -7,18 +7,35 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
 ## Available Recipes
-| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes |GAIE integration |
+### Multi-Feature Recipe
-|-------|-----------|------|------|------------|------------------|-------|------------------|
-| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ | ❌ |
+This recipe combines multiple Dynamo performance features (disaggregated serving + KV-aware routing):
+| Model | Framework | Configuration | GPUs | Features |
+|-------|-----------|---------------|------|----------|
+| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — includes benchmark comparison with real-world Mooncake traces |
+### Aggregated & Disaggregated Recipes
+These recipes demonstrate aggregated or disaggregated serving:
+**GAIE Column**: Indicates whether the recipe includes integration with the [Gateway API Inference Extension (GAIE)](../deploy/inference-gateway/README.md) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.
+| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | GAIE |
+|-------|-----------|------|------|------------|------------------|-------|------|
+| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ |
 | **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
 | **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
-| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization | ❌ |
+| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x GPU | ✅ | ✅ | FP8 quantization | ❌ |
 | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ |
+| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x GPU | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
+| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x GPU | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
 | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
 | **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ |
-| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 8x H200 | ✅*1 | ❌ | Benchmark recipe pending | ❌ |
+| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅*1 | ❌ | TP=8 per worker, single-node | ❌ |
-| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅*1 | ❌ | Benchmark recipe pending | ❌ |
+| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅*1 | ❌ | TP=16 per worker, multi-node | ❌ |
-| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ |Multi-node: 8 decode + 1 prefill nodes | ❌ |
+| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
+| **[DeepSeek-R1](deepseek-r1/vllm/disagg/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
 *1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.
@@ -188,7 +205,7 @@ First, deploy the Dynamo Graph per instructions above.
 Then follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE.
-Update the containers.epp.image in the deployment file, i.e. llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml. It should match the release tag and be in the format `nvcr.io/nvidia/ai-dynamo/frontend:<my-tag>` i.e. `nvcr.io/nvstaging/ai-dynamo/dynamo-frontend:0.7.0rc2-amd64`
+Update the containers.epp.image in the deployment file, i.e. llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml. It should match the release tag and be in the format `nvcr.io/nvidia/ai-dynamo/frontend:<version>` e.g. `nvcr.io/nvidia/ai-dynamo/frontend:0.8.0`
 The recipe assumes you are using Kubernetes discovery backend and sets the `DYN_DISCOVERY_BACKEND` env variable in the epp deployment. If you want to use etcd enable the lines below and remove the DYN_DISCOVERY_BACKEND env var.
 ```bash
 - name: ETCD_ENDPOINTS

--- a/recipes/deepseek-r1/README.md
+++ b/recipes/deepseek-r1/README.md
+# DeepSeek-R1 Recipes
+Production-ready deployments for **DeepSeek-R1** (671B MoE) across multiple backends and hardware configurations.
+## Available Configurations
+| Configuration | GPUs | Backend | Mode | Description |
+|--------------|------|---------|------|-------------|
+| [**sglang/disagg-8gpu**](sglang/disagg-8gpu/) | 16x H200 | SGLang | Disaggregated WideEP | TP=8 per worker, single-node |
+| [**sglang/disagg-16gpu**](sglang/disagg-16gpu/) | 32x H200 | SGLang | Disaggregated WideEP | TP=16 per worker, multi-node |
+| [**trtllm/disagg/wide_ep/gb200**](trtllm/disagg/wide_ep/gb200/) | 36x GB200 | TensorRT-LLM | Disaggregated WideEP | 8 decode + 1 prefill nodes |
+| [**vllm/disagg**](vllm/disagg/) | 32x H200 | vLLM | Disaggregated DEP16 | Multi-node, data-expert parallel |
+## Prerequisites
+1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GPU cluster** with H200 or GB200 GPUs matching the configuration requirements
+3. **HuggingFace token** with access to DeepSeek models
+4. **High-bandwidth networking** — InfiniBand or RoCE recommended for multi-node deployments
+## Quick Start
+```bash
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
+# Download model (update storageClassName in model-cache.yaml first!)
+# For SGLang deployments:
+kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
+kubectl apply -f model-cache/model-download-sglang.yaml -n ${NAMESPACE}
+# For vLLM/TRT-LLM deployments:
+kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
+kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
+# Wait for download (this is a large model - may take 1+ hours)
+# For SGLang: kubectl wait --for=condition=Complete job/model-download-sglang ...
+# For vLLM/TRT-LLM: kubectl wait --for=condition=Complete job/model-download ...
+kubectl wait --for=condition=Complete job/model-download-sglang -n ${NAMESPACE} --timeout=7200s
+# Deploy (choose one configuration)
+kubectl apply -f sglang/disagg-8gpu/deploy.yaml -n ${NAMESPACE}
+```
+## Test the Deployment
+```bash
+# Port-forward the frontend (service name varies by deployment)
+kubectl port-forward svc/sgl-dsr1-8gpu-frontend 8000:8000 -n ${NAMESPACE}
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-ai/DeepSeek-R1",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }'
+```
+## Model Details
+- **Model**: `deepseek-ai/DeepSeek-R1`
+- **Architecture**: 671B parameter Mixture-of-Experts (MoE)
+- **Active parameters**: ~37B per token
+- **Recommended**: FP8 quantization for production deployments
+## Hardware Requirements
+DeepSeek-R1 is a very large model requiring significant GPU memory:
+| Configuration | Min GPU Memory | Recommended |
+|--------------|----------------|-------------|
+| 16x H200 (SGLang TP=8) | 1.1TB total | H200 SXM (141GB each) |
+| 32x H200 (SGLang TP=16, vLLM) | 2.2TB total | H200 SXM (141GB each) |
+| 36x GB200 (TRT-LLM) | ~2.5TB total | GB200 NVL72 |
+## Notes
+- **Model download time**: DeepSeek-R1 is ~1.3TB; expect 1-2 hours for download
+- **NCCL errors**: Usually indicate OOM. Reduce `--mem-fraction-static` in worker args
+- **Multi-node**: Requires InfiniBand/IBGDA enabled. See [vLLM EP docs](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/)
+- **Storage class**: Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
+## Backend-Specific Notes
+### SGLang
+- Uses WideEP (Wide Expert Parallel) for efficient MoE inference
+- See [sglang/README.md](sglang/README.md) for SGLang-specific configuration
+### TensorRT-LLM
+- Requires FP4 quantized checkpoint
+- GB200-specific optimizations
+### vLLM
+- Uses DEP (Data-Expert Parallel) with hybrid load balancing
+- See [vllm/disagg/README.md](vllm/disagg/README.md) for detailed setup
--- a/recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml
+++ b/recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml
@@ -21,7 +21,7 @@ spec:
          mountPoint: /opt/model
      extraPodSpec:
        mainContainer:
-          image: my-registry/sglang-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
    decode:
      componentType: worker
      subComponentType: decode
@@ -38,7 +38,7 @@ spec:
        size: 80Gi
      extraPodSpec:
        mainContainer:
-          image: my-registry/sglang-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
          workingDir: /sgl-workspace/dynamo
          command:
            - python3
@@ -83,7 +83,7 @@ spec:
        size: 80Gi
      extraPodSpec:
        mainContainer:
-          image: my-registry/sglang-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
          workingDir: /sgl-workspace/dynamo
          command:
            - python3

--- a/recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml
+++ b/recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml
@@ -21,7 +21,7 @@ spec:
          mountPoint: /opt/model
      extraPodSpec:
        mainContainer:
-          image: my-registry/sglang-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
    decode:
      componentType: worker
      subComponentType: decode
@@ -36,7 +36,7 @@ spec:
        size: 80Gi
      extraPodSpec:
        mainContainer:
-          image: my-registry/sglang-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
          workingDir: /workspace
          command:
            - python3
@@ -78,7 +78,7 @@ spec:
        size: 80Gi
      extraPodSpec:
        mainContainer:
-          image: my-registry/sglang-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
          workingDir: /workspace
          command:
            - python3

--- a/recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml
+++ b/recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml
@@ -126,7 +126,7 @@ spec:
        tolerations: []
        affinity: {}
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          args:
          - |
            python3 -m dynamo.frontend --http-port 8000
@@ -158,7 +158,7 @@ spec:
        tolerations: []
        affinity: {}
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          workingDir: /workspace/components/backends/trtllm
          # NOTE: If your PVCs (Persistent Volume Claims) are really slow,
          #       you might need to increase 'failureThreshold' below to allow more time for startup
@@ -216,7 +216,7 @@ spec:
        tolerations: []
        affinity: {}
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          workingDir: /workspace/components/backends/trtllm
          # NOTE: If your PVCs (Persistent Volume Claims) are really slow,
          #       you might need to increase 'failureThreshold' below to allow more time for startup

--- a/recipes/deepseek-r1/vllm/disagg/deploy_hopper_16gpu.yaml
+++ b/recipes/deepseek-r1/vllm/disagg/deploy_hopper_16gpu.yaml
@@ -7,6 +7,9 @@ metadata:
  name: vllm-dsr1
 spec:
  backendFramework: vllm
+  envs:
+    - name: HF_HOME
+      value: /model-cache
  pvcs:
    - name: model-cache
      create: false
@@ -23,7 +26,7 @@ spec:
            periodSeconds: 10
            timeoutSeconds: 1800
            failureThreshold: 60
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
    decode:
      componentType: worker
      subComponentType: decode
@@ -49,7 +52,7 @@ spec:
            periodSeconds: 10
            timeoutSeconds: 10
            failureThreshold: 600
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/dynamo
          env:
            - name: VLLM_USE_DEEP_GEMM
@@ -70,7 +73,7 @@ spec:
          args:
            - |
              exec python3 -m dynamo.vllm \
-                --model /model-cache/deepseek-r1 \
+                --model deepseek-ai/DeepSeek-R1 \
                --served-model-name deepseek-ai/DeepSeek-R1 \
                --all2all-backend deepep_low_latency \
                --data-parallel-hybrid-lb \
@@ -110,7 +113,7 @@ spec:
            periodSeconds: 10
            timeoutSeconds: 10
            failureThreshold: 600
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/dynamo
          env:
            - name: VLLM_USE_DEEP_GEMM
@@ -129,7 +132,7 @@ spec:
          args:
            - |
              exec python3 -m dynamo.vllm \
-                --model /model-cache/deepseek-r1 \
+                --model deepseek-ai/DeepSeek-R1 \
                --is-prefill-worker \
                --served-model-name deepseek-ai/DeepSeek-R1 \
                --all2all-backend deepep_high_throughput \

--- a/recipes/gpt-oss-120b/trtllm/agg/deploy.yaml
+++ b/recipes/gpt-oss-120b/trtllm/agg/deploy.yaml
@@ -45,7 +45,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: my-registry/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
      replicas: 1
    TrtllmWorker:
      componentType: main
@@ -79,7 +79,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: my-registry/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"
@@ -90,7 +90,7 @@ spec:
          - name: ENGINE_ARGS
            value: "/opt/dynamo/configs/config.yaml"
          - name: MODEL_PATH
-            value: "/opt/models/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"
+            value: "openai/gpt-oss-120b"
          - name: HF_HOME
            value: /opt/models
          volumeMounts:

--- a/recipes/llama-3-70b/README.md
+++ b/recipes/llama-3-70b/README.md
+# Llama-3.3-70B Recipes
+Production-ready deployments for **Llama-3.3-70B-Instruct** using vLLM with FP8 dynamic quantization.
+## Available Configurations
+| Configuration | GPUs | Mode | Description |
+|--------------|------|------|-------------|
+| [**vllm/agg**](vllm/agg/) | 4x H100/H200 | Aggregated | Single-node, TP4 |
+| [**vllm/disagg-single-node**](vllm/disagg-single-node/) | 8x H100/H200 | Disaggregated | Prefill/decode separation on one node |
+| [**vllm/disagg-multi-node**](vllm/disagg-multi-node/) | 16x H100/H200 | Disaggregated | 2 nodes, 8 GPUs each |
+## Prerequisites
+1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GPU cluster** with H100 or H200 GPUs matching the configuration requirements
+3. **HuggingFace token** with access to Llama models
+## Quick Start
+```bash
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
+# Download model (update storageClassName in model-cache.yaml first!)
+kubectl apply -f model-cache/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
+# Deploy (choose one configuration)
+kubectl apply -f vllm/agg/deploy.yaml -n ${NAMESPACE}
+# OR: kubectl apply -f vllm/disagg-single-node/deploy.yaml -n ${NAMESPACE}
+# OR: kubectl apply -f vllm/disagg-multi-node/deploy.yaml -n ${NAMESPACE}
+```
+## Test the Deployment
+```bash
+# Port-forward the frontend
+kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
+```
+## Model Details
+- **Model**: `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic`
+- **Quantization**: FP8 dynamic (applied at runtime)
+- **Context length**: Default model context
+## Notes
+- Update `storageClassName` in `model-cache/model-cache.yaml` to match your cluster before deploying
+- Model download takes approximately 15-30 minutes depending on network speed
+- For GAIE (Gateway API Inference Extension) integration, see [vllm/agg/gaie/](vllm/agg/gaie/)
--- a/recipes/llama-3-70b/vllm/agg/deploy.yaml
+++ b/recipes/llama-3-70b/vllm/agg/deploy.yaml
@@ -17,7 +17,7 @@ spec:
          mountPoint: /opt/models
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/examples/backends/vllm
      envs:
        - name: HF_HOME
@@ -37,7 +37,7 @@ spec:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
-              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
+              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: HF_HOME
              value: /opt/models
          args:
@@ -45,7 +45,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources:

--- a/recipes/llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml
+++ b/recipes/llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml
@@ -38,7 +38,7 @@ spec:
      containers:
        - name: epp
-          image: nvcr.io/nvidia/ai-dynamo/frontend:<my-tag>
+          image: nvcr.io/nvidia/ai-dynamo/frontend:0.8.0
          imagePullPolicy: IfNotPresent
          resources:
            requests:

--- a/recipes/llama-3-70b/vllm/disagg-multi-node/deploy.yaml
+++ b/recipes/llama-3-70b/vllm/disagg-multi-node/deploy.yaml
@@ -17,7 +17,7 @@ spec:
          mountPoint: /opt/models
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/examples/backends/vllm
      envs:
        - name: HF_HOME
@@ -38,15 +38,15 @@ spec:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
-              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
+              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: HF_HOME
              value: /opt/models
          args:
-          - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --is-prefill-worker --gpu-memory-utilization 0.95 --no-enable-prefix-caching --block-size 128"
+          - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --is-prefill-worker --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128"
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources:
@@ -69,7 +69,7 @@ spec:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
-              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
+              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: HF_HOME
              value: /opt/models
          args:
@@ -77,7 +77,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources:

--- a/recipes/llama-3-70b/vllm/disagg-single-node/deploy.yaml
+++ b/recipes/llama-3-70b/vllm/disagg-single-node/deploy.yaml
@@ -17,7 +17,7 @@ spec:
          mountPoint: /opt/models
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/examples/backends/vllm
      envs:
        - name: HF_HOME
@@ -50,15 +50,15 @@ spec:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
-              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
+              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: HF_HOME
              value: /opt/models
          args:
-          - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 2 --data-parallel-size 1 --is-prefill-worker --gpu-memory-utilization 0.95 --no-enable-prefix-caching --block-size 128"
+          - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 2 --data-parallel-size 1 --is-prefill-worker --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128"
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/examples/backends/vllm
      replicas: 2
      resources:
@@ -93,7 +93,7 @@ spec:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
-              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
+              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: HF_HOME
              value: /opt/models
          args:
@@ -101,7 +101,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources:

--- a/recipes/qwen3-235b-a22b-fp8/README.md
+++ b/recipes/qwen3-235b-a22b-fp8/README.md
+# Qwen3-235B-A22B-FP8 Recipes
+Production-ready deployments for **Qwen3-235B-A22B** (MoE model with 22B active parameters) using TensorRT-LLM.
+## Available Configurations
+| Configuration | GPUs | Mode | Description |
+|--------------|------|------|-------------|
+| [**trtllm/agg**](trtllm/agg/) | 16x GPU | Aggregated | TP4, EP4, KV-aware routing |
+| [**trtllm/disagg**](trtllm/disagg/) | 16x GPU | Disaggregated | Prefill/decode separation |
+## Prerequisites
+1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GPU cluster** with H100/H200 GPUs (high memory recommended)
+3. **HuggingFace token** with access to Qwen models
+## Quick Start
+```bash
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
+# Download model (update storageClassName in model-cache.yaml first!)
+kubectl apply -f model-cache/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
+# Deploy (choose one configuration)
+kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
+# OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
+```
+## Test the Deployment
+```bash
+# Port-forward the frontend
+kubectl port-forward svc/qwen3-235b-a22b-agg-frontend 8000:8000 -n ${NAMESPACE}
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-235B-A22B-FP8",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
+```
+## Model Details
+- **Model**: `Qwen/Qwen3-235B-A22B-FP8`
+- **Architecture**: 235B parameter Mixture-of-Experts (MoE)
+- **Active parameters**: ~22B per token
+- **Backend**: TensorRT-LLM (PyTorch backend)
+- **Parallelism**: TP4 × EP4 (Expert Parallel)
+## Hardware Requirements
+This is a large MoE model requiring significant GPU resources:
+| Configuration | GPUs | Min GPU VRAM (Total) |
+|--------------|------|----------------------|
+| Aggregated | 16x H100/H200 | ~1.3TB |
+| Disaggregated | 16x H100/H200 | ~1.3TB |
+## Notes
+- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
+- Model download may take 30-60 minutes
+- Uses KV-aware routing for efficient cache utilization
+- Chunked prefill enabled for aggregated mode (disabled for disaggregated)
--- a/recipes/qwen3-235b-a22b-fp8/trtllm/agg/deploy.yaml
+++ b/recipes/qwen3-235b-a22b-fp8/trtllm/agg/deploy.yaml
@@ -54,7 +54,7 @@ spec:
                    - qwen3-235b-a22b-agg-frontend
              topologyKey: kubernetes.io/hostname
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          args:
            - python3 -m dynamo.frontend --router-mode kv --http-port 8000
          command:
@@ -78,7 +78,9 @@ spec:
        mainContainer:
          env:
            - name: MODEL_PATH
-              value: /mnt/model-cache/hub/models--Qwen--Qwen3-235B-A22B-FP8/snapshots/39eb2b067ea6b8e3e1dd97d3cd0c7ffeaf3e1a35
+              value: Qwen/Qwen3-235B-A22B-FP8
+            - name: HF_HOME
+              value: /mnt/model-cache
            - name: ENGINE_ARGS
              value: /engine_configs/agg.yaml
          command:
@@ -90,7 +92,7 @@ spec:
              --model-path "${MODEL_PATH}" \
              --served-model-name "Qwen/Qwen3-235B-A22B-FP8" \
              --extra-engine-args "${ENGINE_ARGS}"
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          workingDir: /workspace/components/backends/trtllm
          volumeMounts:
            - name: agg-config

--- a/recipes/qwen3-235b-a22b-fp8/trtllm/disagg/deploy.yaml
+++ b/recipes/qwen3-235b-a22b-fp8/trtllm/disagg/deploy.yaml
@@ -83,7 +83,7 @@ spec:
                    - qwen3-235b-a22b-disagg-frontend
              topologyKey: kubernetes.io/hostname
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          args:
            - python3 -m dynamo.frontend --router-mode kv --http-port 8000
          command:
@@ -112,10 +112,12 @@ spec:
        mainContainer:
          env:
            - name: MODEL_PATH
-              value: /mnt/model-cache/hub/models--Qwen--Qwen3-235B-A22B-FP8/snapshots/39eb2b067ea6b8e3e1dd97d3cd0c7ffeaf3e1a35
+              value: Qwen/Qwen3-235B-A22B-FP8
+            - name: HF_HOME
+              value: /mnt/model-cache
            - name: ENGINE_ARGS
              value: /engine_configs/prefill.yaml
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          workingDir: /workspace/components/backends/trtllm
          command:
            - /bin/sh
@@ -163,10 +165,12 @@ spec:
        mainContainer:
          env:
            - name: MODEL_PATH
-              value: /mnt/model-cache/hub/models--Qwen--Qwen3-235B-A22B-FP8/snapshots/39eb2b067ea6b8e3e1dd97d3cd0c7ffeaf3e1a35
+              value: Qwen/Qwen3-235B-A22B-FP8
+            - name: HF_HOME
+              value: /mnt/model-cache
            - name: ENGINE_ARGS
              value: /engine_configs/decode.yaml
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          workingDir: /workspace/components/backends/trtllm
          command:
            - /bin/sh

--- a/recipes/qwen3-32b-fp8/README.md
+++ b/recipes/qwen3-32b-fp8/README.md
+# Qwen3-32B-FP8 Recipes
+Production-ready deployments for **Qwen3-32B** with FP8 quantization using TensorRT-LLM.
+## Available Configurations
+| Configuration | GPUs | Mode | Description |
+|--------------|------|------|-------------|
+| [**trtllm/agg**](trtllm/agg/) | 2x GPU | Aggregated | TP2, round-robin routing |
+| [**trtllm/disagg**](trtllm/disagg/) | 8x GPU | Disaggregated | Prefill/decode separation |
+## Prerequisites
+1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GPU cluster** with H100/H200/A100 GPUs
+3. **HuggingFace token** with access to Qwen models
+## Quick Start
+```bash
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
+# Download model (update storageClassName in model-cache.yaml first!)
+kubectl apply -f model-cache/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=1800s
+# Deploy (choose one configuration)
+kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
+# OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
+```
+## Test the Deployment
+```bash
+# Port-forward the frontend
+kubectl port-forward svc/qwen3-32b-fp8-agg-frontend 8000:8000 -n ${NAMESPACE}
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-32B-FP8",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
+```
+## Model Details
+- **Model**: `Qwen/Qwen3-32B-FP8`
+- **Backend**: TensorRT-LLM (PyTorch backend)
+- **Quantization**: FP8
+- **Tensor Parallel**: 2
+## Notes
+- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
+- The aggregated config uses CUDA graphs for optimized inference
+- KV cache uses FP8 dtype for memory efficiency
--- a/recipes/qwen3-32b-fp8/trtllm/agg/deploy.yaml
+++ b/recipes/qwen3-32b-fp8/trtllm/agg/deploy.yaml
@@ -61,7 +61,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
      replicas: 1
    TrtllmWorker:
      componentType: main
@@ -94,7 +94,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"

--- a/recipes/qwen3-32b-fp8/trtllm/disagg/deploy.yaml
+++ b/recipes/qwen3-32b-fp8/trtllm/disagg/deploy.yaml
@@ -218,7 +218,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
      replicas: 1
    TrtllmPrefillWorker:
      componentType: worker
@@ -253,7 +253,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"
@@ -313,7 +313,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
          env:
          - name: TRTLLM_ENABLE_PDL
            value: "1"

--- a/recipes/qwen3-32b/vllm/agg-round-robin/deploy.yaml
+++ b/recipes/qwen3-32b/vllm/agg-round-robin/deploy.yaml
@@ -18,7 +18,7 @@ spec:
          value: /home/dynamo/.cache/huggingface
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace
          command:
            - python3
@@ -63,7 +63,7 @@ spec:
          - python3
          - -m
          - dynamo.vllm
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          env:
          - name: DYN_HEALTH_CHECK_ENABLED
            value: "false"

--- a/recipes/qwen3-32b/vllm/disagg-kv-router/deploy.yaml
+++ b/recipes/qwen3-32b/vllm/disagg-kv-router/deploy.yaml
@@ -26,7 +26,7 @@ spec:
            - python
            - -m
            - dynamo.frontend
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace
      replicas: 1
      resources:
@@ -60,7 +60,7 @@ spec:
          - python3
          - -m
          - dynamo.vllm
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace
          env:
          - name: DYN_HEALTH_CHECK_ENABLED
@@ -112,7 +112,7 @@ spec:
          - python3
          - -m
          - dynamo.vllm
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          env:
          - name: DYN_HEALTH_CHECK_ENABLED
            value: "false"