docs: fix recipes landing page — add missing models, specify GPU types, fix errors (#7246)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

docs: fix recipes landing page — add missing models, specify GPU types, fix errors (#7246)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
91b9b148 · Ben Hamm · GitHub · 5178a4a4 · 91b9b148 · 91b9b148
Unverified Commit 91b9b148 authored Mar 11, 2026 by Ben Hamm Committed by GitHub Mar 11, 2026
Showing with 63 additions and 61 deletions

recipes/README.md recipes/README.md +22 -27

recipes/gpt-oss-120b/README.md recipes/gpt-oss-120b/README.md +40 -33

recipes/qwen3-32b/README.md recipes/qwen3-32b/README.md +1 -1

No files found.
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -12,13 +12,15 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D

 ## Available Recipes

-### Multi-Feature Recipe
+### Feature Comparison Recipes

-This recipe combines multiple Dynamo performance features (disaggregated serving + KV-aware routing):
+These recipes compare Dynamo performance features with benchmark results, each including both baseline and optimized deployment configurations:

 | Model | Framework | Configuration | GPUs | Features |
 |-------|-----------|---------------|------|----------|
-| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — includes benchmark comparison with real-world Mooncake traces |
+| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with real-world Mooncake traces |
+| **[DeepSeek-V3.2-NVFP4](deepseek-v32-fp4/)** | TensorRT-LLM | Agg + Disagg WideEP | 32x GB200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with Mooncake-based synthetic coding trace |
+| **[Qwen3-VL-30B-A3B-FP8](qwen3-vl-30b/)** | vLLM | Agg + Embedding Cache | 1x GB200 | **Multimodal Embedding Cache** — benchmark comparison showing +16% throughput, -28% TTFT |

 ### Aggregated & Disaggregated Recipes

@@ -26,28 +28,29 @@ These recipes demonstrate aggregated or disaggregated serving:

 **GAIE Column**: Indicates whether the recipe includes integration with the [Gateway API Inference Extension (GAIE)](../deploy/inference-gateway/README.md) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.

-| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | GAIE |
-|-------|-----------|------|------|------------|------------------|-------|------|
+| Model | Framework | Mode | GPUs | Deployment | Benchmark | Notes | GAIE |
+|-------|-----------|------|------|------------|-----------|-------|------|
 | **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ |
 | **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
 | **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
-| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x GPU | ✅ | ✅ | FP8 quantization | ❌ |
-| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ |
-| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x GPU | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
-| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x GPU | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
+| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ |
+| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ |
+| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
+| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
 | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
-| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ |
-| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅*1 | ❌ | TP=8 per worker, single-node | ❌ |
-| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅*1 | ❌ | TP=16 per worker, multi-node | ❌ |
-| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
-| **[DeepSeek-R1](deepseek-r1/vllm/disagg/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
-| **[Kimi-K2.5](kimi-k2.5/trtllm/agg/)** | TensorRT-LLM | Aggregated | 8x GPU | ✅ | ❌ | MoE model, TP8×EP8, reasoning + tool calling | ❌ |
+| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | TP=8, single-node. Use `model-download-sglang.yaml` | ❌ |
+| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ |
+| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
+| **[DeepSeek-R1](deepseek-r1/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
+| **[Kimi-K2.5](kimi-k2.5/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | ❌ | MoE model, TP8×EP8, reasoning + tool calling | ❌ |

-*1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.
+**Legend:**
+- **Deployment**: ✅ = Complete `deploy.yaml` manifest available
+- **Benchmark**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks

-### Non-Optimized Recipes
+### Functional Recipes (Not Yet Benchmarked)

-These recipes demonstrate functional deployments with Dynamo features, but have not yet been tuned for best performance or paired with benchmark manifests.
+These recipes demonstrate functional deployments with Dynamo features, but have not yet been performance-tuned or paired with benchmark manifests.

 | Model | Framework | Mode | GPUs | Deployment | Notes |
 |-------|-----------|-------|------|------------|-------|
@@ -56,10 +59,6 @@ These recipes demonstrate functional deployments with Dynamo features, but have
 | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
 | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |

-**Legend:**
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
- **Benchmark Recipe**: In the production-ready table above, ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
-
 ## Recipe Structure

 Each complete recipe follows this standard structure:
@@ -130,9 +129,6 @@ cd recipes
 # Update storageClassName in model-cache.yaml first!
 kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}

-# Create model cache PVC
-kubectl apply -f <model>/model-cache/model-download.yaml -n ${NAMESPACE}
-
 # Wait for download to complete (may take 10-60 minutes depending on model size)
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s

@@ -206,7 +202,6 @@ kubectl create secret generic hf-token-secret \
 # Deploy
 cd recipes
 kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
-kubectl apply -f llama-3-70b/model-cache/model-download.yaml -n ${NAMESPACE}
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
 kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}

@@ -214,7 +209,7 @@ kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
 kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
 ```

-### Inference Gateway (GAIE) Integration (Optional)**
+### Inference Gateway (GAIE) Integration (Optional)

 For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.


--- a/recipes/gpt-oss-120b/README.md
+++ b/recipes/gpt-oss-120b/README.md
-# GPT-OSS-120B Recipe Guide
+# GPT-OSS-120B Recipes

-This guide will help you run the GPT-OSS-120B language model using Dynamo's optimized setup.
+Production-ready deployment for **GPT-OSS-120B** using TensorRT-LLM on Blackwell (GB200) hardware.
+
+## Available Configurations
+
+| Configuration | GPUs | Mode | Description |
+|--------------|------|------|-------------|
+| [**trtllm/agg**](trtllm/agg/) | 4x GB200 | Aggregated | WideEP, ARM64 |
+
+> **Note:** A [disaggregated configuration](trtllm/disagg/) exists with engine configs but is not yet production-ready. See [trtllm/disagg/README.md](trtllm/disagg/README.md) for details.

 ## Prerequisites

-Follow the instructions in recipe [README.md](../README.md) to create a namespace and kubernetes secret for huggingface token.
+1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GPU cluster** with GB200 (Blackwell) GPUs
+3. **HuggingFace token** with access to the model

 ## Quick Start

-To run the model, simply execute this command in your terminal:
-
 ```bash
-cd recipe
-./run.sh --model gpt-oss-120b --framework trtllm agg
-```
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}

-## (Alternative) Step by Step Guide
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}

-### 1. Download the Model
+# Download model (update storageClassName in model-cache/model-cache.yaml first!)
+kubectl apply -f model-cache/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

-```bash
-cd recipes/gpt-oss-120b
-kubectl apply -n $NAMESPACE -f ./model-cache
+# Deploy
+kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
 ```

-### 2. Deploy and Benchmark the Model
+## Test the Deployment

 ```bash
-cd recipes/gpt-oss-120b
-kubectl apply -n $NAMESPACE -f ./trtllm/agg
-```
-
-### Container Image
-This recipe was tested with dynamo trtllm runtime container for ARM64 processors.
-
-**Important Note:**
-
-Before dynamo v0.5.1 release, following container image is supported:
-```
-nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3
-```
-
-After dynamo v0.5.1 release, following container image will be supported:
-```
-nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1
+# Port-forward the frontend
+kubectl port-forward svc/gpt-oss-agg-frontend 8000:8000 -n ${NAMESPACE}
+
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-oss-120b",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
 ```

 ## Notes
-1. The benchmark container image uses a specific commit of aiperf to ensure reproducible results and compatibility with the benchmarking setup.

-2. storage class is not specified in the recipe, you need to specify it in the `deploy.yaml` file.
\ No newline at end of file
+- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
+- This recipe requires ARM64 (GB200) nodes — it will not run on x86 Hopper/Ampere hardware
+- Update the container image tag in `deploy.yaml` to match your Dynamo release version
--- a/recipes/qwen3-32b/README.md
+++ b/recipes/qwen3-32b/README.md
@@ -147,7 +147,7 @@ kubectl delete pod -l app=benchmark -n ${NAMESPACE}

 # Delete deployments
 kubectl delete dynamographdeployment agg-8xtp2 -n ${NAMESPACE}
-kubectl delete dynamographdeployment disagg-router-6p-2d-n ${NAMESPACE}
+kubectl delete dynamographdeployment disagg-router-6p-2d -n ${NAMESPACE}
 ```

 ## References