"...git@developer.sourcefind.cn:2222/OpenDAS/vllm_cscc.git" did not exist on "5bcc153d7bf69ef34bc5788a33f60f1792cf2861"
Unverified Commit 91b9b148 authored by Ben Hamm's avatar Ben Hamm Committed by GitHub
Browse files

docs: fix recipes landing page — add missing models, specify GPU types, fix errors (#7246)


Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
parent 5178a4a4
...@@ -12,13 +12,15 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D ...@@ -12,13 +12,15 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
## Available Recipes ## Available Recipes
### Multi-Feature Recipe ### Feature Comparison Recipes
This recipe combines multiple Dynamo performance features (disaggregated serving + KV-aware routing): These recipes compare Dynamo performance features with benchmark results, each including both baseline and optimized deployment configurations:
| Model | Framework | Configuration | GPUs | Features | | Model | Framework | Configuration | GPUs | Features |
|-------|-----------|---------------|------|----------| |-------|-----------|---------------|------|----------|
| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — includes benchmark comparison with real-world Mooncake traces | | **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with real-world Mooncake traces |
| **[DeepSeek-V3.2-NVFP4](deepseek-v32-fp4/)** | TensorRT-LLM | Agg + Disagg WideEP | 32x GB200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with Mooncake-based synthetic coding trace |
| **[Qwen3-VL-30B-A3B-FP8](qwen3-vl-30b/)** | vLLM | Agg + Embedding Cache | 1x GB200 | **Multimodal Embedding Cache** — benchmark comparison showing +16% throughput, -28% TTFT |
### Aggregated & Disaggregated Recipes ### Aggregated & Disaggregated Recipes
...@@ -26,28 +28,29 @@ These recipes demonstrate aggregated or disaggregated serving: ...@@ -26,28 +28,29 @@ These recipes demonstrate aggregated or disaggregated serving:
**GAIE Column**: Indicates whether the recipe includes integration with the [Gateway API Inference Extension (GAIE)](../deploy/inference-gateway/README.md) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management. **GAIE Column**: Indicates whether the recipe includes integration with the [Gateway API Inference Extension (GAIE)](../deploy/inference-gateway/README.md) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.
| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | GAIE | | Model | Framework | Mode | GPUs | Deployment | Benchmark | Notes | GAIE |
|-------|-----------|------|------|------------|------------------|-------|------| |-------|-----------|------|------|------------|-----------|-------|------|
| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ | | **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ |
| **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ | | **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ | | **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x GPU | ✅ | ✅ | FP8 quantization | ❌ | | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ | | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x GPU | ✅ | ✅ | MoE model, TP4×EP4 | ❌ | | **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x GPU | ✅ | ✅ | MoE model, Prefill + Decode | ❌ | | **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ | | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ | | **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | TP=8, single-node. Use `model-download-sglang.yaml` | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅*1 | ❌ | TP=8 per worker, single-node | ❌ | | **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅*1 | ❌ | TP=16 per worker, multi-node | ❌ | | **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ | | **[DeepSeek-R1](deepseek-r1/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
| **[DeepSeek-R1](deepseek-r1/vllm/disagg/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ | | **[Kimi-K2.5](kimi-k2.5/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | ❌ | MoE model, TP8×EP8, reasoning + tool calling | ❌ |
| **[Kimi-K2.5](kimi-k2.5/trtllm/agg/)** | TensorRT-LLM | Aggregated | 8x GPU | ✅ | ❌ | MoE model, TP8×EP8, reasoning + tool calling | ❌ |
*1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC. **Legend:**
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available
- **Benchmark**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks
### Non-Optimized Recipes ### Functional Recipes (Not Yet Benchmarked)
These recipes demonstrate functional deployments with Dynamo features, but have not yet been tuned for best performance or paired with benchmark manifests. These recipes demonstrate functional deployments with Dynamo features, but have not yet been performance-tuned or paired with benchmark manifests.
| Model | Framework | Mode | GPUs | Deployment | Notes | | Model | Framework | Mode | GPUs | Deployment | Notes |
|-------|-----------|-------|------|------------|-------| |-------|-----------|-------|------|------------|-------|
...@@ -56,10 +59,6 @@ These recipes demonstrate functional deployments with Dynamo features, but have ...@@ -56,10 +59,6 @@ These recipes demonstrate functional deployments with Dynamo features, but have
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer | | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ | | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
**Legend:**
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
- **Benchmark Recipe**: In the production-ready table above, ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
## Recipe Structure ## Recipe Structure
Each complete recipe follows this standard structure: Each complete recipe follows this standard structure:
...@@ -130,9 +129,6 @@ cd recipes ...@@ -130,9 +129,6 @@ cd recipes
# Update storageClassName in model-cache.yaml first! # Update storageClassName in model-cache.yaml first!
kubectl apply -f <model>/model-cache/ -n ${NAMESPACE} kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
# Create model cache PVC
kubectl apply -f <model>/model-cache/model-download.yaml -n ${NAMESPACE}
# Wait for download to complete (may take 10-60 minutes depending on model size) # Wait for download to complete (may take 10-60 minutes depending on model size)
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
...@@ -206,7 +202,6 @@ kubectl create secret generic hf-token-secret \ ...@@ -206,7 +202,6 @@ kubectl create secret generic hf-token-secret \
# Deploy # Deploy
cd recipes cd recipes
kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE} kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
kubectl apply -f llama-3-70b/model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE} kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
...@@ -214,7 +209,7 @@ kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE} ...@@ -214,7 +209,7 @@ kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE} kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
``` ```
### Inference Gateway (GAIE) Integration (Optional)** ### Inference Gateway (GAIE) Integration (Optional)
For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided. For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.
......
# GPT-OSS-120B Recipe Guide # GPT-OSS-120B Recipes
This guide will help you run the GPT-OSS-120B language model using Dynamo's optimized setup. Production-ready deployment for **GPT-OSS-120B** using TensorRT-LLM on Blackwell (GB200) hardware.
## Available Configurations
| Configuration | GPUs | Mode | Description |
|--------------|------|------|-------------|
| [**trtllm/agg**](trtllm/agg/) | 4x GB200 | Aggregated | WideEP, ARM64 |
> **Note:** A [disaggregated configuration](trtllm/disagg/) exists with engine configs but is not yet production-ready. See [trtllm/disagg/README.md](trtllm/disagg/README.md) for details.
## Prerequisites ## Prerequisites
Follow the instructions in recipe [README.md](../README.md) to create a namespace and kubernetes secret for huggingface token. 1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with GB200 (Blackwell) GPUs
3. **HuggingFace token** with access to the model
## Quick Start ## Quick Start
To run the model, simply execute this command in your terminal:
```bash ```bash
cd recipe # Set namespace
./run.sh --model gpt-oss-120b --framework trtllm agg export NAMESPACE=dynamo-demo
``` kubectl create namespace ${NAMESPACE}
## (Alternative) Step by Step Guide # Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
### 1. Download the Model # Download model (update storageClassName in model-cache/model-cache.yaml first!)
kubectl apply -f model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
```bash # Deploy
cd recipes/gpt-oss-120b kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
kubectl apply -n $NAMESPACE -f ./model-cache
``` ```
### 2. Deploy and Benchmark the Model ## Test the Deployment
```bash ```bash
cd recipes/gpt-oss-120b # Port-forward the frontend
kubectl apply -n $NAMESPACE -f ./trtllm/agg kubectl port-forward svc/gpt-oss-agg-frontend 8000:8000 -n ${NAMESPACE}
```
# Send a test request
### Container Image curl http://localhost:8000/v1/chat/completions \
This recipe was tested with dynamo trtllm runtime container for ARM64 processors. -H "Content-Type: application/json" \
-d '{
**Important Note:** "model": "openai/gpt-oss-120b",
"messages": [{"role": "user", "content": "Hello!"}],
Before dynamo v0.5.1 release, following container image is supported: "max_tokens": 50
``` }'
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3
```
After dynamo v0.5.1 release, following container image will be supported:
```
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1
``` ```
## Notes ## Notes
1. The benchmark container image uses a specific commit of aiperf to ensure reproducible results and compatibility with the benchmarking setup.
2. storage class is not specified in the recipe, you need to specify it in the `deploy.yaml` file. - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
\ No newline at end of file - This recipe requires ARM64 (GB200) nodes — it will not run on x86 Hopper/Ampere hardware
- Update the container image tag in `deploy.yaml` to match your Dynamo release version
...@@ -147,7 +147,7 @@ kubectl delete pod -l app=benchmark -n ${NAMESPACE} ...@@ -147,7 +147,7 @@ kubectl delete pod -l app=benchmark -n ${NAMESPACE}
# Delete deployments # Delete deployments
kubectl delete dynamographdeployment agg-8xtp2 -n ${NAMESPACE} kubectl delete dynamographdeployment agg-8xtp2 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-router-6p-2d-n ${NAMESPACE} kubectl delete dynamographdeployment disagg-router-6p-2d -n ${NAMESPACE}
``` ```
## References ## References
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment