Unverified Commit f366932a authored by Ben Hamm's avatar Ben Hamm Committed by GitHub
Browse files

fix(recipes): address VDR feedback - fix bugs, improve docs, add READMEs (#5479)


Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatardagil-nvidia <dagil@nvidia.com>
parent 53a609e5
...@@ -7,18 +7,35 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D ...@@ -7,18 +7,35 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
## Available Recipes ## Available Recipes
| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes |GAIE integration | ### Multi-Feature Recipe
|-------|-----------|------|------|------------|------------------|-------|------------------|
| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ | ❌ | This recipe combines multiple Dynamo performance features (disaggregated serving + KV-aware routing):
| Model | Framework | Configuration | GPUs | Features |
|-------|-----------|---------------|------|----------|
| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — includes benchmark comparison with real-world Mooncake traces |
### Aggregated & Disaggregated Recipes
These recipes demonstrate aggregated or disaggregated serving:
**GAIE Column**: Indicates whether the recipe includes integration with the [Gateway API Inference Extension (GAIE)](../deploy/inference-gateway/README.md) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.
| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | GAIE |
|-------|-----------|------|------|------------|------------------|-------|------|
| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ |
| **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ | | **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ | | **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization | ❌ | | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x GPU | ✅ | ✅ | FP8 quantization | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ | | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ |
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x GPU | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x GPU | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ | | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ | | **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 8x H200 | ✅*1 | ❌ | Benchmark recipe pending | ❌ | | **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅*1 | ❌ | TP=8 per worker, single-node | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅*1 | ❌ | Benchmark recipe pending | ❌ | | **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅*1 | ❌ | TP=16 per worker, multi-node | ❌ |
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ |Multi-node: 8 decode + 1 prefill nodes | ❌ | | **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
| **[DeepSeek-R1](deepseek-r1/vllm/disagg/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
*1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC. *1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.
...@@ -188,7 +205,7 @@ First, deploy the Dynamo Graph per instructions above. ...@@ -188,7 +205,7 @@ First, deploy the Dynamo Graph per instructions above.
Then follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE. Then follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE.
Update the containers.epp.image in the deployment file, i.e. llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml. It should match the release tag and be in the format `nvcr.io/nvidia/ai-dynamo/frontend:<my-tag>` i.e. `nvcr.io/nvstaging/ai-dynamo/dynamo-frontend:0.7.0rc2-amd64` Update the containers.epp.image in the deployment file, i.e. llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml. It should match the release tag and be in the format `nvcr.io/nvidia/ai-dynamo/frontend:<version>` e.g. `nvcr.io/nvidia/ai-dynamo/frontend:0.8.0`
The recipe assumes you are using Kubernetes discovery backend and sets the `DYN_DISCOVERY_BACKEND` env variable in the epp deployment. If you want to use etcd enable the lines below and remove the DYN_DISCOVERY_BACKEND env var. The recipe assumes you are using Kubernetes discovery backend and sets the `DYN_DISCOVERY_BACKEND` env variable in the epp deployment. If you want to use etcd enable the lines below and remove the DYN_DISCOVERY_BACKEND env var.
```bash ```bash
- name: ETCD_ENDPOINTS - name: ETCD_ENDPOINTS
......
# DeepSeek-R1 Recipes
Production-ready deployments for **DeepSeek-R1** (671B MoE) across multiple backends and hardware configurations.
## Available Configurations
| Configuration | GPUs | Backend | Mode | Description |
|--------------|------|---------|------|-------------|
| [**sglang/disagg-8gpu**](sglang/disagg-8gpu/) | 16x H200 | SGLang | Disaggregated WideEP | TP=8 per worker, single-node |
| [**sglang/disagg-16gpu**](sglang/disagg-16gpu/) | 32x H200 | SGLang | Disaggregated WideEP | TP=16 per worker, multi-node |
| [**trtllm/disagg/wide_ep/gb200**](trtllm/disagg/wide_ep/gb200/) | 36x GB200 | TensorRT-LLM | Disaggregated WideEP | 8 decode + 1 prefill nodes |
| [**vllm/disagg**](vllm/disagg/) | 32x H200 | vLLM | Disaggregated DEP16 | Multi-node, data-expert parallel |
## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with H200 or GB200 GPUs matching the configuration requirements
3. **HuggingFace token** with access to DeepSeek models
4. **High-bandwidth networking** — InfiniBand or RoCE recommended for multi-node deployments
## Quick Start
```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model (update storageClassName in model-cache.yaml first!)
# For SGLang deployments:
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download-sglang.yaml -n ${NAMESPACE}
# For vLLM/TRT-LLM deployments:
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
# Wait for download (this is a large model - may take 1+ hours)
# For SGLang: kubectl wait --for=condition=Complete job/model-download-sglang ...
# For vLLM/TRT-LLM: kubectl wait --for=condition=Complete job/model-download ...
kubectl wait --for=condition=Complete job/model-download-sglang -n ${NAMESPACE} --timeout=7200s
# Deploy (choose one configuration)
kubectl apply -f sglang/disagg-8gpu/deploy.yaml -n ${NAMESPACE}
```
## Test the Deployment
```bash
# Port-forward the frontend (service name varies by deployment)
kubectl port-forward svc/sgl-dsr1-8gpu-frontend 8000:8000 -n ${NAMESPACE}
# Send a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
## Model Details
- **Model**: `deepseek-ai/DeepSeek-R1`
- **Architecture**: 671B parameter Mixture-of-Experts (MoE)
- **Active parameters**: ~37B per token
- **Recommended**: FP8 quantization for production deployments
## Hardware Requirements
DeepSeek-R1 is a very large model requiring significant GPU memory:
| Configuration | Min GPU Memory | Recommended |
|--------------|----------------|-------------|
| 16x H200 (SGLang TP=8) | 1.1TB total | H200 SXM (141GB each) |
| 32x H200 (SGLang TP=16, vLLM) | 2.2TB total | H200 SXM (141GB each) |
| 36x GB200 (TRT-LLM) | ~2.5TB total | GB200 NVL72 |
## Notes
- **Model download time**: DeepSeek-R1 is ~1.3TB; expect 1-2 hours for download
- **NCCL errors**: Usually indicate OOM. Reduce `--mem-fraction-static` in worker args
- **Multi-node**: Requires InfiniBand/IBGDA enabled. See [vLLM EP docs](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/)
- **Storage class**: Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
## Backend-Specific Notes
### SGLang
- Uses WideEP (Wide Expert Parallel) for efficient MoE inference
- See [sglang/README.md](sglang/README.md) for SGLang-specific configuration
### TensorRT-LLM
- Requires FP4 quantized checkpoint
- GB200-specific optimizations
### vLLM
- Uses DEP (Data-Expert Parallel) with hybrid load balancing
- See [vllm/disagg/README.md](vllm/disagg/README.md) for detailed setup
...@@ -21,7 +21,7 @@ spec: ...@@ -21,7 +21,7 @@ spec:
mountPoint: /opt/model mountPoint: /opt/model
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
decode: decode:
componentType: worker componentType: worker
subComponentType: decode subComponentType: decode
...@@ -38,7 +38,7 @@ spec: ...@@ -38,7 +38,7 @@ spec:
size: 80Gi size: 80Gi
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
workingDir: /sgl-workspace/dynamo workingDir: /sgl-workspace/dynamo
command: command:
- python3 - python3
...@@ -83,7 +83,7 @@ spec: ...@@ -83,7 +83,7 @@ spec:
size: 80Gi size: 80Gi
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
workingDir: /sgl-workspace/dynamo workingDir: /sgl-workspace/dynamo
command: command:
- python3 - python3
......
...@@ -21,7 +21,7 @@ spec: ...@@ -21,7 +21,7 @@ spec:
mountPoint: /opt/model mountPoint: /opt/model
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
decode: decode:
componentType: worker componentType: worker
subComponentType: decode subComponentType: decode
...@@ -36,7 +36,7 @@ spec: ...@@ -36,7 +36,7 @@ spec:
size: 80Gi size: 80Gi
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
workingDir: /workspace workingDir: /workspace
command: command:
- python3 - python3
...@@ -78,7 +78,7 @@ spec: ...@@ -78,7 +78,7 @@ spec:
size: 80Gi size: 80Gi
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.0
workingDir: /workspace workingDir: /workspace
command: command:
- python3 - python3
......
...@@ -126,7 +126,7 @@ spec: ...@@ -126,7 +126,7 @@ spec:
tolerations: [] tolerations: []
affinity: {} affinity: {}
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
args: args:
- | - |
python3 -m dynamo.frontend --http-port 8000 python3 -m dynamo.frontend --http-port 8000
...@@ -158,7 +158,7 @@ spec: ...@@ -158,7 +158,7 @@ spec:
tolerations: [] tolerations: []
affinity: {} affinity: {}
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
workingDir: /workspace/components/backends/trtllm workingDir: /workspace/components/backends/trtllm
# NOTE: If your PVCs (Persistent Volume Claims) are really slow, # NOTE: If your PVCs (Persistent Volume Claims) are really slow,
# you might need to increase 'failureThreshold' below to allow more time for startup # you might need to increase 'failureThreshold' below to allow more time for startup
...@@ -216,7 +216,7 @@ spec: ...@@ -216,7 +216,7 @@ spec:
tolerations: [] tolerations: []
affinity: {} affinity: {}
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
workingDir: /workspace/components/backends/trtllm workingDir: /workspace/components/backends/trtllm
# NOTE: If your PVCs (Persistent Volume Claims) are really slow, # NOTE: If your PVCs (Persistent Volume Claims) are really slow,
# you might need to increase 'failureThreshold' below to allow more time for startup # you might need to increase 'failureThreshold' below to allow more time for startup
......
...@@ -7,6 +7,9 @@ metadata: ...@@ -7,6 +7,9 @@ metadata:
name: vllm-dsr1 name: vllm-dsr1
spec: spec:
backendFramework: vllm backendFramework: vllm
envs:
- name: HF_HOME
value: /model-cache
pvcs: pvcs:
- name: model-cache - name: model-cache
create: false create: false
...@@ -23,7 +26,7 @@ spec: ...@@ -23,7 +26,7 @@ spec:
periodSeconds: 10 periodSeconds: 10
timeoutSeconds: 1800 timeoutSeconds: 1800
failureThreshold: 60 failureThreshold: 60
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
decode: decode:
componentType: worker componentType: worker
subComponentType: decode subComponentType: decode
...@@ -49,7 +52,7 @@ spec: ...@@ -49,7 +52,7 @@ spec:
periodSeconds: 10 periodSeconds: 10
timeoutSeconds: 10 timeoutSeconds: 10
failureThreshold: 600 failureThreshold: 600
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/dynamo workingDir: /workspace/dynamo
env: env:
- name: VLLM_USE_DEEP_GEMM - name: VLLM_USE_DEEP_GEMM
...@@ -70,7 +73,7 @@ spec: ...@@ -70,7 +73,7 @@ spec:
args: args:
- | - |
exec python3 -m dynamo.vllm \ exec python3 -m dynamo.vllm \
--model /model-cache/deepseek-r1 \ --model deepseek-ai/DeepSeek-R1 \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--all2all-backend deepep_low_latency \ --all2all-backend deepep_low_latency \
--data-parallel-hybrid-lb \ --data-parallel-hybrid-lb \
...@@ -110,7 +113,7 @@ spec: ...@@ -110,7 +113,7 @@ spec:
periodSeconds: 10 periodSeconds: 10
timeoutSeconds: 10 timeoutSeconds: 10
failureThreshold: 600 failureThreshold: 600
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/dynamo workingDir: /workspace/dynamo
env: env:
- name: VLLM_USE_DEEP_GEMM - name: VLLM_USE_DEEP_GEMM
...@@ -129,7 +132,7 @@ spec: ...@@ -129,7 +132,7 @@ spec:
args: args:
- | - |
exec python3 -m dynamo.vllm \ exec python3 -m dynamo.vllm \
--model /model-cache/deepseek-r1 \ --model deepseek-ai/DeepSeek-R1 \
--is-prefill-worker \ --is-prefill-worker \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--all2all-backend deepep_high_throughput \ --all2all-backend deepep_high_throughput \
......
...@@ -45,7 +45,7 @@ spec: ...@@ -45,7 +45,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: my-registry/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
replicas: 1 replicas: 1
TrtllmWorker: TrtllmWorker:
componentType: main componentType: main
...@@ -79,7 +79,7 @@ spec: ...@@ -79,7 +79,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: my-registry/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
env: env:
- name: TRTLLM_ENABLE_PDL - name: TRTLLM_ENABLE_PDL
value: "1" value: "1"
...@@ -90,7 +90,7 @@ spec: ...@@ -90,7 +90,7 @@ spec:
- name: ENGINE_ARGS - name: ENGINE_ARGS
value: "/opt/dynamo/configs/config.yaml" value: "/opt/dynamo/configs/config.yaml"
- name: MODEL_PATH - name: MODEL_PATH
value: "/opt/models/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a" value: "openai/gpt-oss-120b"
- name: HF_HOME - name: HF_HOME
value: /opt/models value: /opt/models
volumeMounts: volumeMounts:
......
# Llama-3.3-70B Recipes
Production-ready deployments for **Llama-3.3-70B-Instruct** using vLLM with FP8 dynamic quantization.
## Available Configurations
| Configuration | GPUs | Mode | Description |
|--------------|------|------|-------------|
| [**vllm/agg**](vllm/agg/) | 4x H100/H200 | Aggregated | Single-node, TP4 |
| [**vllm/disagg-single-node**](vllm/disagg-single-node/) | 8x H100/H200 | Disaggregated | Prefill/decode separation on one node |
| [**vllm/disagg-multi-node**](vllm/disagg-multi-node/) | 16x H100/H200 | Disaggregated | 2 nodes, 8 GPUs each |
## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with H100 or H200 GPUs matching the configuration requirements
3. **HuggingFace token** with access to Llama models
## Quick Start
```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model (update storageClassName in model-cache.yaml first!)
kubectl apply -f model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Deploy (choose one configuration)
kubectl apply -f vllm/agg/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f vllm/disagg-single-node/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f vllm/disagg-multi-node/deploy.yaml -n ${NAMESPACE}
```
## Test the Deployment
```bash
# Port-forward the frontend
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
# Send a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
```
## Model Details
- **Model**: `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic`
- **Quantization**: FP8 dynamic (applied at runtime)
- **Context length**: Default model context
## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` to match your cluster before deploying
- Model download takes approximately 15-30 minutes depending on network speed
- For GAIE (Gateway API Inference Extension) integration, see [vllm/agg/gaie/](vllm/agg/gaie/)
...@@ -17,7 +17,7 @@ spec: ...@@ -17,7 +17,7 @@ spec:
mountPoint: /opt/models mountPoint: /opt/models
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
envs: envs:
- name: HF_HOME - name: HF_HOME
...@@ -37,7 +37,7 @@ spec: ...@@ -37,7 +37,7 @@ spec:
- name: SERVED_MODEL_NAME - name: SERVED_MODEL_NAME
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: MODEL_PATH - name: MODEL_PATH
value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: HF_HOME - name: HF_HOME
value: /opt/models value: /opt/models
args: args:
...@@ -45,7 +45,7 @@ spec: ...@@ -45,7 +45,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
replicas: 1 replicas: 1
resources: resources:
......
...@@ -38,7 +38,7 @@ spec: ...@@ -38,7 +38,7 @@ spec:
containers: containers:
- name: epp - name: epp
image: nvcr.io/nvidia/ai-dynamo/frontend:<my-tag> image: nvcr.io/nvidia/ai-dynamo/frontend:0.8.0
imagePullPolicy: IfNotPresent imagePullPolicy: IfNotPresent
resources: resources:
requests: requests:
......
...@@ -17,7 +17,7 @@ spec: ...@@ -17,7 +17,7 @@ spec:
mountPoint: /opt/models mountPoint: /opt/models
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
envs: envs:
- name: HF_HOME - name: HF_HOME
...@@ -38,15 +38,15 @@ spec: ...@@ -38,15 +38,15 @@ spec:
- name: SERVED_MODEL_NAME - name: SERVED_MODEL_NAME
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: MODEL_PATH - name: MODEL_PATH
value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: HF_HOME - name: HF_HOME
value: /opt/models value: /opt/models
args: args:
- "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --is-prefill-worker --gpu-memory-utilization 0.95 --no-enable-prefix-caching --block-size 128" - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --is-prefill-worker --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128"
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
replicas: 1 replicas: 1
resources: resources:
...@@ -69,7 +69,7 @@ spec: ...@@ -69,7 +69,7 @@ spec:
- name: SERVED_MODEL_NAME - name: SERVED_MODEL_NAME
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: MODEL_PATH - name: MODEL_PATH
value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: HF_HOME - name: HF_HOME
value: /opt/models value: /opt/models
args: args:
...@@ -77,7 +77,7 @@ spec: ...@@ -77,7 +77,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
replicas: 1 replicas: 1
resources: resources:
......
...@@ -17,7 +17,7 @@ spec: ...@@ -17,7 +17,7 @@ spec:
mountPoint: /opt/models mountPoint: /opt/models
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
envs: envs:
- name: HF_HOME - name: HF_HOME
...@@ -50,15 +50,15 @@ spec: ...@@ -50,15 +50,15 @@ spec:
- name: SERVED_MODEL_NAME - name: SERVED_MODEL_NAME
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: MODEL_PATH - name: MODEL_PATH
value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: HF_HOME - name: HF_HOME
value: /opt/models value: /opt/models
args: args:
- "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 2 --data-parallel-size 1 --is-prefill-worker --gpu-memory-utilization 0.95 --no-enable-prefix-caching --block-size 128" - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 2 --data-parallel-size 1 --is-prefill-worker --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128"
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
replicas: 2 replicas: 2
resources: resources:
...@@ -93,7 +93,7 @@ spec: ...@@ -93,7 +93,7 @@ spec:
- name: SERVED_MODEL_NAME - name: SERVED_MODEL_NAME
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: MODEL_PATH - name: MODEL_PATH
value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd" value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
- name: HF_HOME - name: HF_HOME
value: /opt/models value: /opt/models
args: args:
...@@ -101,7 +101,7 @@ spec: ...@@ -101,7 +101,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
replicas: 1 replicas: 1
resources: resources:
......
# Qwen3-235B-A22B-FP8 Recipes
Production-ready deployments for **Qwen3-235B-A22B** (MoE model with 22B active parameters) using TensorRT-LLM.
## Available Configurations
| Configuration | GPUs | Mode | Description |
|--------------|------|------|-------------|
| [**trtllm/agg**](trtllm/agg/) | 16x GPU | Aggregated | TP4, EP4, KV-aware routing |
| [**trtllm/disagg**](trtllm/disagg/) | 16x GPU | Disaggregated | Prefill/decode separation |
## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with H100/H200 GPUs (high memory recommended)
3. **HuggingFace token** with access to Qwen models
## Quick Start
```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model (update storageClassName in model-cache.yaml first!)
kubectl apply -f model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Deploy (choose one configuration)
kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
```
## Test the Deployment
```bash
# Port-forward the frontend
kubectl port-forward svc/qwen3-235b-a22b-agg-frontend 8000:8000 -n ${NAMESPACE}
# Send a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-235B-A22B-FP8",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
```
## Model Details
- **Model**: `Qwen/Qwen3-235B-A22B-FP8`
- **Architecture**: 235B parameter Mixture-of-Experts (MoE)
- **Active parameters**: ~22B per token
- **Backend**: TensorRT-LLM (PyTorch backend)
- **Parallelism**: TP4 × EP4 (Expert Parallel)
## Hardware Requirements
This is a large MoE model requiring significant GPU resources:
| Configuration | GPUs | Min GPU VRAM (Total) |
|--------------|------|----------------------|
| Aggregated | 16x H100/H200 | ~1.3TB |
| Disaggregated | 16x H100/H200 | ~1.3TB |
## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- Model download may take 30-60 minutes
- Uses KV-aware routing for efficient cache utilization
- Chunked prefill enabled for aggregated mode (disabled for disaggregated)
...@@ -54,7 +54,7 @@ spec: ...@@ -54,7 +54,7 @@ spec:
- qwen3-235b-a22b-agg-frontend - qwen3-235b-a22b-agg-frontend
topologyKey: kubernetes.io/hostname topologyKey: kubernetes.io/hostname
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
args: args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000 - python3 -m dynamo.frontend --router-mode kv --http-port 8000
command: command:
...@@ -78,7 +78,9 @@ spec: ...@@ -78,7 +78,9 @@ spec:
mainContainer: mainContainer:
env: env:
- name: MODEL_PATH - name: MODEL_PATH
value: /mnt/model-cache/hub/models--Qwen--Qwen3-235B-A22B-FP8/snapshots/39eb2b067ea6b8e3e1dd97d3cd0c7ffeaf3e1a35 value: Qwen/Qwen3-235B-A22B-FP8
- name: HF_HOME
value: /mnt/model-cache
- name: ENGINE_ARGS - name: ENGINE_ARGS
value: /engine_configs/agg.yaml value: /engine_configs/agg.yaml
command: command:
...@@ -90,7 +92,7 @@ spec: ...@@ -90,7 +92,7 @@ spec:
--model-path "${MODEL_PATH}" \ --model-path "${MODEL_PATH}" \
--served-model-name "Qwen/Qwen3-235B-A22B-FP8" \ --served-model-name "Qwen/Qwen3-235B-A22B-FP8" \
--extra-engine-args "${ENGINE_ARGS}" --extra-engine-args "${ENGINE_ARGS}"
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
workingDir: /workspace/components/backends/trtllm workingDir: /workspace/components/backends/trtllm
volumeMounts: volumeMounts:
- name: agg-config - name: agg-config
......
...@@ -83,7 +83,7 @@ spec: ...@@ -83,7 +83,7 @@ spec:
- qwen3-235b-a22b-disagg-frontend - qwen3-235b-a22b-disagg-frontend
topologyKey: kubernetes.io/hostname topologyKey: kubernetes.io/hostname
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
args: args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000 - python3 -m dynamo.frontend --router-mode kv --http-port 8000
command: command:
...@@ -112,10 +112,12 @@ spec: ...@@ -112,10 +112,12 @@ spec:
mainContainer: mainContainer:
env: env:
- name: MODEL_PATH - name: MODEL_PATH
value: /mnt/model-cache/hub/models--Qwen--Qwen3-235B-A22B-FP8/snapshots/39eb2b067ea6b8e3e1dd97d3cd0c7ffeaf3e1a35 value: Qwen/Qwen3-235B-A22B-FP8
- name: HF_HOME
value: /mnt/model-cache
- name: ENGINE_ARGS - name: ENGINE_ARGS
value: /engine_configs/prefill.yaml value: /engine_configs/prefill.yaml
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
workingDir: /workspace/components/backends/trtllm workingDir: /workspace/components/backends/trtllm
command: command:
- /bin/sh - /bin/sh
...@@ -163,10 +165,12 @@ spec: ...@@ -163,10 +165,12 @@ spec:
mainContainer: mainContainer:
env: env:
- name: MODEL_PATH - name: MODEL_PATH
value: /mnt/model-cache/hub/models--Qwen--Qwen3-235B-A22B-FP8/snapshots/39eb2b067ea6b8e3e1dd97d3cd0c7ffeaf3e1a35 value: Qwen/Qwen3-235B-A22B-FP8
- name: HF_HOME
value: /mnt/model-cache
- name: ENGINE_ARGS - name: ENGINE_ARGS
value: /engine_configs/decode.yaml value: /engine_configs/decode.yaml
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
workingDir: /workspace/components/backends/trtllm workingDir: /workspace/components/backends/trtllm
command: command:
- /bin/sh - /bin/sh
......
# Qwen3-32B-FP8 Recipes
Production-ready deployments for **Qwen3-32B** with FP8 quantization using TensorRT-LLM.
## Available Configurations
| Configuration | GPUs | Mode | Description |
|--------------|------|------|-------------|
| [**trtllm/agg**](trtllm/agg/) | 2x GPU | Aggregated | TP2, round-robin routing |
| [**trtllm/disagg**](trtllm/disagg/) | 8x GPU | Disaggregated | Prefill/decode separation |
## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with H100/H200/A100 GPUs
3. **HuggingFace token** with access to Qwen models
## Quick Start
```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model (update storageClassName in model-cache.yaml first!)
kubectl apply -f model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=1800s
# Deploy (choose one configuration)
kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
```
## Test the Deployment
```bash
# Port-forward the frontend
kubectl port-forward svc/qwen3-32b-fp8-agg-frontend 8000:8000 -n ${NAMESPACE}
# Send a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B-FP8",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
```
## Model Details
- **Model**: `Qwen/Qwen3-32B-FP8`
- **Backend**: TensorRT-LLM (PyTorch backend)
- **Quantization**: FP8
- **Tensor Parallel**: 2
## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The aggregated config uses CUDA graphs for optimized inference
- KV cache uses FP8 dtype for memory efficiency
...@@ -61,7 +61,7 @@ spec: ...@@ -61,7 +61,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
replicas: 1 replicas: 1
TrtllmWorker: TrtllmWorker:
componentType: main componentType: main
...@@ -94,7 +94,7 @@ spec: ...@@ -94,7 +94,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
env: env:
- name: TRTLLM_ENABLE_PDL - name: TRTLLM_ENABLE_PDL
value: "1" value: "1"
......
...@@ -218,7 +218,7 @@ spec: ...@@ -218,7 +218,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
replicas: 1 replicas: 1
TrtllmPrefillWorker: TrtllmPrefillWorker:
componentType: worker componentType: worker
...@@ -253,7 +253,7 @@ spec: ...@@ -253,7 +253,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
env: env:
- name: TRTLLM_ENABLE_PDL - name: TRTLLM_ENABLE_PDL
value: "1" value: "1"
...@@ -313,7 +313,7 @@ spec: ...@@ -313,7 +313,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
env: env:
- name: TRTLLM_ENABLE_PDL - name: TRTLLM_ENABLE_PDL
value: "1" value: "1"
......
...@@ -18,7 +18,7 @@ spec: ...@@ -18,7 +18,7 @@ spec:
value: /home/dynamo/.cache/huggingface value: /home/dynamo/.cache/huggingface
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace workingDir: /workspace
command: command:
- python3 - python3
...@@ -63,7 +63,7 @@ spec: ...@@ -63,7 +63,7 @@ spec:
- python3 - python3
- -m - -m
- dynamo.vllm - dynamo.vllm
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
env: env:
- name: DYN_HEALTH_CHECK_ENABLED - name: DYN_HEALTH_CHECK_ENABLED
value: "false" value: "false"
......
...@@ -26,7 +26,7 @@ spec: ...@@ -26,7 +26,7 @@ spec:
- python - python
- -m - -m
- dynamo.frontend - dynamo.frontend
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace workingDir: /workspace
replicas: 1 replicas: 1
resources: resources:
...@@ -60,7 +60,7 @@ spec: ...@@ -60,7 +60,7 @@ spec:
- python3 - python3
- -m - -m
- dynamo.vllm - dynamo.vllm
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace workingDir: /workspace
env: env:
- name: DYN_HEALTH_CHECK_ENABLED - name: DYN_HEALTH_CHECK_ENABLED
...@@ -112,7 +112,7 @@ spec: ...@@ -112,7 +112,7 @@ spec:
- python3 - python3
- -m - -m
- dynamo.vllm - dynamo.vllm
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
env: env:
- name: DYN_HEALTH_CHECK_ENABLED - name: DYN_HEALTH_CHECK_ENABLED
value: "false" value: "false"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment