"lib/parsers/vscode:/vscode.git/clone" did not exist on "0e63cd9c3766375a0f219826e07a18bc88c0152c"
Unverified Commit 88dfd1b3 authored by Ben Hamm's avatar Ben Hamm Committed by GitHub
Browse files

docs: Clean up incomplete recipes and clarify Kubernetes-only focus (#4159)


Signed-off-by: default avatarBen Hamm <ben.hamm@gmail.com>
Signed-off-by: default avatarTanmay Verma <tanmay2592@gmail.com>
Signed-off-by: default avataratchernych <atchernych@nvidia.com>
Co-authored-by: default avatarBiswa Panda <biswa.panda@gmail.com>
Co-authored-by: default avatartanmayv25 <tanmay2592@gmail.com>
Co-authored-by: default avatarTanmay Verma <tanmayv@nvidia.com>
Co-authored-by: default avatarAnant Sharma <anants@nvidia.com>
Co-authored-by: default avataratchernych <atchernych@nvidia.com>
parent 09bb1c68
...@@ -17,11 +17,11 @@ NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4} ...@@ -17,11 +17,11 @@ NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4} NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4}
NUM_PREFILL_WORKERS=${NUM_PREFILL_WORKERS:-1} NUM_PREFILL_WORKERS=${NUM_PREFILL_WORKERS:-1}
PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml}" PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_prefill.yaml}"
NUM_DECODE_NODES=${NUM_DECODE_NODES:-4} NUM_DECODE_NODES=${NUM_DECODE_NODES:-4}
NUM_DECODE_WORKERS=${NUM_DECODE_WORKERS:-1} NUM_DECODE_WORKERS=${NUM_DECODE_WORKERS:-1}
DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml}" DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_decode.yaml}"
# Automate settings of certain variables for convenience, but you are free # Automate settings of certain variables for convenience, but you are free
# to manually set these for more control as well. # to manually set these for more control as well.
......
# Dynamo Model Serving Recipes # Dynamo Production-Ready Recipes
This repository contains production-ready recipes for deploying large language models using the Dynamo platform. Each recipe includes deployment configurations, performance benchmarking, and model caching setup. Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
## Contents > **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform.
- [Available Models](#available-models) > If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first.
- [Quick Start](#quick-start)
- [Prerequisites](#prerequisites)
- Deployment Methods
- [Option 1: Automated Deployment](#option-1-automated-deployment)
- [Option 2: Manual Deployment](#option-2-manual-deployment)
## Available Recipes
## Available Models | Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes |GAIE integration |
|-------|-----------|------|------|------------|------------------|-------|------------------|
| Model Family | Framework | Deployment Mode | GPU Requirements | Status | Benchmark |GAIE-integration | | **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ | ❌ |
|-----------------|-----------|---------------------|------------------|--------|-----------|------------------| | **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| llama-3-70b | vllm | agg | 4x H100/H200 | ✅ | ✅ |✅ | | **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
| llama-3-70b | vllm | disagg (1 node) | 8x H100/H200 | ✅ | ✅ | 🚧 | | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization | ❌ |
| llama-3-70b | vllm | disagg (multi-node) | 16x H100/H200 | ✅ | ✅ |🚧 | | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ |
| deepseek-r1 | sglang | disagg (1 node, wide-ep) | 8x H200 | ✅ | 🚧 |🚧 | | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
| deepseek-r1 | sglang | disagg (multi-node, wide-ep) | 16x H200 | ✅ | 🚧 |🚧 | | **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ |
| gpt-oss-120b | trtllm | agg | 4x GB200 | ✅ | ✅ |🚧 | | **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 8x H200 | ✅ | ❌ | Benchmark recipe pending | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | Benchmark recipe pending | ❌ |
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ |Multi-node: 8 decode + 1 prefill nodes | ❌ |
**Legend:** **Legend:**
- ✅ Functional - **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
- 🚧 Under development - **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
## Recipe Structure
Each complete recipe follows this standard structure:
**Recipe Directory Structure:** ```
Recipes are organized into a directory structure that follows the pattern:
```text
<model-name>/ <model-name>/
├── README.md (optional) # Model-specific deployment notes
├── model-cache/ ├── model-cache/
│ ├── model-cache.yaml # PVC for model cache │ ├── model-cache.yaml # PersistentVolumeClaim for model storage
│ └── model-download.yaml # Job for model download │ └── model-download.yaml # Job to download model from HuggingFace
├── <framework>/ └── <framework>/ # vllm, sglang, or trtllm
│ └── <deployment-mode>/ └── <deployment-mode>/ # agg, disagg, disagg-single-node, etc.
│ ├── deploy.yaml # DynamoGraphDeployment CRD and optional configmap for custom configuration ├── deploy.yaml # Complete DynamoGraphDeployment manifest
│ └── perf.yaml (optional) # Performance benchmark └── perf.yaml (optional) # AIPerf benchmark job
└── README.md (optional) # Model documentation
``` ```
## Quick Start ## Quick Start
Follow the instructions in the [Prerequisites](#prerequisites) section to set up your environment. ### Prerequisites
Choose your preferred deployment method: using the `run.sh` script or manual deployment steps.
## Prerequisites **1. Dynamo Platform Installed**
### 1. Environment Setup The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide:
Create a Kubernetes namespace and set environment variable: - **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Quickstart (~10 minutes)
- **[Detailed Installation Guide](../docs/kubernetes/installation_guide.md)** - Advanced options
```bash **2. GPU Cluster Requirements**
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
```
### 2. Deploy Dynamo Platform Ensure your cluster has:
- GPU nodes matching recipe requirements (see table above)
Install the Dynamo Cloud Platform following the [Quickstart Guide](../docs/kubernetes/README.md).
### 3. GPU Cluster
Ensure your Kubernetes cluster has:
- GPU nodes with appropriate GPU types (see model requirements above)
- GPU operator installed - GPU operator installed
- Sufficient GPU memory and compute resources - Appropriate GPU drivers and container runtime
### 4. Container Registry Access
Ensure access to NVIDIA container registry for runtime images: **3. HuggingFace Access**
- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z`
- `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:x.y.z`
- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:x.y.z`
### 5. HuggingFace Access and Kubernetes Secret Creation Configure authentication to download models:
Set up a kubernetes secret with the HuggingFace token for model download:
```bash ```bash
# Update the token in the secret file export NAMESPACE=your-namespace
vim hf_hub_secret/hf_hub_secret.yaml kubectl create namespace ${NAMESPACE}
# Apply the secret # Create HuggingFace token secret
kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
``` ```
6. Configure Storage Class **4. Storage Configuration**
Update the `storageClassName` in `<model>/model-cache/model-cache.yaml` to match your cluster:
```bash ```bash
# Check available storage classes # Find your storage class name
kubectl get storageclass kubectl get storageclass
```
Replace "your-storage-class-name" with your actual storage class in the file: `<model>/model-cache/model-cache.yaml`
```yaml # Edit the model-cache.yaml file and update:
# In <model>/model-cache/model-cache.yaml # spec:
spec: # storageClassName: "your-actual-storage-class"
storageClassName: "your-actual-storage-class" # Replace this
``` ```
## Option 1: Automated Deployment ### Deploy a Recipe
Use the `run.sh` script for fully automated deployment:
**Note:** The script automatically:
- Create model cache PVC and downloads the model
- Deploy the model service
- Runs performance benchmark if a `perf.yaml` file is present in the deployment directory
**Step 1: Download Model**
#### Script Usage
```bash ```bash
./run.sh [OPTIONS] --model <model> --framework <framework> --deployment <deployment-type> # Update storageClassName in model-cache.yaml first!
``` kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
**Required Options:** # Wait for download to complete (may take 10-60 minutes depending on model size)
- `--model <model>`: Model name matching the directory name in the recipes directory (e.g., llama-3-70b, gpt-oss-120b, deepseek-r1) kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
- `--framework <framework>`: Backend framework (`vllm`, `trtllm`, `sglang`)
- `--deployment <deployment-type>`: Deployment mode (e.g., agg, disagg, disagg-single-node, disagg-multi-node)
**Optional Options:** # Monitor progress
- `--namespace <namespace>`: Kubernetes namespace (default: dynamo) kubectl logs -f job/model-download -n ${NAMESPACE}
- `--dry-run`: Show commands without executing them ```
- `-h, --help`: Show help message
**Environment Variables:** **Step 2: Deploy Service**
- `NAMESPACE`: Kubernetes namespace (default: dynamo)
#### Example Usage
```bash ```bash
# Set up environment kubectl apply -f <model>/<framework>/<mode>/deploy.yaml -n ${NAMESPACE}
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
# Configure HuggingFace token
kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
# use run.sh script to deploy the model
# Deploy Llama-3-70B with vLLM (aggregated mode)
./run.sh --model llama-3-70b --framework vllm --deployment agg
# Deploy GPT-OSS-120B with TensorRT-LLM # Check deployment status
./run.sh --model gpt-oss-120b --framework trtllm --deployment agg kubectl get dynamographdeployment -n ${NAMESPACE}
# Deploy DeepSeek-R1 with SGLang (disaggregated mode)
./run.sh --model deepseek-r1 --framework sglang --deployment disagg
# Deploy with custom namespace # Check pod status
./run.sh --namespace my-namespace --model llama-3-70b --framework vllm --deployment agg kubectl get pods -n ${NAMESPACE}
# Dry run to see what would be executed # Wait for pods to be ready
./run.sh --dry-run --model llama-3-70b --framework vllm --deployment agg kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=<deployment-name> -n ${NAMESPACE} --timeout=600s
``` ```
## If deploying with Gateway API Inference extension GAIE **Step 3: Test Deployment**
1. Follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE. ```bash
# Port forward to access the service locally
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n ${NAMESPACE}
2. Apply manifests by running a script. # In another terminal, test the endpoint
curl http://localhost:8000/v1/models
```bash # Send a test request
# Match the block size to the cli value in your deployment file deploy.yaml: - "python3 -m dynamo.vllm ... --block-size 128" curl http://localhost:8000/v1/chat/completions \
export DYNAMO_KV_BLOCK_SIZE=128 -H "Content-Type: application/json" \
export EPP_IMAGE=nvcr.io/you/epp:tag -d '{
# Add --gaie argument to the script i.e.: "model": "<model-name>",
./run.sh --model llama-3-70b --framework vllm --gaie agg --deployment agg "messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
``` ```
The script will perform gateway checks and apply the manifests.
## Option 2: Manual Deployment **Step 4: Run Benchmark (Optional)**
For step-by-step manual deployment follow these steps :
```bash ```bash
# 0. Set up environment (see Prerequisites section) # Only if perf.yaml exists in the recipe directory
export NAMESPACE=your-namespace kubectl apply -f <model>/<framework>/<mode>/perf.yaml -n ${NAMESPACE}
kubectl create namespace ${NAMESPACE}
kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
# 1. Download model (see Model Download section) # Monitor benchmark progress
kubectl apply -n $NAMESPACE -f <model>/model-cache/ kubectl logs -f job/<benchmark-job-name> -n ${NAMESPACE}
# 2. Deploy model (see Deployment section) # View results after completion
kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/deploy.yaml kubectl logs job/<benchmark-job-name> -n ${NAMESPACE} | tail -50
# 3. Run benchmarks (optional, if perf.yaml exists)
kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/perf.yaml
``` ```
### Step 1: Download Model ** Inference Gateway (GAIE) Integration (Optional)**
```bash For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.
# Start the download job
kubectl apply -n $NAMESPACE -f <model>/model-cache
# Verify job creation Follow to Follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE. Then apply manifests.
kubectl get jobs -n $NAMESPACE | grep model-download
```
Monitor and wait for the model download to complete:
```bash ```bash
export DEPLOY_PATH=llama-3-70b/vllm/agg/
#DEPLOY_PATH=<model>/<framework>/<mode>/
kubectl apply -R -f "$DEPLOY_PATH/gaie/k8s-manifests" -n "$NAMESPACE"
# Wait for job completion (timeout after 100 minutes) ## Example Deployments
kubectl wait --for=condition=Complete job/model-download -n $NAMESPACE --timeout=6000s
# Check job status ### Llama-3-70B with vLLM (Aggregated)
kubectl get job model-download -n $NAMESPACE
# View download logs
kubectl logs job/model-download -n $NAMESPACE
```
### Step 2: Deploy Model Service
```bash ```bash
# Navigate to the specific deployment configuration export NAMESPACE=dynamo-demo
cd <model>/<framework>/<deployment-mode>/ kubectl create namespace ${NAMESPACE}
# Deploy the model service # Create HF token secret
kubectl apply -n $NAMESPACE -f deploy.yaml kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token" \
-n ${NAMESPACE}
# Verify deployment creation # Deploy
kubectl get deployments -n $NAMESPACE kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
# Test
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
``` ```
#### Wait for Deployment Ready ### DeepSeek-R1 on GB200 (Multi-node)
```bash See [deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml](deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) for the complete multi-node WideEP configuration.
# Get deployment name from the deploy.yaml file
DEPLOYMENT_NAME=$(grep "name:" deploy.yaml | head -1 | awk '{print $2}')
# Wait for deployment to be ready (timeout after 10 minutes) ## Customization
kubectl wait --for=condition=available deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=1200s
# Check deployment status Each `deploy.yaml` contains:
kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE - **ConfigMap**: Engine-specific configuration (embedded in the manifest)
- **DynamoGraphDeployment**: Kubernetes resource definitions
- **Resource limits**: GPU count, memory, CPU requests/limits
- **Image references**: Container images with version tags
# Check pod status ### Key Customization Points
kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT_NAME
```
#### Verify Model Service **Model Configuration:**
```yaml
# In deploy.yaml under worker args:
args:
- python3 -m dynamo.vllm --model <your-model-path> --served-model-name <name>
```
```bash **GPU Resources:**
# Check if service is running ```yaml
kubectl get services -n $NAMESPACE resources:
limits:
gpu: "4" # Adjust based on your requirements
requests:
gpu: "4"
```
# Test model endpoint (port-forward to test locally) **Scaling:**
kubectl port-forward service/${DEPLOYMENT_NAME}-frontend 8000:8000 -n $NAMESPACE ```yaml
services:
VllmDecodeWorker:
replicas: 2 # Scale to multiple workers
```
# Test the model API (in another terminal) **Router Mode:**
curl http://localhost:8000/v1/models ```yaml
# In Frontend args:
args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000
# Options: round-robin, kv (KV-aware routing)
```
# Stop port-forward when done **Container Images:**
pkill -f "kubectl port-forward" ```yaml
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z
# Update version tag as needed
``` ```
### Step 3: Performance Benchmarking (Optional) ## Troubleshooting
Run performance benchmarks to evaluate model performance. Note that benchmarking is only available for models that include a `perf.yaml` file (optional): ### Common Issues
#### Launch Benchmark Job **Pods stuck in Pending:**
- Check GPU availability: `kubectl describe node <node-name>`
- Verify storage class exists: `kubectl get storageclass`
- Check resource requests vs. available resources
```bash **Model download fails:**
# From the deployment directory - Verify HuggingFace token is correct
kubectl apply -n $NAMESPACE -f perf.yaml - Check network connectivity from cluster
- Review job logs: `kubectl logs job/model-download -n ${NAMESPACE}`
# Verify benchmark job creation **Workers fail to start:**
kubectl get jobs -n $NAMESPACE - Check GPU compatibility (driver version, CUDA version)
``` - Verify image pull secrets if using private registries
- Review pod logs: `kubectl logs <pod-name> -n ${NAMESPACE}`
#### Monitor Benchmark Progress **For more troubleshooting:**
- [Kubernetes Deployment Guide](../docs/kubernetes/README.md#troubleshooting)
- [Observability Documentation](../docs/kubernetes/observability/)
```bash ## Related Documentation
# Get benchmark job name
PERF_JOB_NAME=$(grep "name:" perf.yaml | head -1 | awk '{print $2}')
# Monitor benchmark logs in real-time - **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Platform installation and concepts
kubectl logs -f job/$PERF_JOB_NAME -n $NAMESPACE - **[API Reference](../docs/kubernetes/api_reference.md)** - DynamoGraphDeployment CRD specification
- **[vLLM Backend Guide](../docs/backends/vllm/README.md)** - vLLM-specific features
- **[SGLang Backend Guide](../docs/backends/sglang/README.md)** - SGLang-specific features
- **[TensorRT-LLM Backend Guide](../docs/backends/trtllm/README.md)** - TensorRT-LLM features
- **[Observability](../docs/kubernetes/observability/)** - Monitoring and logging
- **[Benchmarking Guide](../docs/benchmarks/benchmarking.md)** - Performance testing
# Wait for benchmark completion (timeout after 100 minutes) ## Contributing
kubectl wait --for=condition=Complete job/$PERF_JOB_NAME -n $NAMESPACE --timeout=6000s
```
#### View Benchmark Results We welcome contributions of new recipes! See [CONTRIBUTING.md](CONTRIBUTING.md) for:
- Recipe submission guidelines
- Required components checklist
- Testing and validation requirements
- Documentation standards
```bash ### Recipe Quality Standards
# Check final benchmark results
kubectl logs job/$PERF_JOB_NAME -n $NAMESPACE | tail -50 A production-ready recipe must include:
``` - ✅ Complete `deploy.yaml` with DynamoGraphDeployment
\ No newline at end of file - ✅ Model cache PVC and download job
- ✅ Benchmark recipe (`perf.yaml`) for performance testing
- ✅ Verification on target hardware
- ✅ Documentation of GPU requirements
# GPT-OSS-120B Disaggregated Mode
> **⚠️ INCOMPLETE**: This directory contains only engine configuration files and is not ready for Kubernetes deployment.
## Current Status
This directory contains TensorRT-LLM engine configurations for disaggregated serving:
- `decode.yaml` - Decode worker engine configuration
- `prefill.yaml` - Prefill worker engine configuration
## Missing Components
To complete this recipe, the following files are needed:
- `deploy.yaml` - Kubernetes DynamoGraphDeployment manifest
- `perf.yaml` - Performance benchmarking job (optional)
## Alternative
For a production-ready GPT-OSS-120B deployment, use the **aggregated mode**:
- [gpt-oss-120b/trtllm/agg/](../agg/) - Complete with `deploy.yaml` and `perf.yaml`
## Contributing
If you'd like to complete this recipe, see [recipes/CONTRIBUTING.md](../../../CONTRIBUTING.md) for guidelines on creating proper Kubernetes deployment manifests.
#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -euo pipefail
IFS=$'\n\t'
RECIPES_DIR="$( cd "$( dirname "$0" )" && pwd )"
# Default values
NAMESPACE="${NAMESPACE:-dynamo}"
DEPLOY_TYPE=""
GAIE="${GAIE:-false}"
DEPLOYMENT=""
MODEL=""
FRAMEWORK=""
DRY_RUN=""
# Frameworks - following container/build.sh pattern
declare -A FRAMEWORKS=(["VLLM"]=1 ["TRTLLM"]=2 ["SGLANG"]=3)
DEFAULT_FRAMEWORK=VLLM
# Function to show usage
usage() {
echo "Usage: $0 [OPTIONS] --model <model> --framework <framework> --deployment <deployment-type>"
echo ""
echo "Required Options:"
echo " --model <model> Model name (e.g., llama-3-70b)"
echo " --framework <fw> Framework one of ${!FRAMEWORKS[*]} (default: ${DEFAULT_FRAMEWORK})"
echo " --deployment <type> Deployment type (e.g., agg, disagg etc, please refer to the README.md for available deployment types)"
echo ""
echo "Optional:"
echo " --namespace <ns> Kubernetes namespace (default: dynamo)"
echo " --dry-run Print commands without executing them"
echo " --gaie[=true|false] Enable GAIE integration subfolder (applies GAIE manifests skips benchmark) (default: ${GAIE})"
echo " -h, --help Show this help message"
echo ""
echo "Environment Variables:"
echo " NAMESPACE Kubernetes namespace (default: dynamo)"
echo ""
echo "Examples:"
echo " $0 --model llama-3-70b --framework vllm --deployment agg"
echo " $0 --model llama-3-70b --framework trtllm --deployment disagg-single-node"
echo " $0 --namespace my-ns --model llama-3-70b --framework vllm --deployment disagg-multi-node"
exit 1
}
missing_requirement() {
echo "ERROR: $1 requires an argument."
usage
}
error() {
printf '%s %s\n' "$1" "$2" >&2
exit 1
}
while [[ $# -gt 0 ]]; do
case $1 in
--dry-run)
DRY_RUN="echo"
shift
;;
--model)
if [ "$2" ]; then
MODEL=$2
shift 2
else
missing_requirement "$1"
fi
;;
--framework)
if [ "$2" ]; then
FRAMEWORK=$2
shift 2
else
missing_requirement "$1"
fi
;;
--deployment)
if [ "$2" ]; then
DEPLOYMENT=$2
shift 2
else
missing_requirement "$1"
fi
;;
--namespace)
if [ "$2" ]; then
NAMESPACE=$2
shift 2
else
missing_requirement "$1"
fi
;;
--gaie)
GAIE=true
shift
;;
--gaie=false)
GAIE=false
shift
;;
--gaie=*)
GAIE="${1#*=}"
case "${GAIE,,}" in
true|false) GAIE="${GAIE,,}";;
*) echo "ERROR: --gaie must be true or false"; exit 1;;
esac
shift
;;
-h|--help)
usage
;;
-*)
error 'ERROR: Unknown option: ' "$1"
;;
*)
error "ERROR: Unknown argument: " "$1"
;;
esac
done
if [ -z "$FRAMEWORK" ]; then
FRAMEWORK=$DEFAULT_FRAMEWORK
fi
if [ -n "$FRAMEWORK" ]; then
FRAMEWORK=${FRAMEWORK^^}
if [[ -z "${FRAMEWORKS[$FRAMEWORK]}" ]]; then
error 'ERROR: Unknown framework: ' "$FRAMEWORK"
fi
fi
# Validate required arguments
if [[ -z "$MODEL" ]] || [[ -z "$DEPLOYMENT" ]]; then
if [[ -z "$MODEL" ]]; then
echo "ERROR: --model argument is required"
fi
if [[ -z "$DEPLOYMENT" ]]; then
echo "ERROR: --deployment argument is required"
fi
echo ""
usage
fi
# Construct paths based on new structure: recipes/<model>/<framework>/<deployment-type>/
MODEL_DIR="$RECIPES_DIR/$MODEL"
FRAMEWORK_DIR="$MODEL_DIR/${FRAMEWORK,,}"
DEPLOY_PATH="$FRAMEWORK_DIR/$DEPLOYMENT"
INTEGRATION="$([[ "${GAIE,,}" == "true" ]] && echo gaie || echo "")"
# Check if model directory exists
if [[ ! -d "$MODEL_DIR" ]]; then
echo "Error: Model directory '$MODEL' does not exist in $RECIPES_DIR"
echo "Available models:"
ls -1 "$RECIPES_DIR" | grep -v "\.sh$\|\.md$\|model-cache$" | sed 's/^/ /'
exit 1
fi
# Check if framework directory exists
if [[ ! -d "$FRAMEWORK_DIR" ]]; then
echo "Error: Framework directory '${FRAMEWORK,,}' does not exist in $MODEL_DIR"
echo "Available frameworks for $MODEL:"
ls -1 "$MODEL_DIR" | grep -v "\.sh$\|\.md$" | sed 's/^/ /'
exit 1
fi
# Check if deployment directory exists
if [[ ! -d "$DEPLOY_PATH" ]]; then
echo "Error: Deployment type '$DEPLOYMENT' does not exist in $FRAMEWORK_DIR"
echo "Available deployment types for $MODEL/${FRAMEWORK,,}:"
ls -1 "$FRAMEWORK_DIR" | grep -v "\.sh$\|\.md$" | sed 's/^/ /'
exit 1
fi
# Check if deployment files exist
DEPLOY_FILE="$DEPLOY_PATH/deploy.yaml"
PERF_FILE="$DEPLOY_PATH/perf.yaml"
if [[ ! -f "$DEPLOY_FILE" ]]; then
echo "Error: Deployment file '$DEPLOY_FILE' not found"
exit 1
fi
# Check if perf file exists (optional)
PERF_AVAILABLE=false
if [[ -f "$PERF_FILE" ]]; then
PERF_AVAILABLE=true
echo "Performance benchmark file found: $PERF_FILE"
else
echo "Performance benchmark file not found: $PERF_FILE (skipping benchmarks)"
fi
# Show deployment information
echo "======================================"
echo "Dynamo Recipe Deployment"
echo "======================================"
echo "Model: $MODEL"
echo "Framework: ${FRAMEWORK,,}"
echo "Deployment Type: $DEPLOYMENT"
echo "Namespace: $NAMESPACE"
echo "GAIE integration: $GAIE"
echo "======================================"
# Handle model downloading
MODEL_CACHE_DIR="$MODEL_DIR/model-cache"
echo "Creating PVC for model cache and downloading model..."
$DRY_RUN kubectl apply -n $NAMESPACE -f $MODEL_CACHE_DIR/model-cache.yaml
$DRY_RUN kubectl apply -n $NAMESPACE -f $MODEL_CACHE_DIR/model-download.yaml
# Wait for the model download to complete
MODEL_DOWNLOAD_JOB_NAME=$(grep "name:" $MODEL_CACHE_DIR/model-download.yaml | head -1 | awk '{print $2}')
echo "Waiting for job '$MODEL_DOWNLOAD_JOB_NAME' to complete..."
$DRY_RUN kubectl wait --for=condition=Complete job/$MODEL_DOWNLOAD_JOB_NAME -n $NAMESPACE --timeout=6000s
# Deploy the specified configuration
echo "Deploying $MODEL ${FRAMEWORK,,} $DEPLOYMENT configuration..."
$DRY_RUN kubectl apply -n $NAMESPACE -f $DEPLOY_FILE
if [[ "$INTEGRATION" == "gaie" ]]; then
# run gaie checks.
SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
"${SCRIPT_DIR}/gaie_checks.sh"
$DRY_RUN kubectl apply -R -f "$DEPLOY_PATH/gaie/k8s-manifests" -n "$NAMESPACE"
# For now do not run the benchmark
exit
fi
# Launch the benchmark job (if available)
if [[ "$PERF_AVAILABLE" == "true" ]]; then
echo "Launching benchmark job..."
$DRY_RUN kubectl apply -n $NAMESPACE -f $PERF_FILE
# Construct job name from the perf file
JOB_NAME=$(grep "name:" $PERF_FILE | head -1 | awk '{print $2}')
echo "Waiting for job '$JOB_NAME' to complete..."
$DRY_RUN kubectl wait --for=condition=Complete job/$JOB_NAME -n $NAMESPACE --timeout=6000s
# Print logs from the benchmark job
echo "======================================"
echo "Benchmark completed. Logs:"
echo "======================================"
$DRY_RUN kubectl logs job/$JOB_NAME -n $NAMESPACE
else
echo "======================================"
echo "Deployment completed successfully!"
echo "No performance benchmark available for this configuration."
echo "======================================"
fi
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment