# Dynamo Model Serving Recipes This repository contains production-ready recipes for deploying large language models using the Dynamo platform. Each recipe includes deployment configurations, performance benchmarking, and model caching setup. ## Contents - [Available Models](#available-models) - [Quick Start](#quick-start) - [Prerequisites](#prerequisites) - Deployment Methods - [Option 1: Automated Deployment](#option-1-automated-deployment) - [Option 2: Manual Deployment](#option-2-manual-deployment) ## Available Models | Model Family | Framework | Deployment Mode | GPU Requirements | Status | Benchmark |GAIE-integration | |-----------------|-----------|---------------------|------------------|--------|-----------|------------------| | llama-3-70b | vllm | agg | 4x H100/H200 | ✅ | ✅ |✅ | | llama-3-70b | vllm | disagg (1 node) | 8x H100/H200 | ✅ | ✅ | 🚧 | | llama-3-70b | vllm | disagg (multi-node) | 16x H100/H200 | ✅ | ✅ |🚧 | | deepseek-r1 | sglang | disagg (1 node, wide-ep) | 8x H200 | ✅ | 🚧 |🚧 | | deepseek-r1 | sglang | disagg (multi-node, wide-ep) | 16x H200 | ✅ | 🚧 |🚧 | | gpt-oss-120b | trtllm | agg | 4x GB200 | ✅ | ✅ |🚧 | **Legend:** - ✅ Functional - 🚧 Under development **Recipe Directory Structure:** Recipes are organized into a directory structure that follows the pattern: ```text / ├── model-cache/ │ ├── model-cache.yaml # PVC for model cache │ └── model-download.yaml # Job for model download ├── / │ └── / │ ├── deploy.yaml # DynamoGraphDeployment CRD and optional configmap for custom configuration │ └── perf.yaml (optional) # Performance benchmark └── README.md (optional) # Model documentation ``` ## Quick Start Follow the instructions in the [Prerequisites](#prerequisites) section to set up your environment. Choose your preferred deployment method: using the `run.sh` script or manual deployment steps. ## Prerequisites ### 1. Environment Setup Create a Kubernetes namespace and set environment variable: ```bash export NAMESPACE=your-namespace kubectl create namespace ${NAMESPACE} ``` ### 2. Deploy Dynamo Platform Install the Dynamo Cloud Platform following the [Quickstart Guide](../docs/kubernetes/README.md). ### 3. GPU Cluster Ensure your Kubernetes cluster has: - GPU nodes with appropriate GPU types (see model requirements above) - GPU operator installed - Sufficient GPU memory and compute resources ### 4. Container Registry Access Ensure access to NVIDIA container registry for runtime images: - `nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z` - `nvcr.io/nvidia/ai-dynamo/trtllm-runtime:x.y.z` - `nvcr.io/nvidia/ai-dynamo/sglang-runtime:x.y.z` ### 5. HuggingFace Access and Kubernetes Secret Creation Set up a kubernetes secret with the HuggingFace token for model download: ```bash # Update the token in the secret file vim hf_hub_secret/hf_hub_secret.yaml # Apply the secret kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} ``` 6. Configure Storage Class ```bash # Check available storage classes kubectl get storageclass ``` Replace "your-storage-class-name" with your actual storage class in the file: `/model-cache/model-cache.yaml` ```yaml # In /model-cache/model-cache.yaml spec: storageClassName: "your-actual-storage-class" # Replace this ``` ## Option 1: Automated Deployment Use the `run.sh` script for fully automated deployment: **Note:** The script automatically: - Create model cache PVC and downloads the model - Deploy the model service - Runs performance benchmark if a `perf.yaml` file is present in the deployment directory #### Script Usage ```bash ./run.sh [OPTIONS] --model --framework --deployment ``` **Required Options:** - `--model `: Model name matching the directory name in the recipes directory (e.g., llama-3-70b, gpt-oss-120b, deepseek-r1) - `--framework `: Backend framework (`vllm`, `trtllm`, `sglang`) - `--deployment `: Deployment mode (e.g., agg, disagg, disagg-single-node, disagg-multi-node) **Optional Options:** - `--namespace `: Kubernetes namespace (default: dynamo) - `--dry-run`: Show commands without executing them - `-h, --help`: Show help message **Environment Variables:** - `NAMESPACE`: Kubernetes namespace (default: dynamo) #### Example Usage ```bash # Set up environment export NAMESPACE=your-namespace kubectl create namespace ${NAMESPACE} # Configure HuggingFace token kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} # use run.sh script to deploy the model # Deploy Llama-3-70B with vLLM (aggregated mode) ./run.sh --model llama-3-70b --framework vllm --deployment agg # Deploy GPT-OSS-120B with TensorRT-LLM ./run.sh --model gpt-oss-120b --framework trtllm --deployment agg # Deploy DeepSeek-R1 with SGLang (disaggregated mode) ./run.sh --model deepseek-r1 --framework sglang --deployment disagg # Deploy with custom namespace ./run.sh --namespace my-namespace --model llama-3-70b --framework vllm --deployment agg # Dry run to see what would be executed ./run.sh --dry-run --model llama-3-70b --framework vllm --deployment agg ``` ## If deploying with Gateway API Inference extension GAIE 1. Follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE. 2. Apply manifests by running a script. ```bash # Match the block size to the cli value in your deployment file deploy.yaml: - "python3 -m dynamo.vllm ... --block-size 128" export DYNAMO_KV_BLOCK_SIZE=128 export EPP_IMAGE=nvcr.io/you/epp:tag # Add --gaie argument to the script i.e.: ./run.sh --model llama-3-70b --framework vllm --gaie agg --deployment agg ``` The script will perform gateway checks and apply the manifests. ## Option 2: Manual Deployment For step-by-step manual deployment follow these steps : ```bash # 0. Set up environment (see Prerequisites section) export NAMESPACE=your-namespace kubectl create namespace ${NAMESPACE} kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} # 1. Download model (see Model Download section) kubectl apply -n $NAMESPACE -f /model-cache/ # 2. Deploy model (see Deployment section) kubectl apply -n $NAMESPACE -f ///deploy.yaml # 3. Run benchmarks (optional, if perf.yaml exists) kubectl apply -n $NAMESPACE -f ///perf.yaml ``` ### Step 1: Download Model ```bash # Start the download job kubectl apply -n $NAMESPACE -f /model-cache # Verify job creation kubectl get jobs -n $NAMESPACE | grep model-download ``` Monitor and wait for the model download to complete: ```bash # Wait for job completion (timeout after 100 minutes) kubectl wait --for=condition=Complete job/model-download -n $NAMESPACE --timeout=6000s # Check job status kubectl get job model-download -n $NAMESPACE # View download logs kubectl logs job/model-download -n $NAMESPACE ``` ### Step 2: Deploy Model Service ```bash # Navigate to the specific deployment configuration cd /// # Deploy the model service kubectl apply -n $NAMESPACE -f deploy.yaml # Verify deployment creation kubectl get deployments -n $NAMESPACE ``` #### Wait for Deployment Ready ```bash # Get deployment name from the deploy.yaml file DEPLOYMENT_NAME=$(grep "name:" deploy.yaml | head -1 | awk '{print $2}') # Wait for deployment to be ready (timeout after 10 minutes) kubectl wait --for=condition=available deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=1200s # Check deployment status kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE # Check pod status kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT_NAME ``` #### Verify Model Service ```bash # Check if service is running kubectl get services -n $NAMESPACE # Test model endpoint (port-forward to test locally) kubectl port-forward service/${DEPLOYMENT_NAME}-frontend 8000:8000 -n $NAMESPACE # Test the model API (in another terminal) curl http://localhost:8000/v1/models # Stop port-forward when done pkill -f "kubectl port-forward" ``` ### Step 3: Performance Benchmarking (Optional) Run performance benchmarks to evaluate model performance. Note that benchmarking is only available for models that include a `perf.yaml` file (optional): #### Launch Benchmark Job ```bash # From the deployment directory kubectl apply -n $NAMESPACE -f perf.yaml # Verify benchmark job creation kubectl get jobs -n $NAMESPACE ``` #### Monitor Benchmark Progress ```bash # Get benchmark job name PERF_JOB_NAME=$(grep "name:" perf.yaml | head -1 | awk '{print $2}') # Monitor benchmark logs in real-time kubectl logs -f job/$PERF_JOB_NAME -n $NAMESPACE # Wait for benchmark completion (timeout after 100 minutes) kubectl wait --for=condition=Complete job/$PERF_JOB_NAME -n $NAMESPACE --timeout=6000s ``` #### View Benchmark Results ```bash # Check final benchmark results kubectl logs job/$PERF_JOB_NAME -n $NAMESPACE | tail -50 ```