# Llama-3.3-70B Recipes Production-ready deployments for **Llama-3.3-70B-Instruct** using vLLM with FP8 dynamic quantization. ## Available Configurations | Configuration | GPUs | Mode | Description | |--------------|------|------|-------------| | [**vllm/agg**](vllm/agg/) | 4x H100/H200 | Aggregated | Single-node, TP4 | | [**vllm/disagg-single-node**](vllm/disagg-single-node/) | 8x H100/H200 | Disaggregated | Prefill/decode separation on one node | | [**vllm/disagg-multi-node**](vllm/disagg-multi-node/) | 16x H100/H200 | Disaggregated | 2 nodes, 8 GPUs each | ## Prerequisites 1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md) 2. **GPU cluster** with H100 or H200 GPUs matching the configuration requirements 3. **HuggingFace token** with access to Llama models ## Quick Start ```bash # Set namespace export NAMESPACE=dynamo-demo kubectl create namespace ${NAMESPACE} # Create HuggingFace token secret kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN="your-token-here" \ -n ${NAMESPACE} # Download model (update storageClassName in model-cache.yaml first!) kubectl apply -f model-cache/ -n ${NAMESPACE} kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s # Deploy (choose one configuration) kubectl apply -f vllm/agg/deploy.yaml -n ${NAMESPACE} # OR: kubectl apply -f vllm/disagg-single-node/deploy.yaml -n ${NAMESPACE} # OR: kubectl apply -f vllm/disagg-multi-node/deploy.yaml -n ${NAMESPACE} ``` ## Test the Deployment ```bash # Port-forward the frontend kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE} # Send a test request curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50 }' ``` ## Model Details - **Model**: `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` - **Quantization**: FP8 dynamic (applied at runtime) - **Context length**: Default model context ## Notes - Update `storageClassName` in `model-cache/model-cache.yaml` to match your cluster before deploying - Model download takes approximately 15-30 minutes depending on network speed - For GAIE (Gateway API Inference Extension) integration, `kubectl apply` the files from the corresponding subfolder i.e. [vllm/agg/gaie/](vllm/agg/gaie/)