This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
## Overview
Dynamo supports multinode deployments through the `multinode` section in resource specifications. This allows you to:
- Distribute workloads across multiple physical nodes
- Scale GPU resources beyond a single machine
- Support large models requiring extensive tensor parallelism
- Achieve high availability and fault tolerance
## Basic requirements
-**Kubernetes Cluster**: Version 1.24 or later
-**GPU Nodes**: Multiple nodes with NVIDIA GPUs
-**High-Speed Networking**: InfiniBand, RoCE, or high-bandwidth Ethernet (recommended for optimal performance)
### Advanced Multinode Orchestration
#### Using Grove (default)
For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems:
-**[Grove](https://github.com/NVIDIA/grove)**: Network topology-aware gang scheduling and auto-scaling for AI workloads
- (optional) **[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler)**: Kubernetes native scheduler optimized for AI workloads at scale
These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes.
**Features Enabled with Grove:**
- Hierarchical gang scheduling with `PodGangSet` and `PodClique`
- Multi-level horizontal auto-scaling
- Custom startup ordering for components
- Resource-aware rolling updates
[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is an optional enhancement that provides a Kubernetes native scheduler optimized for AI workloads at large scale.
**Features Enabled with KAI-Scheduler:**
- Network topology-aware pod placement
- AI workload-optimized scheduling algorithms
- GPU resource awareness and allocation
- Support for complex scheduling constraints
- Integration with Grove for enhanced capabilities
- Performance optimizations for large-scale deployments
#### Using LWS and Volcano
LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.