--- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 title: Multi-Node --- This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities. ## Prerequisites Multi-node deployments require: - Multiple nodes with GPU resources - Network connectivity between nodes (faster the better) - Firewall rules allowing NATS/ETCD communication ## Infrastructure Setup ### Step 1: Start NATS/ETCD on Head Node Start the required services on your head node. These endpoints must be accessible from all worker nodes: ```bash # On head node (node-1) docker compose -f deploy/docker-compose.yml up -d ``` Default ports: - NATS: 4222 - ETCD: 2379 ### Step 2: Configure Environment Variables Set the head node IP address and service endpoints. **Set this on all nodes** for easy copy-paste: ```bash # Set this on ALL nodes - replace with your actual head node IP export HEAD_NODE_IP="" # Service endpoints (set on all nodes) export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" ``` ## Deployment Patterns ### Multi-node Aggregated Serving Deploy vLLM workers across multiple nodes for horizontal scaling: **Node 1 (Head Node)**: Run ingress and first worker ```bash # Start ingress python -m dynamo.frontend --router-mode kv # Start vLLM worker python -m dynamo.vllm \ --model meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 8 \ --enforce-eager ``` **Node 2**: Run additional worker ```bash # Start vLLM worker python -m dynamo.vllm \ --model meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 8 \ --enforce-eager ``` ### Multi-node Disaggregated Serving Deploy prefill and decode workers on separate nodes for optimized resource utilization: **Node 1**: Run ingress and decode worker ```bash # Start ingress python -m dynamo.frontend --router-mode kv & # Start decode worker python -m dynamo.vllm \ --model meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 8 \ --enforce-eager \ --disaggregation-mode decode ``` **Node 2**: Run prefill worker ```bash # Start prefill worker python -m dynamo.vllm \ --model meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 8 \ --enforce-eager \ --disaggregation-mode prefill ``` ### Multi-node Tensor/Pipeline Parallelism When the total parallelism (TP × PP) exceeds the number of GPUs on a single node, you need multiple nodes to host a **single** model instance. One node runs the full `dynamo.vllm` process (head) while additional nodes run in `--headless` mode, spawning only vLLM workers. See [`examples/backends/vllm/launch/multi_node_tp.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/multi_node_tp.sh) for a ready-to-use launch script that supports both head and worker roles via `--head` / `--worker` flags. The model, TP size, and node count are configurable via `MODEL`, `TENSOR_PARALLEL_SIZE`, and `NNODES` environment variables. For details on the flags used for multi-node distributed execution (`--master-addr`, `--master-port`, `--nnodes`, `--node-rank`), see the [vLLM multiprocessing docs](https://docs.vllm.ai/en/stable/serving/parallelism_scaling/#running-vllm-with-multiprocessing).