multi-node.md 2.26 KB
Newer Older
Alec's avatar
Alec committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Multi-node Examples

This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities.

## Prerequisites

Multi-node deployments require:
- Multiple nodes with GPU resources
- Network connectivity between nodes (faster the better)
- Firewall rules allowing NATS/ETCD communication

## Infrastructure Setup

### Step 1: Start NATS/ETCD on Head Node

Start the required services on your head node. These endpoints must be accessible from all worker nodes:

```bash
# On head node (node-1)
25
docker compose -f deploy/docker-compose.yml up -d
Alec's avatar
Alec committed
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
```

Default ports:
- NATS: 4222
- ETCD: 2379

### Step 2: Configure Environment Variables

Set the head node IP address and service endpoints. **Set this on all nodes** for easy copy-paste:

```bash
# Set this on ALL nodes - replace with your actual head node IP
export HEAD_NODE_IP="<your-head-node-ip>"

# Service endpoints (set on all nodes)
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```

## Deployment Patterns

### Multi-node Aggregated Serving

Deploy vLLM workers across multiple nodes for horizontal scaling:

**Node 1 (Head Node)**: Run ingress and first worker
```bash
# Start ingress
Alec's avatar
Alec committed
54
python -m dynamo.frontend --router-mode kv
Alec's avatar
Alec committed
55
56

# Start vLLM worker
Alec's avatar
Alec committed
57
python -m dynamo.vllm \
Alec's avatar
Alec committed
58
59
60
61
62
63
64
65
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enforce-eager
```

**Node 2**: Run additional worker
```bash
# Start vLLM worker
Alec's avatar
Alec committed
66
python -m dynamo.vllm \
Alec's avatar
Alec committed
67
68
69
70
71
72
73
74
75
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enforce-eager
```

### Multi-node Disaggregated Serving

Deploy prefill and decode workers on separate nodes for optimized resource utilization:

76
**Node 1**: Run ingress and decode worker
Alec's avatar
Alec committed
77
78
```bash
# Start ingress
Alec's avatar
Alec committed
79
python -m dynamo.frontend --router-mode kv &
Alec's avatar
Alec committed
80
81

# Start prefill worker
Alec's avatar
Alec committed
82
python -m dynamo.vllm \
Alec's avatar
Alec committed
83
84
85
86
87
  --model meta-llama/Llama-3.3-70B-Instruct
  --tensor-parallel-size 8 \
  --enforce-eager
```

88
**Node 2**: Run prefill worker
Alec's avatar
Alec committed
89
90
```bash
# Start decode worker
Alec's avatar
Alec committed
91
python -m dynamo.vllm \
Alec's avatar
Alec committed
92
93
94
95
96
  --model meta-llama/Llama-3.3-70B-Instruct
  --tensor-parallel-size 8 \
  --enforce-eager \
  --is-prefill-worker
```