multi-node.md 2.52 KB
Newer Older
Alec's avatar
Alec committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Multi-node Examples

This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities.

## Prerequisites

Multi-node deployments require:
- Multiple nodes with GPU resources
- Network connectivity between nodes (faster the better)
- Firewall rules allowing NATS/ETCD communication

## Infrastructure Setup

### Step 1: Start NATS/ETCD on Head Node

Start the required services on your head node. These endpoints must be accessible from all worker nodes:

```bash
# On head node (node-1)
docker compose -f deploy/metrics/docker-compose.yml up -d
```

Default ports:
- NATS: 4222
- ETCD: 2379

### Step 2: Configure Environment Variables

Set the head node IP address and service endpoints. **Set this on all nodes** for easy copy-paste:

```bash
# Set this on ALL nodes - replace with your actual head node IP
export HEAD_NODE_IP="<your-head-node-ip>"

# Service endpoints (set on all nodes)
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```

## Deployment Patterns

### Multi-node Aggregated Serving

Deploy vLLM workers across multiple nodes for horizontal scaling:

**Node 1 (Head Node)**: Run ingress and first worker
```bash
# Start ingress
Alec's avatar
Alec committed
54
python -m dynamo.frontend --router-mode kv
Alec's avatar
Alec committed
55
56

# Start vLLM worker
Alec's avatar
Alec committed
57
python -m dynamo.vllm \
Alec's avatar
Alec committed
58
59
60
61
62
63
64
65
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enforce-eager
```

**Node 2**: Run additional worker
```bash
# Start vLLM worker
Alec's avatar
Alec committed
66
python -m dynamo.vllm \
Alec's avatar
Alec committed
67
68
69
70
71
72
73
74
75
76
77
78
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enforce-eager
```

### Multi-node Disaggregated Serving

Deploy prefill and decode workers on separate nodes for optimized resource utilization:

**Node 1**: Run ingress and prefill workers
```bash
# Start ingress
Alec's avatar
Alec committed
79
python -m dynamo.frontend --router-mode kv &
Alec's avatar
Alec committed
80
81

# Start prefill worker
Alec's avatar
Alec committed
82
python -m dynamo.vllm \
Alec's avatar
Alec committed
83
84
85
86
87
88
89
90
  --model meta-llama/Llama-3.3-70B-Instruct
  --tensor-parallel-size 8 \
  --enforce-eager
```

**Node 2**: Run decode workers
```bash
# Start decode worker
Alec's avatar
Alec committed
91
python -m dynamo.vllm \
Alec's avatar
Alec committed
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
  --model meta-llama/Llama-3.3-70B-Instruct
  --tensor-parallel-size 8 \
  --enforce-eager \
  --is-prefill-worker
```


## TODO

## Large Model Deployment

For models requiring more GPUs than available on a single node such as tensor-parallel-size 16:

**Node 1**: First part of tensor-parallel model
```bash
# Start ingress
Alec's avatar
Alec committed
108
python -m dynamo.frontend --router-mode kv &
Alec's avatar
Alec committed
109
110
```