README.md 19.3 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Container Development Guide

## Overview

The NVIDIA Dynamo project uses containerized development and deployment to maintain consistent environments across different AI inference frameworks and deployment scenarios. This directory contains the tools for building and running Dynamo containers:

### Core Components

- **`build.sh`** - A Docker image builder that creates containers for different AI inference frameworks (vLLM, TensorRT-LLM, SGLang). It handles framework-specific dependencies, multi-stage builds, and development vs production configurations.

- **`run.sh`** - A container runtime manager that launches Docker containers with proper GPU access, volume mounts, and environment configurations. It supports different development workflows from root-based legacy setups to user-based development environments.

- **Multiple Dockerfiles** - Framework-specific Dockerfiles that define the container images:
  - `Dockerfile.vllm` - For vLLM inference backend
  - `Dockerfile.trtllm` - For TensorRT-LLM inference backend
  - `Dockerfile.sglang` - For SGLang inference backend
  - `Dockerfile` - Base/standalone configuration
18
  - `Dockerfile.epp` - For building the Endpoint Picker (EPP) image
19

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
### Stage Summary for Frameworks

<details>
<summary>Show Stage Summary Table</summary>
Dockerfile.${FRAMEWORK} General Structure

Below is a summary of the general file structure for the framework Dockerfile stages. Some exceptions exist.

| Stage/Filepath | Target |
| --- | --- |
| **STAGE dynamo_base** | **FROM ${BASE_IMAGE}** |
| /bin/uv, /bin/uvx | COPY from ghcr.io/astral-sh/uv:latest (→ framework, runtime) |
|  /usr/bin/nats-server | Downloaded from GitHub (→ runtime) |
|  /usr/local/bin/etcd/ | Downloaded from GitHub (→ runtime) |
|  /usr/local/rustup/ | Installed via rustup-init (→ wheel_builder, dev) |
|  /usr/local/cargo/ | Installed via rustup-init (→ wheel_builder, dev) |
|  /usr/local/cuda/ | Inherited from BASE_IMAGE (→ wheel_builder, runtime) |
| **STAGE: wheel_builder** | **FROM quay.io/pypa/manylinux_2_28_${ARCH_ALT}** |
|  /usr/local/ucx/ | Built from source (→ runtime)
|  /opt/nvidia/nvda_nixl/ | Built from source (→ runtime)
|  /opt/nvidia/nvda_nixl/lib64/ | Built from source (→ runtime)
|  /opt/dynamo/target/ | Cargo build output (→ runtime)
|  /opt/dynamo/dist/*.whl | Built wheels (→ runtime)
|  /opt/dynamo/dist/nixl/ | Built nixl wheels (→ runtime)
| **STAGE: framework** | **FROM ${BASE_IMAGE}** |
|  /opt/dynamo/venv/ | Created with uv venv (→ runtime)
|  /${FRAMEWORK_INSTALL} | Built framework (→ runtime)
| **STAGE: runtime** | **FROM ${RUNTIME_IMAGE}** |
|  /usr/local/cuda/{bin,include,nvvm}/ | COPY from dynamo_base |
49
50
51
|  /usr/bin/nats-server | COPY from dynamo_base |
|  /usr/local/bin/etcd/ | COPY from dynamo_base |
|  /usr/local/ucx/ | COPY from wheel_builder |
52
53
54
55
56
57
|  /opt/nvidia/nvda_nixl/ | COPY from wheel_builder |
|  /opt/dynamo/wheelhouse/ | COPY from wheel_builder |
|  /opt/dynamo/venv/ | COPY from framework |
|  /opt/vllm/ | COPY from framework |
|  /workspace/{tests,examples,deploy}/ |COPY from build context |
| **STAGE: dev** | **FROM runtime** |
58
59
|  /usr/local/rustup/ | COPY from dynamo_base |
|  /usr/local/cargo/ | COPY from dynamo_base |
60
61
</details>

62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
### Why Containerization?

Each inference framework (vLLM, TensorRT-LLM, SGLang) has specific CUDA versions, Python dependencies, and system libraries. Containers provide consistent environments, framework isolation, and proper GPU configurations across development and production.

The scripts in this directory abstract away the complexity of Docker commands while providing fine-grained control over build and runtime configurations.

### Convenience Scripts vs Direct Docker Commands

The `build.sh` and `run.sh` scripts are convenience wrappers that simplify common Docker operations. They automatically handle:
- Framework-specific image selection and tagging
- GPU access configuration and runtime selection
- Volume mount setup for development workflows
- Environment variable management
- Build argument construction for multi-stage builds

**You can always use Docker commands directly** if you prefer more control or want to customize beyond what the scripts provide. The scripts use `--dry-run` flags to show you the exact Docker commands they would execute, making it easy to understand and modify the underlying operations.

## Development Targets Feature Matrix

81
82
83
84
85
86
87
88
89
90
91
**Note**: In Dynamo, "targets" and "Docker stages" are synonymous. Each target corresponds to a stage in the multi-stage Docker build. Similarly, "frameworks" and "engines" are synonymous (vLLM, TensorRT-LLM, SGLang).

| Feature | **runtime + `run.sh`** | **local-dev (`run.sh` or Dev Container)** | **dev + `run.sh`** (legacy) |
|---------|----------------------|-------------------------------------------|--------------------------|
| **Usage** | Benchmarking inference and deployments, non-root | Development, compilation, testing locally | Legacy workflows, root user, use with caution |
| **User** | dynamo (UID 1000) | dynamo (UID=host user) with sudo | root (UID 0, use with caution) |
| **Home Directory** | `/home/dynamo` | `/home/dynamo` | `/root` |
| **Working Directory** | `/workspace` (in-container or mounted) | `/workspace` (must be mounted w/ `--mount-workspace`) | `/workspace` (must be mounted w/ `--mount-workspace`) |
| **Rust Toolchain** | None (uses pre-built wheels) | System install (`/usr/local/rustup`, `/usr/local/cargo`) | System install (`/usr/local/rustup`, `/usr/local/cargo`) |
| **Cargo Target** | None | `/workspace/target` | `/workspace/target` |
| **Python Env** | venv (`/opt/dynamo/venv`) for vllm/trtllm, system site-packages for sglang | venv (`/opt/dynamo/venv`) for vllm/trtllm, system site-packages for sglang | venv (`/opt/dynamo/venv`) for vllm/trtllm, system site-packages for sglang |
92

93
94
**Note (SGLang)**: SGLang runtime uses system site-packages, but the `dev` image creates `/opt/dynamo/venv` (and `local-dev` inherits it from `dev`) for build tooling like `maturin`.

95
96
## Usage Guidelines

97
98
99
100
- **Use runtime target**: for benchmarking inference and deployments. Runs as non-root `dynamo` user (UID 1000, GID 0) for security
- **Use local-dev + `run.sh`**: for command-line development and Docker mounted partitions. Runs as `dynamo` user with UID matched to your local user, GID 0. Add `-it` flag for interactive sessions
- **Use local-dev + Dev Container**: VS Code/Cursor Dev Container Plugin, using `dynamo` user with UID matched to your local user, GID 0
- **Use dev + `run.sh`**: Root user, use with caution. Runs as root for backward compatibility with early workflows
101
102
103

## Example Commands

104
### 1. runtime target (runs as non-root dynamo user):
105
```bash
106
107
108
109
110
# Build runtime image
./build.sh --framework vllm --target runtime

# Run runtime container
./run.sh --image dynamo:latest-vllm-runtime -it
111
112
```

113
### 2. local-dev + `run.sh` (runs as dynamo user with matched host UID/GID):
114
```bash
115
run.sh --mount-workspace -it --image dynamo:latest-vllm-local-dev ...
116
117
```

118
### 3. local-dev + Dev Container Extension:
119
Use VS Code/Cursor Dev Container Extension with devcontainer.json configuration. The `dynamo` user UID is automatically matched to your local user.
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135

## Build and Run Scripts Overview

### build.sh - Docker Image Builder

The `build.sh` script is responsible for building Docker images for different AI inference frameworks. It supports multiple frameworks and configurations:

**Purpose:**
- Builds Docker images for NVIDIA Dynamo with support for vLLM, TensorRT-LLM, SGLang, or standalone configurations
- Handles framework-specific dependencies and optimizations
- Manages build contexts, caching, and multi-stage builds
- Configures development vs production targets

**Key Features:**
- **Framework Support**: vLLM (default when --framework not specified), TensorRT-LLM, SGLang, or NONE
- **Multi-stage Builds**: Build process with base images
136
- **Development Targets**: Supports `dev` target and `local-dev` target
137
138
139
140
141
142
- **Build Caching**: Docker layer caching and sccache support
- **GPU Optimization**: CUDA, EFA, and NIXL support

**Common Usage Examples:**

```bash
143
# Build vLLM dev image called dynamo:latest-vllm (default). This runs as root and is fine to use for inferencing/benchmarking, etc.
144
145
./build.sh

146
# Build both development and local-dev images (integrated into build.sh). While the dev image runs as root, the local-dev image will run as dynamo user with UID/GID matched to your host user, which is useful when mounting partitions. It will also contain development tools.
147
148
./build.sh --framework vllm --target local-dev

149
150
151
# Build TensorRT-LLM development image called dynamo:latest-trtllm
./build.sh --framework trtllm

152
153
154
155
156
157
158
159
160
161
162
163
164
# Build with custom tag
./build.sh --framework sglang --tag my-custom-tag

# Dry run to see commands
./build.sh --dry-run

# Build with no cache
./build.sh --no-cache

# Build with build arguments
./build.sh --build-arg CUSTOM_ARG=value
```

165
166
### Building the Frontend Image

167
The frontend image is a specialized container that includes the Dynamo components (Dynamo, NIXL, etc) along with the Endpoint Picker (EPP) for Kubernetes Gateway API Inference Extension integration. This image is primarily used for inference gateway deployments.
168
169

```bash
170
171
# Build the frontend image (automatically builds EPP image as a dependency)
./build.sh --framework none --target frontend
172
173
```

174
175
176
177
The build process automatically:
1. Clones the Gateway API Inference Extension (GAIE) repository
2. Builds the custom EPP image with Dynamo routing capabilities
3. Builds the frontend image with the EPP binary and Dynamo runtime components
178

179
180
181
For more details, see [`deploy/inference-gateway/README.md`](../deploy/inference-gateway/README.md).

**Note:** `--framework none` defaults `ENABLE_MEDIA_NIXL=false`.
182
183
184
185
186
187
188
189
190
191
192
193
194
195

#### Frontend Image Contents

The frontend image includes:
- **EPP (Endpoint Picker)**: Handles request routing and load balancing for inference gateway
- **Dynamo Runtime**: Core platform components and routing logic
- **NIXL**: NVIDIA InfiniBand Library for high-performance network communication
- **Benchmarking Tools**: Performance testing utilities (aiperf, aiconfigurator, etc)
- **Python Environment**: Virtual environment with all required dependencies

#### Deployment

The frontend image is designed for Kubernetes deployment with the Gateway API Inference Extension. See [`deploy/inference-gateway/README.md`](../deploy/inference-gateway/README.md) for complete deployment instructions using Helm charts.

196
197
198
199
200
201
202
203
204
205
206
207
208
### run.sh - Container Runtime Manager

The `run.sh` script launches Docker containers with the appropriate configuration for development and inference workloads.

**Purpose:**
- Runs pre-built Dynamo Docker images with proper GPU access
- Configures volume mounts, networking, and environment variables
- Supports different development workflows (root vs user-based)
- Manages container lifecycle and resource allocation

**Key Features:**
- **GPU Management**: Automatic GPU detection and allocation
- **Volume Mounting**: Workspace and HuggingFace cache mounting
209
- **User Management**: Non-root `dynamo` user execution (UID 1000, GID 0), with optional `--user` flag to override
210
- **Network Configuration**: Configurable networking modes (host, bridge, none, container sharing)
211
- **Resource Limits**: Memory, file descriptors, and IPC configuration
212
- **Interactive Mode**: Use `-it` flag for interactive terminal sessions (required for shells, debugging, and interactive development)
213
214
215
216

**Common Usage Examples:**

```bash
217
218
219
220
221
# Basic container launch with dev image (runs as root by default, non-interactive)
./run.sh --image dynamo:latest-vllm -v $HOME/.cache:/root/.cache

# Interactive development with workspace mounted using dev image (runs as root)
./run.sh --image dynamo:latest-vllm --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
222

223
224
# Interactive development with local-dev image (runs as dynamo user with matched host UID/GID)
./run.sh --image dynamo:latest-vllm-local-dev --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
225

226
# Use specific image and framework for development
227
./run.sh --image v0.1.0.dev.08cc44965-vllm-local-dev --framework vllm --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
228

229
230
# Interactive development shell with workspace mounted (local-dev)
./run.sh --image dynamo:latest-vllm-local-dev --mount-workspace -v $HOME/.cache:/home/dynamo/.cache -it -- bash
231

232
# Development with custom environment variables
233
./run.sh --image dynamo:latest-vllm-local-dev -e CUDA_VISIBLE_DEVICES=0,1 --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
234
235
236
237

# Dry run to see docker command
./run.sh --dry-run

238
# Development with custom volume mounts
239
240
241
242
243
244
245
./run.sh --image dynamo:latest-vllm-local-dev -v /host/path:/container/path --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache

# Run runtime image as non-root dynamo user (for production)
./run.sh --image dynamo:latest-vllm-runtime -v $HOME/.cache:/home/dynamo/.cache

# Run dev image as specific user (override default root)
./run.sh --image dynamo:latest-vllm --user dynamo -v $HOME/.cache:/home/dynamo/.cache
246
247
248
249
250
251
252
253
```

### Network Configuration Options

The `run.sh` script supports different networking modes via the `--network` flag (defaults to `host`):

#### Host Networking (Default)
```bash
254
255
256
# Examples with dynamo user
./run.sh --image dynamo:latest-vllm-local-dev --network host -v $HOME/.cache:/home/dynamo/.cache
./run.sh --image dynamo:latest-vllm-local-dev -v $HOME/.cache:/home/dynamo/.cache
257
258
259
260
261
262
263
264
265
266
267
```
**Use cases:**
- High-performance ML inference (default for GPU workloads)
- Services that need direct host port access
- Maximum network performance with minimal overhead
- Sharing services with the host machine (NATS, etcd, etc.)

**⚠️ Port Sharing Limitation:** Host networking shares all ports with the host machine, which means you can only run **one instance** of services like NATS (port 4222) or etcd (port 2379) across all containers and the host.

#### Bridge Networking (Isolated)
```bash
268
# CI/testing with isolated bridge networking and host cache sharing (no -it for automated CI)
269
./run.sh --image dynamo:latest-vllm --mount-workspace --network bridge -v $HOME/.cache:/home/dynamo/.cache
270
271
272
273
274
275
276
277
278
279
280
281
```
**Use cases:**
- Secure isolation from host network
- CI/CD pipelines requiring complete isolation
- When you need absolute control of ports
- Exposing specific services to host while maintaining isolation

**Note:** For port sharing with the host, use the `--port` or `-p` option with format `host_port:container_port` (e.g., `--port 8000:8000` or `-p 9081:8081`) to expose specific container ports to the host.

#### No Networking ⚠️ **LIMITED FUNCTIONALITY**
```bash
# Complete network isolation - no external connectivity
282
./run.sh --image dynamo:latest-vllm --network none --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
283

284
285
# Same with local-dev image (dynamo user with matched host UID/GID)
./run.sh --image dynamo:latest-vllm-local-dev --network none --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
286
```
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
**⚠️ WARNING: `--network none` severely limits Dynamo functionality:**
- **No model downloads** - HuggingFace models cannot be downloaded
- **No API access** - Cannot reach external APIs or services
- **No distributed inference** - Multi-node setups won't work
- **No monitoring/logging** - External monitoring systems unreachable
- **Limited debugging** - Cannot access external debugging tools

**Very limited use cases:**
- Pre-downloaded models with purely local processing
- Air-gapped security environments (models must be pre-staged)

#### Container Network Sharing
Use `--network container:name` to share the network namespace with another container.

**Use cases:**
- Sidecar patterns (logging, monitoring, caching)
- Service mesh architectures
- Sharing network namespaces between related containers

See Docker documentation for `--network container:name` usage.

#### Custom Networks
Use custom Docker networks for multi-container applications. Create with `docker network create` and specify with `--network network-name`.

**Use cases:**
- Multi-container applications
- Service discovery by container name

See Docker documentation for custom network creation and management.

#### Network Mode Comparison

| Mode | Performance | Security | Use Case | Dynamo Compatibility | Port Sharing | Port Publishing |
|------|-------------|----------|----------|---------------------|---------------|-----------------|
| `host` | Highest | Lower | ML/GPU workloads, high-performance services | ✅ Full | ⚠️ **Shared with host** (one NATS/etcd only) | ❌ Not needed |
| `bridge` | Good | Higher | General web services, controlled port exposure | ✅ Full | ✅ Isolated ports | ✅ `-p host:container` |
| `none` | N/A | Highest | Air-gapped environments only | ⚠️ **Very Limited** | ✅ No network | ❌ No network |
| `container:name` | Good | Medium | Sidecar patterns, shared network stacks | ✅ Full | ⚠️ Shared with target container | ❌ Use target's ports |
| Custom networks | Good | Medium | Multi-container applications | ✅ Full | ✅ Isolated ports | ✅ `-p host:container` |
326
327
328
329
330

## Workflow Examples

### Development Workflow
```bash
331
# 1. Build local-dev image (creates both dynamo:latest-vllm and dynamo:latest-vllm-local-dev)
332
333
./build.sh --framework vllm --target local-dev

334
# 2. Run development container using the local-dev image
335
./run.sh --image dynamo:latest-vllm-local-dev --mount-workspace -v $HOME/.cache:/home/dynamo/.cache -it
336
337
338
339
340
341

# 3. Inside container, run inference (requires both frontend and backend)
# Start frontend
python -m dynamo.frontend &

# Start backend (vLLM example)
342
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --gpu-memory-utilization 0.20 &
343
344
345
346
```

### Production Workflow
```bash
347
348
# 1. Build production runtime image (runs as non-root dynamo user)
./build.sh --framework vllm --target runtime
349

350
351
# 2. Run production container as non-root dynamo user
./run.sh --image dynamo:latest-vllm-runtime --gpus all -v $HOME/.cache:/home/dynamo/.cache
352
353
```

354
### Testing Workflow
355
```bash
356
# 1. Build dev image
357
358
./build.sh --framework vllm --no-cache

359
# 2. Run tests with network isolation for reproducible results (no -it needed for CI)
360
./run.sh --image dynamo:latest-vllm --mount-workspace --network bridge -v $HOME/.cache:/home/dynamo/.cache -- python -m pytest tests/
361
362
363
364
365
366
367
368
369

# 3. Inside the container with bridge networking, start services
# Note: Services are only accessible from the same container - no port conflicts with host
nats-server -js &
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
python -m dynamo.frontend &

# 4. Start worker backend (choose one framework):
# vLLM
370
DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --gpu-memory-utilization 0.20 --enforce-eager --no-enable-prefix-caching --max-num-seqs 64 &
371
372

# SGLang
373
DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model Qwen/Qwen3-0.6B --mem-fraction-static 0.20 --max-running-requests 64 &
374
375

# TensorRT-LLM
376
DYN_SYSTEM_PORT=8081 python -m dynamo.trtllm --model Qwen/Qwen3-0.6B --free-gpu-memory-fraction 0.20 --max-num-tokens 8192 --max-batch-size 64 &
377
```
378
379
380
381
382

**Framework-Specific GPU Memory Arguments:**
- **vLLM**: `--gpu-memory-utilization 0.20` (use 20% GPU memory), `--enforce-eager` (disable CUDA graphs), `--no-enable-prefix-caching` (save memory), `--max-num-seqs 64` (max concurrent sequences)
- **SGLang**: `--mem-fraction-static 0.20` (20% GPU memory for static allocation), `--max-running-requests 64` (max concurrent requests)
- **TensorRT-LLM**: `--free-gpu-memory-fraction 0.20` (reserve 20% GPU memory), `--max-num-tokens 8192` (max tokens in batch), `--max-batch-size 64` (max batch size)