README.md 25.2 KB
Newer Older
1
2
3
4
5
6
# Container Development Guide

## Overview

The NVIDIA Dynamo project uses containerized development and deployment to maintain consistent environments across different AI inference frameworks and deployment scenarios. This directory contains the tools for building and running Dynamo containers:

7
### Rendering Requirements:
8
9
10
11
- Python
- Python Packages:
  - pyyaml
  - jinja2
12

13
14
15
16
17
18
### Core Components

- **`render.py`** - A render script used to generate Dockerfiles for AI inference frameworks (vLLM, TensorRT-LLM, SGLang) and the frontend image. The generated Dockerfile includes the needed multi-stage steps for development vs production configurations.

- **`run.sh`** - A container runtime manager that launches Docker containers with proper GPU access, volume mounts, and environment configurations. It supports different development workflows from root-based legacy setups to user-based development environments.

19
20
21
22
### Stage Summary for Frameworks

<details>
<summary>Show Stage Summary Table</summary>
23
Dockerfile General Structure
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

Below is a summary of the general file structure for the framework Dockerfile stages. Some exceptions exist.

| Stage/Filepath | Target |
| --- | --- |
| **STAGE dynamo_base** | **FROM ${BASE_IMAGE}** |
| /bin/uv, /bin/uvx | COPY from ghcr.io/astral-sh/uv:latest (→ framework, runtime) |
|  /usr/bin/nats-server | Downloaded from GitHub (→ runtime) |
|  /usr/local/bin/etcd/ | Downloaded from GitHub (→ runtime) |
|  /usr/local/rustup/ | Installed via rustup-init (→ wheel_builder, dev) |
|  /usr/local/cargo/ | Installed via rustup-init (→ wheel_builder, dev) |
|  /usr/local/cuda/ | Inherited from BASE_IMAGE (→ wheel_builder, runtime) |
| **STAGE: wheel_builder** | **FROM quay.io/pypa/manylinux_2_28_${ARCH_ALT}** |
|  /usr/local/ucx/ | Built from source (→ runtime)
|  /opt/nvidia/nvda_nixl/ | Built from source (→ runtime)
|  /opt/nvidia/nvda_nixl/lib64/ | Built from source (→ runtime)
|  /opt/dynamo/target/ | Cargo build output (→ runtime)
|  /opt/dynamo/dist/*.whl | Built wheels (→ runtime)
|  /opt/dynamo/dist/nixl/ | Built nixl wheels (→ runtime)
| **STAGE: framework** | **FROM ${BASE_IMAGE}** |
|  /opt/dynamo/venv/ | Created with uv venv (→ runtime)
|  /${FRAMEWORK_INSTALL} | Built framework (→ runtime)
| **STAGE: runtime** | **FROM ${RUNTIME_IMAGE}** |
|  /usr/local/cuda/{bin,include,nvvm}/ | COPY from dynamo_base |
48
49
50
|  /usr/bin/nats-server | COPY from dynamo_base |
|  /usr/local/bin/etcd/ | COPY from dynamo_base |
|  /usr/local/ucx/ | COPY from wheel_builder |
51
52
53
54
55
|  /opt/nvidia/nvda_nixl/ | COPY from wheel_builder |
|  /opt/dynamo/wheelhouse/ | COPY from wheel_builder |
|  /opt/dynamo/venv/ | COPY from framework |
|  /opt/vllm/ | COPY from framework |
|  /workspace/{tests,examples,deploy}/ |COPY from build context |
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
| **STAGE: dev** | **FROM runtime (via dev/Dockerfile.dev)** |
|  /usr/bin/, /usr/lib/, etc. | COPY from dynamo_tools (dev utilities, git, sudo, etc.) |
|  /usr/local/rustup/ | COPY from dynamo_tools |
|  /usr/local/cargo/ | COPY from dynamo_tools |
|  /usr/local/bin/maturin | COPY from dynamo_tools |
|  /opt/dynamo/venv/ | For SGLang: created with --system-site-packages, includes uv and maturin |
|  /workspace/ | Full source code copied from build context with editable install |
|  **💡 Recommendation** | **Use --mount-workspace with run.sh** for live editing (bind mount overrides baked-in code) |
|  PATH | Includes /opt/dynamo/venv/bin:/usr/local/cargo/bin |
|  umask 002 | Login shell sources /etc/profile.d/00-umask.sh for group-writable files |
| **STAGE: local-dev** | **FROM dev (via dev/Dockerfile.dev)** |
|  /home/dynamo/.rustup/ | COPY from /usr/local/rustup (user-writable) |
|  USER | dynamo (UID/GID remapped to match host user) |
|  **💡 Recommendation** | **Use --mount-workspace with run.sh** for live editing (bind mount overrides baked-in code) |
|  RUSTUP_HOME | /home/dynamo/.rustup |
|  CARGO_HOME | /home/dynamo/.cargo |
72
73
</details>

74
75
76
77
78
79
80
81
### Why Containerization?

Each inference framework (vLLM, TensorRT-LLM, SGLang) has specific CUDA versions, Python dependencies, and system libraries. Containers provide consistent environments, framework isolation, and proper GPU configurations across development and production.

The scripts in this directory abstract away the complexity of Docker commands while providing fine-grained control over build and runtime configurations.

### Convenience Scripts vs Direct Docker Commands

82
The `run.sh` script and rendering scripts are conveniences that simplify common Docker operations. They automatically handle:
83
84
85
86
87
- GPU access configuration and runtime selection
- Volume mount setup for development workflows
- Environment variable management
- Build argument construction for multi-stage builds

88
**You can always use Docker commands directly** if you prefer more control or want to customize beyond what the scripts provide. `run.sh` supports a `--dry-run` flag to show you the exact commands they would execute, making it easy to understand and modify the underlying operations.
89
90
91

## Development Targets Feature Matrix

92
93
94
95
96
97
98
**Note**: In Dynamo, "targets" and "Docker stages" are synonymous. Each target corresponds to a stage in the multi-stage Docker build. Similarly, "frameworks" and "engines" are synonymous (vLLM, TensorRT-LLM, SGLang).

| Feature | **runtime + `run.sh`** | **local-dev (`run.sh` or Dev Container)** | **dev + `run.sh`** (legacy) |
|---------|----------------------|-------------------------------------------|--------------------------|
| **Usage** | Benchmarking inference and deployments, non-root | Development, compilation, testing locally | Legacy workflows, root user, use with caution |
| **User** | dynamo (UID 1000) | dynamo (UID=host user) with sudo | root (UID 0, use with caution) |
| **Home Directory** | `/home/dynamo` | `/home/dynamo` | `/root` |
99
| **Working Directory** | `/workspace` (in-container or mounted) | `/workspace` (baked-in, optionally mounted w/ `--mount-workspace`) | `/workspace` (baked-in, optionally mounted w/ `--mount-workspace`) |
100
101
| **Rust Toolchain** | None (uses pre-built wheels) | System install (`/usr/local/rustup`, `/usr/local/cargo`) | System install (`/usr/local/rustup`, `/usr/local/cargo`) |
| **Cargo Target** | None | `/workspace/target` | `/workspace/target` |
102
| **Python Env** | venv (`/opt/dynamo/venv`) for vllm/trtllm, system site-packages for sglang | venv (`/opt/dynamo/venv`) for all frameworks (with --system-site-packages for sglang) | venv (`/opt/dynamo/venv`) for all frameworks (with --system-site-packages for sglang) |
103

104
**Note (SGLang)**: SGLang runtime uses system site-packages, but the `dev` and `local-dev` images create `/opt/dynamo/venv` with `--system-site-packages` for build tooling like `maturin` and `uv`.
105

106
107
## Usage Guidelines

108
109
110
111
- **Use runtime target**: for benchmarking inference and deployments. Runs as non-root `dynamo` user (UID 1000, GID 0) for security
- **Use local-dev + `run.sh`**: for command-line development and Docker mounted partitions. Runs as `dynamo` user with UID matched to your local user, GID 0. Add `-it` flag for interactive sessions
- **Use local-dev + Dev Container**: VS Code/Cursor Dev Container Plugin, using `dynamo` user with UID matched to your local user, GID 0
- **Use dev + `run.sh`**: Root user, use with caution. Runs as root for backward compatibility with early workflows
112
113
114

## Example Commands

115
### 1. runtime target (runs as non-root dynamo user):
116
```bash
117
# Build runtime image
118
container/render.py --framework vllm --target runtime --output-short-filename
119
docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
120
121

# Run runtime container
122
container/run.sh --image dynamo:latest-vllm-runtime -it
123
124
```

125
126
127
128
129
130
131
### 2. test image (layers test deps on top of runtime):
```bash
# Build test image from a runtime image (for running tests locally)
docker build -f container/Dockerfile.test --build-arg BASE_IMAGE=dynamo:latest-vllm-runtime -t dynamo:latest-vllm-test .
```

### 3. local-dev + `run.sh` (runs as dynamo user with matched host UID/GID):
132
```bash
133
run.sh --mount-workspace -it --image dynamo:latest-vllm-local-dev ...
134
135
```

136
### 3. local-dev + Dev Container Extension:
137
Use VS Code/Cursor Dev Container Extension with devcontainer.json configuration. The `dynamo` user UID is automatically matched to your local user.
138
139
140

## Build and Run Scripts Overview

141
### render.py - Docker Image Generator
142

143
The `render.py` script is responsible for generating Dockerfiles for different AI inference frameworks. It supports multiple frameworks and configurations:
144
145

**Purpose:**
146
- Generates Dockerfiles for NVIDIA Dynamo with support for vLLM, TensorRT-LLM, SGLang, or standalone configurations
147
148
149
150
151
- Handles framework-specific dependencies and optimizations
- Manages build contexts, caching, and multi-stage builds
- Configures development vs production targets

**Key Features:**
152
- **Framework Support**: vLLM (default when --framework not specified), TensorRT-LLM, SGLang, or NONE (standalone Dynamo)
153
- **Multi-stage Builds**: Build process with base images
154
- **Development Targets**: Supports `dev`, `runtime`, and `local-dev` targets via `render.py`.
155
156
157
- **Build Caching**: Docker layer caching and sccache support
- **GPU Optimization**: CUDA, EFA, and NIXL support

158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
#### BuildKit cache mounts in Dockerfiles

The framework Dockerfiles use BuildKit cache mounts (`RUN --mount=type=cache,...`) to reduce repeated downloads across builds. These caches are stored in Docker/BuildKit’s cache storage on the host (not in your host `~/.cache`), and are shared across builds that use the same builder.

Common cache mount targets:
- `--mount=type=cache,target=/root/.cache/uv`: `uv` download cache (wheels/sdists, git checkouts used by `uv`, etc.)
- `--mount=type=cache,target=/var/cache/apt,sharing=locked`: apt download cache (`sharing=locked` avoids apt/dpkg races with concurrent builds)
- `--mount=type=cache,target=/var/cache/{yum,dnf},sharing=locked`: yum/dnf metadata cache (`sharing=locked` avoids corruption with concurrent builds)
- `--mount=type=cache,target=/root/.cargo/{registry,git}`: Cargo crate/git download caches (Cargo has its own locking; no `sharing=locked` needed)

To inspect cache usage:
```bash
docker buildx du
docker info --format 'DockerRootDir: {{.DockerRootDir}}'
```

##### Inspecting BuildKit cache on the host (quick checklist)

1. Quick summary:
```bash
docker buildx du | tail -5
```

2. Find Docker root:
```bash
docker info | grep "Docker Root Dir"
# Output example: Docker Root Dir: /var/lib/docker
```

3. Check executor storage size:
```bash
DOCKER_ROOT="$(docker info --format '{{.DockerRootDir}}')"
sudo du -sh "${DOCKER_ROOT}/buildkit/executor" 2>/dev/null || true
```

4. Find specific caches (example: uv cache under BuildKit executor rootfs):
```bash
DOCKER_ROOT="$(docker info --format '{{.DockerRootDir}}')"
sudo sh -c 'find '"${DOCKER_ROOT}"'/buildkit/executor/*/rootfs/root/.cache/uv -type d 2>/dev/null | while read -r dir; do
  parent=$(dirname "$(dirname "$(dirname "$dir")")")
  du -sh "$parent/root/.cache/uv" 2>/dev/null
done'
```

5. List all large cache directories:
```bash
DOCKER_ROOT="$(docker info --format '{{.DockerRootDir}}')"
sudo sh -c 'du -sh '"${DOCKER_ROOT}"'/buildkit/executor/* 2>/dev/null | sort -h | tail -10'
```

Cleanup commands:
```bash
# Safe: clean only reclaimable cache
docker buildx prune

# Aggressive: clean everything
docker buildx prune --all

# Time-based: remove cache older than 3 days
docker buildx prune --filter until=72h
```

Current cache types (as mounted in various Dockerfiles):
1. `/root/.cache/uv` and `/home/dynamo/.cache/uv` - Python packages (uv; match the current `USER`)
2. `/root/.cargo/registry` - Rust crates
3. `/root/.cargo/git` - Rust git deps
4. `/var/cache/yum`, `/var/cache/dnf` - AlmaLinux packages
5. `/var/cache/apt` - Ubuntu packages

Note: `uv` commands set `UV_CACHE_DIR` per `RUN` so `uv` always uses the same path as the cache mount (instead of relying on `$HOME`).

229
230
> **💡 Tip**: The `dev` and `local-dev` images have source code baked in, but **using `--mount-workspace` with `run.sh` is recommended for development** to bind mount your local workspace for live editing.

231
232
233
**Common Usage Examples:**

```bash
234
# Build a vLLM local-dev image called dynamo:latest-vllm-local-dev. The local-dev image will run as `dynamo` with UID/GID matched to your host user,
235
# which is useful when mounting partitions for development.
236
container/render.py --framework=vllm --target=local-dev --output-short-filename
237
docker build --build-arg USER_UID=$(id -u) --build-arg USER_GID=$(id -g) -f container/rendered.Dockerfile -t dynamo:latest-vllm-local-dev .
238

239
240
# Build TensorRT-LLM runtime image called dynamo:latest-trtllm-runtime
container/render.py --framework=trtllm --target=runtime --output-short-filename --cuda-version=13.1
241
docker build -t dynamo:latest-trtllm-runtime -f container/rendered.Dockerfile .
242
243
```

244
245
246
247
248
249
After building, use `run.sh` to launch the container (see [run.sh - Container Runtime Manager](#runsh---container-runtime-manager) below for full options):
```bash
# Launch local-dev container with workspace mounted for live editing
container/run.sh --image dynamo:latest-vllm-local-dev --mount-workspace -it
```

250
251
### Building the Frontend Image

252
The frontend image is a specialized container that includes the Dynamo components (Dynamo, NIXL, etc) along with the Endpoint Picker (EPP) for Kubernetes Gateway API Inference Extension integration. This image is primarily used for inference gateway deployments.
253

254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
**Build EPP Image**
```bash
sudo apt-get update && sudo apt-get install -y git build-essential protobuf-compiler libclang-dev
curl https://sh.rustup.rs -sSf | sh -s -- -y --default-toolchain stable
. "$HOME/.cargo/env"
cargo install cbindgen

pushd deploy/inference-gateway/epp
make all
popd

EPP_GIT_TAG=$(git describe --tags --dirty --always 2>/dev/null || echo "dev")
EPP_IMAGE="dynamo/dynamo-epp:${EPP_GIT_TAG}"
```

**Build Frontend Image**
270
```bash
271
# Build the frontend image (automatically builds EPP image as a dependency)
272
container/render.py --framework=dynamo --target=frontend --output-short-filename
273
docker build -t dynamo:frontend --build-arg EPP_IMAGE=${EPP_IMAGE} -f container/rendered.Dockerfile .
274
275
```

276
The build process automatically:
277
278
1. Builds the Dynamo static library for EPP KV-aware routing
2. Builds the custom EPP Docker image using `make all` from `deploy/inference-gateway/epp/Makefile`
279
3. Builds the frontend image with the EPP binary and Dynamo runtime components
280

281
282
For more details, see [`deploy/inference-gateway/README.md`](../deploy/inference-gateway/README.md).

283
284
285
286
287
288
289
290
291
292
293
294
295
#### Frontend Image Contents

The frontend image includes:
- **EPP (Endpoint Picker)**: Handles request routing and load balancing for inference gateway
- **Dynamo Runtime**: Core platform components and routing logic
- **NIXL**: NVIDIA InfiniBand Library for high-performance network communication
- **Benchmarking Tools**: Performance testing utilities (aiperf, aiconfigurator, etc)
- **Python Environment**: Virtual environment with all required dependencies

#### Deployment

The frontend image is designed for Kubernetes deployment with the Gateway API Inference Extension. See [`deploy/inference-gateway/README.md`](../deploy/inference-gateway/README.md) for complete deployment instructions using Helm charts.

296
297
298
299
300
301
302
303
304
305
306
307
308
### run.sh - Container Runtime Manager

The `run.sh` script launches Docker containers with the appropriate configuration for development and inference workloads.

**Purpose:**
- Runs pre-built Dynamo Docker images with proper GPU access
- Configures volume mounts, networking, and environment variables
- Supports different development workflows (root vs user-based)
- Manages container lifecycle and resource allocation

**Key Features:**
- **GPU Management**: Automatic GPU detection and allocation
- **Volume Mounting**: Workspace and HuggingFace cache mounting
309
- **User Management**: Non-root `dynamo` user execution (UID 1000, GID 0), with optional `--user` flag to override
310
- **Network Configuration**: Configurable networking modes (host, bridge, none, container sharing)
311
- **Resource Limits**: Memory, file descriptors, and IPC configuration
312
- **Interactive Mode**: Use `-it` flag for interactive terminal sessions (required for shells, debugging, and interactive development)
313
314
315
316

**Common Usage Examples:**

```bash
317
# Basic container launch with dev image (runs as root by default, non-interactive)
318
container/run.sh --image dynamo:latest-vllm -v $HOME/.cache:/root/.cache
319
320

# Interactive development with workspace mounted using dev image (runs as root)
321
container/run.sh --image dynamo:latest-vllm --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
322

323
# Interactive development with local-dev image (runs as dynamo user with matched host UID/GID)
324
container/run.sh --image dynamo:latest-vllm-local-dev --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
325

326
# Use specific image and framework for development
327
container/run.sh --image v0.1.0.dev.08cc44965-vllm-local-dev --framework vllm --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
328

329
# Interactive development shell with workspace mounted (local-dev)
330
container/run.sh --image dynamo:latest-vllm-local-dev --mount-workspace -v $HOME/.cache:/home/dynamo/.cache -it -- bash
331

332
# Development with custom environment variables
333
container/run.sh --image dynamo:latest-vllm-local-dev -e CUDA_VISIBLE_DEVICES=0,1 --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
334
335

# Dry run to see docker command
336
container/run.sh --dry-run
337

338
# Development with custom volume mounts
339
container/run.sh --image dynamo:latest-vllm-local-dev -v /host/path:/container/path --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
340
341

# Run runtime image as non-root dynamo user (for production)
342
container/run.sh --image dynamo:latest-vllm-runtime -v $HOME/.cache:/home/dynamo/.cache
343
344

# Run dev image as specific user (override default root)
345
container/run.sh --image dynamo:latest-vllm --user dynamo -v $HOME/.cache:/home/dynamo/.cache
346
347
348
349
350
351
352
353
```

### Network Configuration Options

The `run.sh` script supports different networking modes via the `--network` flag (defaults to `host`):

#### Host Networking (Default)
```bash
354
# Examples with dynamo user
355
356
container/run.sh --image dynamo:latest-vllm-local-dev --network host -v $HOME/.cache:/home/dynamo/.cache
container/run.sh --image dynamo:latest-vllm-local-dev -v $HOME/.cache:/home/dynamo/.cache
357
358
359
360
361
362
363
364
365
366
367
```
**Use cases:**
- High-performance ML inference (default for GPU workloads)
- Services that need direct host port access
- Maximum network performance with minimal overhead
- Sharing services with the host machine (NATS, etcd, etc.)

**⚠️ Port Sharing Limitation:** Host networking shares all ports with the host machine, which means you can only run **one instance** of services like NATS (port 4222) or etcd (port 2379) across all containers and the host.

#### Bridge Networking (Isolated)
```bash
368
# CI/testing with isolated bridge networking and host cache sharing (no -it for automated CI)
369
container/run.sh --image dynamo:latest-vllm --mount-workspace --network bridge -v $HOME/.cache:/home/dynamo/.cache
370
371
372
373
374
375
376
377
378
379
380
381
```
**Use cases:**
- Secure isolation from host network
- CI/CD pipelines requiring complete isolation
- When you need absolute control of ports
- Exposing specific services to host while maintaining isolation

**Note:** For port sharing with the host, use the `--port` or `-p` option with format `host_port:container_port` (e.g., `--port 8000:8000` or `-p 9081:8081`) to expose specific container ports to the host.

#### No Networking ⚠️ **LIMITED FUNCTIONALITY**
```bash
# Complete network isolation - no external connectivity
382
container/run.sh --image dynamo:latest-vllm --network none --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
383

384
# Same with local-dev image (dynamo user with matched host UID/GID)
385
container/run.sh --image dynamo:latest-vllm-local-dev --network none --mount-workspace -it -v $HOME/.cache:/home/dynamo/.cache
386
```
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
**⚠️ WARNING: `--network none` severely limits Dynamo functionality:**
- **No model downloads** - HuggingFace models cannot be downloaded
- **No API access** - Cannot reach external APIs or services
- **No distributed inference** - Multi-node setups won't work
- **No monitoring/logging** - External monitoring systems unreachable
- **Limited debugging** - Cannot access external debugging tools

**Very limited use cases:**
- Pre-downloaded models with purely local processing
- Air-gapped security environments (models must be pre-staged)

#### Container Network Sharing
Use `--network container:name` to share the network namespace with another container.

**Use cases:**
- Sidecar patterns (logging, monitoring, caching)
- Service mesh architectures
- Sharing network namespaces between related containers

See Docker documentation for `--network container:name` usage.

#### Custom Networks
Use custom Docker networks for multi-container applications. Create with `docker network create` and specify with `--network network-name`.

**Use cases:**
- Multi-container applications
- Service discovery by container name

See Docker documentation for custom network creation and management.

#### Network Mode Comparison

| Mode | Performance | Security | Use Case | Dynamo Compatibility | Port Sharing | Port Publishing |
|------|-------------|----------|----------|---------------------|---------------|-----------------|
| `host` | Highest | Lower | ML/GPU workloads, high-performance services | ✅ Full | ⚠️ **Shared with host** (one NATS/etcd only) | ❌ Not needed |
| `bridge` | Good | Higher | General web services, controlled port exposure | ✅ Full | ✅ Isolated ports | ✅ `-p host:container` |
| `none` | N/A | Highest | Air-gapped environments only | ⚠️ **Very Limited** | ✅ No network | ❌ No network |
| `container:name` | Good | Medium | Sidecar patterns, shared network stacks | ✅ Full | ⚠️ Shared with target container | ❌ Use target's ports |
| Custom networks | Good | Medium | Multi-container applications | ✅ Full | ✅ Isolated ports | ✅ `-p host:container` |
426
427
428
429
430

## Workflow Examples

### Development Workflow
```bash
431
# 1. Build local-dev image (builds runtime, then dev as intermediate, then local-dev as final image)
432
container/render.py --framework=vllm --target=local-dev --output-short-filename
433
docker build --build-arg USER_UID=$(id -u) --build-arg USER_GID=$(id -g) -f container/rendered.Dockerfile -t dynamo:latest-vllm-local-dev .
434

435
# 2. Run development container using the local-dev image
436
# RECOMMENDED: --mount-workspace for live editing in dev and local-dev images
437
container/run.sh --image dynamo:latest-vllm-local-dev --mount-workspace -v $HOME/.cache:/home/dynamo/.cache -it
438

439
440
441
442
443
444
# From this point forward, commands run inside the container started in step 2.

# 3. Sanity check (optional but recommended)
deploy/sanity_check.py

# 4. Run inference (requires both frontend and backend)
445
446
447
448
# Start frontend
python -m dynamo.frontend &

# Start backend (vLLM example)
449
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --gpu-memory-utilization 0.20 &
450
451
452
453
```

### Production Workflow
```bash
454
# 1. Build production runtime image (runs as non-root dynamo user)
455
container/render.py --framework=vllm --target=runtime --output-short-filename
456
docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
457

458
# 2. Run production container as non-root dynamo user
459
container/run.sh --image dynamo:latest-vllm-runtime --gpus all -v $HOME/.cache:/home/dynamo/.cache
460
461
```

462
### Testing Workflow
463
```bash
464
# 1. Build dev image
465
container/render.py --framework=vllm --target=dev --output-short-filename
466
docker build -t dynamo:latest-vllm-dev -f container/rendered.Dockerfile .
467

468
469
470
471
472
473
474
# 2. Launch the container
# Without --network (default: host networking, ports shared with host -- simplest for development)
container/run.sh --image dynamo:latest-vllm-dev --mount-workspace -v $HOME/.cache:/home/dynamo/.cache -it
# Or with --network bridge (isolated networking, no port conflicts with host)
container/run.sh --image dynamo:latest-vllm-dev --mount-workspace --network bridge -v $HOME/.cache:/home/dynamo/.cache -it

# From this point forward, commands run inside the container started in step 2.
475

476
# 3. Start infrastructure services
477
478
nats-server -js &
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
479
480
481
482
483
484
485
486
487
488
489
490

# 4. Compile code
cargo build --locked --features dynamo-llm/block-manager --workspace
cd lib/bindings/python && maturin develop --uv && cd -

# 5. Sanity check (optional but recommended)
deploy/sanity_check.py --runtime-check-only

# 6. Run tests
python -m pytest tests/

# 7. (Optional) Start frontend and backend for interactive testing
491
492
python -m dynamo.frontend &

493
# Start worker backend (choose one framework):
494
# vLLM
495
DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --gpu-memory-utilization 0.20 --enforce-eager --no-enable-prefix-caching --max-num-seqs 64 &
496
497

# SGLang
498
DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model Qwen/Qwen3-0.6B --mem-fraction-static 0.20 --max-running-requests 64 &
499
500

# TensorRT-LLM
501
DYN_SYSTEM_PORT=8081 python -m dynamo.trtllm --model Qwen/Qwen3-0.6B --free-gpu-memory-fraction 0.20 --max-num-tokens 8192 --max-batch-size 64 &
502
```
503
504
505
506
507

**Framework-Specific GPU Memory Arguments:**
- **vLLM**: `--gpu-memory-utilization 0.20` (use 20% GPU memory), `--enforce-eager` (disable CUDA graphs), `--no-enable-prefix-caching` (save memory), `--max-num-seqs 64` (max concurrent sequences)
- **SGLang**: `--mem-fraction-static 0.20` (20% GPU memory for static allocation), `--max-running-requests 64` (max concurrent requests)
- **TensorRT-LLM**: `--free-gpu-memory-fraction 0.20` (reserve 20% GPU memory), `--max-num-tokens 8192` (max tokens in batch), `--max-batch-size 64` (max batch size)