encoder-disaggregation.md 3.17 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Encoder Disaggregation
subtitle: Separate vision encoding into a dedicated worker for independent scaling
---

## Overview

Encoder disaggregation separates the vision encoder from the prefill/decode pipeline into its own worker. Instead of running image encoding inline, a dedicated encode worker handles media processing and transfers the resulting embeddings to downstream workers via NIXL (RDMA).

This enables:

- Independent scaling of encode workers based on vision workload
- Reduced GPU memory pressure on prefill/decode workers
- Better GPU utilization by matching worker counts to actual bottlenecks

## When to Use

Use encoder disaggregation when:

- Vision encoding is a bottleneck and you need to scale encoders independently of LLM workers
- You want to run the vision encoder on different hardware (e.g., smaller GPUs for encoding, larger GPUs for LLM inference)
- Your deployment handles high volumes of multimodal requests and encoding throughput is limiting

For simple deployments or development/testing, the aggregated (EPD) pattern is easier to set up.

## Support Matrix

| Backend | E/PD | E/P/D | Notes |
|---------|------|-------|-------|
| **vLLM** | ✅ | ✅ | NIXL transfer for embeddings; NIXL KV cache transfer for P/D |
| **TRT-LLM** | ❌ | ✅ | Supports image URLs (via `MultimodalEncoder`) and pre-computed embeddings (via NIXL) |
| **SGLang** | ✅ | ✅ | NIXL for embeddings; bootstrap mechanism for P/D KV transfer |

## Deployment Patterns

**E/PD** — Separate encoder, combined prefill+decode:

```text
Frontend → Processor → Encode Worker → PD Worker → Response
                           (NIXL)
```

The encode worker runs the vision model and transfers embeddings via NIXL to a combined prefill+decode worker.

**E/P/D** — All stages separate:

```text
Frontend → Processor → Encode Worker → Prefill Worker → Decode Worker → Response
                           (NIXL)          (KV transfer)
```

Full disaggregation with separate workers for each stage. The encode worker transfers embeddings to the prefill worker, which then transfers KV cache to the decode worker.

## Launching

### vLLM

```bash
cd $DYNAMO_HOME/examples/backends/vllm

# E/PD
bash launch/disagg_multimodal_e_pd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"

# E/P/D
bash launch/disagg_multimodal_epd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
```

### TRT-LLM

```bash
cd $DYNAMO_HOME/examples/backends/trtllm

# E/PD
bash launch/disagg_e_pd.sh

# E/P/D
./launch/epd_multimodal_image_and_embeddings.sh
```

### SGLang

```bash
cd $DYNAMO_HOME/examples/backends/sglang

# E/PD
./launch/multimodal_epd.sh

# E/P/D
./launch/multimodal_disagg.sh
```

See the backend-specific documentation ([vLLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md), [TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md), [SGLang](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)) for full configuration details and component flags.