multimodal_epd.md 7.92 KB
Newer Older
1
# Encode-Prefill-Decode (EPD) Flow with NIXL
2
3
4

For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.

5
## Enabling the Feature
6
7
8
9
10
11
12
13

This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:

```bash
./container/build.sh --framework trtllm --tensorrtllm-commit b4065d8ca64a64eee9fdc64b39cb66d73d4be47c
```

14
## Key Features
15
16
17
18
19
20

- **High Performance**: Zero-copy RDMA transfer for embeddings
- **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image
- **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings
- **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON

21
## How to use
22
23
24
25
26
27
28
29

```bash
cd $DYNAMO_HOME/components/backends/trtllm

# Launch 3-worker EPD flow with NIXL
./launch/epd_disagg.sh
```

30
## Configuration
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker:

```bash
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
```

This endpoint follows Dynamo's standard format: `dyn://namespace.component.endpoint` where the Encode worker registers itself as `dynamo.tensorrt_llm_encode.generate`.

For local embedding file access, use the `--allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH"` parameter to specify the secure directory path where embedding files can be loaded from (default: `/tmp`). This prevents path traversal attacks while allowing flexible file access within the designated directory.

```bash
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
```

For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB"` parameter to limit the maximum size of downloadable embedding files/Image URLs (default: `50MB`). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes.

```bash
export MAX_FILE_SIZE_MB=50
```

52
## Architecture Overview
53
54
55
56
57
58
59

The EPD flow implements a **3-worker architecture** for high-performance multimodal inference:

- **Encode Worker**: Loads and processes multimodal embeddings
- **Prefill Worker**: Handles initial context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation

60
## Request Flow Diagrams
61

62
### Prefill-First Disaggregation Strategy
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

```mermaid
sequenceDiagram
    participant Client
    participant Gateway
    participant PrefillWorker as "Prefill Worker<br/>(AggregatedHandler)"
    participant EncodeWorker as "Encode Worker<br/>(EncodeHandler)"
    participant DecodeWorker as "Decode Worker<br/>(DecodeHandler)"
    participant NIXL as "NIXL<br/>(RDMA Transfer)"

    Note over Client,NIXL: Prefill-First Strategy: Context processing first, then streaming generation

    Client->>Gateway: POST /v1/chat/completions<br/>(multimodal request)
    Gateway->>PrefillWorker: Route request

    Note over PrefillWorker: Check for multimodal content
    PrefillWorker->>EncodeWorker: Send request<br/>(contains embedding paths)

    Note over EncodeWorker: Load embeddings from file/url<br/>
    EncodeWorker->>NIXL: Create readable operation<br/>
    EncodeWorker->>PrefillWorker: Send metadata + NIXL info<br/>(JSON: shape, dtype, aux_data)

    Note over PrefillWorker: Allocate tensor with dynamic shape
    PrefillWorker->>NIXL: Begin read operation
    NIXL-->>PrefillWorker: Zero-copy transfer complete<br/>

    Note over PrefillWorker: Reconstruct embeddings<br/>(mm_embeddings + special_tokens + offsets)
    Note over PrefillWorker: Process full context<br/>(text + multimodal embeddings)
    Note over PrefillWorker: Generate KV-cache<br/>(max_tokens=1 in prefill mode)

    PrefillWorker->>DecodeWorker: Transfer KV-cache + disaggregated_params<br/>(generation_only mode)

    Note over DecodeWorker: Continue generation<br/>(streaming tokens)
    DecodeWorker->>Gateway: Stream response chunk 1
    Gateway->>Client: Response chunk 1
    DecodeWorker->>Gateway: Stream response chunk 2
    Gateway->>Client: Response chunk 2
    DecodeWorker->>Gateway: ... (continue streaming)
    Gateway->>Client: ... (continue streaming)
    DecodeWorker->>Gateway: Final response + [DONE]
    Gateway->>Client: Final response + [DONE]
```

106
### Decode-First Disaggregation Strategy
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157

```mermaid
sequenceDiagram
    participant Client
    participant Gateway
    participant DecodeWorker as "Decode Worker<br/>(DecodeHandler)<br/>PRIMARY"
    participant PrefillWorker as "Prefill Worker<br/>(PrefillHandler)"
    participant EncodeWorker as "Encode Worker<br/>(EncodeHandler)"
    participant NIXL as "NIXL<br/>(RDMA Transfer)"

    Note over Client,NIXL: Decode-First Strategy: DecodeWorker orchestrates prefill then handles generation

    Client->>Gateway: POST /v1/chat/completions<br/>(multimodal request)
    Gateway->>DecodeWorker: Route request<br/>(primary worker)

    Note over DecodeWorker: Check disaggregation_strategy == DECODE_FIRST
    Note over DecodeWorker: Call remote_prefill() to trigger prefill

    DecodeWorker->>PrefillWorker: Send request via remote_prefill()

    Note over PrefillWorker: Check for multimodal content
    PrefillWorker->>EncodeWorker: Send request<br/>(contains embedding paths)

    Note over EncodeWorker: Load embeddings from file<br/>
    EncodeWorker->>NIXL: Create readable operation<br/>
    EncodeWorker->>PrefillWorker: Send metadata + NIXL info<br/>(JSON: shape, dtype, aux_data)

    Note over PrefillWorker: Allocate tensor with dynamic shape
    PrefillWorker->>NIXL: Begin read operation
    NIXL-->>PrefillWorker: Zero-copy transfer complete<br/>

    Note over PrefillWorker: Reconstruct embeddings<br/>(mm_embeddings + special_tokens + offsets)
    Note over PrefillWorker: Process full context<br/>(text + multimodal embeddings)
    Note over PrefillWorker: Generate prefill response<br/>(max_tokens=1 in prefill mode)

    PrefillWorker->>DecodeWorker: Return prefill response<br/>(disaggregated_params)

    Note over DecodeWorker: Extract disaggregated_params<br/>from prefill_response
    Note over DecodeWorker: Update request with params<br/>request["disaggregated_params"] = response_data["disaggregated_params"]
    Note over DecodeWorker: Begin local generation<br/>(generate_locally with prefill state)

    DecodeWorker->>Gateway: Stream response chunk 1
    Gateway->>Client: Response chunk 1
    DecodeWorker->>Gateway: Stream response chunk 2
    Gateway->>Client: Response chunk 2
    DecodeWorker->>Gateway: ... (continue streaming)
    Gateway->>Client: ... (continue streaming)
    DecodeWorker->>Gateway: Final response + [DONE]
    Gateway->>Client: Final response + [DONE]
```

158
## How the System Works
159
160
161
162
163
164
165

1. **Request Processing**: Multimodal requests containing embedding file paths OR urls are routed based on disaggregation strategy
2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata
3. **NIXL Transfer**: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency
4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker
5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing

166
## Example Request
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187

The request format is identical to regular multimodal requests:

```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the image"},
                {
                    "type": "image_url",
                    "image_url": {"url": "/path/to/embeddings.pt"}
                }
            ]
        }
    ],
    "max_tokens": 160
}'
```