multimodal_epd.md 5.82 KB
Newer Older
1
# Encode-Prefill-Decode (EPD) Flow with NIXL
2
3
4

For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.

5
## Enabling the Feature
6
7
8
9
10

This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:

```bash
11
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit v1.2.0rc3
12
13
```

14
## Key Features
15
16
17
18
19
20

- **High Performance**: Zero-copy RDMA transfer for embeddings
- **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image
- **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings
- **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON

21
## How to use
22
23

```bash
24
cd $DYNAMO_HOME/examples/backends/trtllm
25

26
# Launch 3-worker EPD flow with NIXL.
27
28
29
./launch/epd_disagg.sh
```

30
31
32
33
## Pre-requsites

This script is specifically designed to work on 8 node H200 and `Llama-4-Maverick-17B-128E-Instruct` model with assumption that you already have a model specific embedding file ready.

34
## Configuration
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker:

```bash
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
```

This endpoint follows Dynamo's standard format: `dyn://namespace.component.endpoint` where the Encode worker registers itself as `dynamo.tensorrt_llm_encode.generate`.

For local embedding file access, use the `--allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH"` parameter to specify the secure directory path where embedding files can be loaded from (default: `/tmp`). This prevents path traversal attacks while allowing flexible file access within the designated directory.

```bash
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
```

For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB"` parameter to limit the maximum size of downloadable embedding files/Image URLs (default: `50MB`). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes.

```bash
export MAX_FILE_SIZE_MB=50
```

56
## Architecture Overview
57
58
59
60
61
62
63

The EPD flow implements a **3-worker architecture** for high-performance multimodal inference:

- **Encode Worker**: Loads and processes multimodal embeddings
- **Prefill Worker**: Handles initial context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation

64
## Request Flow Diagram
65
66
67
68

```mermaid
sequenceDiagram
    participant Client
69
70
    participant Frontend
    participant PrefillWorker as "Prefill Worker<br/>(PrefillHandler)"
71
72
73
74
    participant EncodeWorker as "Encode Worker<br/>(EncodeHandler)"
    participant DecodeWorker as "Decode Worker<br/>(DecodeHandler)"
    participant NIXL as "NIXL<br/>(RDMA Transfer)"

75
    Note over Client,NIXL: Unified Frontend: Context processing followed by streaming generation
76

77
78
    Client->>Frontend: POST /v1/chat/completions<br/>(multimodal request)
    Frontend->>PrefillWorker: Route to prefill worker
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

    Note over PrefillWorker: Check for multimodal content
    PrefillWorker->>EncodeWorker: Send request<br/>(contains embedding paths)

    Note over EncodeWorker: Load embeddings from file/url<br/>
    EncodeWorker->>NIXL: Create readable operation<br/>
    EncodeWorker->>PrefillWorker: Send metadata + NIXL info<br/>(JSON: shape, dtype, aux_data)

    Note over PrefillWorker: Allocate tensor with dynamic shape
    PrefillWorker->>NIXL: Begin read operation
    NIXL-->>PrefillWorker: Zero-copy transfer complete<br/>

    Note over PrefillWorker: Reconstruct embeddings<br/>(mm_embeddings + special_tokens + offsets)
    Note over PrefillWorker: Process full context<br/>(text + multimodal embeddings)
    Note over PrefillWorker: Generate KV-cache<br/>(max_tokens=1 in prefill mode)

95
    PrefillWorker->>Frontend: Return prefill response<br/>(disaggregated_params)
96

97
    Frontend->>DecodeWorker: Route to decode worker<br/>with disaggregated_params
98

99
100
101
102
103
104
105
106
107
    Note over DecodeWorker: Continue generation<br/>(streaming tokens)
    DecodeWorker->>Frontend: Stream response chunk 1
    Frontend->>Client: Response chunk 1
    DecodeWorker->>Frontend: Stream response chunk 2
    Frontend->>Client: Response chunk 2
    DecodeWorker->>Frontend: ... (continue streaming)
    Frontend->>Client: ... (continue streaming)
    DecodeWorker->>Frontend: Final response + [DONE]
    Frontend->>Client: Final response + [DONE]
108
109
```

110
## How the System Works
111

112
1. **Request Processing**: Multimodal requests containing embedding file paths or URLs are routed by the frontend to prefill workers
113
114
115
116
117
2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata
3. **NIXL Transfer**: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency
4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker
5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing

118
## Example Request
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

The request format is identical to regular multimodal requests:

```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the image"},
                {
                    "type": "image_url",
                    "image_url": {"url": "/path/to/embeddings.pt"}
                }
            ]
        }
    ],
    "max_tokens": 160
}'
```