multimodal_epd.md 5.62 KB
Newer Older
1
# Encode-Prefill-Decode (EPD) Flow with NIXL
2
3
4

For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.

5
## Enabling the Feature
6
7
8
9
10

This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:

```bash
11
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main
12
13
```

14
## Key Features
15
16
17
18
19
20

- **High Performance**: Zero-copy RDMA transfer for embeddings
- **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image
- **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings
- **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON

21
## How to use
22
23

```bash
24
cd $DYNAMO_HOME/examples/backends/trtllm
25
26
27
28
29

# Launch 3-worker EPD flow with NIXL
./launch/epd_disagg.sh
```

30
## Configuration
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker:

```bash
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
```

This endpoint follows Dynamo's standard format: `dyn://namespace.component.endpoint` where the Encode worker registers itself as `dynamo.tensorrt_llm_encode.generate`.

For local embedding file access, use the `--allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH"` parameter to specify the secure directory path where embedding files can be loaded from (default: `/tmp`). This prevents path traversal attacks while allowing flexible file access within the designated directory.

```bash
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
```

For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB"` parameter to limit the maximum size of downloadable embedding files/Image URLs (default: `50MB`). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes.

```bash
export MAX_FILE_SIZE_MB=50
```

52
## Architecture Overview
53
54
55
56
57
58
59

The EPD flow implements a **3-worker architecture** for high-performance multimodal inference:

- **Encode Worker**: Loads and processes multimodal embeddings
- **Prefill Worker**: Handles initial context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation

60
## Request Flow Diagram
61
62
63
64

```mermaid
sequenceDiagram
    participant Client
65
66
    participant Frontend
    participant PrefillWorker as "Prefill Worker<br/>(PrefillHandler)"
67
68
69
70
    participant EncodeWorker as "Encode Worker<br/>(EncodeHandler)"
    participant DecodeWorker as "Decode Worker<br/>(DecodeHandler)"
    participant NIXL as "NIXL<br/>(RDMA Transfer)"

71
    Note over Client,NIXL: Unified Frontend: Context processing followed by streaming generation
72

73
74
    Client->>Frontend: POST /v1/chat/completions<br/>(multimodal request)
    Frontend->>PrefillWorker: Route to prefill worker
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

    Note over PrefillWorker: Check for multimodal content
    PrefillWorker->>EncodeWorker: Send request<br/>(contains embedding paths)

    Note over EncodeWorker: Load embeddings from file/url<br/>
    EncodeWorker->>NIXL: Create readable operation<br/>
    EncodeWorker->>PrefillWorker: Send metadata + NIXL info<br/>(JSON: shape, dtype, aux_data)

    Note over PrefillWorker: Allocate tensor with dynamic shape
    PrefillWorker->>NIXL: Begin read operation
    NIXL-->>PrefillWorker: Zero-copy transfer complete<br/>

    Note over PrefillWorker: Reconstruct embeddings<br/>(mm_embeddings + special_tokens + offsets)
    Note over PrefillWorker: Process full context<br/>(text + multimodal embeddings)
    Note over PrefillWorker: Generate KV-cache<br/>(max_tokens=1 in prefill mode)

91
    PrefillWorker->>Frontend: Return prefill response<br/>(disaggregated_params)
92

93
    Frontend->>DecodeWorker: Route to decode worker<br/>with disaggregated_params
94

95
96
97
98
99
100
101
102
103
    Note over DecodeWorker: Continue generation<br/>(streaming tokens)
    DecodeWorker->>Frontend: Stream response chunk 1
    Frontend->>Client: Response chunk 1
    DecodeWorker->>Frontend: Stream response chunk 2
    Frontend->>Client: Response chunk 2
    DecodeWorker->>Frontend: ... (continue streaming)
    Frontend->>Client: ... (continue streaming)
    DecodeWorker->>Frontend: Final response + [DONE]
    Frontend->>Client: Final response + [DONE]
104
105
```

106
## How the System Works
107

108
1. **Request Processing**: Multimodal requests containing embedding file paths or URLs are routed by the frontend to prefill workers
109
110
111
112
113
2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata
3. **NIXL Transfer**: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency
4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker
5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing

114
## Example Request
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135

The request format is identical to regular multimodal requests:

```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the image"},
                {
                    "type": "image_url",
                    "image_url": {"url": "/path/to/embeddings.pt"}
                }
            ]
        }
    ],
    "max_tokens": 160
}'
```