## Encode-Prefill-Decode (EPD) Flow with NIXL
For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
### Enabling the Feature
This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:
```bash
./container/build.sh --framework trtllm --tensorrtllm-commit b4065d8ca64a64eee9fdc64b39cb66d73d4be47c
```
### Key Features
- **High Performance**: Zero-copy RDMA transfer for embeddings
- **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image
- **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings
- **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON
### How to use
```bash
cd $DYNAMO_HOME/components/backends/trtllm
# Launch 3-worker EPD flow with NIXL
./launch/epd_disagg.sh
```
### Configuration
The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker:
```bash
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
```
This endpoint follows Dynamo's standard format: `dyn://namespace.component.endpoint` where the Encode worker registers itself as `dynamo.tensorrt_llm_encode.generate`.
For local embedding file access, use the `--allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH"` parameter to specify the secure directory path where embedding files can be loaded from (default: `/tmp`). This prevents path traversal attacks while allowing flexible file access within the designated directory.
```bash
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
```
For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB"` parameter to limit the maximum size of downloadable embedding files/Image URLs (default: `50MB`). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes.
```bash
export MAX_FILE_SIZE_MB=50
```
### Architecture Overview
The EPD flow implements a **3-worker architecture** for high-performance multimodal inference:
- **Encode Worker**: Loads and processes multimodal embeddings
- **Prefill Worker**: Handles initial context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation
### Request Flow Diagrams
#### Prefill-First Disaggregation Strategy
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant PrefillWorker as "Prefill Worker
(AggregatedHandler)"
participant EncodeWorker as "Encode Worker
(EncodeHandler)"
participant DecodeWorker as "Decode Worker
(DecodeHandler)"
participant NIXL as "NIXL
(RDMA Transfer)"
Note over Client,NIXL: Prefill-First Strategy: Context processing first, then streaming generation
Client->>Gateway: POST /v1/chat/completions
(multimodal request)
Gateway->>PrefillWorker: Route request
Note over PrefillWorker: Check for multimodal content
PrefillWorker->>EncodeWorker: Send request
(contains embedding paths)
Note over EncodeWorker: Load embeddings from file/url
EncodeWorker->>NIXL: Create readable operation
EncodeWorker->>PrefillWorker: Send metadata + NIXL info
(JSON: shape, dtype, aux_data)
Note over PrefillWorker: Allocate tensor with dynamic shape
PrefillWorker->>NIXL: Begin read operation
NIXL-->>PrefillWorker: Zero-copy transfer complete
Note over PrefillWorker: Reconstruct embeddings
(mm_embeddings + special_tokens + offsets)
Note over PrefillWorker: Process full context
(text + multimodal embeddings)
Note over PrefillWorker: Generate KV-cache
(max_tokens=1 in prefill mode)
PrefillWorker->>DecodeWorker: Transfer KV-cache + disaggregated_params
(generation_only mode)
Note over DecodeWorker: Continue generation
(streaming tokens)
DecodeWorker->>Gateway: Stream response chunk 1
Gateway->>Client: Response chunk 1
DecodeWorker->>Gateway: Stream response chunk 2
Gateway->>Client: Response chunk 2
DecodeWorker->>Gateway: ... (continue streaming)
Gateway->>Client: ... (continue streaming)
DecodeWorker->>Gateway: Final response + [DONE]
Gateway->>Client: Final response + [DONE]
```
#### Decode-First Disaggregation Strategy
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant DecodeWorker as "Decode Worker
(DecodeHandler)
PRIMARY"
participant PrefillWorker as "Prefill Worker
(PrefillHandler)"
participant EncodeWorker as "Encode Worker
(EncodeHandler)"
participant NIXL as "NIXL
(RDMA Transfer)"
Note over Client,NIXL: Decode-First Strategy: DecodeWorker orchestrates prefill then handles generation
Client->>Gateway: POST /v1/chat/completions
(multimodal request)
Gateway->>DecodeWorker: Route request
(primary worker)
Note over DecodeWorker: Check disaggregation_strategy == DECODE_FIRST
Note over DecodeWorker: Call remote_prefill() to trigger prefill
DecodeWorker->>PrefillWorker: Send request via remote_prefill()
Note over PrefillWorker: Check for multimodal content
PrefillWorker->>EncodeWorker: Send request
(contains embedding paths)
Note over EncodeWorker: Load embeddings from file
EncodeWorker->>NIXL: Create readable operation
EncodeWorker->>PrefillWorker: Send metadata + NIXL info
(JSON: shape, dtype, aux_data)
Note over PrefillWorker: Allocate tensor with dynamic shape
PrefillWorker->>NIXL: Begin read operation
NIXL-->>PrefillWorker: Zero-copy transfer complete
Note over PrefillWorker: Reconstruct embeddings
(mm_embeddings + special_tokens + offsets)
Note over PrefillWorker: Process full context
(text + multimodal embeddings)
Note over PrefillWorker: Generate prefill response
(max_tokens=1 in prefill mode)
PrefillWorker->>DecodeWorker: Return prefill response
(disaggregated_params)
Note over DecodeWorker: Extract disaggregated_params
from prefill_response
Note over DecodeWorker: Update request with params
request["disaggregated_params"] = response_data["disaggregated_params"]
Note over DecodeWorker: Begin local generation
(generate_locally with prefill state)
DecodeWorker->>Gateway: Stream response chunk 1
Gateway->>Client: Response chunk 1
DecodeWorker->>Gateway: Stream response chunk 2
Gateway->>Client: Response chunk 2
DecodeWorker->>Gateway: ... (continue streaming)
Gateway->>Client: ... (continue streaming)
DecodeWorker->>Gateway: Final response + [DONE]
Gateway->>Client: Final response + [DONE]
```
### How the System Works
1. **Request Processing**: Multimodal requests containing embedding file paths OR urls are routed based on disaggregation strategy
2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata
3. **NIXL Transfer**: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency
4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker
5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing
### Example Request
The request format is identical to regular multimodal requests:
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image"},
{
"type": "image_url",
"image_url": {"url": "/path/to/embeddings.pt"}
}
]
}
],
"max_tokens": 160
}'
```