SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Encode-Prefill-Decode (EPD) Flow with NIXL
For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
- workers: For aggregated serving, we have two workers, [MultimodalEncodeWorker](src/dynamo/sglang/request_handlers/multimodal_encode_worker_handler.py) for encoding and [MultimodalWorker](src/dynamo/sglang/request_handlers/multimodal_worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow
The MultimodalEncodeWorker is responsible for encoding the image and passing the embeddings to the MultimodalWorker via a combination of NATS and RDMA.
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../README.md) example.
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
MultimodalEncodeWorker independently from the prefill and decode workers if needed.
"content":"This image shows a public transit bus on a dimly lit, street-level track in what appears to be a quiet urban neighborhood or suburban area. The bus displays \"OUT OF SERVICE\" in red on its illuminated sign. It is positioned",
"role":"assistant",
"reasoning_content":null
},
"finish_reason":"length"
}
],
"created":1758824222,
"model":"Qwen/Qwen2.5-VL-7B-Instruct",
"object":"chat.completion",
"usage":{
"prompt_tokens":0,
"completion_tokens":40,
"total_tokens":40
}
}
```
## Multimodal Disaggregated Serving
### Components
- workers: For disaggregated serving, we have three workers, [MultimodalEncodeWorker](src/dynamo/sglang/request_handlers/multimodal_encode_worker_handler.py) for encoding, [MultimodalWorker](src/dynamo/sglang/request_handlers/multimodal_worker_handler.py) for decoding, and [MultimodalPrefillWorker](src/dynamo/sglang/request_handlers/multimodal_worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow
For the Qwen2.5-VL model, embeddings are only required during the prefill stage. As such, the image embeddings are transferred using a NIXL descriptor from the encode worker to the worker and then passed to the prefill worker for processing.
The prefill worker performs the prefilling step and forwards the KV cache to the worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../README.md) example.
"content":"This image shows a public transit bus on a dimly lit, street-level track in what appears to be a quiet urban neighborhood or suburban area. The bus displays \"OUT OF SERVICE\" in red on its illuminated sign. It is positioned",
"help":"Use SGLang's tokenizer. This will skip tokenization of the input and output and only v1/chat/completions will be available when using the dynamo frontend. Cannot be used with --custom-jinja-template.",
"help":"Use SGLang's tokenizer. This will skip tokenization of the input and output and only v1/chat/completions will be available when using the dynamo frontend. Cannot be used with --custom-jinja-template.",
},
},
"multimodal-processor":{
"flags":["--multimodal-processor"],
"action":"store_true",
"default":False,
"help":"Run as multimodal processor component for handling multimodal requests",
},
"multimodal-encode-worker":{
"flags":["--multimodal-encode-worker"],
"action":"store_true",
"default":False,
"help":"Run as multimodal encode worker component for processing images/videos",
},
"multimodal-worker":{
"flags":["--multimodal-worker"],
"action":"store_true",
"default":False,
"help":"Run as multimodal worker component for LLM inference with multimodal data",
- processor: Tokenizes the prompt and passes it to the VllmEncodeWorker.
- processor: Tokenizes the prompt and passes it to the VllmEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
- frontend: HTTP endpoint to handle incoming requests.
### Graph
### Workflow
In this graph, we have two workers, [VllmEncodeWorker](components/encode_worker.py) and [VllmPDWorker](components/worker.py).
In this workflow, we have two workers, [VllmEncodeWorker](components/encode_worker.py) and [VllmPDWorker](components/worker.py).
The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the VllmPDWorker via a combination of NATS and RDMA.
The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the VllmPDWorker via a combination of NATS and RDMA.
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example.
Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../../components/backends/vllm/README.md) example.
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
VllmEncodeWorker independently from the prefill and decode workers if needed.
VllmEncodeWorker independently from the prefill and decode workers if needed.
This figure shows the flow of the graph:
This figure illustrates the workflow:
```mermaid
```mermaid
flowchart LR
flowchart LR
HTTP --> processor
HTTP --> processor
...
@@ -115,16 +115,16 @@ You should see a response similar to this:
...
@@ -115,16 +115,16 @@ You should see a response similar to this:
- processor: Tokenizes the prompt and passes it to the VllmEncodeWorker.
- processor: Tokenizes the prompt and passes it to the VllmEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
- frontend: HTTP endpoint to handle incoming requests.
### Graph
### Workflow
In this graph, we have three workers, [VllmEncodeWorker](components/encode_worker.py), [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
In this workflow, we have three workers, [VllmEncodeWorker](components/encode_worker.py), [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
For the Llava model, embeddings are only required during the prefill stage. As such, the VllmEncodeWorker is connected directly to the prefill worker.
For the Llava model, embeddings are only required during the prefill stage. As such, the VllmEncodeWorker is connected directly to the prefill worker.
The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../../components/backends/vllm/README.md) example.
This figure shows the flow of the graph:
This figure illustrates the workflow:
```mermaid
```mermaid
flowchart LR
flowchart LR
HTTP --> processor
HTTP --> processor
...
@@ -201,11 +201,11 @@ of the model per node.
...
@@ -201,11 +201,11 @@ of the model per node.
- processor: Tokenizes the prompt and passes it to the VllmPDWorker.
- processor: Tokenizes the prompt and passes it to the VllmPDWorker.
- frontend: HTTP endpoint to handle incoming requests.
- frontend: HTTP endpoint to handle incoming requests.
#### Graph
#### Workflow
In this graph, we have [VllmPDWorker](components/worker.py) which will encode the image, prefill and decode the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example.
In this workflow, we have [VllmPDWorker](components/worker.py) which will encode the image, prefill and decode the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example.
This figure shows the flow of the graph:
This figure illustrates the workflow:
```mermaid
```mermaid
flowchart LR
flowchart LR
HTTP --> processor
HTTP --> processor
...
@@ -263,13 +263,13 @@ You should see a response similar to this:
...
@@ -263,13 +263,13 @@ You should see a response similar to this:
- processor: Tokenizes the prompt and passes it to the VllmPDWorker.
- processor: Tokenizes the prompt and passes it to the VllmPDWorker.
- frontend: HTTP endpoint to handle incoming requests.
- frontend: HTTP endpoint to handle incoming requests.
#### Graph
#### Workflow
In this graph, we have two workers, [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
In this workflow, we have two workers, [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
The prefill worker performs the encoding and prefilling steps and forwards the KV cache to the decode worker for decoding.
The prefill worker performs the encoding and prefilling steps and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example.
This figure shows the flow of the graph:
This figure illustrates the workflow:
```mermaid
```mermaid
flowchart LR
flowchart LR
HTTP --> processor
HTTP --> processor
...
@@ -337,16 +337,16 @@ This example demonstrates deploying an aggregated multimodal model that can proc
...
@@ -337,16 +337,16 @@ This example demonstrates deploying an aggregated multimodal model that can proc
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
- frontend: HTTP endpoint to handle incoming requests.
### Graph
### Workflow
In this graph, we have two workers, [VideoEncodeWorker](components/video_encode_worker.py) and [VllmPDWorker](components/worker.py).
In this workflow, we have two workers, [VideoEncodeWorker](components/video_encode_worker.py) and [VllmPDWorker](components/worker.py).
The VideoEncodeWorker is responsible for decoding the video into a series of frames. Unlike the image pipeline which generates embeddings,
The VideoEncodeWorker is responsible for decoding the video into a series of frames. Unlike the image pipeline which generates embeddings,
this pipeline passes the raw frames directly to the VllmPDWorker via a combination of NATS and RDMA.
this pipeline passes the raw frames directly to the VllmPDWorker via a combination of NATS and RDMA.
Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example.
Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example.
By separating the video processing from the prefill and decode stages, we can have a more flexible deployment and scale the
By separating the video processing from the prefill and decode stages, we can have a more flexible deployment and scale the
VideoEncodeWorker independently from the prefill and decode workers if needed.
VideoEncodeWorker independently from the prefill and decode workers if needed.
This figure shows the flow of the graph:
This figure illustrates the workflow:
```mermaid
```mermaid
flowchart LR
flowchart LR
HTTP --> processor
HTTP --> processor
...
@@ -425,15 +425,15 @@ This example demonstrates deploying a disaggregated multimodal model that can pr
...
@@ -425,15 +425,15 @@ This example demonstrates deploying a disaggregated multimodal model that can pr
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
- frontend: HTTP endpoint to handle incoming requests.
### Graph
### Workflow
In this graph, we have three workers, [VideoEncodeWorker](components/video_encode_worker.py), [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
In this workflow, we have three workers, [VideoEncodeWorker](components/video_encode_worker.py), [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py).
For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. As such, the VideoEncodeWorker is connected directly to the prefill worker.
For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. As such, the VideoEncodeWorker is connected directly to the prefill worker.
The VideoEncodeWorker is responsible for decoding the video into a series of frames and passing them to the prefill worker via RDMA.
The VideoEncodeWorker is responsible for decoding the video into a series of frames and passing them to the prefill worker via RDMA.
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example.