docs: Update Multimodal Example README (#1275)

This change corrects the README.md file in the examples/multimodal folder: - Correct "vllm worker" to "decode worker" - Correct assertion that data is moved via NATS when embeddings are moved via RDMA. Additionally, this change updates the textual graphs with Mermaid graphs for improved presentation on github.com.

docs: Update Multimodal Example README (#1275)
This change corrects the README.md file in the examples/multimodal folder: - Correct "vllm worker" to "decode worker" - Correct assertion that data is moved via NATS when embeddings are moved via RDMA. Additionally, this change updates the textual graphs with Mermaid graphs for improved presentation on github.com.
fb4bf252 · J Wyman · GitHub · f67dc38b · fb4bf252
Unverified Commit fb4bf252 authored May 29, 2025 by J Wyman Committed by GitHub May 29, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 75 additions and 69 deletions

examples/multimodal/README.md examples/multimodal/README.md +75 -69

No files found.
--- a/examples/multimodal/README.md
+++ b/examples/multimodal/README.md
@@ -24,26 +24,29 @@ The examples are based on the [llava-1.5-7b-hf](https://huggingface.co/llava-hf/
 ### Components
- workers: For aggregated serving, we have two workers, [encode_worker](components/encode_worker.py) for encoding and [vllm_worker](components/worker.py) for prefilling and decoding.
+- workers: For aggregated serving, we have two workers, [encode_worker](components/encode_worker.py) for encoding and [decode_worker](components/decode_worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the vllm worker.
+- processor: Tokenizes the prompt and passes it to the decode worker.
- frontend: Http endpoint to handle incoming requests.
+- frontend: HTTP endpoint to handle incoming requests.
 ### Deployment
-In this deployment, we have two workers, [encode_worker](components/encode_worker.py) and [vllm_worker](components/worker.py).
+In this deployment, we have two workers, [encode_worker](components/encode_worker.py) and [decode_worker](components/decode_worker.py).
-The encode worker is responsible for encoding the image and passing the embeddings to the vllm worker via NATS.
+The encode worker is responsible for encoding the image and passing the embeddings to the decode worker via a combination of NATS and RDMA.
-The vllm worker then prefills and decodes the prompt, just like the [LLM aggregated serving](../llm/README.md) example.
+The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
+Its decode worker then prefills and decodes the prompt, just like the [LLM aggregated serving](../llm/README.md) example.
 By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
 encode worker independently from the prefill and decode workers if needed.
 This figure shows the flow of the deployment:
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --> decode_worker
+  decode_worker --> processor
+  decode_worker --image_url--> encode_worker
+  encode_worker --embeddings--> decode_worker
 ```
-+------+      +-----------+      +------------------+      image url       +---------------+
-| HTTP |----->| processor |----->|   vllm worker    |--------------------->| encode worker |
-|      |<-----|           |<-----|                  |<---------------------|               |
-+------+      +-----------+      +------------------+   image embeddings   +---------------+
 ```
 ```bash
@@ -58,31 +61,31 @@ In another terminal:
 curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "llava-hf/llava-1.5-7b-hf",
+      "model": "llava-hf/llava-1.5-7b-hf",
-    "messages": [
+      "messages": [
-      {
+        {
-        "role": "user",
+          "role": "user",
-        "content": [
+          "content": [
-          {
+            {
-            "type": "text",
+              "type": "text",
-            "text": "What is in this image?"
+              "text": "What is in this image?"
-          },
+            },
-          {
+            {
-            "type": "image_url",
+              "type": "image_url",
-            "image_url": {
+              "image_url": {
-              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              }
            }
-          }
+          ]
-        ]
+        }
-      }
+      ],
-    ],
+      "max_tokens": 300,
-    "max_tokens": 300,
+      "stream": false
-    "stream": false
+    }'
-  }'
 ```
 You should see a response similar to this:
-```
+```json
 {"id": "c37b946e-9e58-4d54-88c8-2dbd92c47b0c", "object": "chat.completion", "created": 1747725277, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " In the image, there is a city bus parked on a street, with a street sign nearby on the right side. The bus appears to be stopped out of service. The setting is in a foggy city, giving it a slightly moody atmosphere."}, "finish_reason": "stop"}]}
 ```
@@ -90,29 +93,32 @@ You should see a response similar to this:
 ### Components
- workers: For disaggregated serving, we have three workers, [encode_worker](components/encode_worker.py) for encoding, [vllm_worker](components/worker.py) for decoding, and [prefill_worker](components/prefill_worker.py) for prefilling.
+- workers: For disaggregated serving, we have three workers, [encode_worker](components/encode_worker.py) for encoding, [decode_worker](components/decode_worker.py) for decoding, and [prefill_worker](components/prefill_worker.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the vllm worker.
+- processor: Tokenizes the prompt and passes it to the decode worker.
- frontend: Http endpoint to handle incoming requests.
+- frontend: HTTP endpoint to handle incoming requests.
 ### Deployment
-In this deployment, we have three workers, [encode_worker](components/encode_worker.py), [vllm_worker](components/worker.py), and [prefill_worker](components/prefill_worker.py).
+In this deployment, we have three workers, [encode_worker](components/encode_worker.py), [decode_worker](components/decode_worker.py), and [prefill_worker](components/prefill_worker.py).
 For the Llava model, embeddings are only required during the prefill stage. As such, the encode worker is connected directly to the prefill worker.
-The encode worker handles image encoding and transmits the resulting embeddings to the prefill worker via NATS.
+The encode worker is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
-The prefill worker performs the prefilling step and forwards the KV cache to the vllm worker for decoding.
+Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
-For more details on the roles of the prefill and vllm workers, refer to the [LLM disaggregated serving](../llm/README.md) example.
+The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
+For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../llm/README.md) example.
 This figure shows the flow of the deployment:
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --> decode_worker
+  decode_worker --> processor
+  decode_worker --> prefill_worker
+  prefill_worker --> decode_worker
+  prefill_worker --image_url--> encode_worker
+  encode_worker --embeddings--> prefill_worker
 ```
-+------+      +-----------+      +------------------+      +------------------+      image url       +---------------+
-| HTTP |----->| processor |----->|   vllm worker    |----->|  prefill worker  |--------------------->| encode worker |
-|      |<-----|           |<-----|  (decode worker) |<-----|                  |<---------------------|               |
-+------+      +-----------+      +------------------+      +------------------+   image embeddings   +---------------+
-```
 ```bash
 cd $DYNAMO_HOME/examples/multimodal
 dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml
@@ -125,30 +131,30 @@ In another terminal:
 curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "llava-hf/llava-1.5-7b-hf",
+      "model": "llava-hf/llava-1.5-7b-hf",
-    "messages": [
+      "messages": [
-      {
+        {
-        "role": "user",
+          "role": "user",
-        "content": [
+          "content": [
-          {
+            {
-            "type": "text",
+              "type": "text",
-            "text": "What is in this image?"
+              "text": "What is in this image?"
-          },
+            },
-          {
+            {
-            "type": "image_url",
+              "type": "image_url",
-            "image_url": {
+              "image_url": {
-              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              }
            }
-          }
+          ]
-        ]
+        }
-      }
+      ],
-    ],
+      "max_tokens": 300,
-    "max_tokens": 300,
+      "stream": false
-    "stream": false
+    }'
-  }'
 ```
 You should see a response similar to this:
-```
+```json
 {"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]}
 ```