Unverified Commit 87112373 authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: router cost function mermaid (#4943)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent 45e881d3
...@@ -84,38 +84,6 @@ This includes the specific commit [vllm-project/vllm#19790](https://github.com/v ...@@ -84,38 +84,6 @@ This includes the specific commit [vllm-project/vllm#19790](https://github.com/v
> [!IMPORTANT] > [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility. > Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
This figure shows an overview of the major components to deploy:
```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'10px', 'primaryColor':'#2e8b57', 'primaryTextColor':'#fff', 'primaryBorderColor':'#333', 'lineColor':'#81b1db', 'secondaryColor':'#b35900', 'tertiaryColor':'#808080', 'edgeLabelBackground':'transparent'}}}%%
graph TD
%% Node Definitions with custom shapes
HTTP[HTTP]
ROUTER[Router]
PREFILL[vLLM Prefill Worker]
DECODE[vLLM Decode Worker]
%% Class Definitions for color
classDef worker_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff;
classDef router_style fill:#b35900,stroke:#333,stroke-width:2px,color:#fff;
%% Applying classes to nodes
class PREFILL,DECODE worker_style
class ROUTER router_style
%% Request/Response flow
HTTP <--> |"request/response"| ROUTER
ROUTER --> |"1. send to prefill"| PREFILL
PREFILL --> |"2. return NIXL metadata"| ROUTER
ROUTER --> |"3. send with metadata"| DECODE
DECODE --> |"4. stream response"| ROUTER
%% KV Events publishing
PREFILL -.-> |"publish kv events"| ROUTER
```
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
### Aggregated Serving ### Aggregated Serving
```bash ```bash
......
...@@ -88,27 +88,64 @@ When both workers are registered, requests are automatically routed. ...@@ -88,27 +88,64 @@ When both workers are registered, requests are automatically routed.
```python ```python
# Decode worker registration (in your decode worker) # Decode worker registration (in your decode worker)
decode_endpoint = runtime.namespace("dynamo").component("decode").endpoint("generate")
await register_llm( await register_llm(
model_input=ModelInput.Tokens, model_input=ModelInput.Tokens,
model_type=ModelType.Chat | ModelType.Completions, model_type=ModelType.Chat | ModelType.Completions,
endpoint=generate_endpoint, endpoint=decode_endpoint,
model_name="meta-llama/Llama-2-7b-hf", model_name="meta-llama/Llama-2-7b-hf",
# ... other parameters # ... other parameters
) )
await decode_endpoint.serve_endpoint(decode_handler.generate)
# Prefill worker registration (in your prefill worker) # Prefill worker registration (in your prefill worker)
prefill_endpoint = runtime.namespace("dynamo").component("prefill").endpoint("generate")
await register_llm( await register_llm(
model_input=ModelInput.Tokens, model_input=ModelInput.Tokens,
model_type=ModelType.Prefill, # <-- Mark as prefill worker model_type=ModelType.Prefill, # <-- Mark as prefill worker
endpoint=generate_endpoint, endpoint=prefill_endpoint,
model_name="meta-llama/Llama-2-7b-hf", # Must match decode model name model_name="meta-llama/Llama-2-7b-hf", # Must match decode model name
# ... other parameters # ... other parameters
) )
await prefill_endpoint.serve_endpoint(prefill_handler.generate)
``` ```
> [!Note] > [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh). > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
### Request Flow
The following diagram shows an overview of the major components in disaggregated serving:
```mermaid
graph TD
HTTP[HTTP]
ROUTER[Router]
PREFILL[Prefill Worker]
DECODE[Decode Worker]
classDef worker_style fill:#f3e5f5,stroke:#333,stroke-width:2px,color:#333;
classDef router_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff;
class PREFILL,DECODE worker_style
class ROUTER router_style
HTTP <--> |"request/response"| ROUTER
ROUTER --> |"1. send to prefill"| PREFILL
PREFILL --> |"2. return NIXL metadata"| ROUTER
ROUTER --> |"3. send with metadata"| DECODE
DECODE --> |"4. stream response"| ROUTER
PREFILL -.-> |"publish kv events"| ROUTER
linkStyle 0,1,2,3,4 stroke:#8b4513,stroke-width:2px
linkStyle 5 stroke:#2196f3,stroke-width:2px
```
## Overview ## Overview
The KV-aware router operates on two key principles to optimize request routing: The KV-aware router operates on two key principles to optimize request routing:
...@@ -153,13 +190,15 @@ graph TD ...@@ -153,13 +190,15 @@ graph TD
JS -->|Consume as Durable Consumer| R2 JS -->|Consume as Durable Consumer| R2
JS -->|Periodic Snapshot| OS JS -->|Periodic Snapshot| OS
style JS fill:#e1f5fe,color:#5a850f style JS fill:#e1f5fe,stroke:#333,color:#333
style OS fill:#e8f5e9,color:#5a850f style OS fill:#e1f5fe,stroke:#333,color:#333
style E1 fill:#fff3e0,color:#5a850f style E1 fill:#f3e5f5,stroke:#333,color:#333
style E2 fill:#fff3e0,color:#5a850f style E2 fill:#f3e5f5,stroke:#333,color:#333
style E3 fill:#fff3e0,color:#5a850f style E3 fill:#f3e5f5,stroke:#333,color:#333
style R1 fill:#f3e5f5,color:#5a850f style R1 fill:#2e8b57,stroke:#333,color:#fff
style R2 fill:#f3e5f5,color:#5a850f style R2 fill:#2e8b57,stroke:#333,color:#fff
linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px
``` ```
#### Mode 2: NATS Core with Local Indexer #### Mode 2: NATS Core with Local Indexer
...@@ -194,12 +233,14 @@ graph TD ...@@ -194,12 +233,14 @@ graph TD
NC -->|Subscribe| R1 NC -->|Subscribe| R1
NC -->|Subscribe| R2 NC -->|Subscribe| R2
style NC fill:#e1f5fe,color:#5a850f style NC fill:#e1f5fe,stroke:#333,color:#333
style E1 fill:#fff3e0,color:#5a850f style E1 fill:#f3e5f5,stroke:#333,color:#333
style E2 fill:#fff3e0,color:#5a850f style E2 fill:#f3e5f5,stroke:#333,color:#333
style E3 fill:#fff3e0,color:#5a850f style E3 fill:#f3e5f5,stroke:#333,color:#333
style R1 fill:#f3e5f5,color:#5a850f style R1 fill:#2e8b57,stroke:#333,color:#fff
style R2 fill:#f3e5f5,color:#5a850f style R2 fill:#2e8b57,stroke:#333,color:#fff
linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px
``` ```
**How gap detection works:** **How gap detection works:**
...@@ -372,20 +413,21 @@ To get a feel for how KV Cache management works on a single worker with KV Cache ...@@ -372,20 +413,21 @@ To get a feel for how KV Cache management works on a single worker with KV Cache
Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/). Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/).
## KV Cache Routing and Load Balancing ## KV Cache Routing and Load Balancing
```text ```mermaid
+---------+ +------------------+ +---------+ graph TD
| Tokens |--------->| KV Aware Router |---------> | Worker 2| T[Tokens] --> R[KV Aware Router]
+---------+ +------------------+ +---------+
| R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
+------------------+------------------+ R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
| | | R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
| Cached: 2 blocks | Cached: 5 blocks | Cached: 8 blocks
| Prefill: 8 blks | Prefill: 5 blks | Prefill: 2 blks style T fill:#fff3e0,stroke:#333,color:#333
| Decode: 10 blks | Decode: 5 blks | Decode: 9 blks style R fill:#2e8b57,stroke:#333,color:#fff
v v v style W1 fill:#f3e5f5,stroke:#333,color:#333
+----------------+ +----------------+ +----------------+ style W2 fill:#c8e6c9,stroke:#333,color:#333
| Worker 1 | | Worker 2 | | Worker 3 | style W3 fill:#f3e5f5,stroke:#333,color:#333
+----------------+ +----------------+ +----------------+
linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
``` ```
KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to: KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment