"deploy/cloud/vscode:/vscode.git/clone" did not exist on "d9f6d7a599e91cda755f3a62ad001717af1df4f5"
Unverified Commit e0373bd7 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: alphabetize backends (SGLang, TensorRT-LLM, vLLM) (#6537)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 80955ef4
...@@ -8,7 +8,7 @@ title: FlexKV ...@@ -8,7 +8,7 @@ title: FlexKV
## Introduction ## Introduction
[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like vLLM, TensorRT-LLM, and SGLang. [FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like SGLang, TensorRT-LLM, and vLLM.
### Key Features ### Key Features
......
...@@ -17,7 +17,7 @@ Events are published over the **Dynamo event plane**, a transport-agnostic pub/s ...@@ -17,7 +17,7 @@ Events are published over the **Dynamo event plane**, a transport-agnostic pub/s
`KvEventPublisher` supports two publishing modes: `KvEventPublisher` supports two publishing modes:
1. **Direct publishing** — Your engine calls `publish_stored()` / `publish_removed()` to push events directly over the event plane. Simplest approach for custom engines. 1. **Direct publishing** — Your engine calls `publish_stored()` / `publish_removed()` to push events directly over the event plane. Simplest approach for custom engines.
2. **ZMQ relay** — For engines that emit raw KV events over a ZMQ socket (like vLLM and SGLang). The publisher subscribes to the ZMQ endpoint and relays events to the event plane automatically. 2. **ZMQ relay** — For engines that emit raw KV events over a ZMQ socket (like SGLang and vLLM). The publisher subscribes to the ZMQ endpoint and relays events to the event plane automatically.
## Event Types ## Event Types
...@@ -137,11 +137,11 @@ async def main(): ...@@ -137,11 +137,11 @@ async def main():
## ZMQ Relay (For Engines with Raw KV Events) ## ZMQ Relay (For Engines with Raw KV Events)
For engines that already publish raw KV events over a ZMQ socket (like vLLM and SGLang), use the same `KvEventPublisher` with a `zmq_endpoint`. The publisher subscribes to the ZMQ socket and relays events to the event plane automatically. For engines that already publish raw KV events over a ZMQ socket (like SGLang and vLLM), use the same `KvEventPublisher` with a `zmq_endpoint`. The publisher subscribes to the ZMQ socket and relays events to the event plane automatically.
```mermaid ```mermaid
flowchart LR flowchart LR
subgraph Engine["Custom Engine / vLLM / SGLang"] subgraph Engine["Custom Engine / SGLang / vLLM"]
cache["KV Cache Manager"] cache["KV Cache Manager"]
zmq_pub["ZMQ Publisher"] zmq_pub["ZMQ Publisher"]
end end
...@@ -170,7 +170,7 @@ flowchart LR ...@@ -170,7 +170,7 @@ flowchart LR
``` ```
**When to use:** **When to use:**
- Your engine already publishes KV events via ZMQ (like vLLM or SGLang) - Your engine already publishes KV events via ZMQ (like SGLang or vLLM)
- You want to decouple event publishing from your engine's main loop - You want to decouple event publishing from your engine's main loop
### Setup ### Setup
...@@ -192,7 +192,7 @@ No further calls to `publish_stored()` / `publish_removed()` are needed — the ...@@ -192,7 +192,7 @@ No further calls to `publish_stored()` / `publish_removed()` are needed — the
### ZMQ Wire Format ### ZMQ Wire Format
The ZMQ message format (compatible with vLLM / SGLang): The ZMQ message format (compatible with SGLang / vLLM):
| Frame | Description | | Frame | Description |
|-------|-------------| |-------|-------------|
......
...@@ -212,7 +212,7 @@ Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) ...@@ -212,7 +212,7 @@ Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits)
| Field | Required | Affects Hash | Example | | Field | Required | Affects Hash | Example |
|-------|----------|-------------|---------| |-------|----------|-------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` | | `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `framework` | ✓ | ✓ | `vllm`, `sglang`, `trtllm` | | `framework` | ✓ | ✓ | `sglang`, `trtllm`, `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` | | `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) | | `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) | | `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
......
...@@ -437,7 +437,7 @@ status: ...@@ -437,7 +437,7 @@ status:
- For HuggingFace: Verify token is valid, repo exists and is accessible - For HuggingFace: Verify token is valid, repo exists and is accessible
2. **Invalid LoRA format** 2. **Invalid LoRA format**
**Solution:** Ensure your LoRA weights are in the format expected by your backend framework (vLLM, SGLang, etc.) **Solution:** Ensure your LoRA weights are in the format expected by your backend framework (SGLang, vLLM, etc.)
3. **Endpoint API errors** 3. **Endpoint API errors**
```bash ```bash
......
...@@ -194,7 +194,7 @@ curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ ...@@ -194,7 +194,7 @@ curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
**Timeline:** **Timeline:**
``` ```
Timeline: 0, 1, ... Timeline: 0, 1, ...
Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (vLLM, SGLang, TRT) Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (SGLang, TRT, vLLM)
│request start │received │ │request start │received │
| | | | | |
│ ├──> start prefill ──> first token ──> |last token │ ├──> start prefill ──> first token ──> |last token
......
...@@ -16,20 +16,20 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu ...@@ -16,20 +16,20 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
## Quick Comparison ## Quick Comparison
| Feature | vLLM | TensorRT-LLM | SGLang | Source | | Feature | SGLang | TensorRT-LLM | vLLM | Source |
| :--- | :---: | :---: | :---: | :--- | | :--- | :---: | :---: | :---: | :--- |
| **Disaggregated Serving** | ✅ | ✅ | ✅ | [Design Doc][disagg] | | **Disaggregated Serving** | ✅ | ✅ | ✅ | [Design Doc][disagg] |
| **KV-Aware Routing** | ✅ | ✅ | ✅ | [Router Doc][kv-routing] | | **KV-Aware Routing** | ✅ | ✅ | ✅ | [Router Doc][kv-routing] |
| **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] | | **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] |
| **KV Block Manager** | | ✅ | 🚧 | [KVBM Doc][kvbm] | | **KV Block Manager** | 🚧 | ✅ | | [KVBM Doc][kvbm] |
| **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] | | **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] |
| **Multimodal (Video)** | | | | [Multimodal Doc][mm] | | **Multimodal (Video)** | | | | [Multimodal Doc][mm] |
| **Multimodal (Audio)** | 🚧 | | | [Multimodal Doc][mm] | | **Multimodal (Audio)** | | | 🚧 | [Multimodal Doc][mm] |
| **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] | | **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] |
| **Request Cancellation** | | ✅ | 🚧 | Backend READMEs | | **Request Cancellation** | 🚧 | ✅ | | Backend READMEs |
| **LoRA** | | | | [K8s Guide][lora] | | **LoRA** | | | | [K8s Guide][lora] |
| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] | | **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
| **Speculative Decoding** | | ✅ | 🚧 | Backend READMEs | | **Speculative Decoding** | 🚧 | ✅ | | Backend READMEs |
## 1. vLLM Backend ## 1. vLLM Backend
......
...@@ -48,7 +48,7 @@ We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtl ...@@ -48,7 +48,7 @@ We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtl
| Package | Description | Python | Platform | PyPI | | Package | Description | Python | Platform | PyPI |
|---------|-------------|--------|----------|------| |---------|-------------|--------|----------|------|
| `ai-dynamo==0.8.1` | Main package with backend integrations (vLLM, SGLang, TRT-LLM) | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo/0.8.1/) | | `ai-dynamo==0.8.1` | Main package with backend integrations (SGLang, TRT-LLM, vLLM) | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo/0.8.1/) |
| `ai-dynamo-runtime==0.8.1` | Core Python bindings for Dynamo runtime | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo-runtime/0.8.1/) | | `ai-dynamo-runtime==0.8.1` | Core Python bindings for Dynamo runtime | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo-runtime/0.8.1/) |
| `kvbm==0.8.1` | KV Block Manager for disaggregated KV cache | `3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/kvbm/0.8.1/) | | `kvbm==0.8.1` | KV Block Manager for disaggregated KV cache | `3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/kvbm/0.8.1/) |
...@@ -75,7 +75,7 @@ We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtl ...@@ -75,7 +75,7 @@ We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtl
### Container Images (NGC) ### Container Images (NGC)
> For detailed run instructions, see the [Container README](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md) or backend-specific guides: [vLLM](../backends/vllm/README.md) | [SGLang](../backends/sglang/README.md) | [TensorRT-LLM](../backends/trtllm/README.md) > For detailed run instructions, see the [Container README](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md) or backend-specific guides: [SGLang](../backends/sglang/README.md) | [TensorRT-LLM](../backends/trtllm/README.md) | [vLLM](../backends/vllm/README.md)
```bash ```bash
# Runtime containers # Runtime containers
...@@ -158,7 +158,7 @@ For a complete list of known issues, refer to the release notes for each patch: ...@@ -158,7 +158,7 @@ For a complete list of known issues, refer to the release notes for each patch:
- **v0.8.1.post1 Patch**: Updated TRT-LLM to `v1.2.0rc6.post2` (PyPI wheels and TRT-LLM container only) - **v0.8.1.post1 Patch**: Updated TRT-LLM to `v1.2.0rc6.post2` (PyPI wheels and TRT-LLM container only)
- **Standalone Frontend Container**: `dynamo-frontend` added in v0.8.0 - **Standalone Frontend Container**: `dynamo-frontend` added in v0.8.0
- **CUDA 13 Runtimes**: Experimental CUDA 13 runtime for vLLM and SGLang in v0.8.0 - **CUDA 13 Runtimes**: Experimental CUDA 13 runtime for SGLang and vLLM in v0.8.0
- **New Rust Crates**: `dynamo-memory` and `dynamo-config` added in v0.8.0 - **New Rust Crates**: `dynamo-memory` and `dynamo-config` added in v0.8.0
### GitHub Releases ### GitHub Releases
......
...@@ -14,7 +14,7 @@ This document provides the support matrix for Dynamo, including hardware, softwa ...@@ -14,7 +14,7 @@ This document provides the support matrix for Dynamo, including hardware, softwa
The following table shows the backend framework versions included with each Dynamo release: The following table shows the backend framework versions included with each Dynamo release:
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **NIXL** | | **Dynamo** | **SGLang** | **TensorRT-LLM** | **vLLM** | **NIXL** |
| :--- | :--- | :--- | :--- | :--- | | :--- | :--- | :--- | :--- | :--- |
| **main (ToT)** | `0.15.1` | `0.5.9` | `1.3.0rc3` | `0.9.0` | | **main (ToT)** | `0.15.1` | `0.5.9` | `1.3.0rc3` | `0.9.0` |
| **v1.0.0** *(planned)* | `0.15.0` | *Latest as of 2/17* | *Latest as of 2/17* | `0.10.0` | | **v1.0.0** *(planned)* | `0.15.0` | *Latest as of 2/17* | *Latest as of 2/17* | `0.10.0` |
...@@ -44,14 +44,14 @@ The following table shows the backend framework versions included with each Dyna ...@@ -44,14 +44,14 @@ The following table shows the backend framework versions included with each Dyna
### CUDA Versions by Backend ### CUDA Versions by Backend
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **Notes** | | **Dynamo** | **SGLang** | **TensorRT-LLM** | **vLLM** | **Notes** |
| :--- | :--- | :--- | :--- | :--- | | :--- | :--- | :--- | :--- | :--- |
| **v0.8.1** | `12.9`, `13.0` | `12.9`, `13.0` | `13.0` | Experimental vLLM/SGLang CUDA 13 support | | **v0.8.1** | `12.9`, `13.0` | `13.0` | `12.9`, `13.0` | Experimental SGLang/vLLM CUDA 13 support |
| **v0.8.0** | `12.9`, `13.0` | `12.9`, `13.0` | `13.0` | Experimental vLLM/SGLang CUDA 13 support | | **v0.8.0** | `12.9`, `13.0` | `13.0` | `12.9`, `13.0` | Experimental SGLang/vLLM CUDA 13 support |
| **v0.7.1** | `12.9` | `12.8` | `13.0` | | | **v0.7.1** | `12.8` | `13.0` | `12.9` | |
| **v0.7.0** | `12.8` | `12.9` | `13.0` | TensorRT-LLM CUDA 13 support - CUDA 12.9 deprecated | | **v0.7.0** | `12.9` | `13.0` | `12.8` | TensorRT-LLM CUDA 13 support - CUDA 12.9 deprecated |
| **v0.6.1** | `12.8` | `12.9` | `12.9` | | | **v0.6.1** | `12.9` | `12.9` | `12.8` | |
| **v0.6.0** | `12.8` | `12.8` | `12.9` | | | **v0.6.0** | `12.8` | `12.9` | `12.8` | |
Patch versions (e.g., v0.8.1.post1, v0.7.0.post1) have the same CUDA support as their base version. Patch versions (e.g., v0.8.1.post1, v0.7.0.post1) have the same CUDA support as their base version.
...@@ -101,22 +101,22 @@ Dynamo container images include CUDA toolkit libraries. The host machine must ha ...@@ -101,22 +101,22 @@ Dynamo container images include CUDA toolkit libraries. The host machine must ha
| Dynamo Version | Backend | CUDA Toolkit | Min Driver (Linux) | Min Driver (Windows) | Notes | | Dynamo Version | Backend | CUDA Toolkit | Min Driver (Linux) | Min Driver (Windows) | Notes |
| :--- | :--- | :--- | :--- | :--- | :--- | | :--- | :--- | :--- | :--- | :--- | :--- |
| **0.8.1** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | | | **0.8.1** | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental | | | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | | | | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.8.0** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental | | | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | | | **0.8.0** | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental | | | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | | | | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.7.1** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | **SGLang** | 12.8 | 570.xx+ | 571.xx+ | | | | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| **0.7.1** | **SGLang** | 12.8 | 570.xx+ | 571.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | | | | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.7.0** | **vLLM** | 12.8 | 570.xx+ | 571.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | | | **0.7.0** | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | | | | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| | **vLLM** | 12.8 | 570.xx+ | 571.xx+ | |
Experimental CUDA 13 images are not published for all versions. Check [Release Artifacts](release-artifacts.md) for availability. Experimental CUDA 13 images are not published for all versions. Check [Release Artifacts](release-artifacts.md) for availability.
......
...@@ -30,7 +30,7 @@ Templates for creating consistent Dynamo documentation. ...@@ -30,7 +30,7 @@ Templates for creating consistent Dynamo documentation.
└──────────────────────────────────────────────────────────────┘ └──────────────────────────────────────────────────────────────┘
``` ```
### Backends (vLLM, SGLang, TRT-LLM) ### Backends (SGLang, TRT-LLM, vLLM)
``` ```
┌─────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────┐
......
...@@ -38,7 +38,7 @@ block2 in B: seq_hash = hash(hash(hash(block0') || block1') || block2) = 0x2222 ...@@ -38,7 +38,7 @@ block2 in B: seq_hash = hash(hash(hash(block0') || block1') || block2) = 0x2222
> **Important: Engine-Provided Hashes** > **Important: Engine-Provided Hashes**
> >
> In practice, the `ExternalSequenceBlockHash` may come directly from the inference engine (e.g., vLLM, TensorRT-LLM) using a rolling hash algorithm that we don't know or control. The engine computes these hashes internally and reports them via KV cache events. > In practice, the `ExternalSequenceBlockHash` may come directly from the inference engine (e.g., TensorRT-LLM, vLLM) using a rolling hash algorithm that we don't know or control. The engine computes these hashes internally and reports them via KV cache events.
> >
> **LoRA identity**: The engine is responsible for incorporating the LoRA adapter identity into the `ExternalSequenceBlockHash` before emitting KV events. Dynamo does not add LoRA information at the router layer. For example, vLLM does this via `_gen_lora_extra_hash_keys`, which appends the LoRA ID as extra keys when calling `hash_block_tokens(..., extra_keys)`. Any engine integrating with the KV router must follow the same convention to ensure correct cache isolation between LoRA adapters. > **LoRA identity**: The engine is responsible for incorporating the LoRA adapter identity into the `ExternalSequenceBlockHash` before emitting KV events. Dynamo does not add LoRA information at the router layer. For example, vLLM does this via `_gen_lora_extra_hash_keys`, which appends the LoRA ID as extra keys when calling `hash_block_tokens(..., extra_keys)`. Any engine integrating with the KV router must follow the same convention to ensure correct cache isolation between LoRA adapters.
> >
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment