Unverified Commit e0373bd7 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: alphabetize backends (SGLang, TensorRT-LLM, vLLM) (#6537)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 80955ef4
......@@ -8,7 +8,7 @@ title: FlexKV
## Introduction
[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like vLLM, TensorRT-LLM, and SGLang.
[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like SGLang, TensorRT-LLM, and vLLM.
### Key Features
......
......@@ -17,7 +17,7 @@ Events are published over the **Dynamo event plane**, a transport-agnostic pub/s
`KvEventPublisher` supports two publishing modes:
1. **Direct publishing** — Your engine calls `publish_stored()` / `publish_removed()` to push events directly over the event plane. Simplest approach for custom engines.
2. **ZMQ relay** — For engines that emit raw KV events over a ZMQ socket (like vLLM and SGLang). The publisher subscribes to the ZMQ endpoint and relays events to the event plane automatically.
2. **ZMQ relay** — For engines that emit raw KV events over a ZMQ socket (like SGLang and vLLM). The publisher subscribes to the ZMQ endpoint and relays events to the event plane automatically.
## Event Types
......@@ -137,11 +137,11 @@ async def main():
## ZMQ Relay (For Engines with Raw KV Events)
For engines that already publish raw KV events over a ZMQ socket (like vLLM and SGLang), use the same `KvEventPublisher` with a `zmq_endpoint`. The publisher subscribes to the ZMQ socket and relays events to the event plane automatically.
For engines that already publish raw KV events over a ZMQ socket (like SGLang and vLLM), use the same `KvEventPublisher` with a `zmq_endpoint`. The publisher subscribes to the ZMQ socket and relays events to the event plane automatically.
```mermaid
flowchart LR
subgraph Engine["Custom Engine / vLLM / SGLang"]
subgraph Engine["Custom Engine / SGLang / vLLM"]
cache["KV Cache Manager"]
zmq_pub["ZMQ Publisher"]
end
......@@ -170,7 +170,7 @@ flowchart LR
```
**When to use:**
- Your engine already publishes KV events via ZMQ (like vLLM or SGLang)
- Your engine already publishes KV events via ZMQ (like SGLang or vLLM)
- You want to decouple event publishing from your engine's main loop
### Setup
......@@ -192,7 +192,7 @@ No further calls to `publish_stored()` / `publish_removed()` are needed — the
### ZMQ Wire Format
The ZMQ message format (compatible with vLLM / SGLang):
The ZMQ message format (compatible with SGLang / vLLM):
| Frame | Description |
|-------|-------------|
......
......@@ -212,7 +212,7 @@ Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits)
| Field | Required | Affects Hash | Example |
|-------|----------|-------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `framework` | ✓ | ✓ | `vllm`, `sglang`, `trtllm` |
| `framework` | ✓ | ✓ | `sglang`, `trtllm`, `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
......
......@@ -437,7 +437,7 @@ status:
- For HuggingFace: Verify token is valid, repo exists and is accessible
2. **Invalid LoRA format**
**Solution:** Ensure your LoRA weights are in the format expected by your backend framework (vLLM, SGLang, etc.)
**Solution:** Ensure your LoRA weights are in the format expected by your backend framework (SGLang, vLLM, etc.)
3. **Endpoint API errors**
```bash
......
......@@ -194,7 +194,7 @@ curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
**Timeline:**
```
Timeline: 0, 1, ...
Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (vLLM, SGLang, TRT)
Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (SGLang, TRT, vLLM)
│request start │received │
| | |
│ ├──> start prefill ──> first token ──> |last token
......
......@@ -16,20 +16,20 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
## Quick Comparison
| Feature | vLLM | TensorRT-LLM | SGLang | Source |
| Feature | SGLang | TensorRT-LLM | vLLM | Source |
| :--- | :---: | :---: | :---: | :--- |
| **Disaggregated Serving** | ✅ | ✅ | ✅ | [Design Doc][disagg] |
| **KV-Aware Routing** | ✅ | ✅ | ✅ | [Router Doc][kv-routing] |
| **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] |
| **KV Block Manager** | | ✅ | 🚧 | [KVBM Doc][kvbm] |
| **KV Block Manager** | 🚧 | ✅ | | [KVBM Doc][kvbm] |
| **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] |
| **Multimodal (Video)** | | | | [Multimodal Doc][mm] |
| **Multimodal (Audio)** | 🚧 | | | [Multimodal Doc][mm] |
| **Multimodal (Video)** | | | | [Multimodal Doc][mm] |
| **Multimodal (Audio)** | | | 🚧 | [Multimodal Doc][mm] |
| **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] |
| **Request Cancellation** | | ✅ | 🚧 | Backend READMEs |
| **LoRA** | | | | [K8s Guide][lora] |
| **Request Cancellation** | 🚧 | ✅ | | Backend READMEs |
| **LoRA** | | | | [K8s Guide][lora] |
| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
| **Speculative Decoding** | | ✅ | 🚧 | Backend READMEs |
| **Speculative Decoding** | 🚧 | ✅ | | Backend READMEs |
## 1. vLLM Backend
......
......@@ -48,7 +48,7 @@ We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtl
| Package | Description | Python | Platform | PyPI |
|---------|-------------|--------|----------|------|
| `ai-dynamo==0.8.1` | Main package with backend integrations (vLLM, SGLang, TRT-LLM) | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo/0.8.1/) |
| `ai-dynamo==0.8.1` | Main package with backend integrations (SGLang, TRT-LLM, vLLM) | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo/0.8.1/) |
| `ai-dynamo-runtime==0.8.1` | Core Python bindings for Dynamo runtime | `3.10``3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/ai-dynamo-runtime/0.8.1/) |
| `kvbm==0.8.1` | KV Block Manager for disaggregated KV cache | `3.12` | Linux (glibc `v2.28+`) | [link](https://pypi.org/project/kvbm/0.8.1/) |
......@@ -75,7 +75,7 @@ We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtl
### Container Images (NGC)
> For detailed run instructions, see the [Container README](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md) or backend-specific guides: [vLLM](../backends/vllm/README.md) | [SGLang](../backends/sglang/README.md) | [TensorRT-LLM](../backends/trtllm/README.md)
> For detailed run instructions, see the [Container README](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md) or backend-specific guides: [SGLang](../backends/sglang/README.md) | [TensorRT-LLM](../backends/trtllm/README.md) | [vLLM](../backends/vllm/README.md)
```bash
# Runtime containers
......@@ -158,7 +158,7 @@ For a complete list of known issues, refer to the release notes for each patch:
- **v0.8.1.post1 Patch**: Updated TRT-LLM to `v1.2.0rc6.post2` (PyPI wheels and TRT-LLM container only)
- **Standalone Frontend Container**: `dynamo-frontend` added in v0.8.0
- **CUDA 13 Runtimes**: Experimental CUDA 13 runtime for vLLM and SGLang in v0.8.0
- **CUDA 13 Runtimes**: Experimental CUDA 13 runtime for SGLang and vLLM in v0.8.0
- **New Rust Crates**: `dynamo-memory` and `dynamo-config` added in v0.8.0
### GitHub Releases
......
......@@ -14,7 +14,7 @@ This document provides the support matrix for Dynamo, including hardware, softwa
The following table shows the backend framework versions included with each Dynamo release:
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **NIXL** |
| **Dynamo** | **SGLang** | **TensorRT-LLM** | **vLLM** | **NIXL** |
| :--- | :--- | :--- | :--- | :--- |
| **main (ToT)** | `0.15.1` | `0.5.9` | `1.3.0rc3` | `0.9.0` |
| **v1.0.0** *(planned)* | `0.15.0` | *Latest as of 2/17* | *Latest as of 2/17* | `0.10.0` |
......@@ -44,14 +44,14 @@ The following table shows the backend framework versions included with each Dyna
### CUDA Versions by Backend
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **Notes** |
| **Dynamo** | **SGLang** | **TensorRT-LLM** | **vLLM** | **Notes** |
| :--- | :--- | :--- | :--- | :--- |
| **v0.8.1** | `12.9`, `13.0` | `12.9`, `13.0` | `13.0` | Experimental vLLM/SGLang CUDA 13 support |
| **v0.8.0** | `12.9`, `13.0` | `12.9`, `13.0` | `13.0` | Experimental vLLM/SGLang CUDA 13 support |
| **v0.7.1** | `12.9` | `12.8` | `13.0` | |
| **v0.7.0** | `12.8` | `12.9` | `13.0` | TensorRT-LLM CUDA 13 support - CUDA 12.9 deprecated |
| **v0.6.1** | `12.8` | `12.9` | `12.9` | |
| **v0.6.0** | `12.8` | `12.8` | `12.9` | |
| **v0.8.1** | `12.9`, `13.0` | `13.0` | `12.9`, `13.0` | Experimental SGLang/vLLM CUDA 13 support |
| **v0.8.0** | `12.9`, `13.0` | `13.0` | `12.9`, `13.0` | Experimental SGLang/vLLM CUDA 13 support |
| **v0.7.1** | `12.8` | `13.0` | `12.9` | |
| **v0.7.0** | `12.9` | `13.0` | `12.8` | TensorRT-LLM CUDA 13 support - CUDA 12.9 deprecated |
| **v0.6.1** | `12.9` | `12.9` | `12.8` | |
| **v0.6.0** | `12.8` | `12.9` | `12.8` | |
Patch versions (e.g., v0.8.1.post1, v0.7.0.post1) have the same CUDA support as their base version.
......@@ -101,22 +101,22 @@ Dynamo container images include CUDA toolkit libraries. The host machine must ha
| Dynamo Version | Backend | CUDA Toolkit | Min Driver (Linux) | Min Driver (Windows) | Notes |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **0.8.1** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| **0.8.1** | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.8.0** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| **0.8.0** | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.7.1** | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | **SGLang** | 12.8 | 570.xx+ | 571.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| | | 13.0 | 580.xx+ | 581.xx+ | Experimental |
| **0.7.1** | **SGLang** | 12.8 | 570.xx+ | 571.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| **0.7.0** | **vLLM** | 12.8 | 570.xx+ | 571.xx+ | |
| | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | 576.xx+ | |
| **0.7.0** | **SGLang** | 12.9 | 575.xx+ | 576.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | 581.xx+ | |
| | **vLLM** | 12.8 | 570.xx+ | 571.xx+ | |
Experimental CUDA 13 images are not published for all versions. Check [Release Artifacts](release-artifacts.md) for availability.
......
......@@ -30,7 +30,7 @@ Templates for creating consistent Dynamo documentation.
└──────────────────────────────────────────────────────────────┘
```
### Backends (vLLM, SGLang, TRT-LLM)
### Backends (SGLang, TRT-LLM, vLLM)
```
┌─────────────────────────────────────────────────────┐
......
......@@ -38,7 +38,7 @@ block2 in B: seq_hash = hash(hash(hash(block0') || block1') || block2) = 0x2222
> **Important: Engine-Provided Hashes**
>
> In practice, the `ExternalSequenceBlockHash` may come directly from the inference engine (e.g., vLLM, TensorRT-LLM) using a rolling hash algorithm that we don't know or control. The engine computes these hashes internally and reports them via KV cache events.
> In practice, the `ExternalSequenceBlockHash` may come directly from the inference engine (e.g., TensorRT-LLM, vLLM) using a rolling hash algorithm that we don't know or control. The engine computes these hashes internally and reports them via KV cache events.
>
> **LoRA identity**: The engine is responsible for incorporating the LoRA adapter identity into the `ExternalSequenceBlockHash` before emitting KV events. Dynamo does not add LoRA information at the router layer. For example, vLLM does this via `_gen_lora_extra_hash_keys`, which appends the LoRA ID as extra keys when calling `hash_block_tokens(..., extra_keys)`. Any engine integrating with the KV router must follow the same convention to ensure correct cache isolation between LoRA adapters.
>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment