Unverified Commit e0373bd7 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: alphabetize backends (SGLang, TensorRT-LLM, vLLM) (#6537)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 80955ef4
...@@ -36,7 +36,7 @@ High-throughput, low-latency inference framework designed for serving generative ...@@ -36,7 +36,7 @@ High-throughput, low-latency inference framework designed for serving generative
Large language models exceed single-GPU capacity. Tensor parallelism spreads layers across GPUs but creates coordination challenges. Dynamo closes this orchestration gap. Large language models exceed single-GPU capacity. Tensor parallelism spreads layers across GPUs but creates coordination challenges. Dynamo closes this orchestration gap.
Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provides: Dynamo is inference engine agnostic (supports SGLang, TRT-LLM, vLLM) and provides:
- **Disaggregated Prefill & Decode** – Maximizes GPU throughput with latency/throughput trade-offs - **Disaggregated Prefill & Decode** – Maximizes GPU throughput with latency/throughput trade-offs
- **Dynamic GPU Scheduling** – Optimizes performance based on fluctuating demand - **Dynamic GPU Scheduling** – Optimizes performance based on fluctuating demand
......
...@@ -14,7 +14,7 @@ A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that s ...@@ -14,7 +14,7 @@ A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that s
- **What** model you want to deploy (`model`) - **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`) - **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences) - **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm) - **Which** backend to use (`backend`: sglang, trtllm, or vllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`) - **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically: The Dynamo Operator watches for DGDRs and automatically:
...@@ -186,7 +186,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin ...@@ -186,7 +186,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin
- **Duration**: 2-4 hours - **Duration**: 2-4 hours
- **Accuracy**: Highest (real measurements) - **Accuracy**: Highest (real measurements)
- **GPU Requirements**: Full access to test different parallelization mappings - **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM - **Backends**: SGLang, TensorRT-LLM, vLLM
```yaml ```yaml
profilingConfig: profilingConfig:
...@@ -202,7 +202,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r ...@@ -202,7 +202,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r
- **Duration**: 20-30 seconds - **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations) - **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None - **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon) - **Backends**: TensorRT-LLM only (SGLang/vLLM coming soon)
```yaml ```yaml
profilingConfig: profilingConfig:
...@@ -401,7 +401,7 @@ The profiler uses the DGD config as a **base template**, then optimizes it based ...@@ -401,7 +401,7 @@ The profiler uses the DGD config as a **base template**, then optimizes it based
| Argument | Type | Default | Description | | Argument | Type | Default | Description |
|----------|------|---------|-------------| |----------|------|---------|-------------|
| `--backend` | string | - | Inference backend: vllm, sglang, trtllm | | `--backend` | string | - | Inference backend: sglang, trtllm, vllm |
| `--config` | string | - | Path to DGD YAML config file | | `--config` | string | - | Path to DGD YAML config file |
| `--model` | string | - | HuggingFace model ID | | `--model` | string | - | HuggingFace model ID |
| `--ttft` | float | - | Target TTFT in milliseconds | | `--ttft` | float | - | Target TTFT in milliseconds |
......
...@@ -18,7 +18,7 @@ To enable this feature, you should set the following flag while launching the ba ...@@ -18,7 +18,7 @@ To enable this feature, you should set the following flag while launching the ba
- `--dyn-tool-call-parser` : select the parser from the available parsers list using the below command - `--dyn-tool-call-parser` : select the parser from the available parsers list using the below command
```bash ```bash
# <backend> can be vllm, sglang, trtllm, etc. based on your installation # <backend> can be sglang, trtllm, vllm, etc. based on your installation
python -m dynamo.<backend> --help" python -m dynamo.<backend> --help"
``` ```
......
...@@ -139,7 +139,7 @@ python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-nam ...@@ -139,7 +139,7 @@ python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-nam
The benchmarking framework supports various comparative analysis scenarios: The benchmarking framework supports various comparative analysis scenarios:
- **Compare multiple DynamoGraphDeployments of a single backend** (e.g., aggregated vs disaggregated configurations) - **Compare multiple DynamoGraphDeployments of a single backend** (e.g., aggregated vs disaggregated configurations)
- **Compare different backends** (e.g., vLLM vs TensorRT-LLM vs SGLang) - **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM)
- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix) - **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B) - **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200) - **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
...@@ -529,6 +529,6 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/ ...@@ -529,6 +529,6 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/
- **CI/CD pipelines** that need to validate infrastructure without model execution - **CI/CD pipelines** that need to validate infrastructure without model execution
- **Benchmarking framework validation** to ensure your setup works before using real backends - **Benchmarking framework validation** to ensure your setup works before using real backends
The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference. The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference.
See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options. See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
...@@ -60,7 +60,7 @@ Tune these values based on your workload. Connection window should accommodate ` ...@@ -60,7 +60,7 @@ Tune these values based on your workload. Connection window should accommodate `
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_model()` API will be used. Currently the frontend support serving of the following model type and model input combination: Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_model()` API will be used. Currently the frontend support serving of the following model type and model input combination:
* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor * `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend) * `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo SGLang / TRTLLM / vLLM backend)
* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference * `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference
The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail: The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
......
...@@ -121,10 +121,10 @@ Backend engine scheduling priority forwarded to the engine's `generate` call. In ...@@ -121,10 +121,10 @@ Backend engine scheduling priority forwarded to the engine's `generate` call. In
The semantics of the priority value differ between backends: The semantics of the priority value differ between backends:
- **vLLM**: Smaller values = higher priority. A request with `priority: 0` is scheduled before `priority: 10`. Ties are broken by arrival time. Requires `--scheduling-policy priority` on the engine.
- **SGLang**: By default, larger values = higher priority. This can be inverted with `--schedule-low-priority-values-first` to match vLLM's convention. Requires `--enable-priority-scheduling` on the engine. - **SGLang**: By default, larger values = higher priority. This can be inverted with `--schedule-low-priority-values-first` to match vLLM's convention. Requires `--enable-priority-scheduling` on the engine.
- **vLLM**: Smaller values = higher priority. A request with `priority: 0` is scheduled before `priority: 10`. Ties are broken by arrival time. Requires `--scheduling-policy priority` on the engine.
When omitted, vLLM defaults to `0`; SGLang defaults to `None` (engine default). TensorRT-LLM does not currently support per-request priority. When omitted, SGLang defaults to `None` (engine default); vLLM defaults to `0`. TensorRT-LLM does not currently support per-request priority.
```json ```json
{ {
......
...@@ -25,9 +25,9 @@ When both modes are enabled, throughput-based scaling provides a lower bound on ...@@ -25,9 +25,9 @@ When both modes are enabled, throughput-based scaling provides a lower bound on
| Disaggregated | Supported | Supported | | Disaggregated | Supported | Supported |
| Aggregated | Unsupported | Supported | | Aggregated | Unsupported | Supported |
| **LLM Framework** | | | | **LLM Framework** | | |
| vLLM | Supported | Supported |
| TensorRT-LLM | Supported | Supported |
| SGLang | Supported | Supported | | SGLang | Supported | Supported |
| TensorRT-LLM | Supported | Supported |
| vLLM | Supported | Supported |
| **Requires Profiling Data** | Yes | No | | **Requires Profiling Data** | Yes | No |
| **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A | | **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A |
| **Connectors** | | | | **Connectors** | | |
...@@ -98,7 +98,7 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE ...@@ -98,7 +98,7 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
|----------|---------|-------------| |----------|---------|-------------|
| **Common** | | | | **Common** | | |
| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace | | `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) | | `--backend` | `vllm` | Backend framework (`sglang`, `trtllm`, `vllm`) |
| `--mode` | `disagg` | Planner mode (`disagg`, `prefill`, `decode`, `agg`) | | `--mode` | `disagg` | Planner mode (`disagg`, `prefill`, `decode`, `agg`) |
| `--environment` | `kubernetes` | Deployment environment | | `--environment` | `kubernetes` | Deployment environment |
| `--ttft` | `500.0` | Target Time To First Token (ms) | | `--ttft` | `500.0` | Target Time To First Token (ms) |
......
...@@ -71,7 +71,7 @@ A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface ...@@ -71,7 +71,7 @@ A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface
- **What** model to deploy (`model`) - **What** model to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`) - **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences) - **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm) - **Which** backend to use (`backend`: sglang, trtllm, or vllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`) - **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically: The Dynamo Operator watches for DGDRs and automatically:
...@@ -161,7 +161,7 @@ metadata: ...@@ -161,7 +161,7 @@ metadata:
| Field | Type | Description | | Field | Type | Description |
|-------|------|-------------| |-------|------|-------------|
| `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) | | `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) |
| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` | | `spec.backend` | enum | Inference backend: `sglang`, `trtllm`, or `vllm` |
| `spec.profilingConfig.profilerImage` | string | Container image for profiling job | | `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) | | `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
......
...@@ -10,14 +10,14 @@ The Dynamo Profiler is an automated performance analysis tool that measures mode ...@@ -10,14 +10,14 @@ The Dynamo Profiler is an automated performance analysis tool that measures mode
## Feature Matrix ## Feature Matrix
| Feature | vLLM | SGLang | TensorRT-LLM | | Feature | SGLang | TensorRT-LLM | vLLM |
|---------|------|--------|--------------| |---------|--------|--------------|------|
| Dense Model Profiling | ✅ | ✅ | ✅ | | Dense Model Profiling | ✅ | ✅ | ✅ |
| MoE Model Profiling | 🚧 | | 🚧 | | MoE Model Profiling | | 🚧 | 🚧 |
| AI Configurator (Offline) | ❌ | | | | AI Configurator (Offline) | ❌ | | |
| Online Profiling (AIPerf) | ✅ | ✅ | ✅ | | Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
| Interactive WebUI | ✅ | ✅ | ✅ | | Interactive WebUI | ✅ | ✅ | ✅ |
| Runtime Profiling Endpoints | | | ❌ | | Runtime Profiling Endpoints | | | ❌ |
## Quick Start ## Quick Start
......
...@@ -15,7 +15,7 @@ A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that s ...@@ -15,7 +15,7 @@ A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that s
- **What** model you want to deploy (`model`) - **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`) - **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences) - **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm) - **Which** backend to use (`backend`: sglang, trtllm, or vllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`) - **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically: The Dynamo Operator watches for DGDRs and automatically:
...@@ -187,7 +187,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin ...@@ -187,7 +187,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin
- **Duration**: 2-4 hours - **Duration**: 2-4 hours
- **Accuracy**: Highest (real measurements) - **Accuracy**: Highest (real measurements)
- **GPU Requirements**: Full access to test different parallelization mappings - **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM - **Backends**: SGLang, TensorRT-LLM, vLLM
```yaml ```yaml
profilingConfig: profilingConfig:
...@@ -203,7 +203,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r ...@@ -203,7 +203,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r
- **Duration**: 20-30 seconds - **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations) - **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None - **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon) - **Backends**: TensorRT-LLM only (SGLang/vLLM coming soon)
```yaml ```yaml
profilingConfig: profilingConfig:
...@@ -422,7 +422,7 @@ The profiler uses the DGD config as a **base template**, then optimizes it based ...@@ -422,7 +422,7 @@ The profiler uses the DGD config as a **base template**, then optimizes it based
| Argument | Type | Default | Description | | Argument | Type | Default | Description |
|----------|------|---------|-------------| |----------|------|---------|-------------|
| `--backend` | string | - | Inference backend: vllm, sglang, trtllm | | `--backend` | string | - | Inference backend: sglang, trtllm, vllm |
| `--config` | string | - | Path to DGD YAML config file | | `--config` | string | - | Path to DGD YAML config file |
| `--model` | string | - | HuggingFace model ID | | `--model` | string | - | HuggingFace model ID |
| `--ttft` | float | - | Target TTFT in milliseconds | | `--ttft` | float | - | Target TTFT in milliseconds |
......
...@@ -88,7 +88,7 @@ For more configuration options and tuning guidelines, see the [Router Guide](rou ...@@ -88,7 +88,7 @@ For more configuration options and tuning guidelines, see the [Router Guide](rou
- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead) - You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
**Multimodal Support:** **Multimodal Support:**
- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes - **TRT-LLM and vLLM**: Multimodal routing supported for images via multimodal hashes
- **SGLang**: Image routing not yet supported - **SGLang**: Image routing not yet supported
- **Other modalities** (audio, video, etc.): Not yet supported - **Other modalities** (audio, video, etc.): Not yet supported
......
...@@ -279,7 +279,7 @@ See [Router Design](../../design-docs/router-design.md) for architecture details ...@@ -279,7 +279,7 @@ See [Router Design](../../design-docs/router-design.md) for architecture details
For full documentation on implementing KV event publishing for custom inference engines, see the dedicated [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md) guide. It covers: For full documentation on implementing KV event publishing for custom inference engines, see the dedicated [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md) guide. It covers:
- **Direct publishing**: Call `publish_stored()` / `publish_removed()` to push events over the Dynamo event plane - **Direct publishing**: Call `publish_stored()` / `publish_removed()` to push events over the Dynamo event plane
- **ZMQ relay**: For engines that emit raw KV events over ZMQ (like vLLM and SGLang), the same `KvEventPublisher` subscribes to the ZMQ socket and relays events automatically - **ZMQ relay**: For engines that emit raw KV events over ZMQ (like SGLang and vLLM), the same `KvEventPublisher` subscribes to the ZMQ socket and relays events automatically
- API reference, event structure, ZMQ wire format, and best practices - API reference, event structure, ZMQ wire format, and best practices
## Global Router (Hierarchical Routing) ## Global Router (Hierarchical Routing)
......
...@@ -6,7 +6,7 @@ title: Overall Architecture ...@@ -6,7 +6,7 @@ title: Overall Architecture
# High Level Architecture # High Level Architecture
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities: Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting SGLang, TRT-LLM, vLLM and others, while capturing essential LLM capabilities:
- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency - **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand - **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
......
...@@ -20,7 +20,7 @@ While theoretically each `DistributedRuntime` can have multiple `Namespace`s as ...@@ -20,7 +20,7 @@ While theoretically each `DistributedRuntime` can have multiple `Namespace`s as
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components: For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components:
- `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality. - `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality.
- `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (vLLM, SGLang, TensorRT-LLM). - `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (SGLang, TensorRT-LLM, vLLM).
Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`: Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`:
......
...@@ -61,7 +61,7 @@ Coordination and messaging support: ...@@ -61,7 +61,7 @@ Coordination and messaging support:
### NIXL (NVIDIA Interchange Library): ### NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe - Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
- Transfer metadata exchanged via `disaggregated_params` in prefill response - Transfer metadata exchanged via `disaggregated_params` in prefill response
- Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state - Backend-specific coordination: SGLang uses bootstrap connections, TRTLLM uses opaque state, vLLM uses block IDs
### Disaggregated KV Cache: ### Disaggregated KV Cache:
- Each worker maintains local KV cache in its GPU memory - Each worker maintains local KV cache in its GPU memory
......
...@@ -45,11 +45,11 @@ export DYN_EVENT_PLANE=zmq ...@@ -45,11 +45,11 @@ export DYN_EVENT_PLANE=zmq
Python components also accept this as a CLI flag: Python components also accept this as a CLI flag:
```bash ```bash
# vLLM backend
python3 -m dynamo.vllm --event-plane zmq --model Qwen/Qwen3-0.6B
# SGLang backend # SGLang backend
python3 -m dynamo.sglang --event-plane zmq --model Qwen/Qwen3-0.6B python3 -m dynamo.sglang --event-plane zmq --model Qwen/Qwen3-0.6B
# vLLM backend
python3 -m dynamo.vllm --event-plane zmq --model Qwen/Qwen3-0.6B
``` ```
### Environment Variables ### Environment Variables
......
...@@ -6,7 +6,7 @@ title: KVBM Design ...@@ -6,7 +6,7 @@ title: KVBM Design
# KVBM Design # KVBM Design
This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in vLLM and SGLang, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading). This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in SGLang and vLLM, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading).
## KVBM Components ## KVBM Components
...@@ -313,7 +313,7 @@ This design ensures that performance, resilience, and extensibility scale indepe ...@@ -313,7 +313,7 @@ This design ensures that performance, resilience, and extensibility scale indepe
## Framework Integrations ## Framework Integrations
KVBM integrates with inference frameworks (vLLM, TensorRT-LLM, SGLang) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution. KVBM integrates with inference frameworks (SGLang, TensorRT-LLM, vLLM) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution.
### Connector Architecture ### Connector Architecture
......
...@@ -107,7 +107,7 @@ To get a feel for how KV Cache management works on a single worker with KV Cache ...@@ -107,7 +107,7 @@ To get a feel for how KV Cache management works on a single worker with KV Cache
- These tensors are stored in the newly allocated cache blocks - These tensors are stored in the newly allocated cache blocks
- **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**. - **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**.
Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/). Further details can be found for: [SGLang](https://lmsys.org/blog/2024-01-17-sglang/), [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/) and [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching).
## Events ## Events
...@@ -214,7 +214,7 @@ By default, workers have local indexer enabled. Each worker maintains its own lo ...@@ -214,7 +214,7 @@ By default, workers have local indexer enabled. Each worker maintains its own lo
- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios; deployments without NATS (using ZMQ event plane) - **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios; deployments without NATS (using ZMQ event plane)
- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available - **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available
- **Switch to JetStream**: Use `--durable-kv-events` flag on **both** workers (vLLM, SGLang, TRT-LLM, mocker) **and** frontend - **Switch to JetStream**: Use `--durable-kv-events` flag on **both** workers (SGLang, TRT-LLM, vLLM, mocker) **and** frontend
```mermaid ```mermaid
graph TD graph TD
......
...@@ -86,7 +86,7 @@ aiconfigurator cli default \ ...@@ -86,7 +86,7 @@ aiconfigurator cli default \
- `--total_gpus`: Number of GPUs available for deployment - `--total_gpus`: Number of GPUs available for deployment
- `--isl` / `--osl`: Input/Output sequence lengths in tokens - `--isl` / `--osl`: Input/Output sequence lengths in tokens
- `--ttft` / `--tpot`: SLA targets - Time To First Token (ms) and Time Per Output Token (ms) - `--ttft` / `--tpot`: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)
- `--backend`: Inference backend (`vllm`, `trtllm`, or `sglang`) - `--backend`: Inference backend (`sglang`, `trtllm`, or `vllm`)
- `--backend_version`: Backend version (e.g., `0.12.0` for vLLM) - `--backend_version`: Backend version (e.g., `0.12.0` for vLLM)
- `--save_dir`: Directory to save generated deployment configs - `--save_dir`: Directory to save generated deployment configs
...@@ -623,13 +623,13 @@ AIConfigurator's default predictions assume no prefix caching. Enable it post-de ...@@ -623,13 +623,13 @@ AIConfigurator's default predictions assume no prefix caching. Enable it post-de
### Systems ### Systems
| GPU System | TensorRT-LLM | vLLM | SGLang | | GPU System | SGLang | TensorRT-LLM | vLLM |
|------------|--------------|------|--------| |------------|--------|--------------|------|
| H200 SXM | Yes | Yes | Yes | | H200 SXM | Yes | Yes | Yes |
| H100 SXM | Yes | Yes | Yes | | H100 SXM | Yes | Yes | Yes |
| A100 SXM | Yes | Yes | -- | | A100 SXM | -- | Yes | Yes |
| B200 SXM | Yes | -- | Yes | | B200 SXM | Yes | Yes | -- |
| GB200 SXM | Yes | -- | -- | | GB200 SXM | -- | Yes | -- |
### Models ### Models
......
...@@ -39,10 +39,10 @@ Dynamo supports multimodal inference across multiple LLM backends, enabling mode ...@@ -39,10 +39,10 @@ Dynamo supports multimodal inference across multiple LLM backends, enabling mode
### Input Format Support ### Input Format Support
| Format | vLLM | TRT-LLM | SGLang | | Format | SGLang | TRT-LLM | vLLM |
|--------|------|---------|--------| |--------|--------|---------|------|
| HTTP/HTTPS URL | ✅ | ✅ | ✅ | | HTTP/HTTPS URL | ✅ | ✅ | ✅ |
| Data URL (Base64) | | ❌ | | | Data URL (Base64) | | ❌ | |
| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ | | Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
## Architecture Patterns ## Architecture Patterns
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment