docs: alphabetize backends (SGLang, TensorRT-LLM, vLLM) (#6537)

Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: alphabetize backends (SGLang, TensorRT-LLM, vLLM) (#6537)
Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
e0373bd7 · dagil-nvidia · GitHub · 80955ef4 · e0373bd7 · e0373bd7
Unverified Commit e0373bd7 authored Feb 25, 2026 by dagil-nvidia Committed by GitHub Feb 25, 2026
20 changed files
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ High-throughput, low-latency inference framework designed for serving generative
 Large language models exceed single-GPU capacity. Tensor parallelism spreads layers across GPUs but creates coordination challenges. Dynamo closes this orchestration gap.
-Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provides:
+Dynamo is inference engine agnostic (supports SGLang, TRT-LLM, vLLM) and provides:
 - **Disaggregated Prefill & Decode** – Maximizes GPU throughput with latency/throughput trade-offs
 - **Dynamic GPU Scheduling** – Optimizes performance based on fluctuating demand

--- a/docs/components/profiler/profiler_guide.md
+++ b/docs/components/profiler/profiler_guide.md
@@ -14,7 +14,7 @@ A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that s
 - **What** model you want to deploy (`model`)
 - **How** it should perform (SLA targets: `ttft`, `itl`)
 - **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
+- **Which** backend to use (`backend`: sglang, trtllm, or vllm)
 - **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
 The Dynamo Operator watches for DGDRs and automatically:
@@ -186,7 +186,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin
 - **Duration**: 2-4 hours
 - **Accuracy**: Highest (real measurements)
 - **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM
+- **Backends**: SGLang, TensorRT-LLM, vLLM
 ```yaml
 profilingConfig:
@@ -202,7 +202,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r
 - **Duration**: 20-30 seconds
 - **Accuracy**: Estimated (may have errors for unusual configurations)
 - **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
+- **Backends**: TensorRT-LLM only (SGLang/vLLM coming soon)
 ```yaml
 profilingConfig:
@@ -401,7 +401,7 @@ The profiler uses the DGD config as a **base template**, then optimizes it based
 | Argument | Type | Default | Description |
 |----------|------|---------|-------------|
-| `--backend` | string | - | Inference backend: vllm, sglang, trtllm |
+| `--backend` | string | - | Inference backend: sglang, trtllm, vllm |
 | `--config` | string | - | Path to DGD YAML config file |
 | `--model` | string | - | HuggingFace model ID |
 | `--ttft` | float | - | Target TTFT in milliseconds |

--- a/docs/pages/agents/tool-calling.md
+++ b/docs/pages/agents/tool-calling.md
@@ -18,7 +18,7 @@ To enable this feature, you should set the following flag while launching the ba
 - `--dyn-tool-call-parser` : select the parser from the available parsers list using the below command
 ```bash
-# <backend> can be vllm, sglang, trtllm, etc. based on your installation
+# <backend> can be sglang, trtllm, vllm, etc. based on your installation
 python -m dynamo.<backend> --help"
 ```

--- a/docs/pages/benchmarks/benchmarking.md
+++ b/docs/pages/benchmarks/benchmarking.md
@@ -139,7 +139,7 @@ python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-nam
 The benchmarking framework supports various comparative analysis scenarios:
 - **Compare multiple DynamoGraphDeployments of a single backend** (e.g., aggregated vs disaggregated configurations)
- **Compare different backends** (e.g., vLLM vs TensorRT-LLM vs SGLang)
+- **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM)
 - **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
 - **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
 - **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
@@ -529,6 +529,6 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/
 - **CI/CD pipelines** that need to validate infrastructure without model execution
 - **Benchmarking framework validation** to ensure your setup works before using real backends
-The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
+The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference.
 See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
--- a/docs/pages/components/frontend/frontend-guide.md
+++ b/docs/pages/components/frontend/frontend-guide.md
@@ -60,7 +60,7 @@ Tune these values based on your workload. Connection window should accommodate `
 Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_model()` API will be used. Currently the frontend support serving of the following model type and model input combination:
 * `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
-* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
+* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo SGLang / TRTLLM / vLLM backend)
 * `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference
 The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:

--- a/docs/pages/components/frontend/nvext.md
+++ b/docs/pages/components/frontend/nvext.md
@@ -121,10 +121,10 @@ Backend engine scheduling priority forwarded to the engine's `generate` call. In
 The semantics of the priority value differ between backends:
- **vLLM**: Smaller values = higher priority. A request with `priority: 0` is scheduled before `priority: 10`. Ties are broken by arrival time. Requires `--scheduling-policy priority` on the engine.
 - **SGLang**: By default, larger values = higher priority. This can be inverted with `--schedule-low-priority-values-first` to match vLLM's convention. Requires `--enable-priority-scheduling` on the engine.
+- **vLLM**: Smaller values = higher priority. A request with `priority: 0` is scheduled before `priority: 10`. Ties are broken by arrival time. Requires `--scheduling-policy priority` on the engine.
-When omitted, vLLM defaults to `0`; SGLang defaults to `None` (engine default). TensorRT-LLM does not currently support per-request priority.
+When omitted, SGLang defaults to `None` (engine default); vLLM defaults to `0`. TensorRT-LLM does not currently support per-request priority.
 ```json
 {

--- a/docs/pages/components/planner/README.md
+++ b/docs/pages/components/planner/README.md
@@ -25,9 +25,9 @@ When both modes are enabled, throughput-based scaling provides a lower bound on
 | Disaggregated | Supported | Supported |
 | Aggregated | Unsupported | Supported |
 | **LLM Framework** | | |
-| vLLM | Supported | Supported |
-| TensorRT-LLM | Supported | Supported |
 | SGLang | Supported | Supported |
+| TensorRT-LLM | Supported | Supported |
+| vLLM | Supported | Supported |
 | **Requires Profiling Data** | Yes | No |
 | **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A |
 | **Connectors** | | |
@@ -98,7 +98,7 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
 |----------|---------|-------------|
 | **Common** | | |
 | `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
-| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
+| `--backend` | `vllm` | Backend framework (`sglang`, `trtllm`, `vllm`) |
 | `--mode` | `disagg` | Planner mode (`disagg`, `prefill`, `decode`, `agg`) |
 | `--environment` | `kubernetes` | Deployment environment |
 | `--ttft` | `500.0` | Target Time To First Token (ms) |

--- a/docs/pages/components/planner/planner-guide.md
+++ b/docs/pages/components/planner/planner-guide.md
@@ -71,7 +71,7 @@ A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface
 - **What** model to deploy (`model`)
 - **How** it should perform (SLA targets: `ttft`, `itl`)
 - **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
+- **Which** backend to use (`backend`: sglang, trtllm, or vllm)
 - **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
 The Dynamo Operator watches for DGDRs and automatically:
@@ -161,7 +161,7 @@ metadata:
 | Field | Type | Description |
 |-------|------|-------------|
 | `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) |
-| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
+| `spec.backend` | enum | Inference backend: `sglang`, `trtllm`, or `vllm` |
 | `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
 | `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |

--- a/docs/pages/components/profiler/README.md
+++ b/docs/pages/components/profiler/README.md
@@ -10,14 +10,14 @@ The Dynamo Profiler is an automated performance analysis tool that measures mode
 ## Feature Matrix
-| Feature | vLLM | SGLang | TensorRT-LLM |
+| Feature | SGLang | TensorRT-LLM | vLLM |
-|---------|------|--------|--------------|
+|---------|--------|--------------|------|
 | Dense Model Profiling | ✅ | ✅ | ✅ |
-| MoE Model Profiling | 🚧 | ✅ | 🚧 |
+| MoE Model Profiling | ✅ | 🚧 | 🚧 |
-| AI Configurator (Offline) | ❌ | ❌ | ✅ |
+| AI Configurator (Offline) | ❌ | ✅ | ❌ |
 | Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
 | Interactive WebUI | ✅ | ✅ | ✅ |
-| Runtime Profiling Endpoints | ❌ | ✅ | ❌ |
+| Runtime Profiling Endpoints | ✅ | ❌ | ❌ |
 ## Quick Start

--- a/docs/pages/components/profiler/profiler-guide.md
+++ b/docs/pages/components/profiler/profiler-guide.md
@@ -15,7 +15,7 @@ A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that s
 - **What** model you want to deploy (`model`)
 - **How** it should perform (SLA targets: `ttft`, `itl`)
 - **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
+- **Which** backend to use (`backend`: sglang, trtllm, or vllm)
 - **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
 The Dynamo Operator watches for DGDRs and automatically:
@@ -187,7 +187,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin
 - **Duration**: 2-4 hours
 - **Accuracy**: Highest (real measurements)
 - **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM
+- **Backends**: SGLang, TensorRT-LLM, vLLM
 ```yaml
 profilingConfig:
@@ -203,7 +203,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r
 - **Duration**: 20-30 seconds
 - **Accuracy**: Estimated (may have errors for unusual configurations)
 - **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
+- **Backends**: TensorRT-LLM only (SGLang/vLLM coming soon)
 ```yaml
 profilingConfig:
@@ -422,7 +422,7 @@ The profiler uses the DGD config as a **base template**, then optimizes it based
 | Argument | Type | Default | Description |
 |----------|------|---------|-------------|
-| `--backend` | string | - | Inference backend: vllm, sglang, trtllm |
+| `--backend` | string | - | Inference backend: sglang, trtllm, vllm |
 | `--config` | string | - | Path to DGD YAML config file |
 | `--model` | string | - | HuggingFace model ID |
 | `--ttft` | float | - | Target TTFT in milliseconds |

--- a/docs/pages/components/router/README.md
+++ b/docs/pages/components/router/README.md
@@ -88,7 +88,7 @@ For more configuration options and tuning guidelines, see the [Router Guide](rou
 - You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
 **Multimodal Support:**
- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes
+- **TRT-LLM and vLLM**: Multimodal routing supported for images via multimodal hashes
 - **SGLang**: Image routing not yet supported
 - **Other modalities** (audio, video, etc.): Not yet supported

--- a/docs/pages/components/router/router-examples.md
+++ b/docs/pages/components/router/router-examples.md
@@ -279,7 +279,7 @@ See [Router Design](../../design-docs/router-design.md) for architecture details
 For full documentation on implementing KV event publishing for custom inference engines, see the dedicated [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md) guide. It covers:
 - **Direct publishing**: Call `publish_stored()` / `publish_removed()` to push events over the Dynamo event plane
- **ZMQ relay**: For engines that emit raw KV events over ZMQ (like vLLM and SGLang), the same `KvEventPublisher` subscribes to the ZMQ socket and relays events automatically
+- **ZMQ relay**: For engines that emit raw KV events over ZMQ (like SGLang and vLLM), the same `KvEventPublisher` subscribes to the ZMQ socket and relays events automatically
 - API reference, event structure, ZMQ wire format, and best practices
 ## Global Router (Hierarchical Routing)

--- a/docs/pages/design-docs/architecture.md
+++ b/docs/pages/design-docs/architecture.md
@@ -6,7 +6,7 @@ title: Overall Architecture
 # High Level Architecture
-Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
+Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting SGLang, TRT-LLM, vLLM and others, while capturing essential LLM capabilities:
 - **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
 - **Dynamic GPU scheduling**: Optimizes performance based on real-time demand

--- a/docs/pages/design-docs/distributed-runtime.md
+++ b/docs/pages/design-docs/distributed-runtime.md
@@ -20,7 +20,7 @@ While theoretically each `DistributedRuntime` can have multiple `Namespace`s as
 For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components:
 - `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality.
- `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
+- `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (SGLang, TensorRT-LLM, vLLM).
 Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`:

--- a/docs/pages/design-docs/dynamo-flow.md
+++ b/docs/pages/design-docs/dynamo-flow.md
@@ -61,7 +61,7 @@ Coordination and messaging support:
 ### NIXL (NVIDIA Interchange Library):
 - Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
 - Transfer metadata exchanged via `disaggregated_params` in prefill response
- Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state
+- Backend-specific coordination: SGLang uses bootstrap connections, TRTLLM uses opaque state, vLLM uses block IDs
 ### Disaggregated KV Cache:
 - Each worker maintains local KV cache in its GPU memory

--- a/docs/pages/design-docs/event-plane.md
+++ b/docs/pages/design-docs/event-plane.md
@@ -45,11 +45,11 @@ export DYN_EVENT_PLANE=zmq
 Python components also accept this as a CLI flag:
 ```bash
-# vLLM backend
-python3 -m dynamo.vllm --event-plane zmq --model Qwen/Qwen3-0.6B
 # SGLang backend
 python3 -m dynamo.sglang --event-plane zmq --model Qwen/Qwen3-0.6B
+# vLLM backend
+python3 -m dynamo.vllm --event-plane zmq --model Qwen/Qwen3-0.6B
 ```
 ### Environment Variables

--- a/docs/pages/design-docs/kvbm-design.md
+++ b/docs/pages/design-docs/kvbm-design.md
@@ -6,7 +6,7 @@ title: KVBM Design
 # KVBM Design
-This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in vLLM and SGLang, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading).
+This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in SGLang and vLLM, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading).
 ## KVBM Components
@@ -313,7 +313,7 @@ This design ensures that performance, resilience, and extensibility scale indepe
 ## Framework Integrations
-KVBM integrates with inference frameworks (vLLM, TensorRT-LLM, SGLang) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution.
+KVBM integrates with inference frameworks (SGLang, TensorRT-LLM, vLLM) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution.
 ### Connector Architecture

--- a/docs/pages/design-docs/router-design.md
+++ b/docs/pages/design-docs/router-design.md
@@ -107,7 +107,7 @@ To get a feel for how KV Cache management works on a single worker with KV Cache
    - These tensors are stored in the newly allocated cache blocks
    - **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**.
-Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/).
+Further details can be found for: [SGLang](https://lmsys.org/blog/2024-01-17-sglang/), [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/) and [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching).
 ## Events
@@ -214,7 +214,7 @@ By default, workers have local indexer enabled. Each worker maintains its own lo
 - **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios; deployments without NATS (using ZMQ event plane)
 - **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available
- **Switch to JetStream**: Use `--durable-kv-events` flag on **both** workers (vLLM, SGLang, TRT-LLM, mocker) **and** frontend
+- **Switch to JetStream**: Use `--durable-kv-events` flag on **both** workers (SGLang, TRT-LLM, vLLM, mocker) **and** frontend
 ```mermaid
 graph TD

--- a/docs/pages/features/disaggregated-serving/README.md
+++ b/docs/pages/features/disaggregated-serving/README.md
@@ -86,7 +86,7 @@ aiconfigurator cli default \
 - `--total_gpus`: Number of GPUs available for deployment
 - `--isl` / `--osl`: Input/Output sequence lengths in tokens
 - `--ttft` / `--tpot`: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)
- `--backend`: Inference backend (`vllm`, `trtllm`, or `sglang`)
+- `--backend`: Inference backend (`sglang`, `trtllm`, or `vllm`)
 - `--backend_version`: Backend version (e.g., `0.12.0` for vLLM)
 - `--save_dir`: Directory to save generated deployment configs
@@ -623,13 +623,13 @@ AIConfigurator's default predictions assume no prefix caching. Enable it post-de
 ### Systems
-| GPU System | TensorRT-LLM | vLLM | SGLang |
+| GPU System | SGLang | TensorRT-LLM | vLLM |
-|------------|--------------|------|--------|
+|------------|--------|--------------|------|
 | H200 SXM | Yes | Yes | Yes |
 | H100 SXM | Yes | Yes | Yes |
-| A100 SXM | Yes | Yes | -- |
+| A100 SXM | -- | Yes | Yes |
-| B200 SXM | Yes | -- | Yes |
+| B200 SXM | Yes | Yes | -- |
-| GB200 SXM | Yes | -- | -- |
+| GB200 SXM | -- | Yes | -- |
 ### Models

--- a/docs/pages/features/multimodal/README.md
+++ b/docs/pages/features/multimodal/README.md
@@ -39,10 +39,10 @@ Dynamo supports multimodal inference across multiple LLM backends, enabling mode
 ### Input Format Support
-| Format | vLLM | TRT-LLM | SGLang |
+| Format | SGLang | TRT-LLM | vLLM |
-|--------|------|---------|--------|
+|--------|--------|---------|------|
 | HTTP/HTTPS URL | ✅ | ✅ | ✅ |
-| Data URL (Base64) | ✅ | ❌ | ❌ |
+| Data URL (Base64) | ❌ | ❌ | ✅ |
 | Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
 ## Architecture Patterns