Unverified Commit c6b59045 authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: address Harry/VDR feedback + fixing broken links across repository (#3802)


Signed-off-by: default avatarHarry Kim <harry_kim@live.com>
Signed-off-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Signed-off-by: default avatarakshatha-k <33278067+akshatha-k@users.noreply.github.com>
Signed-off-by: default avatarHarrison Saturley-Hall <hsaturleyhal@nvidia.com>
Signed-off-by: default avatarHarrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Co-authored-by: default avatarHarry Kim <harry_kim@live.com>
Co-authored-by: default avatarClaude <noreply@anthropic.com>
Co-authored-by: default avatarakshatha-k <33278067+akshatha-k@users.noreply.github.com>
Co-authored-by: default avatarHarrison Saturley-Hall <hsaturleyhal@nvidia.com>
parent d712ce8d
...@@ -22,7 +22,7 @@ limitations under the License. ...@@ -22,7 +22,7 @@ limitations under the License.
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ) [![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Support matrix](https://github.com/ai-dynamo/dynamo/blob/main/docs/reference/support-matrix.md)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Prebuilt containers](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Blogs](https://developer.nvidia.com/blog/tag/nvidia-dynamo)** | **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/2486)** | **[Support matrix](https://github.com/ai-dynamo/dynamo/blob/main/docs/reference/support-matrix.md)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Prebuilt containers](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Blogs](https://developer.nvidia.com/blog/tag/nvidia-dynamo)**
# NVIDIA Dynamo # NVIDIA Dynamo
...@@ -56,9 +56,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa ...@@ -56,9 +56,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
| Feature | vLLM | SGLang | TensorRT-LLM | | Feature | vLLM | SGLang | TensorRT-LLM |
| ------------------------------------------------------------------------------------------------- | ---- | ------ | ------------ | | ------------------------------------------------------------------------------------------------- | ---- | ------ | ------------ |
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ | | [**Disaggregated Serving**](/docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 | | [**Conditional Disaggregation**](/docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ | | [**KV-Aware Routing**](/docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**Load Based Planner**](docs/planner/load_planner.md) | 🚧 | 🚧 | 🚧 | | [**Load Based Planner**](docs/planner/load_planner.md) | 🚧 | 🚧 | 🚧 |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ | | [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/kvbm_architecture.md) | ✅ | 🚧 | ✅ | | [**KVBM**](docs/kvbm/kvbm_architecture.md) | ✅ | 🚧 | ✅ |
......
...@@ -116,7 +116,7 @@ To see all available router arguments, run: ...@@ -116,7 +116,7 @@ To see all available router arguments, run:
python -m dynamo.frontend --help python -m dynamo.frontend --help
``` ```
For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/architecture/kv_cache_routing.md). For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md).
#### Disaggregated Serving with Automatic Prefill Routing #### Disaggregated Serving with Automatic Prefill Routing
...@@ -125,7 +125,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a ...@@ -125,7 +125,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a
- Uses KV-aware routing regardless of the frontend's `--router-mode` setting - Uses KV-aware routing regardless of the frontend's `--router-mode` setting
- Seamlessly integrates with your decode workers for token generation - Seamlessly integrates with your decode workers for token generation
No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/architecture/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details. No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details.
**Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead: **Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
......
...@@ -31,7 +31,7 @@ Each engine provides launch scripts for different deployment patterns in their r ...@@ -31,7 +31,7 @@ Each engine provides launch scripts for different deployment patterns in their r
## Core Components ## Core Components
### [Backends](src/dynamo/) ### [Backends](backends/)
The backends directory contains inference engine integrations and implementations, with a key focus on: The backends directory contains inference engine integrations and implementations, with a key focus on:
......
...@@ -144,7 +144,7 @@ All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you ...@@ -144,7 +144,7 @@ All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you
## Further Reading ## Further Reading
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/create_deployment.md) - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/deployment/create_deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md) - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
......
...@@ -153,7 +153,7 @@ args: ...@@ -153,7 +153,7 @@ args:
### 3. Deploy ### 3. Deploy
See the [Create Deployment Guide](../../../../docs/kubernetes/create_deployment.md) to learn how to deploy the deployment file. See the [Create Deployment Guide](../../../../docs/kubernetes/deployment/create_deployment.md) to learn how to deploy the deployment file.
First, create a secret for the HuggingFace token. First, create a secret for the HuggingFace token.
```bash ```bash
...@@ -258,7 +258,7 @@ For detailed configuration instructions, see the [KV cache transfer guide](../.. ...@@ -258,7 +258,7 @@ For detailed configuration instructions, see the [KV cache transfer guide](../..
## Request Migration ## Request Migration
You can enable [request migration](../../../../docs/architecture/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations: You can enable [request migration](../../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations:
```yaml ```yaml
args: args:
...@@ -277,11 +277,11 @@ Configure the `model` name and `host` based on your deployment. ...@@ -277,11 +277,11 @@ Configure the `model` name and `host` based on your deployment.
## Further Reading ## Further Reading
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/create_deployment.md) - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/deployment/create_deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md) - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/kv_cache_routing.md)
- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md) - **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md) - **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md)
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
......
...@@ -224,7 +224,7 @@ All templates use **Qwen/Qwen3-0.6B** as the default model, but you can use any ...@@ -224,7 +224,7 @@ All templates use **Qwen/Qwen3-0.6B** as the default model, but you can use any
## Request Migration ## Request Migration
You can enable [request migration](../../../../docs/architecture/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations: You can enable [request migration](../../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations:
```yaml ```yaml
args: args:
...@@ -234,12 +234,12 @@ args: ...@@ -234,12 +234,12 @@ args:
## Further Reading ## Further Reading
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/create_deployment.md) - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/deployment/create_deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md) - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/planner/sla_planner_quickstart.md) - **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/planner/sla_planner_quickstart.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/kv_cache_routing.md)
## Troubleshooting ## Troubleshooting
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# Standalone Router # Standalone Router
A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [KV Cache Routing documentation](/docs/architecture/kv_cache_routing.md). A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [KV Cache Routing documentation](/docs/router/kv_cache_routing.md).
## Overview ## Overview
...@@ -29,7 +29,7 @@ python -m dynamo.router \ ...@@ -29,7 +29,7 @@ python -m dynamo.router \
- `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`) - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
**Router Configuration:** **Router Configuration:**
For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [KV Cache Routing documentation](/docs/architecture/kv_cache_routing.md). For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [KV Cache Routing documentation](/docs/router/kv_cache_routing.md).
## Architecture ## Architecture
...@@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p ...@@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p
## Example: Manual Disaggregated Serving (Alternative Setup) ## Example: Manual Disaggregated Serving (Alternative Setup)
> [!Note] > [!Note]
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [KV Cache Routing documentation](/docs/architecture/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for the default setup. > **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [KV Cache Routing documentation](../../../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for the default setup.
> >
> Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately. > Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
...@@ -103,6 +103,6 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere ...@@ -103,6 +103,6 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere
## See Also ## See Also
- [KV Cache Routing Architecture](/docs/architecture/kv_cache_routing.md) - Detailed explanation of KV-aware routing - [KV Cache Routing Architecture](/docs/router/kv_cache_routing.md) - Detailed explanation of KV-aware routing
- [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing - [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
- [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning - [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
...@@ -21,7 +21,7 @@ This directory contains a pre-deployment check script that verifies your Kuberne ...@@ -21,7 +21,7 @@ This directory contains a pre-deployment check script that verifies your Kuberne
- For NCCL tests, please refer to the [NCCL tests](https://docs.nebius.com/kubernetes/gpu/nccl-test#run-tests) for more details. - For NCCL tests, please refer to the [NCCL tests](https://docs.nebius.com/kubernetes/gpu/nccl-test#run-tests) for more details.
- For NIXL benchmark, please refer to the [NIXL benchmark pre-deployment checks](/deploy/cloud/pre-deployment/nixl/README.md) for more details. For the latest pre-deployment check instructions, see the [main branch version of this README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/cloud/pre-deployment/README.md).
## Usage ## Usage
......
...@@ -16,7 +16,7 @@ Currently, these setups are only supported with the kGateway based Inference Gat ...@@ -16,7 +16,7 @@ Currently, these setups are only supported with the kGateway based Inference Gat
- [Prerequisites](#prerequisites) - [Prerequisites](#prerequisites)
- [Installation Steps](#installation-steps) - [Installation Steps](#installation-steps)
- [Usage](#usage) - [Usage](#6-usage)
## Prerequisites ## Prerequisites
...@@ -160,7 +160,7 @@ You can configure the plugin by setting environment vars in your [values-dynamo- ...@@ -160,7 +160,7 @@ You can configure the plugin by setting environment vars in your [values-dynamo-
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable KV event tracking while using kv-routing - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable KV event tracking while using kv-routing
- See the [KV cache routing design](../../docs/architecture/kv_cache_routing.md) for details. - See the [KV cache routing design](../../docs/router/kv_cache_routing.md) for details.
......
# Dynamo Logging on Kubernetes # Dynamo Logging on Kubernetes
For detailed documentation on collecting and visualizing logs on Kubernetes, see [docs/kubernetes/logging.md](../../docs/kubernetes/logging.md). For detailed documentation on collecting and visualizing logs on Kubernetes, see [docs/kubernetes/observability/logging.md](../../docs/kubernetes/observability/logging.md).
# Dynamo Metrics Collection on Kubernetes # Dynamo Metrics Collection on Kubernetes
For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/kubernetes/metrics.md](../../../docs/kubernetes/metrics.md). For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/kubernetes/observability/metrics.md](../../../docs/kubernetes/observability/metrics.md).
Overview
============
.. include:: ../architecture/architecture.md
:parser: myst_parser.sphinx_
.. toctree::
:hidden:
Overview <self>
Disaggregated Serving <../architecture/disagg_serving>
..
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Backends Backends
======== ========
NVIDIA Dynamo supports multiple inference backends to provide flexibility and performance optimization for different use cases and model architectures. Backends are the underlying engines that execute AI model inference, each optimized for specific scenarios, hardware configurations, and performance requirements.
Overview
--------
Dynamo's multi-backend architecture allows you to:
* **Choose the optimal engine** for your specific workload and hardware
* **Switch between backends** without changing your application code
* **Leverage specialized optimizations** from each backend
* **Scale flexibly** across different deployment scenarios
Supported Backends
------------------
Dynamo currently supports the following high-performance inference backends:
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
......
Deployment Guide
================
.. toctree::
:hidden:
Kubernetes Quickstart <../kubernetes/README>
Detailed Installation Guide <../kubernetes/installation_guide>
Dynamo Operator <../kubernetes/dynamo_operator>
Minikube Setup <../kubernetes/deployment/minikube>
Multinode
=========
.. toctree::
:hidden:
Multinode Deployments <../kubernetes/deployment/multinode-deployment>
Grove <../kubernetes/grove>
Observability
=============
.. toctree::
:hidden:
Metrics <../kubernetes/observability/metrics>
Logging <../kubernetes/observability/logging>
Observability
=============
.. toctree::
:hidden:
Metrics <../observability/metrics>
Logging <../observability/logging>
Health Checks <../observability/health-checks>
\ No newline at end of file
...@@ -34,9 +34,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -34,9 +34,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | SGLang | Notes | | Feature | SGLang | Notes |
|---------|--------|-------| |---------|--------|-------|
| [**Disaggregated Serving**](../../architecture/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | | [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../architecture/kv_cache_routing.md) | ✅ | | | [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | |
| [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ | | | [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ | |
| [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned | | [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned |
...@@ -55,7 +55,7 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu ...@@ -55,7 +55,7 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu
| Argument | Description | Default | SGLang Equivalent | | Argument | Description | Default | SGLang Equivalent |
|----------|-------------|---------|-------------------| |----------|-------------|---------|-------------------|
| `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A | | `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
| `--migration-limit` | Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../../docs/architecture/request_migration.md). | `0` (disabled) | N/A | | `--migration-limit` | Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault_tolerance/request_migration.md). | `0` (disabled) | N/A |
| `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` | | `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` |
| `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` | | `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` |
| `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A | | `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A |
...@@ -83,7 +83,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re ...@@ -83,7 +83,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
> [!WARNING] > [!WARNING]
> ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode. > ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
For more details, see the [Request Cancellation Architecture](../../architecture/request_cancellation.md) documentation. For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
## Installation ## Installation
......
...@@ -23,7 +23,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -23,7 +23,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
### Components ### Components
- workers: For aggregated serving, we have two workers, [MultimodalEncodeWorker](src/dynamo/sglang/request_handlers/multimodal_encode_worker_handler.py) for encoding and [MultimodalWorker](src/dynamo/sglang/request_handlers/multimodal_worker_handler.py) for prefilling and decoding. - workers: For aggregated serving, we have two workers, [MultimodalEncodeWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding and [MultimodalWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker. - processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow ### Workflow
...@@ -109,7 +109,7 @@ You should see a response similar to this: ...@@ -109,7 +109,7 @@ You should see a response similar to this:
### Components ### Components
- workers: For disaggregated serving, we have three workers, [MultimodalEncodeWorker](src/dynamo/sglang/request_handlers/multimodal_encode_worker_handler.py) for encoding, [MultimodalWorker](src/dynamo/sglang/request_handlers/multimodal_worker_handler.py) for decoding, and [MultimodalPrefillWorker](src/dynamo/sglang/request_handlers/multimodal_worker_handler.py) for prefilling. - workers: For disaggregated serving, we have three workers, [MultimodalEncodeWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding, [MultimodalWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding, and [MultimodalPrefillWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker. - processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow ### Workflow
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment