Unverified Commit b19de4ed authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: cleanup of docs refactor for components, integrations, and features (#6019)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 80e7bafd
...@@ -52,10 +52,10 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open ...@@ -52,10 +52,10 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
|---|:----:|:----------:|:--:| |---|:----:|:----------:|:--:|
| **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage | | **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ | | [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**KV-Aware Routing**](docs/router/README.md) | ✅ | ✅ | ✅ | | [**KV-Aware Routing**](docs/components/router/README.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ | | [**SLA-Based Planner**](docs/components/planner/planner_guide.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ | | [**KVBM**](docs/components/kvbm/README.md) | 🚧 | ✅ | ✅ |
| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ | | [**Multimodal**](docs/features/multimodal/README.md) | ✅ | ✅ | ✅ |
| [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ | | [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ |
> **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions. > **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
...@@ -347,7 +347,7 @@ python3 -m dynamo.frontend ...@@ -347,7 +347,7 @@ python3 -m dynamo.frontend
Dynamo provides comprehensive benchmarking tools: Dynamo provides comprehensive benchmarking tools:
- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf - **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements - **[SLA-Driven Deployments](docs/components/planner/planner_guide.md)** – Optimize deployments to meet SLA requirements
## Frontend OpenAPI Specification ## Frontend OpenAPI Specification
...@@ -357,7 +357,7 @@ The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To ...@@ -357,7 +357,7 @@ The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To
cargo run -p dynamo-llm --bin generate-frontend-openapi cargo run -p dynamo-llm --bin generate-frontend-openapi
``` ```
This writes to `docs/frontends/openapi.json`. This writes to `docs/reference/api/openapi.json`.
## Service Discovery and Messaging ## Service Discovery and Messaging
...@@ -388,9 +388,9 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL ...@@ -388,9 +388,9 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL
<!-- Reference links for Feature Compatibility Matrix --> <!-- Reference links for Feature Compatibility Matrix -->
[disagg]: docs/design_docs/disagg_serving.md [disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/router/README.md [kv-routing]: docs/components/router/README.md
[planner]: docs/planner/sla_planner.md [planner]: docs/components/planner/planner_guide.md
[kvbm]: docs/kvbm/README.md [kvbm]: docs/components/kvbm/README.md
[mm]: examples/multimodal/ [mm]: examples/multimodal/
[migration]: docs/fault_tolerance/request_migration.md [migration]: docs/fault_tolerance/request_migration.md
[lora]: examples/backends/vllm/deploy/lora/README.md [lora]: examples/backends/vllm/deploy/lora/README.md
......
../../docs/benchmarks/sla_driven_profiling.md
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Profiler
Documentation for the Dynamo Profiler has moved to [docs/components/profiler/](../../docs/components/profiler/README.md).
- [Profiler Overview](../../docs/components/profiler/README.md)
- [Profiler Guide](../../docs/components/profiler/profiler_guide.md)
- [Profiler Examples](../../docs/components/profiler/profiler_examples.md)
...@@ -620,7 +620,7 @@ def create_gradio_interface( ...@@ -620,7 +620,7 @@ def create_gradio_interface(
> 📝 **Note:** The dotted red line in the prefill and decode charts are default TTFT and ITL SLAs if not specified. > 📝 **Note:** The dotted red line in the prefill and decode charts are default TTFT and ITL SLAs if not specified.
> ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner.md#2-correction-factor-calculation) to adjust dynamically at runtime. > ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/design_docs/planner_design.md#step-2-correction-factor-calculation) to adjust dynamically at runtime.
> 💡 **Tip:** Use the GPU cost checkbox and input in the charts section to convert GPU hours to cost. > 💡 **Tip:** Use the GPU cost checkbox and input in the charts section to convert GPU hours to cost.
""" """
......
...@@ -127,7 +127,7 @@ To see all available router arguments, run: ...@@ -127,7 +127,7 @@ To see all available router arguments, run:
python -m dynamo.frontend --help python -m dynamo.frontend --help
``` ```
For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/router/router_guide.md). For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/components/router/router_guide.md).
> [!Note] > [!Note]
> If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead: > If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
...@@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a ...@@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a
- Uses the same routing mode as the frontend's `--router-mode` setting - Uses the same routing mode as the frontend's `--router-mode` setting
- Seamlessly integrates with your decode workers for token generation - Seamlessly integrates with your decode workers for token generation
No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/router/router_guide.md#disaggregated-serving) for more details. No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/components/router/router_guide.md#disaggregated-serving) for more details.
> [!Note] > [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh) > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)
......
...@@ -60,7 +60,7 @@ python -m dynamo.mocker \ ...@@ -60,7 +60,7 @@ python -m dynamo.mocker \
The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume. The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume.
To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/benchmarks/sla_driven_profiling.md) for details): To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/components/profiler/profiler_guide.md) for details):
```bash ```bash
python benchmarks/profiler/profile_sla.py \ python benchmarks/profiler/profile_sla.py \
......
...@@ -19,5 +19,5 @@ limitations under the License. ...@@ -19,5 +19,5 @@ limitations under the License.
SLA-driven autoscaling controller for Dynamo inference graphs. SLA-driven autoscaling controller for Dynamo inference graphs.
- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples) - **User docs**: [docs/planner/](/docs/components/planner/) (deployment, configuration, examples)
- **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms) - **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms)
...@@ -29,7 +29,7 @@ logger = logging.getLogger(__name__) ...@@ -29,7 +29,7 @@ logger = logging.getLogger(__name__)
MISSING_PROFILING_DATA_ERROR_MESSAGE = ( MISSING_PROFILING_DATA_ERROR_MESSAGE = (
"SLA-Planner requires pre-deployment profiling results to run.\n" "SLA-Planner requires pre-deployment profiling results to run.\n"
"Please follow /docs/benchmarks/sla_driven_profiling.md to run the profiling first,\n" "Please follow /docs/components/profiler/profiler_guide.md to run the profiling first,\n"
"and make sure the profiling results are present in --profile-results-dir." "and make sure the profiling results are present in --profile-results-dir."
) )
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# Standalone Router # Standalone Router
A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/router/router_guide.md). A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/components/router/router_guide.md).
## Overview ## Overview
...@@ -29,7 +29,7 @@ python -m dynamo.router \ ...@@ -29,7 +29,7 @@ python -m dynamo.router \
- `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`) - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
**Router Configuration:** **Router Configuration:**
For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/router/router_guide.md). For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/components/router/router_guide.md).
## Architecture ## Architecture
...@@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p ...@@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p
## Example: Manual Disaggregated Serving (Alternative Setup) ## Example: Manual Disaggregated Serving (Alternative Setup)
> [!Note] > [!Note]
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/router/router_guide.md#disaggregated-serving) for the default setup. > **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/components/router/router_guide.md#disaggregated-serving) for the default setup.
> >
> Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately. > Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
...@@ -103,7 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere ...@@ -103,7 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere
## See Also ## See Also
- [Router Guide](/docs/router/router_guide.md) - Configuration and tuning for KV-aware routing - [Router Guide](/docs/components/router/router_guide.md) - Configuration and tuning for KV-aware routing
- [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes - [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes
- [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing - [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
- [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning - [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
...@@ -220,7 +220,7 @@ Common Vars for Routing Configuration: ...@@ -220,7 +220,7 @@ Common Vars for Routing Configuration:
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
- See the [Router Guide](../../docs/router/router_guide.md) for details. - See the [Router Guide](../../docs/components/router/router_guide.md) for details.
Stand-Alone installation only: Stand-Alone installation only:
......
...@@ -145,7 +145,7 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE ...@@ -145,7 +145,7 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE
For complete benchmarking and profiling workflows: For complete benchmarking and profiling workflows:
- **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints - **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints
- **Pre-Deployment Profiling**: See [docs/benchmarks/sla_driven_profiling.md](../../docs/benchmarks/sla_driven_profiling.md) for optimizing configurations before deployment - **Pre-Deployment Profiling**: See [docs/components/profiler/profiler_guide.md](../../docs/components/profiler/profiler_guide.md) for optimizing configurations before deployment
## Notes ## Notes
......
Frontends
=========
.. toctree::
:maxdepth: 1
Frontend Overview <../components/frontend/README.md>
Frontend Guide <../components/frontend/frontend_guide.md>
KServe (deprecated) <../frontends/kserve.md>
\ No newline at end of file
...@@ -103,7 +103,7 @@ flowchart LR ...@@ -103,7 +103,7 @@ flowchart LR
### Multimodal Example ### Multimodal Example
In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md): In the case of the [Dynamo Multimodal Disaggregated Example](../../features/multimodal/multimodal_vllm.md):
1. The HTTP frontend accepts a text prompt and a URL to an image. 1. The HTTP frontend accepts a text prompt and a URL to an image.
......
...@@ -36,10 +36,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -36,10 +36,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
|---------|--------|-------| |---------|--------|-------|
| [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | | [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../router/README.md) | ✅ | | | [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../components/planner/planner_guide.md) | ✅ | |
| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | | | [**Multimodal Support**](../../features/multimodal/multimodal_sglang.md) | ✅ | |
| [**KVBM**](../../kvbm/README.md) | ❌ | Planned | | [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |
## Dynamo SGLang Integration ## Dynamo SGLang Integration
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Enable SGLang Hierarchical Cache (HiCache)
This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.
## 1) Start the SGLang worker with HiCache enabled
```bash
python -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 \
--page-size 64 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl \
--log-level debug \
--skip-tokenizer-init
```
- **--enable-hierarchical-cache**: Enables hierarchical KV cache/offload
- **--hicache-ratio**: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.
- **--hicache-write-policy**: Write policy (e.g., `write_through` for synchronous host writes)
- **--hicache-storage-backend**: Host storage backend for HiCache (e.g., `nixl`). NIXL selects the concrete store automatically; see [PR #8488](https://github.com/sgl-project/sglang/pull/8488)
Then, start the frontend:
```bash
python -m dynamo.frontend --http-port 8000
```
## 2) Send a single request
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": false,
"max_tokens": 30
}'
```
## 3) (Optional) Benchmarking
Run the perf script:
```bash
bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
--model Qwen/Qwen3-0.6B \
--tensor-parallelism 1 \
--data-parallelism 1 \
--concurrency "2,4,8" \
--input-sequence-length 2048 \
--output-sequence-length 256
```
...@@ -55,10 +55,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -55,10 +55,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
|---------|--------------|-------| |---------|--------------|-------|
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet | | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../router/README.md) | ✅ | | | [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned | | [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | | | [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | |
### Large Scale P/D and WideEP Features ### Large Scale P/D and WideEP Features
...@@ -114,7 +114,7 @@ apt-get update && apt-get -y install git git-lfs ...@@ -114,7 +114,7 @@ apt-get update && apt-get -y install git git-lfs
> [!IMPORTANT] > [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals. > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../router/router_guide.md). For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router_guide.md).
### Aggregated ### Aggregated
```bash ```bash
...@@ -231,7 +231,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t ...@@ -231,7 +231,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t
## Multimodal support ## Multimodal support
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md). Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal_trtllm.md).
## Logits Processing ## Logits Processing
...@@ -327,7 +327,7 @@ For detailed instructions on running comprehensive performance sweeps across bot ...@@ -327,7 +327,7 @@ For detailed instructions on running comprehensive performance sweeps across bot
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) . Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
## Known Issues and Mitigations ## Known Issues and Mitigations
......
...@@ -37,10 +37,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -37,10 +37,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
|---------|------|-------| |---------|------|-------|
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP | | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../router/README.md) | ✅ | | | [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP | | [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | | | [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | |
| [**LMCache**](../../integrations/lmcache_integration.md) | ✅ | | | [**LMCache**](../../integrations/lmcache_integration.md) | ✅ | |
| [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag | | [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
...@@ -144,7 +144,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu ...@@ -144,7 +144,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu
Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node. Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy. This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
**Guide:** [Speculative Decoding Quickstart](./speculative_decoding.md) **Guide:** [Speculative Decoding Quickstart](../../features/speculative_decoding/speculative_decoding_vllm.md)
> **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation. > **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
> **Note**: This content has moved to [Speculative Decoding with vLLM](../../features/speculative_decoding/speculative_decoding_vllm.md).
> See [Speculative Decoding Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
> This file will be removed in a future release.
# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.
Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.
## Step 1: Set Up Your Docker Environment
First, we’ll initialize a Docker container using the VLLM backend.
You can refer to the [VLLM Quickstart Guide](./README.md#vllm-quick-start) — or follow the full steps below.
### 1. Launch Docker Compose
```bash
docker compose -f deploy/docker-compose.yml up -d
```
### 2. Build the Container
```bash
./container/build.sh --framework VLLM
```
### 3. Run the Container
```bash
./container/run.sh -it --framework VLLM --mount-workspace
```
## Step 2: Get Access to the Llama-3 Model
The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face.
Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form.
Approval usually takes around **5 minutes**.
Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container:
```bash
export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
```
## Step 3: Run Aggregated Speculative Decoding
Now that your environment is ready, start the aggregated server with **speculative decoding**.
```bash
# Requires only one GPU
cd examples/backends/vllm
bash launch/agg_spec_decoding.sh
```
Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.
## Step 4: Example Request
To verify your setup, try sending a simple prompt to your model:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
],
"max_tokens": 250
}'
```
### Example Output
```json
{
"id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
"choices": [
{
"text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
"index": 0,
"finish_reason": "stop"
}
],
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"usage": {
"prompt_tokens": 16,
"completion_tokens": 250,
"total_tokens": 266
}
}
```
## Additional Resources
* [VLLM Quickstart](./README.md#vllm-quick-start)
* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
\ No newline at end of file
This diff is collapsed.
...@@ -78,4 +78,4 @@ See the [Frontend Guide](frontend_guide.md) for full configuration options. ...@@ -78,4 +78,4 @@ See the [Frontend Guide](frontend_guide.md) for full configuration options.
| Document | Description | | Document | Description |
|----------|-------------| |----------|-------------|
| [Frontend Guide](frontend_guide.md) | KServe gRPC configuration and integration | | [Frontend Guide](frontend_guide.md) | KServe gRPC configuration and integration |
| [Router Documentation](../../router/README.md) | KV-aware routing configuration | | [Router Documentation](../router/README.md) | KV-aware routing configuration |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment