docs: cleanup of docs refactor for components, integrations, and features (#6019)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: cleanup of docs refactor for components, integrations, and features (#6019)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
b19de4ed · dagil-nvidia · GitHub · 80e7bafd · b19de4ed · 80e7bafd
Unverified Commit b19de4ed authored Feb 05, 2026 by dagil-nvidia Committed by GitHub Feb 05, 2026
20 changed files
--- a/README.md
+++ b/README.md
@@ -52,10 +52,10 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
 |---|:----:|:----------:|:--:|
 | **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
 | [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
-| [**KV-Aware Routing**](docs/router/README.md) | ✅ | ✅ | ✅ |
+| [**KV-Aware Routing**](docs/components/router/README.md) | ✅ | ✅ | ✅ |
-| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
+| [**SLA-Based Planner**](docs/components/planner/planner_guide.md) | ✅ | ✅ | ✅ |
-| [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ |
+| [**KVBM**](docs/components/kvbm/README.md) | 🚧 | ✅ | ✅ |
-| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
+| [**Multimodal**](docs/features/multimodal/README.md) | ✅ | ✅ | ✅ |
 | [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ |
 > **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
@@ -347,7 +347,7 @@ python3 -m dynamo.frontend
 Dynamo provides comprehensive benchmarking tools:
 - **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
+- **[SLA-Driven Deployments](docs/components/planner/planner_guide.md)** – Optimize deployments to meet SLA requirements
 ## Frontend OpenAPI Specification
@@ -357,7 +357,7 @@ The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To
 cargo run -p dynamo-llm --bin generate-frontend-openapi
 ```
-This writes to `docs/frontends/openapi.json`.
+This writes to `docs/reference/api/openapi.json`.
 ## Service Discovery and Messaging
@@ -388,9 +388,9 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL
 <!-- Reference links for Feature Compatibility Matrix -->
 [disagg]: docs/design_docs/disagg_serving.md
-[kv-routing]: docs/router/README.md
+[kv-routing]: docs/components/router/README.md
-[planner]: docs/planner/sla_planner.md
+[planner]: docs/components/planner/planner_guide.md
-[kvbm]: docs/kvbm/README.md
+[kvbm]: docs/components/kvbm/README.md
 [mm]: examples/multimodal/
 [migration]: docs/fault_tolerance/request_migration.md
 [lora]: examples/backends/vllm/deploy/lora/README.md

--- a/benchmarks/profiler/README.md
+++ b/benchmarks/profiler/README.md
-../../docs/benchmarks/sla_driven_profiling.md
\ No newline at end of file
--- a/benchmarks/profiler/README.md
+++ b/benchmarks/profiler/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Profiler
+Documentation for the Dynamo Profiler has moved to [docs/components/profiler/](../../docs/components/profiler/README.md).
+- [Profiler Overview](../../docs/components/profiler/README.md)
+- [Profiler Guide](../../docs/components/profiler/profiler_guide.md)
+- [Profiler Examples](../../docs/components/profiler/profiler_examples.md)
--- a/benchmarks/profiler/webui/utils.py
+++ b/benchmarks/profiler/webui/utils.py
@@ -620,7 +620,7 @@ def create_gradio_interface(
            > 📝 **Note:** The dotted red line in the prefill and decode charts are default TTFT and ITL SLAs if not specified.
-            > ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner.md#2-correction-factor-calculation) to adjust dynamically at runtime.
+            > ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/design_docs/planner_design.md#step-2-correction-factor-calculation) to adjust dynamically at runtime.
            > 💡 **Tip:** Use the GPU cost checkbox and input in the charts section to convert GPU hours to cost.
            """

--- a/benchmarks/router/README.md
+++ b/benchmarks/router/README.md
@@ -127,7 +127,7 @@ To see all available router arguments, run:
 python -m dynamo.frontend --help
 ```
-For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/router/router_guide.md).
+For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/components/router/router_guide.md).
 > [!Note]
 > If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
@@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a
 - Uses the same routing mode as the frontend's `--router-mode` setting
 - Seamlessly integrates with your decode workers for token generation
-No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/router/router_guide.md#disaggregated-serving) for more details.
+No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/components/router/router_guide.md#disaggregated-serving) for more details.
 > [!Note]
 > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)

--- a/components/src/dynamo/mocker/README.md
+++ b/components/src/dynamo/mocker/README.md
@@ -60,7 +60,7 @@ python -m dynamo.mocker \
 The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume.
-To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/benchmarks/sla_driven_profiling.md) for details):
+To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/components/profiler/profiler_guide.md) for details):
 ```bash
 python benchmarks/profiler/profile_sla.py \

--- a/components/src/dynamo/planner/README.md
+++ b/components/src/dynamo/planner/README.md
@@ -19,5 +19,5 @@ limitations under the License.
 SLA-driven autoscaling controller for Dynamo inference graphs.
- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples)
+- **User docs**: [docs/planner/](/docs/components/planner/) (deployment, configuration, examples)
 - **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms)
--- a/components/src/dynamo/planner/utils/perf_interpolation.py
+++ b/components/src/dynamo/planner/utils/perf_interpolation.py
@@ -29,7 +29,7 @@ logger = logging.getLogger(__name__)
 MISSING_PROFILING_DATA_ERROR_MESSAGE = (
    "SLA-Planner requires pre-deployment profiling results to run.\n"
-    "Please follow /docs/benchmarks/sla_driven_profiling.md to run the profiling first,\n"
+    "Please follow /docs/components/profiler/profiler_guide.md to run the profiling first,\n"
    "and make sure the profiling results are present in --profile-results-dir."
 )

--- a/components/src/dynamo/router/README.md
+++ b/components/src/dynamo/router/README.md
@@ -3,7 +3,7 @@
 # Standalone Router
-A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/router/router_guide.md).
+A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/components/router/router_guide.md).
 ## Overview
@@ -29,7 +29,7 @@ python -m dynamo.router \
 - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
 **Router Configuration:**
-For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/router/router_guide.md).
+For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/components/router/router_guide.md).
 ## Architecture
@@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p
 ## Example: Manual Disaggregated Serving (Alternative Setup)
 > [!Note]
-> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/router/router_guide.md#disaggregated-serving) for the default setup.
+> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/components/router/router_guide.md#disaggregated-serving) for the default setup.
 >
 > Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
@@ -103,7 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere
 ## See Also
- [Router Guide](/docs/router/router_guide.md) - Configuration and tuning for KV-aware routing
+- [Router Guide](/docs/components/router/router_guide.md) - Configuration and tuning for KV-aware routing
 - [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes
 - [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
 - [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
--- a/deploy/inference-gateway/README.md
+++ b/deploy/inference-gateway/README.md
@@ -220,7 +220,7 @@ Common Vars for Routing Configuration:
  - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
  - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
  - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
-  - See the [Router Guide](../../docs/router/router_guide.md) for details.
+  - See the [Router Guide](../../docs/components/router/router_guide.md) for details.
 Stand-Alone installation only:

--- a/deploy/utils/README.md
+++ b/deploy/utils/README.md
@@ -145,7 +145,7 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE
 For complete benchmarking and profiling workflows:
 - **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints
- **Pre-Deployment Profiling**: See [docs/benchmarks/sla_driven_profiling.md](../../docs/benchmarks/sla_driven_profiling.md) for optimizing configurations before deployment
+- **Pre-Deployment Profiling**: See [docs/components/profiler/profiler_guide.md](../../docs/components/profiler/profiler_guide.md) for optimizing configurations before deployment
 ## Notes

--- a/docs/_sections/frontends.rst
+++ b/docs/_sections/frontends.rst
-Frontends
-=========
-.. toctree::
-   :maxdepth: 1
-   Frontend Overview <../components/frontend/README.md>
-   Frontend Guide <../components/frontend/frontend_guide.md>
-   KServe (deprecated) <../frontends/kserve.md>
\ No newline at end of file
--- a/docs/api/nixl_connect/README.md
+++ b/docs/api/nixl_connect/README.md
@@ -103,7 +103,7 @@ flowchart LR
 ### Multimodal Example
-In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md):
+In the case of the [Dynamo Multimodal Disaggregated Example](../../features/multimodal/multimodal_vllm.md):
 1. The HTTP frontend accepts a text prompt and a URL to an image.

--- a/docs/backends/sglang/README.md
+++ b/docs/backends/sglang/README.md
@@ -36,10 +36,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|--------|-------|
 | [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
-| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../components/planner/planner_guide.md) | ✅ |  |
-| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ |  |
+| [**Multimodal Support**](../../features/multimodal/multimodal_sglang.md) | ✅ |  |
-| [**KVBM**](../../kvbm/README.md) | ❌ | Planned |
+| [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |
 ## Dynamo SGLang Integration

--- a/docs/backends/sglang/sgl-hicache-example.md
+++ b/docs/backends/sglang/sgl-hicache-example.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-# Enable SGLang Hierarchical Cache (HiCache)
-This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.
-## 1) Start the SGLang worker with HiCache enabled
-```bash
-python -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
-  --host 0.0.0.0 --port 8000 \
-  --page-size 64 \
-  --enable-hierarchical-cache \
-  --hicache-ratio 2 \
-  --hicache-write-policy write_through \
-  --hicache-storage-backend nixl \
-  --log-level debug \
-  --skip-tokenizer-init
-```
- **--enable-hierarchical-cache**: Enables hierarchical KV cache/offload
- **--hicache-ratio**: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.
- **--hicache-write-policy**: Write policy (e.g., `write_through` for synchronous host writes)
- **--hicache-storage-backend**: Host storage backend for HiCache (e.g., `nixl`). NIXL selects the concrete store automatically; see [PR #8488](https://github.com/sgl-project/sglang/pull/8488)
-Then, start the frontend:
-```bash
-python -m dynamo.frontend --http-port 8000
-```
-## 2) Send a single request
-```bash
-curl localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [
-      {
-        "role": "user",
-        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
-      }
-    ],
-    "stream": false,
-    "max_tokens": 30
-  }'
-```
-## 3) (Optional) Benchmarking
-Run the perf script:
-```bash
-bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
-  --model Qwen/Qwen3-0.6B \
-  --tensor-parallelism 1 \
-  --data-parallelism 1 \
-  --concurrency "2,4,8" \
-  --input-sequence-length 2048 \
-  --output-sequence-length 256
-```
--- a/docs/backends/trtllm/README.md
+++ b/docs/backends/trtllm/README.md
@@ -55,10 +55,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|--------------|-------|
 | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ |  |
-| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
+| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | Planned |
-| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
+| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | |
 ### Large Scale P/D and WideEP Features
@@ -114,7 +114,7 @@ apt-get update && apt-get -y install git git-lfs
 > [!IMPORTANT]
 > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
-For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../router/router_guide.md).
+For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router_guide.md).
 ### Aggregated
 ```bash
@@ -231,7 +231,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t
 ## Multimodal support
-Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).
+Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal_trtllm.md).
 ## Logits Processing
@@ -327,7 +327,7 @@ For detailed instructions on running comprehensive performance sweeps across bot
 Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
-Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
+Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
 ## Known Issues and Mitigations

--- a/docs/backends/vllm/README.md
+++ b/docs/backends/vllm/README.md
@@ -37,10 +37,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|------|-------|
 | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
-| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ |  |
-| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
+| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | WIP |
-| [**KVBM**](../../../docs/kvbm/README.md) | ✅ |  |
+| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ |  |
 | [**LMCache**](../../integrations/lmcache_integration.md) | ✅ |  |
 | [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
@@ -144,7 +144,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu
 Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
 This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
-**Guide:** [Speculative Decoding Quickstart](./speculative_decoding.md)
+**Guide:** [Speculative Decoding Quickstart](../../features/speculative_decoding/speculative_decoding_vllm.md)
 > **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.

--- a/docs/backends/vllm/speculative_decoding.md
+++ b/docs/backends/vllm/speculative_decoding.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-> **Note**: This content has moved to [Speculative Decoding with vLLM](../../features/speculative_decoding/speculative_decoding_vllm.md).
-> See [Speculative Decoding Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
-> This file will be removed in a future release.
-# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
-This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.
-Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.
-## Step 1: Set Up Your Docker Environment
-First, we’ll initialize a Docker container using the VLLM backend.
-You can refer to the [VLLM Quickstart Guide](./README.md#vllm-quick-start) — or follow the full steps below.
-### 1. Launch Docker Compose
-```bash
-docker compose -f deploy/docker-compose.yml up -d
-```
-### 2. Build the Container
-```bash
-./container/build.sh --framework VLLM
-```
-### 3. Run the Container
-```bash
-./container/run.sh -it --framework VLLM --mount-workspace
-```
-## Step 2: Get Access to the Llama-3 Model
-The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face.
-Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form.
-Approval usually takes around **5 minutes**.
-Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container:
-```bash
-export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
-export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
-```
-## Step 3: Run Aggregated Speculative Decoding
-Now that your environment is ready, start the aggregated server with **speculative decoding**.
-```bash
-# Requires only one GPU
-cd examples/backends/vllm
-bash launch/agg_spec_decoding.sh
-```
-Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.
-## Step 4: Example Request
-To verify your setup, try sending a simple prompt to your model:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-   -H "Content-Type: application/json" \
-   -d '{
-     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
-     "messages": [
-       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
-     ],
-     "max_tokens": 250
-   }'
-```
-### Example Output
-```json
-{
-  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
-  "choices": [
-    {
-      "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
-      "index": 0,
-      "finish_reason": "stop"
-    }
-  ],
-  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
-  "usage": {
-    "prompt_tokens": 16,
-    "completion_tokens": 250,
-    "total_tokens": 266
-  }
-}
-```
-## Additional Resources
-* [VLLM Quickstart](./README.md#vllm-quick-start)
-* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
\ No newline at end of file
--- a/docs/benchmarks/sla_driven_profiling.md
+++ b/docs/benchmarks/sla_driven_profiling.md
-# SLA-Driven Profiling with DynamoGraphDeploymentRequest
-> [!TIP]
-> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.
-> [!NOTE]
-> **See also**: [Profiler Component Overview](/docs/components/profiler/README.md) for a quick start guide and feature matrix.
-## Overview
-Dynamo provides automated SLA-driven profiling through **DynamoGraphDeploymentRequests (DGDR)**. Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically.
-**Key Benefits:**
- **Declarative**: Specify SLAs, not implementation details
- **Automated**: No manual job setup or result processing
- **Integrated**: Seamlessly works with Dynamo Operator
- **Production-Ready**: Generates optimized configurations with SLA planner
-This document covers:
- Technical details of online vs offline profiling
- Profiling process internals (GPU usage, measurements, interpolation)
- Direct script usage for advanced scenarios
- Comprehensive troubleshooting
-## Support Matrix
-| Backend | Dense Models | MoE Models |
-|---------|-------------|------------|
-| vLLM | ✅ | 🚧 |
-| SGLang | ✅ | ✅ |
-| TensorRT-LLM | ✅ | 🚧 |
-Specifically, the profiler sweeps over the following parallelization mapping for prefill and decode:
-| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping |
-|---------|-------------|------------|
-| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP |
-| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
-| Other Models | TP | TP |
-> [!NOTE]
-> - Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
-## Using DGDR for Profiling (Recommended)
-The recommended way to profile models is through DGDRs. Sample configurations are provided in `deploy/`:
-**Available Samples:**
- **`profile_sla_dgdr.yaml`**: Standard profiling with AIPerf on real engines
- **`profile_sla_aic_dgdr.yaml`**: Fast profiling with AI Configurator simulation
- **`profile_sla_moe_dgdr.yaml`**: MoE model profiling
-The Dynamo Operator automatically:
-1. Discovers GPU resources (cluster-scoped operators only)
-2. Runs profiling (AIPerf on real engines or AI Configurator simulation)
-3. Generates optimal DGD configuration with SLA planner
-4. Deploys the DGD to your cluster
-See the [Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for prerequisites and detailed instructions.
-## Hardware Configuration
-Hardware parameters have sensible defaults and are **optional** - you can override them if needed:
-```yaml
-profilingConfig:
-  config:
-    # Override hardware defaults if needed
-    hardware:
-      minNumGpusPerEngine: 1
-      maxNumGpusPerEngine: 8
-      numGpusPerNode: 8
-    # Only needed when using AI Configurator (sweep.useAiConfigurator: true)
-    sweep:
-      aicSystem: h200_sxm  # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
-```
-### Automatic GPU Discovery (Optional Feature)
-Cluster-scoped operators can optionally enable automatic GPU discovery to detect hardware from cluster nodes. When enabled, hardware config is auto-detected and overrides any manually specified values.
-```yaml
-spec:
-  enableGpuDiscovery: true
-```
-This feature is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions. It is not available for namespace-restricted operators.
-## Profiling Method
-1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
-2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense model and 4 nodes for MoE models.
-3. **Parallelization Mapping Sweep**: Use the input ISL and OSL, test the performance of the engines with different parallelization mappings.
-   - For dense models, we test different TP sizes for both prefill and decode.
-   - For MoE models (SGLang), we evaluate both TEP and DEP as candidates for prefill and decode.
-   - **Prefill**:
-     - TP/TEP: We measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
-     - DEP: Attention uses data parallelism. We send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst. This stabilizes measurements when the first batch may launch before all requests arrive.
-   ![Prefill Performance](../images/h100_prefill_performance.png)
-   - **Decode**: Since the ITL (or iteration time) is relevant with how many requests are in-flight, we measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL. However, for MoE models, this is not guaranteed because the kv cache in different attention DP ranks is different. We are working on framework-side change to fix this issue. For example, the below plot shows the decode parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B.
-   ![Decode Performance](../images/h100_decode_performance.png)
-4. **Recommendation**: Selects optimal parallelization mapping for prefill and decode that achieves the highest per GPU throughput while adhering the SLA on TTFT and ITL. Specifically, the profiler will choose the point (or a point on the curve for decode) that is left to the vertical red dashed line that represents the SLAs while has the highest y coordinate (throughput per GPU).
-5. **In-Depth Profiling on the Recommended P/D Engine**: After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner.
-![ITL Interpolation](../images/pd_interpolation.png)
-   - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
-   - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases.
-To run the parallelization mapping sweep and the in-depth profiling on the recommended P/D engine, the profiler need to know the engine's forward pass time with different loads. There are two ways to achieve this: run AIPerf on real engines or use AI Configurator to run simulations.
-### AIPerf on Real Engines
-Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
-**Characteristics:**
- **Duration**: 2-4 hours
- **Accuracy**: Highest (real measurements)
- **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM
-**DGDR Configuration:**
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      useAiConfigurator: false  # Default
-```
-### AI Configurator Simulation
-Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
-**Characteristics:**
- **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
-**DGDR Configuration:**
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      useAiConfigurator: true
-      aicSystem: h200_sxm          # GPU system type
-      aicHfId: Qwen/Qwen3-32B      # HuggingFace model ID
-      aicBackendVersion: "0.20.0"
-```
-**Supported Configurations:**
-For the current list of supported models, systems, and backend versions, see the [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features).
-To check from the command line: `aiconfigurator cli --help`
-**Currently supports:**
- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
-### Output Format
-After profiling, the DGDR status contains:
-1. **Recommended Configuration**: Optimal TP for prefill and decode
-2. **Performance Data**: Interpolation models for SLA planner
-3. **Generated DGD**: Complete deployment manifest
-**Example Recommendations:**
-```
-Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
-Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
-```
-#### Interactive Configuration Selection WebUI
-When running the profiler with `--pick-with-webui`, an interactive web interface is launched that allows you to visually explore profiling results and manually select configurations.
-**Features:**
- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
-**Selection Methods:**
-1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
-2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
-**Example DGD Config Output:**
-When you click "Show Config", you'll see a DynamoGraphDeployment configuration like:
-```yaml
-# DynamoGraphDeployment Configuration
-# Prefill: 1 GPU(s), TP=1
-# Decode: 4 GPU(s), TP=4
-# Model: Qwen/Qwen3-32B-FP8
-# Backend: trtllm
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-spec:
-  services:
-    PrefillWorker:
-      subComponentType: prefill
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --tensor-parallel-size=1
-    DecodeWorker:
-      subComponentType: decode
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --tensor-parallel-size=4
-```
-**Usage:**
-```bash
-python -m benchmarks.profiler.profile_sla \
-  --backend trtllm \
-  --config path/to/disagg.yaml \
-  --pick-with-webui \
-  --use-ai-configurator \
-  --model Qwen/Qwen3-32B-FP8 \
-  --aic-system h200_sxm \
-  --ttft 200 --itl 15
-```
-Once you have selected a configuration, the full DynamoGraphDeployment CRD will be saved in your output folder as `config_with_planner.yaml`.
-The WebUI launches on port 8000 by default (configurable with `--webui-port`).
-#### Output Performance Plots
-The profiler will generate the following plots to better visualize the performance data:
-**Parallelization Mapping Sweep Plots:**
- `prefill_performance.png`: TTFT vs Parallelization Mapping size
- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests
-Note these two plots are based on the input ISL and OSL.
-**In-Depth Profiling for the Recommended P/D Engine Plots:**
- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL for the recommended prefill engine
- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL for the recommended prefill engine
- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length for the recommended decode engine
- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length for the recommended decode engine
-### Output Interpolation Data
-The profiler generates `.npz` files to store the performance data for the recommended P/D engine:
-**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`):
- `prefill_isl`: 1D array of input sequence lengths tested
- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL
- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL
-**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`):
- `max_kv_tokens`: Total KV tokens capacity in decode engine
- `x_kv_usage`: 1D array of active KV usage percentages [0, 1]
- `y_context_length`: 1D array of average context lengths tested
- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point
- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point
-## DGDR Configuration Reference
-This section provides detailed explanations of all DGDR `profilingConfig` options. The DGDR controller passes this configuration to the profiler script, which is defined in `benchmarks/profiler/utils/profiler_argparse.py`.
-### Configuration Structure
-All profiler configuration goes under `spec.profilingConfig.config`:
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: my-deployment
-spec:
-  model: "Qwen/Qwen3-0.6B"         # High-level: model to deploy
-  backend: vllm                    # High-level: inference backend
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Required
-    configMapRef:                  # Optional: base DGD config
-      name: my-config
-      key: disagg.yaml
-    config:                        # Profiler configuration
-      sla: { ... }
-      hardware: { ... }
-      sweep: { ... }               # AIC settings go here (aicSystem, aicHfId, etc.)
-      planner: { ... }
-  deploymentOverrides:             # Optional
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-```
-### SLA Configuration (Required)
-Define your performance requirements and workload characteristics:
-```yaml
-profilingConfig:
-  config:
-    sla:
-      isl: 3000      # Average input sequence length (tokens)
-      osl: 150       # Average output sequence length (tokens)
-      ttft: 200.0    # Target Time To First Token (milliseconds)
-      itl: 20.0      # Target Inter-Token Latency (milliseconds)
-```
-**What these control:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine)
- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine)
- **Trade-offs**: Tighter SLAs require more GPU resources
-### Hardware Configuration (Optional)
-Control GPU search space and constraints:
-```yaml
-profilingConfig:
-  config:
-    hardware:
-      minNumGpusPerEngine: 2      # if not provided, will automatically determine based on model and VRAM size
-      maxNumGpusPerEngine: 8      # Maximum GPUs to test
-      numGpusPerNode: 8            # GPUs per node (for multi-node MoE)
-      gpuType: h200_sxm              # GPU type hint
-```
-**When to use:**
- **minNumGpusPerEngine**: Skip small TP sizes if your model is large
- **maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
- **numGpusPerNode**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
- **gpu_type**: Informational, auto-detected by controller
-> [!TIP]
-> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
-### Sweep Configuration (Optional)
-Control profiling behavior:
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      useAiConfigurator: false              # Use offline profiling (default: false)
-      prefillInterpolationGranularity: 16   # Samples for prefill TTFT curve
-      decodeInterpolationGranularity: 6     # Samples for decode ITL curve
-```
-**Use cases:**
- **useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
- **prefillInterpolationGranularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
- **decodeInterpolationGranularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
-### AI Configurator Configuration (Required if `useAiConfigurator: true`)
-Configure AI Configurator profiling mode:
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      useAiConfigurator: true
-      aicSystem: h200_sxm              # GPU system: h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
-      aicHfId: Qwen/Qwen3-32B         # Huggingface model id
-      aicBackendVersion: "0.20.0"     # TensorRT-LLM version: 0.20.0, 1.0.0rc3
-```
-**Supported configurations:** See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features)
-### Planner Configuration (Optional)
-Pass arguments to the SLA planner:
-```yaml
-profilingConfig:
-  config:
-    planner:
-      planner_min_endpoint: 2                    # Minimum endpoints to maintain
-      planner_adjustment_interval: 60            # Adjustment interval (seconds)
-      planner_load_predictor: linear             # Load prediction method
-```
-> [!NOTE]
-> Planner arguments use `planner_` prefix. See planner documentation for full list.
-### Model Cache PVC (Advanced)
-For large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. This is useful when:
- The model is not publicly available on HuggingFace
- You want to avoid repeated downloads during profiling
- You have a shared model cache across your cluster
-```yaml
-profilingConfig:
-  config:
-    deployment:
-      modelCache:
-        pvcName: "model-cache"                        # Name of PVC containing model weights (required)
-        pvcPath: "hub/models--deepseek-ai--DeepSeek-R1"  # Subpath within PVC (optional)
-        mountPath: "/opt/model-cache"                 # Mount path in container (optional, default: /opt/model-cache)
-```
-**Requirements:**
- The PVC must exist in the same namespace as the DGDR
- The model weights must be accessible at `{mountPath}/{pvcPath}`
-### Engine Configuration (Auto-configured)
-The controller automatically sets these from high-level fields:
-```yaml
-# You specify:
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-# Controller auto-injects into config:
-profilingConfig:
-  config:
-    deployment:
-      model: "Qwen/Qwen3-0.6B"       # From spec.model
-    engine:
-      backend: vllm                  # From spec.backend
-      config: /path/to/configmap     # From spec.profilingConfig.configMapRef (if provided)
-```
-**You should not manually set** `deployment.model` or `engine.backend` in `profilingConfig.config` - they are automatically injected from the high-level fields.
-### Complete Example: AIPerf on Real Engines
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: vllm-dense-online
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-    config:
-      sla:
-        isl: 3000
-        osl: 150
-        ttft: 200.0
-        itl: 20.0
-      hardware:
-        minNumGpusPerEngine: 1
-        maxNumGpusPerEngine: 8
-      sweep:
-        useAiConfigurator: false
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-  autoApply: true
-```
-### Complete Example: AI Configurator Simulation
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: trtllm-aic-offline
-spec:
-  model: "Qwen/Qwen3-32B"
-  backend: trtllm
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
-    config:
-      sla:
-        isl: 4000
-        osl: 500
-        ttft: 300.0
-        itl: 10.0
-      sweep:
-        useAiConfigurator: true
-        aicSystem: h200_sxm
-        aicHfId: Qwen/Qwen3-32B
-        aicBackendVersion: "0.20.0"
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
-  autoApply: true
-```
-### Complete Example: MoE Model
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: sglang-moe
-spec:
-  model: "deepseek-ai/DeepSeek-R1"
-  backend: sglang
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-    config:
-      sla:
-        isl: 2048
-        osl: 512
-        ttft: 300.0
-        itl: 25.0
-      hardware:
-        numGpusPerNode: 8
-        maxNumGpusPerEngine: 32
-      engine:
-        isMoeModel: true       # Enable MoE profiling mode
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-  autoApply: true
-```
-## Troubleshooting
-### Profiling Takes Too Long
-**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
-```yaml
-sweep:
-  useAiConfigurator: true
-```
-**Solution 2**: Reduce search space:
-```yaml
-config:
-  sweep:
-    minNumGpus: 4  # Skip TP1, TP2
-    maxNumGpus: 8  # Don't test beyond TP8
-```
-### SLA Cannot Be Met
-**Symptoms**: Profiler reports no configuration meets targets
-**Solutions:**
-1. Relax SLA targets (increase TTFT/ITL)
-2. Add more GPU resources
-3. Try a different backend
-4. Use a smaller model
-### AI Configurator: Attention Head Constraint Error
-**Symptoms**: Profiling fails with error:
-```
-AssertionError: num_heads <N> should be divisible by tp_size <M> and the division result should be >= 4
-```
-**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes.
-**Affected Models:**
- **Qwen3-0.6B** (16 heads): Max TP = 4 ❌ Fails at TP=8
- **GPT-2** (12 heads): Max TP = 3
- Most models **<1B parameters**: May hit this constraint
-**Solution**: Limit `maxNumGpusPerEngine` in your DGDR:
-```yaml
-profilingConfig:
-  profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
-  config:
-    hardware:
-      maxNumGpusPerEngine: 4  # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
-    sweep:
-      useAiConfigurator: true
-      aicSystem: h200_sxm
-      aicHfId: Qwen/Qwen3-0.6B
-```
-**Calculate Max TP**: `max_tp = num_attention_heads / 4`
-> **Note**: This is an AI Configurator limitation. Online profiling doesn't have this constraint.
-### Image Pull Errors
-**Symptoms**: `ErrImagePull` or `ImagePullBackOff`
-**Solution**: Ensure image pull secrets are configured:
-```bash
-kubectl create secret docker-registry nvcr-imagepullsecret \
-  --docker-server=nvcr.io \
-  --docker-username='$oauthtoken' \
-  --docker-password=<NGC_API_KEY> \
-  --namespace <your-namespace>
-```
-### Out of Memory During Profiling
-**Symptoms**: OOM errors in profiling jobs
-**Solutions:**
-1. Reduce `gpu_memory_utilization` in engine config
-2. Reduce `--max-context-length`
-3. Skip larger TP configurations
-4. Use fewer GPUs per test
-### Unsupported Parallelization Mapping in Backend
-**Symptoms**: Starttime/runtime error in the backend. For example, prime number of attention heads restrain TP size to be 1 (i.e., falcon-7b with 71 attention heads). Or some backend does not support different TP sizes for prefill and decode.
-**Solutions:**
-1. Contact the backend to add support for the use cases and bump backend version in dynamo.
-2. Restrain the max and min number of GPUs per engine to the supported range.
-## Next Steps
- **Deploy with DGDR**: See [Quick Start Guide](/docs/planner/sla_planner_quickstart.md)
- **Understand SLA Planner**: Read [SLA Planner Deep Dive](/docs/planner/sla_planner.md)
- **Monitor Deployments**: Set up [Observability](/docs/kubernetes/observability/metrics.md)
- **Optimize Performance**: See [Performance Tuning](/docs/performance/tuning.md)
-## Related Documentation
- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md)
- [SLA Planner Architecture](/docs/planner/sla_planner.md)
- [Profiler Arguments Reference](/benchmarks/profiler/utils/profiler_argparse.py)
--- a/docs/components/frontend/README.md
+++ b/docs/components/frontend/README.md
@@ -78,4 +78,4 @@ See the [Frontend Guide](frontend_guide.md) for full configuration options.
 | Document | Description |
 |----------|-------------|
 | [Frontend Guide](frontend_guide.md) | KServe gRPC configuration and integration |
-| [Router Documentation](../../router/README.md) | KV-aware routing configuration |
+| [Router Documentation](../router/README.md) | KV-aware routing configuration |