docs: cleanup of docs refactor for components, integrations, and features (#6019)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: cleanup of docs refactor for components, integrations, and features (#6019)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
b19de4ed · dagil-nvidia · GitHub · 80e7bafd · b19de4ed · 80e7bafd
Unverified Commit b19de4ed authored Feb 05, 2026 by dagil-nvidia Committed by GitHub Feb 05, 2026
20 changed files
--- a/README.md
+++ b/README.md
@@ -52,10 +52,10 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
 |---|:----:|:----------:|:--:|
 | **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
 | [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
-| [**KV-Aware Routing**](docs/router/README.md) | ✅ | ✅ | ✅ |
+| [**KV-Aware Routing**](docs/components/router/README.md) | ✅ | ✅ | ✅ |
-| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
+| [**SLA-Based Planner**](docs/components/planner/planner_guide.md) | ✅ | ✅ | ✅ |
-| [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ |
+| [**KVBM**](docs/components/kvbm/README.md) | 🚧 | ✅ | ✅ |
-| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
+| [**Multimodal**](docs/features/multimodal/README.md) | ✅ | ✅ | ✅ |
 | [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ |
 > **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
@@ -347,7 +347,7 @@ python3 -m dynamo.frontend
 Dynamo provides comprehensive benchmarking tools:
 - **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
+- **[SLA-Driven Deployments](docs/components/planner/planner_guide.md)** – Optimize deployments to meet SLA requirements
 ## Frontend OpenAPI Specification
@@ -357,7 +357,7 @@ The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To
 cargo run -p dynamo-llm --bin generate-frontend-openapi
 ```
-This writes to `docs/frontends/openapi.json`.
+This writes to `docs/reference/api/openapi.json`.
 ## Service Discovery and Messaging
@@ -388,9 +388,9 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL
 <!-- Reference links for Feature Compatibility Matrix -->
 [disagg]: docs/design_docs/disagg_serving.md
-[kv-routing]: docs/router/README.md
+[kv-routing]: docs/components/router/README.md
-[planner]: docs/planner/sla_planner.md
+[planner]: docs/components/planner/planner_guide.md
-[kvbm]: docs/kvbm/README.md
+[kvbm]: docs/components/kvbm/README.md
 [mm]: examples/multimodal/
 [migration]: docs/fault_tolerance/request_migration.md
 [lora]: examples/backends/vllm/deploy/lora/README.md

--- a/benchmarks/profiler/README.md
+++ b/benchmarks/profiler/README.md
-../../docs/benchmarks/sla_driven_profiling.md
\ No newline at end of file
--- a/benchmarks/profiler/README.md
+++ b/benchmarks/profiler/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Profiler
+Documentation for the Dynamo Profiler has moved to [docs/components/profiler/](../../docs/components/profiler/README.md).
+- [Profiler Overview](../../docs/components/profiler/README.md)
+- [Profiler Guide](../../docs/components/profiler/profiler_guide.md)
+- [Profiler Examples](../../docs/components/profiler/profiler_examples.md)
--- a/benchmarks/profiler/webui/utils.py
+++ b/benchmarks/profiler/webui/utils.py
@@ -620,7 +620,7 @@ def create_gradio_interface(
            > 📝 **Note:** The dotted red line in the prefill and decode charts are default TTFT and ITL SLAs if not specified.
-            > ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner.md#2-correction-factor-calculation) to adjust dynamically at runtime.
+            > ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/design_docs/planner_design.md#step-2-correction-factor-calculation) to adjust dynamically at runtime.
            > 💡 **Tip:** Use the GPU cost checkbox and input in the charts section to convert GPU hours to cost.
            """

--- a/benchmarks/router/README.md
+++ b/benchmarks/router/README.md
@@ -127,7 +127,7 @@ To see all available router arguments, run:
 python -m dynamo.frontend --help
 ```
-For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/router/router_guide.md).
+For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/components/router/router_guide.md).
 > [!Note]
 > If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
@@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a
 - Uses the same routing mode as the frontend's `--router-mode` setting
 - Seamlessly integrates with your decode workers for token generation
-No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/router/router_guide.md#disaggregated-serving) for more details.
+No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/components/router/router_guide.md#disaggregated-serving) for more details.
 > [!Note]
 > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)

--- a/components/src/dynamo/mocker/README.md
+++ b/components/src/dynamo/mocker/README.md
@@ -60,7 +60,7 @@ python -m dynamo.mocker \
 The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume.
-To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/benchmarks/sla_driven_profiling.md) for details):
+To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/components/profiler/profiler_guide.md) for details):
 ```bash
 python benchmarks/profiler/profile_sla.py \

--- a/components/src/dynamo/planner/README.md
+++ b/components/src/dynamo/planner/README.md
@@ -19,5 +19,5 @@ limitations under the License.
 SLA-driven autoscaling controller for Dynamo inference graphs.
- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples)
+- **User docs**: [docs/planner/](/docs/components/planner/) (deployment, configuration, examples)
 - **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms)
--- a/components/src/dynamo/planner/utils/perf_interpolation.py
+++ b/components/src/dynamo/planner/utils/perf_interpolation.py
@@ -29,7 +29,7 @@ logger = logging.getLogger(__name__)
 MISSING_PROFILING_DATA_ERROR_MESSAGE = (
    "SLA-Planner requires pre-deployment profiling results to run.\n"
-    "Please follow /docs/benchmarks/sla_driven_profiling.md to run the profiling first,\n"
+    "Please follow /docs/components/profiler/profiler_guide.md to run the profiling first,\n"
    "and make sure the profiling results are present in --profile-results-dir."
 )

--- a/components/src/dynamo/router/README.md
+++ b/components/src/dynamo/router/README.md
@@ -3,7 +3,7 @@
 # Standalone Router
-A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/router/router_guide.md).
+A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/components/router/router_guide.md).
 ## Overview
@@ -29,7 +29,7 @@ python -m dynamo.router \
 - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
 **Router Configuration:**
-For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/router/router_guide.md).
+For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/components/router/router_guide.md).
 ## Architecture
@@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p
 ## Example: Manual Disaggregated Serving (Alternative Setup)
 > [!Note]
-> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/router/router_guide.md#disaggregated-serving) for the default setup.
+> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/components/router/router_guide.md#disaggregated-serving) for the default setup.
 >
 > Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
@@ -103,7 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere
 ## See Also
- [Router Guide](/docs/router/router_guide.md) - Configuration and tuning for KV-aware routing
+- [Router Guide](/docs/components/router/router_guide.md) - Configuration and tuning for KV-aware routing
 - [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes
 - [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
 - [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
--- a/deploy/inference-gateway/README.md
+++ b/deploy/inference-gateway/README.md
@@ -220,7 +220,7 @@ Common Vars for Routing Configuration:
  - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
  - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
  - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
-  - See the [Router Guide](../../docs/router/router_guide.md) for details.
+  - See the [Router Guide](../../docs/components/router/router_guide.md) for details.
 Stand-Alone installation only:

--- a/deploy/utils/README.md
+++ b/deploy/utils/README.md
@@ -145,7 +145,7 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE
 For complete benchmarking and profiling workflows:
 - **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints
- **Pre-Deployment Profiling**: See [docs/benchmarks/sla_driven_profiling.md](../../docs/benchmarks/sla_driven_profiling.md) for optimizing configurations before deployment
+- **Pre-Deployment Profiling**: See [docs/components/profiler/profiler_guide.md](../../docs/components/profiler/profiler_guide.md) for optimizing configurations before deployment
 ## Notes

--- a/docs/_sections/frontends.rst
+++ b/docs/_sections/frontends.rst
-Frontends
-=========
-.. toctree::
-   :maxdepth: 1
-   Frontend Overview <../components/frontend/README.md>
-   Frontend Guide <../components/frontend/frontend_guide.md>
-   KServe (deprecated) <../frontends/kserve.md>
\ No newline at end of file
--- a/docs/api/nixl_connect/README.md
+++ b/docs/api/nixl_connect/README.md
@@ -103,7 +103,7 @@ flowchart LR
 ### Multimodal Example
-In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md):
+In the case of the [Dynamo Multimodal Disaggregated Example](../../features/multimodal/multimodal_vllm.md):
 1. The HTTP frontend accepts a text prompt and a URL to an image.

--- a/docs/backends/sglang/README.md
+++ b/docs/backends/sglang/README.md
@@ -36,10 +36,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|--------|-------|
 | [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
-| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../components/planner/planner_guide.md) | ✅ |  |
-| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ |  |
+| [**Multimodal Support**](../../features/multimodal/multimodal_sglang.md) | ✅ |  |
-| [**KVBM**](../../kvbm/README.md) | ❌ | Planned |
+| [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |
 ## Dynamo SGLang Integration

--- a/docs/backends/sglang/sgl-hicache-example.md
+++ b/docs/backends/sglang/sgl-hicache-example.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-# Enable SGLang Hierarchical Cache (HiCache)
-This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.
-## 1) Start the SGLang worker with HiCache enabled
-```bash
-python -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
-  --host 0.0.0.0 --port 8000 \
-  --page-size 64 \
-  --enable-hierarchical-cache \
-  --hicache-ratio 2 \
-  --hicache-write-policy write_through \
-  --hicache-storage-backend nixl \
-  --log-level debug \
-  --skip-tokenizer-init
-```
- **--enable-hierarchical-cache**: Enables hierarchical KV cache/offload
- **--hicache-ratio**: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.
- **--hicache-write-policy**: Write policy (e.g., `write_through` for synchronous host writes)
- **--hicache-storage-backend**: Host storage backend for HiCache (e.g., `nixl`). NIXL selects the concrete store automatically; see [PR #8488](https://github.com/sgl-project/sglang/pull/8488)
-Then, start the frontend:
-```bash
-python -m dynamo.frontend --http-port 8000
-```
-## 2) Send a single request
-```bash
-curl localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [
-      {
-        "role": "user",
-        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
-      }
-    ],
-    "stream": false,
-    "max_tokens": 30
-  }'
-```
-## 3) (Optional) Benchmarking
-Run the perf script:
-```bash
-bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
-  --model Qwen/Qwen3-0.6B \
-  --tensor-parallelism 1 \
-  --data-parallelism 1 \
-  --concurrency "2,4,8" \
-  --input-sequence-length 2048 \
-  --output-sequence-length 256
-```
--- a/docs/backends/trtllm/README.md
+++ b/docs/backends/trtllm/README.md
@@ -55,10 +55,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|--------------|-------|
 | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ |  |
-| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
+| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | Planned |
-| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
+| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | |
 ### Large Scale P/D and WideEP Features
@@ -114,7 +114,7 @@ apt-get update && apt-get -y install git git-lfs
 > [!IMPORTANT]
 > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
-For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../router/router_guide.md).
+For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router_guide.md).
 ### Aggregated
 ```bash
@@ -231,7 +231,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t
 ## Multimodal support
-Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).
+Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal_trtllm.md).
 ## Logits Processing
@@ -327,7 +327,7 @@ For detailed instructions on running comprehensive performance sweeps across bot
 Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
-Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
+Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
 ## Known Issues and Mitigations

--- a/docs/backends/vllm/README.md
+++ b/docs/backends/vllm/README.md
@@ -37,10 +37,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|------|-------|
 | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
-| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ |  |
-| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
+| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | WIP |
-| [**KVBM**](../../../docs/kvbm/README.md) | ✅ |  |
+| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ |  |
 | [**LMCache**](../../integrations/lmcache_integration.md) | ✅ |  |
 | [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
@@ -144,7 +144,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu
 Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
 This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
-**Guide:** [Speculative Decoding Quickstart](./speculative_decoding.md)
+**Guide:** [Speculative Decoding Quickstart](../../features/speculative_decoding/speculative_decoding_vllm.md)
 > **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.

--- a/docs/backends/vllm/speculative_decoding.md
+++ b/docs/backends/vllm/speculative_decoding.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-> **Note**: This content has moved to [Speculative Decoding with vLLM](../../features/speculative_decoding/speculative_decoding_vllm.md).
-> See [Speculative Decoding Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
-> This file will be removed in a future release.
-# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
-This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.
-Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.
-## Step 1: Set Up Your Docker Environment
-First, we’ll initialize a Docker container using the VLLM backend.
-You can refer to the [VLLM Quickstart Guide](./README.md#vllm-quick-start) — or follow the full steps below.
-### 1. Launch Docker Compose
-```bash
-docker compose -f deploy/docker-compose.yml up -d
-```
-### 2. Build the Container
-```bash
-./container/build.sh --framework VLLM
-```
-### 3. Run the Container
-```bash
-./container/run.sh -it --framework VLLM --mount-workspace
-```
-## Step 2: Get Access to the Llama-3 Model
-The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face.
-Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form.
-Approval usually takes around **5 minutes**.
-Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container:
-```bash
-export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
-export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
-```
-## Step 3: Run Aggregated Speculative Decoding
-Now that your environment is ready, start the aggregated server with **speculative decoding**.
-```bash
-# Requires only one GPU
-cd examples/backends/vllm
-bash launch/agg_spec_decoding.sh
-```
-Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.
-## Step 4: Example Request
-To verify your setup, try sending a simple prompt to your model:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-   -H "Content-Type: application/json" \
-   -d '{
-     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
-     "messages": [
-       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
-     ],
-     "max_tokens": 250
-   }'
-```
-### Example Output
-```json
-{
-  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
-  "choices": [
-    {
-      "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
-      "index": 0,
-      "finish_reason": "stop"
-    }
-  ],
-  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
-  "usage": {
-    "prompt_tokens": 16,
-    "completion_tokens": 250,
-    "total_tokens": 266
-  }
-}
-```
-## Additional Resources
-* [VLLM Quickstart](./README.md#vllm-quick-start)
-* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
\ No newline at end of file
--- a/docs/benchmarks/sla_driven_profiling.md
+++ b/docs/benchmarks/sla_driven_profiling.md
--- a/docs/components/frontend/README.md
+++ b/docs/components/frontend/README.md
@@ -78,4 +78,4 @@ See the [Frontend Guide](frontend_guide.md) for full configuration options.
 | Document | Description |
 |----------|-------------|
 | [Frontend Guide](frontend_guide.md) | KServe gRPC configuration and integration |
-| [Router Documentation](../../router/README.md) | KV-aware routing configuration |
+| [Router Documentation](../router/README.md) | KV-aware routing configuration |