docs: migrate Speculative Decoding docs to three-tier structure (#6001)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: migrate Speculative Decoding docs to three-tier structure (#6001)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
8aa7335e · dagil-nvidia · GitHub · 953e5d7b · 8aa7335e · 8aa7335e
Unverified Commit 8aa7335e authored Feb 05, 2026 by dagil-nvidia Committed by GitHub Feb 05, 2026
6 changed files
--- a/docs/backends/vllm/README.md
+++ b/docs/backends/vllm/README.md
@@ -146,6 +146,8 @@ This setup demonstrates how to use Dynamo to create an instance using Eagle-base
 **Guide:** [Speculative Decoding Quickstart](./speculative_decoding.md)
+> **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
 ### Kubernetes Deployment
 For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../examples/backends/vllm/deploy/README.md)

--- a/docs/backends/vllm/speculative_decoding.md
+++ b/docs/backends/vllm/speculative_decoding.md
@@ -14,6 +14,11 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
+> **Note**: This content has moved to [Speculative Decoding with vLLM](../../features/speculative_decoding/speculative_decoding_vllm.md).
+> See [Speculative Decoding Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
+> This file will be removed in a future release.
 # Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
 This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.

--- a/docs/conf.py
+++ b/docs/conf.py
@@ -95,6 +95,8 @@ redirects = {
    "backends/sglang/multimodal_epd": "../../multimodal/sglang.html",
    "backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html",
    "multimodal/multimodal_intro": "index.html",
+    # Speculative decoding consolidation (PR speculative-migration)
+    "backends/vllm/speculative_decoding": "../../features/speculative_decoding/speculative_decoding_vllm.html",
 }
 # Custom extensions

--- a/docs/features/speculative_decoding/README.md
+++ b/docs/features/speculative_decoding/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Speculative Decoding
+Speculative decoding is an optimization technique that uses a smaller "draft" model to predict multiple tokens, which are then verified by the main model in parallel. This can significantly reduce latency for autoregressive generation.
+## Backend Support
+| Backend | Status | Notes |
+|---------|--------|-------|
+| vLLM | ✅ | Eagle3 draft model support |
+| SGLang | 🚧 | Not yet documented |
+| TensorRT-LLM | 🚧 | Not yet documented |
+## Overview
+Speculative decoding works by:
+1. **Draft phase**: A smaller, faster model generates candidate tokens
+2. **Verify phase**: The main model verifies these candidates in a single forward pass
+3. **Accept/reject**: Tokens are accepted if they match what the main model would have generated
+This approach trades off additional compute for lower latency, as multiple tokens can be generated per forward pass of the main model.
+## Quick Start (vLLM + Eagle3)
+This guide walks through deploying **Meta-Llama-3.1-8B-Instruct** with **Eagle3** speculative decoding on a single GPU with at least 16GB VRAM.
+### Prerequisites
+1. Start infrastructure services:
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+2. Build and run the vLLM container:
+```bash
+./container/build.sh --framework VLLM
+./container/run.sh -it --framework VLLM --mount-workspace
+```
+3. Set up Hugging Face access (Meta-Llama-3.1-8B-Instruct is gated):
+```bash
+export HUGGING_FACE_HUB_TOKEN="your_token_here"
+export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
+```
+### Run Speculative Decoding
+```bash
+cd examples/backends/vllm
+bash launch/agg_spec_decoding.sh
+```
+### Test the Deployment
+```bash
+curl http://localhost:8000/v1/chat/completions \
+   -H "Content-Type: application/json" \
+   -d '{
+     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+     "messages": [
+       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
+     ],
+     "max_tokens": 250
+   }'
+```
+## Backend-Specific Guides
+| Backend | Guide |
+|---------|-------|
+| vLLM | [speculative_decoding_vllm.md](./speculative_decoding_vllm.md) |
+## See Also
+- [vLLM Backend](../../backends/vllm/README.md) - Full vLLM deployment guide
+- [Disaggregated Serving](../../design_docs/disagg_serving.md) - Alternative optimization approach
+- [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
--- a/docs/features/speculative_decoding/speculative_decoding_vllm.md
+++ b/docs/features/speculative_decoding/speculative_decoding_vllm.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Speculative Decoding with vLLM
+Using Speculative Decoding with the vLLM backend.
+> **See also**: [Speculative Decoding Overview](./README.md) for cross-backend documentation.
+## Prerequisites
+- vLLM container with Eagle3 support
+- GPU with at least 16GB VRAM
+- Hugging Face access token (for gated models)
+## Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3
+This guide walks through deploying **Meta-Llama-3.1-8B-Instruct** with **Eagle3** speculative decoding on a single node.
+### Step 1: Set Up Your Docker Environment
+First, initialize a Docker container using the vLLM backend. See the [vLLM Quickstart Guide](../../backends/vllm/README.md#vllm-quick-start) for details.
+```bash
+# Launch infrastructure services
+docker compose -f deploy/docker-compose.yml up -d
+# Build the container
+./container/build.sh --framework VLLM
+# Run the container
+./container/run.sh -it --framework VLLM --mount-workspace
+```
+### Step 2: Get Access to the Llama-3 Model
+The **Meta-Llama-3.1-8B-Instruct** model is gated. Request access on Hugging Face:
+[Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
+Approval time varies depending on Hugging Face review traffic.
+Once approved, set your access token inside the container:
+```bash
+export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
+export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
+```
+### Step 3: Run Aggregated Speculative Decoding
+```bash
+# Requires only one GPU
+cd examples/backends/vllm
+bash launch/agg_spec_decoding.sh
+```
+Once the weights finish downloading, the server will be ready for inference requests.
+### Step 4: Test the Deployment
+```bash
+curl http://localhost:8000/v1/chat/completions \
+   -H "Content-Type: application/json" \
+   -d '{
+     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+     "messages": [
+       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
+     ],
+     "max_tokens": 250
+   }'
+```
+### Example Output
+```json
+{
+  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
+  "choices": [
+    {
+      "message": {
+        "role": "assistant",
+        "content": "In cherry blossom's gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes."
+      },
+      "index": 0,
+      "finish_reason": "stop"
+    }
+  ],
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "usage": {
+    "prompt_tokens": 16,
+    "completion_tokens": 250,
+    "total_tokens": 266
+  }
+}
+```
+## Configuration
+Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:
+- Target model: `meta-llama/Meta-Llama-3.1-8B-Instruct`
+- Draft model: Eagle3 variant
+- Aggregated serving mode
+See `examples/backends/vllm/launch/agg_spec_decoding.sh` for the full configuration.
+## Limitations
+- Currently only supports Eagle3 as the draft model
+- Requires compatible model architectures between target and draft
+## See Also
+| Document | Path |
+|----------|------|
+| Speculative Decoding Overview | [README.md](./README.md) |
+| vLLM Backend Guide | [vLLM README](../../backends/vllm/README.md) |
+| Meta-Llama-3.1-8B-Instruct | [Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -83,6 +83,9 @@
   backends/vllm/prompt-embeddings.md
   backends/vllm/speculative_decoding.md
+   features/speculative_decoding/README.md
+   features/speculative_decoding/speculative_decoding_vllm.md
   benchmarks/kv-router-ab-testing.md
   mocker/mocker.md