docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct (#3895)

Signed-off-by: DilreetRaju <dilreetraju@gmail.com>

docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct (#3895)
Signed-off-by: DilreetRaju <dilreetraju@gmail.com>
f315374f · Dilreet Raju · GitHub · 3dbab3f1 · f315374f · f315374f
Unverified Commit f315374f authored Dec 05, 2025 by Dilreet Raju Committed by GitHub Dec 05, 2025
4 changed files
--- a/components/backends/vllm/launch/agg_spec_decoding.sh
+++ b/components/backends/vllm/launch/agg_spec_decoding.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+
+
+# ---------------------------
+# 1. Frontend (Ingress)
+# ---------------------------
+python -m dynamo.frontend --http-port=8000 &
+
+
+# ---------------------------
+# 2. Speculative Main Worker
+# ---------------------------
+# This runs the main model with EAGLE as the draft model for speculative decoding
+DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
+CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm \
+    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
+    --enforce-eager \
+    --speculative_config '{
+        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
+        "draft_tensor_parallel_size": 1,
+        "num_speculative_tokens": 2,
+        "method": "eagle"
+    }' \
+    --connector none \
+    --gpu-memory-utilization 0.8
\ No newline at end of file
--- a/docs/backends/vllm/README.md
+++ b/docs/backends/vllm/README.md
@@ -165,6 +165,13 @@ bash launch/dep.sh

 Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!

+### Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)
+
+Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
+This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
+
+**Guide:** [Speculative Decoding Quickstart](./speculative_decoding.md)
+
 ### Kubernetes Deployment

 For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../examples/backends/vllm/deploy/README.md)

--- a/docs/backends/vllm/speculative_decoding.md
+++ b/docs/backends/vllm/speculative_decoding.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
+
+This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.
+Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.
+
+
+
+## Step 1: Set Up Your Docker Environment
+
+First, we’ll initialize a Docker container using the VLLM backend.
+You can refer to the [VLLM Quickstart Guide](./README.md#vllm-quick-start) — or follow the full steps below.
+
+### 1. Launch Docker Compose
+
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+
+### 2. Build the Container
+
+```bash
+./container/build.sh --framework VLLM
+```
+
+### 3. Run the Container
+
+```bash
+./container/run.sh -it --framework VLLM --mount-workspace
+```
+
+
+
+## Step 2: Get Access to the Llama-3 Model
+
+The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face.
+Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form.
+Approval usually takes around **5 minutes**.
+
+Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container:
+
+```bash
+export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
+export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
+```
+
+
+
+## Step 3: Run Aggregated Speculative Decoding
+
+Now that your environment is ready, start the aggregated server with **speculative decoding**.
+
+```bash
+# Requires only one GPU
+cd components/backends/vllm
+bash launch/agg_spec_decoding.sh
+```
+
+Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.
+
+
+
+
+## Step 4: Example Request
+
+To verify your setup, try sending a simple prompt to your model:
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+   -H "Content-Type: application/json" \
+   -d '{
+     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+     "messages": [
+       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
+     ],
+     "max_tokens": 250
+   }'
+```
+
+### Example Output
+
+```json
+{
+  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
+  "choices": [
+    {
+      "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
+      "index": 0,
+      "finish_reason": "stop"
+    }
+  ],
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "usage": {
+    "prompt_tokens": 16,
+    "completion_tokens": 250,
+    "total_tokens": 266
+  }
+}
+```
+
+
+
+## Additional Resources
+
+* [VLLM Quickstart](./README.md#vllm-quick-start)
+* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
\ No newline at end of file
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -77,6 +77,7 @@
   backends/vllm/multi-node.md
   backends/vllm/multimodal.md
   backends/vllm/prometheus.md
+   backends/vllm/speculative_decoding.md

   benchmarks/kv-router-ab-testing.md