speculative-decoding-vllm.md 3.31 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Speculative Decoding with vLLM
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---

# Speculative Decoding with vLLM

Using Speculative Decoding with the vLLM backend.

> **See also**: [Speculative Decoding Overview](./README.md) for cross-backend documentation.

## Prerequisites

- vLLM container with Eagle3 support
- GPU with at least 16GB VRAM
- Hugging Face access token (for gated models)

## Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3

This guide walks through deploying **Meta-Llama-3.1-8B-Instruct** with **Eagle3** speculative decoding on a single node.

### Step 1: Set Up Your Docker Environment

First, initialize a Docker container using the vLLM backend. See the [vLLM Quickstart Guide](../../backends/vllm/README.md#vllm-quick-start) for details.

```bash
# Launch infrastructure services
docker compose -f deploy/docker-compose.yml up -d

# Build the container
./container/build.sh --framework VLLM

# Run the container
./container/run.sh -it --framework VLLM --mount-workspace
```

### Step 2: Get Access to the Llama-3 Model

The **Meta-Llama-3.1-8B-Instruct** model is gated. Request access on Hugging Face:
[Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

Approval time varies depending on Hugging Face review traffic.

Once approved, set your access token inside the container:

```bash
export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
```

### Step 3: Run Aggregated Speculative Decoding

```bash
# Requires only one GPU
cd examples/backends/vllm
bash launch/agg_spec_decoding.sh
```

Once the weights finish downloading, the server will be ready for inference requests.

### Step 4: Test the Deployment

```bash
curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
     "messages": [
       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
     ],
     "max_tokens": 250
   }'
```

### Example Output

```json
{
  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "In cherry blossom's gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes."
      },
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 250,
    "total_tokens": 266
  }
}
```

## Configuration

Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:

- Target model: `meta-llama/Meta-Llama-3.1-8B-Instruct`
- Draft model: Eagle3 variant
- Aggregated serving mode

See `examples/backends/vllm/launch/agg_spec_decoding.sh` for the full configuration.

## Limitations

- Currently only supports Eagle3 as the draft model
- Requires compatible model architectures between target and draft

## See Also

| Document | Path |
|----------|------|
| Speculative Decoding Overview | [README.md](./README.md) |
| vLLM Backend Guide | [vLLM README](../../backends/vllm/README.md) |
| Meta-Llama-3.1-8B-Instruct | [Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |