Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
973f5dc5
Unverified
Commit
973f5dc5
authored
Jan 07, 2025
by
sroy745
Committed by
GitHub
Jan 07, 2025
Browse files
[Doc]Add documentation for using EAGLE in vLLM (#11417)
Signed-off-by:
Sourashis Roy
<
sroy@roblox.com
>
parent
c994223d
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
66 additions
and
0 deletions
+66
-0
docs/source/features/spec_decode.md
docs/source/features/spec_decode.md
+66
-0
No files found.
docs/source/features/spec_decode.md
View file @
973f5dc5
...
@@ -159,6 +159,72 @@ A variety of speculative models of this type are available on HF hub:
...
@@ -159,6 +159,72 @@ A variety of speculative models of this type are available on HF hub:
-
[
granite-7b-instruct-accelerator
](
https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator
)
-
[
granite-7b-instruct-accelerator
](
https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator
)
-
[
granite-20b-code-instruct-accelerator
](
https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator
)
-
[
granite-20b-code-instruct-accelerator
](
https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator
)
## Speculating using EAGLE based draft models
The following code configures vLLM to use speculative decoding where proposals are generated by
an
[
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)
](
https://arxiv.org/pdf/2401.15077
)
based draft model.
```
python
from
vllm
import
LLM
,
SamplingParams
prompts
=
[
"The future of AI is"
,
]
sampling_params
=
SamplingParams
(
temperature
=
0.8
,
top_p
=
0.95
)
llm
=
LLM
(
model
=
"meta-llama/Meta-Llama-3-8B-Instruct"
,
tensor_parallel_size
=
4
,
speculative_model
=
"path/to/modified/eagle/model"
,
speculative_draft_tensor_parallel_size
=
1
,
)
outputs
=
llm
.
generate
(
prompts
,
sampling_params
)
for
output
in
outputs
:
prompt
=
output
.
prompt
generated_text
=
output
.
outputs
[
0
].
text
print
(
f
"Prompt:
{
prompt
!
r
}
, Generated text:
{
generated_text
!
r
}
"
)
```
A few important things to consider when using the EAGLE based draft models:
1.
The EAGLE draft models available in the
[
HF repository for EAGLE models
](
https://huggingface.co/yuhuili
)
cannot be
used directly with vLLM due to differences in the expected layer names and model definition.
To use these models with vLLM, use the
[
following script
](
https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d
)
to convert them. Note that this script does not modify the model's weights.
In the above example, use the script to first convert
the
[
yuhuili/EAGLE-LLaMA3-Instruct-8B
](
https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B
)
model
and then use the converted checkpoint as the draft model in vLLM.
2.
The EAGLE based draft models need to be run without tensor parallelism
(i.e. speculative_draft_tensor_parallel_size is set to 1), although
it is possible to run the main model using tensor parallelism (see example above).
3.
When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
reported in the reference implementation
[
here
](
https://github.com/SafeAILab/EAGLE
)
. This issue is under
investigation and tracked here:
[
https://github.com/vllm-project/vllm/issues/9565
](
https://github.com/vllm-project/vllm/issues/9565
)
.
A variety of EAGLE draft models are available on the Hugging Face hub:
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
|---------------------------------------------------------------------|-------------------------------------------|--------------------|
| Vicuna-7B-v1.3 | yuhuili/EAGLE-Vicuna-7B-v1.3 | 0.24B |
| Vicuna-13B-v1.3 | yuhuili/EAGLE-Vicuna-13B-v1.3 | 0.37B |
| Vicuna-33B-v1.3 | yuhuili/EAGLE-Vicuna-33B-v1.3 | 0.56B |
| LLaMA2-Chat 7B | yuhuili/EAGLE-llama2-chat-7B | 0.24B |
| LLaMA2-Chat 13B | yuhuili/EAGLE-llama2-chat-13B | 0.37B |
| LLaMA2-Chat 70B | yuhuili/EAGLE-llama2-chat-70B | 0.99B |
| Mixtral-8x7B-Instruct-v0.1 | yuhuili/EAGLE-mixtral-instruct-8x7B | 0.28B |
| LLaMA3-Instruct 8B | yuhuili/EAGLE-LLaMA3-Instruct-8B | 0.25B |
| LLaMA3-Instruct 70B | yuhuili/EAGLE-LLaMA3-Instruct-70B | 0.99B |
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
## Lossless guarantees of Speculative Decoding
## Lossless guarantees of Speculative Decoding
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment