Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
0e12cd67
Unverified
Commit
0e12cd67
authored
Aug 07, 2024
by
Stas Bekman
Committed by
GitHub
Aug 07, 2024
Browse files
[Doc] add online speculative decoding example (#7243)
parent
80cbe10c
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
55 additions
and
11 deletions
+55
-11
docs/source/models/spec_decode.rst
docs/source/models/spec_decode.rst
+55
-11
No files found.
docs/source/models/spec_decode.rst
View file @
0e12cd67
...
@@ -14,17 +14,17 @@ Speculative decoding is a technique which improves inter-token latency in memory
...
@@ -14,17 +14,17 @@ Speculative decoding is a technique which improves inter-token latency in memory
Speculating
with
a
draft
model
Speculating
with
a
draft
model
------------------------------
------------------------------
The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time.
The
following
code
configures
vLLM
in
an
offline
mode
to
use
speculative
decoding
with
a
draft
model
,
speculating
5
tokens
at
a
time
.
..
code
-
block
::
python
..
code
-
block
::
python
from
vllm
import
LLM
,
SamplingParams
from
vllm
import
LLM
,
SamplingParams
prompts
=
[
prompts
=
[
"The future of AI is"
,
"The future of AI is"
,
]
]
sampling_params
=
SamplingParams
(
temperature
=
0.8
,
top_p
=
0.95
)
sampling_params
=
SamplingParams
(
temperature
=
0.8
,
top_p
=
0.95
)
llm
=
LLM
(
llm
=
LLM
(
model
=
"facebook/opt-6.7b"
,
model
=
"facebook/opt-6.7b"
,
tensor_parallel_size
=
1
,
tensor_parallel_size
=
1
,
...
@@ -33,12 +33,56 @@ The following code configures vLLM to use speculative decoding with a draft mode
...
@@ -33,12 +33,56 @@ The following code configures vLLM to use speculative decoding with a draft mode
use_v2_block_manager
=
True
,
use_v2_block_manager
=
True
,
)
)
outputs
=
llm
.
generate
(
prompts
,
sampling_params
)
outputs
=
llm
.
generate
(
prompts
,
sampling_params
)
for
output
in
outputs
:
for
output
in
outputs
:
prompt
=
output
.
prompt
prompt
=
output
.
prompt
generated_text
=
output
.
outputs
[
0
].
text
generated_text
=
output
.
outputs
[
0
].
text
print
(
f
"Prompt: {prompt!r}, Generated text: {generated_text!r}"
)
print
(
f
"Prompt: {prompt!r}, Generated text: {generated_text!r}"
)
To
perform
the
same
with
an
online
mode
launch
the
server
:
..
code
-
block
::
bash
python
-
m
vllm
.
entrypoints
.
openai
.
api_server
--
host
0.0.0.0
--
port
8000
--
model
facebook
/
opt
-
6.7
b
\
--
seed
42
-
tp
1
--
speculative_model
facebook
/
opt
-
125
m
--
use
-
v2
-
block
-
manager
\
--
num_speculative_tokens
5
--
gpu_memory_utilization
0.8
Then
use
a
client
:
..
code
-
block
::
python
from
openai
import
OpenAI
#
Modify
OpenAI
's API key and API base to use vLLM'
s
API
server
.
openai_api_key
=
"EMPTY"
openai_api_base
=
"http://localhost:8000/v1"
client
=
OpenAI
(
#
defaults
to
os
.
environ
.
get
(
"OPENAI_API_KEY"
)
api_key
=
openai_api_key
,
base_url
=
openai_api_base
,
)
models
=
client
.
models
.
list
()
model
=
models
.
data
[
0
].
id
#
Completion
API
stream
=
False
completion
=
client
.
completions
.
create
(
model
=
model
,
prompt
=
"The future of AI is"
,
echo
=
False
,
n
=
1
,
stream
=
stream
,
)
print
(
"Completion results:"
)
if
stream
:
for
c
in
completion
:
print
(
c
)
else
:
print
(
completion
)
Speculating
by
matching
n
-
grams
in
the
prompt
Speculating
by
matching
n
-
grams
in
the
prompt
---------------------------------------------
---------------------------------------------
...
@@ -48,12 +92,12 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
...
@@ -48,12 +92,12 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
..
code
-
block
::
python
..
code
-
block
::
python
from
vllm
import
LLM
,
SamplingParams
from
vllm
import
LLM
,
SamplingParams
prompts
=
[
prompts
=
[
"The future of AI is"
,
"The future of AI is"
,
]
]
sampling_params
=
SamplingParams
(
temperature
=
0.8
,
top_p
=
0.95
)
sampling_params
=
SamplingParams
(
temperature
=
0.8
,
top_p
=
0.95
)
llm
=
LLM
(
llm
=
LLM
(
model
=
"facebook/opt-6.7b"
,
model
=
"facebook/opt-6.7b"
,
tensor_parallel_size
=
1
,
tensor_parallel_size
=
1
,
...
@@ -63,7 +107,7 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
...
@@ -63,7 +107,7 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
use_v2_block_manager
=
True
,
use_v2_block_manager
=
True
,
)
)
outputs
=
llm
.
generate
(
prompts
,
sampling_params
)
outputs
=
llm
.
generate
(
prompts
,
sampling_params
)
for
output
in
outputs
:
for
output
in
outputs
:
prompt
=
output
.
prompt
prompt
=
output
.
prompt
generated_text
=
output
.
outputs
[
0
].
text
generated_text
=
output
.
outputs
[
0
].
text
...
@@ -74,7 +118,7 @@ Speculating using MLP speculators
...
@@ -74,7 +118,7 @@ Speculating using MLP speculators
The
following
code
configures
vLLM
to
use
speculative
decoding
where
proposals
are
generated
by
The
following
code
configures
vLLM
to
use
speculative
decoding
where
proposals
are
generated
by
draft
models
that
conditioning
draft
predictions
on
both
context
vectors
and
sampled
tokens
.
draft
models
that
conditioning
draft
predictions
on
both
context
vectors
and
sampled
tokens
.
For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
For
more
information
see
`
this
blog
<
https
://
pytorch
.
org
/
blog
/
hitchhikers
-
guide
-
speculative
-
decoding
/>`
_
or
`
this
technical
report
<
https
://
arxiv
.
org
/
abs
/
2404.19124
>`
_
.
`
this
technical
report
<
https
://
arxiv
.
org
/
abs
/
2404.19124
>`
_
.
..
code
-
block
::
python
..
code
-
block
::
python
...
@@ -100,9 +144,9 @@ For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-
...
@@ -100,9 +144,9 @@ For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-
generated_text
=
output
.
outputs
[
0
].
text
generated_text
=
output
.
outputs
[
0
].
text
print
(
f
"Prompt: {prompt!r}, Generated text: {generated_text!r}"
)
print
(
f
"Prompt: {prompt!r}, Generated text: {generated_text!r}"
)
Note that these speculative models currently need to be run without tensor parallelism, although
Note
that
these
speculative
models
currently
need
to
be
run
without
tensor
parallelism
,
although
it is possible to run the main model using tensor parallelism (see example above). Since the
it
is
possible
to
run
the
main
model
using
tensor
parallelism
(
see
example
above
).
Since
the
speculative models are relatively small, we still see significant speedups. However, this
speculative
models
are
relatively
small
,
we
still
see
significant
speedups
.
However
,
this
limitation
will
be
fixed
in
a
future
release
.
limitation
will
be
fixed
in
a
future
release
.
A
variety
of
speculative
models
of
this
type
are
available
on
HF
hub
:
A
variety
of
speculative
models
of
this
type
are
available
on
HF
hub
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment