Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
0e12cd67
"docker/Dockerfile.neuron" did not exist on "9c6459e4cb020ec1ad9ea08cac9309b83d432fc8"
Unverified
Commit
0e12cd67
authored
Aug 07, 2024
by
Stas Bekman
Committed by
GitHub
Aug 07, 2024
Browse files
[Doc] add online speculative decoding example (#7243)
parent
80cbe10c
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
55 additions
and
11 deletions
+55
-11
docs/source/models/spec_decode.rst
docs/source/models/spec_decode.rst
+55
-11
No files found.
docs/source/models/spec_decode.rst
View file @
0e12cd67
...
...
@@ -14,7 +14,7 @@ Speculative decoding is a technique which improves inter-token latency in memory
Speculating
with
a
draft
model
------------------------------
The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time.
The
following
code
configures
vLLM
in
an
offline
mode
to
use
speculative
decoding
with
a
draft
model
,
speculating
5
tokens
at
a
time
.
..
code
-
block
::
python
...
...
@@ -39,6 +39,50 @@ The following code configures vLLM to use speculative decoding with a draft mode
generated_text
=
output
.
outputs
[
0
].
text
print
(
f
"Prompt: {prompt!r}, Generated text: {generated_text!r}"
)
To
perform
the
same
with
an
online
mode
launch
the
server
:
..
code
-
block
::
bash
python
-
m
vllm
.
entrypoints
.
openai
.
api_server
--
host
0.0.0.0
--
port
8000
--
model
facebook
/
opt
-
6.7
b
\
--
seed
42
-
tp
1
--
speculative_model
facebook
/
opt
-
125
m
--
use
-
v2
-
block
-
manager
\
--
num_speculative_tokens
5
--
gpu_memory_utilization
0.8
Then
use
a
client
:
..
code
-
block
::
python
from
openai
import
OpenAI
#
Modify
OpenAI
's API key and API base to use vLLM'
s
API
server
.
openai_api_key
=
"EMPTY"
openai_api_base
=
"http://localhost:8000/v1"
client
=
OpenAI
(
#
defaults
to
os
.
environ
.
get
(
"OPENAI_API_KEY"
)
api_key
=
openai_api_key
,
base_url
=
openai_api_base
,
)
models
=
client
.
models
.
list
()
model
=
models
.
data
[
0
].
id
#
Completion
API
stream
=
False
completion
=
client
.
completions
.
create
(
model
=
model
,
prompt
=
"The future of AI is"
,
echo
=
False
,
n
=
1
,
stream
=
stream
,
)
print
(
"Completion results:"
)
if
stream
:
for
c
in
completion
:
print
(
c
)
else
:
print
(
completion
)
Speculating
by
matching
n
-
grams
in
the
prompt
---------------------------------------------
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment