Unverified Commit 0e12cd67 authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[Doc] add online speculative decoding example (#7243)

parent 80cbe10c
...@@ -14,17 +14,17 @@ Speculative decoding is a technique which improves inter-token latency in memory ...@@ -14,17 +14,17 @@ Speculative decoding is a technique which improves inter-token latency in memory
Speculating with a draft model Speculating with a draft model
------------------------------ ------------------------------
The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time. The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
.. code-block:: python .. code-block:: python
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
prompts = [ prompts = [
"The future of AI is", "The future of AI is",
] ]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM( llm = LLM(
model="facebook/opt-6.7b", model="facebook/opt-6.7b",
tensor_parallel_size=1, tensor_parallel_size=1,
...@@ -33,12 +33,56 @@ The following code configures vLLM to use speculative decoding with a draft mode ...@@ -33,12 +33,56 @@ The following code configures vLLM to use speculative decoding with a draft mode
use_v2_block_manager=True, use_v2_block_manager=True,
) )
outputs = llm.generate(prompts, sampling_params) outputs = llm.generate(prompts, sampling_params)
for output in outputs: for output in outputs:
prompt = output.prompt prompt = output.prompt
generated_text = output.outputs[0].text generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
To perform the same with an online mode launch the server:
.. code-block:: bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
Then use a client:
.. code-block:: python
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
# Completion API
stream = False
completion = client.completions.create(
model=model,
prompt="The future of AI is",
echo=False,
n=1,
stream=stream,
)
print("Completion results:")
if stream:
for c in completion:
print(c)
else:
print(completion)
Speculating by matching n-grams in the prompt Speculating by matching n-grams in the prompt
--------------------------------------------- ---------------------------------------------
...@@ -48,12 +92,12 @@ matching n-grams in the prompt. For more information read `this thread. <https:/ ...@@ -48,12 +92,12 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
.. code-block:: python .. code-block:: python
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
prompts = [ prompts = [
"The future of AI is", "The future of AI is",
] ]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM( llm = LLM(
model="facebook/opt-6.7b", model="facebook/opt-6.7b",
tensor_parallel_size=1, tensor_parallel_size=1,
...@@ -63,7 +107,7 @@ matching n-grams in the prompt. For more information read `this thread. <https:/ ...@@ -63,7 +107,7 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
use_v2_block_manager=True, use_v2_block_manager=True,
) )
outputs = llm.generate(prompts, sampling_params) outputs = llm.generate(prompts, sampling_params)
for output in outputs: for output in outputs:
prompt = output.prompt prompt = output.prompt
generated_text = output.outputs[0].text generated_text = output.outputs[0].text
...@@ -74,7 +118,7 @@ Speculating using MLP speculators ...@@ -74,7 +118,7 @@ Speculating using MLP speculators
The following code configures vLLM to use speculative decoding where proposals are generated by The following code configures vLLM to use speculative decoding where proposals are generated by
draft models that conditioning draft predictions on both context vectors and sampled tokens. draft models that conditioning draft predictions on both context vectors and sampled tokens.
For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
`this technical report <https://arxiv.org/abs/2404.19124>`_. `this technical report <https://arxiv.org/abs/2404.19124>`_.
.. code-block:: python .. code-block:: python
...@@ -100,9 +144,9 @@ For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide- ...@@ -100,9 +144,9 @@ For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-
generated_text = output.outputs[0].text generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Note that these speculative models currently need to be run without tensor parallelism, although Note that these speculative models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the it is possible to run the main model using tensor parallelism (see example above). Since the
speculative models are relatively small, we still see significant speedups. However, this speculative models are relatively small, we still see significant speedups. However, this
limitation will be fixed in a future release. limitation will be fixed in a future release.
A variety of speculative models of this type are available on HF hub: A variety of speculative models of this type are available on HF hub:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment