spec_decode.rst 6.3 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.. _spec_decode:

Speculative decoding in vLLM
============================

.. warning::
    Please note that speculative decoding in vLLM is not yet optimized and does
    not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
    to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_

This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.

Speculating with a draft model
------------------------------

17
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
18
19

.. code-block:: python
20

21
    from vllm import LLM, SamplingParams
22

23
24
25
26
    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
27

28
29
30
31
32
33
34
35
    llm = LLM(
        model="facebook/opt-6.7b",
        tensor_parallel_size=1,
        speculative_model="facebook/opt-125m",
        num_speculative_tokens=5,
        use_v2_block_manager=True,
    )
    outputs = llm.generate(prompts, sampling_params)
36

37
38
39
40
41
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
To perform the same with an online mode launch the server:

.. code-block:: bash

    python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
    --seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
    --num_speculative_tokens 5 --gpu_memory_utilization 0.8

 Then use a client:

.. code-block:: python

    from openai import OpenAI

    # Modify OpenAI's API key and API base to use vLLM's API server.
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"

    client = OpenAI(
        # defaults to os.environ.get("OPENAI_API_KEY")
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    models = client.models.list()
    model = models.data[0].id

    # Completion API
    stream = False
    completion = client.completions.create(
        model=model,
        prompt="The future of AI is",
        echo=False,
        n=1,
        stream=stream,
    )

    print("Completion results:")
    if stream:
        for c in completion:
            print(c)
    else:
        print(completion)

86
87
88
89
90
91
92
Speculating by matching n-grams in the prompt
---------------------------------------------

The following code configures vLLM to use speculative decoding where proposals are generated by
matching n-grams in the prompt. For more information read `this thread. <https://x.com/joao_gante/status/1747322413006643259>`_

.. code-block:: python
93

94
    from vllm import LLM, SamplingParams
95

96
97
98
99
    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
100

101
102
103
104
105
106
107
108
109
    llm = LLM(
        model="facebook/opt-6.7b",
        tensor_parallel_size=1,
        speculative_model="[ngram]",
        num_speculative_tokens=5,
        ngram_prompt_lookup_max=4,
        use_v2_block_manager=True,
    )
    outputs = llm.generate(prompts, sampling_params)
110

111
112
113
114
115
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

116
117
118
119
120
Speculating using MLP speculators
---------------------------------

The following code configures vLLM to use speculative decoding where proposals are generated by
draft models that conditioning draft predictions on both context vectors and sampled tokens.
121
For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
`this technical report <https://arxiv.org/abs/2404.19124>`_.

.. code-block:: python

    from vllm import LLM, SamplingParams

    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tensor_parallel_size=4,
        speculative_model="ibm-fms/llama3-70b-accelerator",
        speculative_draft_tensor_parallel_size=1,
        use_v2_block_manager=True,
    )
    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

147
148
149
Note that these speculative models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the
speculative models are relatively small, we still see significant speedups. However, this
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
limitation will be fixed in a future release.

A variety of speculative models of this type are available on HF hub:

* `llama-13b-accelerator <https://huggingface.co/ibm-fms/llama-13b-accelerator>`_
* `llama3-8b-accelerator <https://huggingface.co/ibm-fms/llama3-8b-accelerator>`_
* `codellama-34b-accelerator <https://huggingface.co/ibm-fms/codellama-34b-accelerator>`_
* `llama2-70b-accelerator <https://huggingface.co/ibm-fms/llama2-70b-accelerator>`_
* `llama3-70b-accelerator <https://huggingface.co/ibm-fms/llama3-70b-accelerator>`_
* `granite-3b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator>`_
* `granite-8b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator>`_
* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_


165
166
167
168
Resources for vLLM contributors
-------------------------------
* `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_
* `What is Lookahead Scheduling in vLLM? <https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a>`_
169
* `Information on batch expansion <https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8>`_
170
* `Dynamic speculative decoding <https://github.com/vllm-project/vllm/issues/4565>`_