generative_models.md 5.46 KB
Newer Older
1
# Generative Models
2
3
4

vLLM provides first-class support for generative models, which covers most of LLMs.

5
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
6
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
7
which are then passed through [Sampler][vllm.model_executor.layers.sampler.Sampler] to obtain the final text.
8

9
10
11
12
13
14
15
16
17
## Configuration

### Model Runner (`--runner`)

Run a model in generation mode via the option `--runner generate`.

!!! tip
    There is no need to set this option in the vast majority of cases as vLLM can automatically
    detect the model runner to use via `--runner auto`.
18

19
20
## Offline Inference

21
The [LLM][vllm.LLM] class provides various methods for offline inference.
22
See [configuration](../api/summary.md#configuration) for a list of options when initializing the model.
23
24
25

### `LLM.generate`

26
The [generate][vllm.LLM.generate] method is available to all generative models in vLLM.
27
28
29
30
It is similar to [its counterpart in HF Transformers](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate),
except that tokenization and detokenization are also performed automatically.

```python
Reid's avatar
Reid committed
31
32
from vllm import LLM

33
34
35
36
37
38
39
40
41
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate("Hello, my name is")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

42
You can optionally control the language generation by passing [SamplingParams][vllm.SamplingParams].
43
For example, you can use greedy sampling by setting `temperature=0`:
44
45

```python
Reid's avatar
Reid committed
46
47
from vllm import LLM, SamplingParams

48
49
50
51
52
53
54
55
56
57
llm = LLM(model="facebook/opt-125m")
params = SamplingParams(temperature=0)
outputs = llm.generate("Hello, my name is", params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

58
!!! important
59
    By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
60

61
    However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
62
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py>
63
64
65

### `LLM.beam_search`

66
The [beam_search][vllm.LLM.beam_search] method implements [beam search](https://huggingface.co/docs/transformers/en/generation_strategies#beam-search) on top of [generate][vllm.LLM.generate].
67
68
69
For example, to search using 5 beams and output at most 50 tokens:

```python
70
71
72
from vllm import LLM
from vllm.sampling_params import BeamSearchParams

73
74
llm = LLM(model="facebook/opt-125m")
params = BeamSearchParams(beam_width=5, max_tokens=50)
75
outputs = llm.beam_search([{"prompt": "Hello, my name is "}], params)
76
77

for output in outputs:
78
79
    generated_text = output.sequences[0].text
    print(f"Generated text: {generated_text!r}")
80
81
82
83
```

### `LLM.chat`

84
The [chat][vllm.LLM.chat] method implements chat functionality on top of [generate][vllm.LLM.generate].
85
86
87
In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.

88
!!! important
89
90
    In general, only instruction-tuned models have a chat template.
    Base models may perform poorly as they are not trained to respond to the chat conversation.
91

92
??? code
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122

    ```python
    from vllm import LLM

    llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant"
        },
        {
            "role": "user",
            "content": "Hello"
        },
        {
            "role": "assistant",
            "content": "Hello! How can I assist you today?"
        },
        {
            "role": "user",
            "content": "Write an essay about the importance of higher education.",
        },
    ]
    outputs = llm.chat(conversation)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    ```
123

124
A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>
125
126
127
128
129
130
131
132
133
134
135
136
137
138

If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template:

```python
from vllm.entrypoints.chat_utils import load_chat_template

# You can find a list of existing chat templates under `examples/`
custom_template = load_chat_template(chat_template="<path_to_template>")
print("Loaded chat template:", custom_template)

outputs = llm.chat(conversation, chat_template=custom_template)
```

139
## Online Serving
140

141
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
142

143
- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
144
- [Chat API][chat-api]  is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.