"vllm/vscode:/vscode.git/clone" did not exist on "67b4221a61ace91a79aff507df0a95a01978300e"
structured_outputs.md 13.4 KB
Newer Older
1
# Structured Outputs
2

3
4
5
6
7
vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or
[guidance](https://github.com/guidance-ai/llguidance) as backends.
This document shows you some examples of the different options that are
available to generate structured outputs.
8

9
!!! warning
10
    If you are still using the following deprecated API fields which were removed in v0.12.0, please update your code to use `structured_outputs` as demonstrated in the rest of this document:
11
12
13
14
15
16
17
18
19

    - `guided_json` -> `{"structured_outputs": {"json": ...}}` or `StructuredOutputsParams(json=...)`
    - `guided_regex` -> `{"structured_outputs": {"regex": ...}}` or `StructuredOutputsParams(regex=...)`
    - `guided_choice` -> `{"structured_outputs": {"choice": ...}}` or `StructuredOutputsParams(choice=...)`
    - `guided_grammar` -> `{"structured_outputs": {"grammar": ...}}` or `StructuredOutputsParams(grammar=...)`
    - `guided_whitespace_pattern` -> `{"structured_outputs": {"whitespace_pattern": ...}}` or `StructuredOutputsParams(whitespace_pattern=...)`
    - `structural_tag` -> `{"structured_outputs": {"structural_tag": ...}}` or `StructuredOutputsParams(structural_tag=...)`
    - `guided_decoding_backend` -> Remove this field from your request

20
## Online Serving (OpenAI API)
21
22
23
24
25

You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.

The following parameters are supported, which must be added as extra parameters:

26
27
28
29
- `choice`: the output will be exactly one of the choices.
- `regex`: the output will follow the regex pattern.
- `json`: the output will follow the JSON schema.
- `grammar`: the output will follow the context free grammar.
30
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
31

32
You can see the complete list of supported parameters on the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) page.
33
34
35

Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the
36
`--structured-outputs-config.backend` flag to `vllm serve`. The default backend is `auto`,
37
38
39
40
which will try to choose an appropriate backend based on the details of the
request. You may also choose a specific backend, along with
some options. A full set of options is available in the `vllm serve --help`
text.
41

42
Now let's see an example for each of the cases, starting with the `choice`, as it's the easiest one:
43

44
??? code
45
46
47
48
49
50
51
52
53
54
55
56
57
58

    ```python
    from openai import OpenAI
    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="-",
    )
    model = client.models.list().data[0].id

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
        ],
59
        extra_body={"structured_outputs": {"choice": ["positive", "negative"]}},
60
61
62
    )
    print(completion.choices[0].message.content)
    ```
63

Xu Song's avatar
Xu Song committed
64
The next example shows how to use the `regex`. The supported regex syntax depends on the structured output backend. For example, `xgrammar`, `guidance`, and `outlines` use Rust-style regex, while `lm-format-enforcer` uses Python's `re` module. The idea is to generate an email address, given a simple regex template:
65

66
??? code
67
68
69
70
71
72
73
74
75
76

    ```python
    completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
            }
        ],
77
        extra_body={"structured_outputs": {"regex": r"\w+@\w+\.com\n"}, "stop": ["\n"]},
78
79
80
    )
    print(completion.choices[0].message.content)
    ```
81
82

One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
83
For this we can use the `json` parameter in two different ways:
84
85
86
87

- Using directly a [JSON Schema](https://json-schema.org/)
- Defining a [Pydantic model](https://docs.pydantic.dev/latest/) and then extracting the JSON Schema from it (which is normally an easier option).

88
The next example shows how to use the `response_format` parameter with a Pydantic model:
89

90
??? code
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116

    ```python
    from pydantic import BaseModel
    from enum import Enum

    class CarType(str, Enum):
        sedan = "sedan"
        suv = "SUV"
        truck = "Truck"
        coupe = "Coupe"

    class CarDescription(BaseModel):
        brand: str
        model: str
        car_type: CarType

    json_schema = CarDescription.model_json_schema()

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
            }
        ],
117
        response_format={
118
119
120
121
122
            "type": "json_schema",
            "json_schema": {
                "name": "car-description",
                "schema": CarDescription.model_json_schema()
            },
123
        },
124
125
126
    )
    print(completion.choices[0].message.content)
    ```
127

128
!!! tip
129
    While not strictly necessary, normally it's better to indicate in the prompt the
130
    JSON schema and how the fields should be populated. This can improve the
131
    results notably in most cases.
132

133
Finally we have the `grammar` option, which is probably the most
134
difficult to use, but it's really powerful. It allows us to define complete
135
languages like SQL queries. It works by using a context free EBNF grammar.
136
As an example, we can use to define a specific format of simplified SQL queries:
137

138
??? code
139

140
141
142
    ```python
    simplified_sql_grammar = """
        root ::= select_statement
143

144
        select_statement ::= "SELECT " column " from " table " where " condition
145

146
        column ::= "col_1 " | "col_2 "
147

148
        table ::= "table_1 " | "table_2 "
149

150
        condition ::= column "= " number
151

152
153
154
155
156
157
158
159
160
161
162
        number ::= "1 " | "2 "
    """

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
            }
        ],
163
        extra_body={"structured_outputs": {"grammar": simplified_sql_grammar}},
164
165
166
    )
    print(completion.choices[0].message.content)
    ```
167

168
See also: [full example](../examples/online_serving/structured_outputs.md)
169
170
171
172
173
174
175
176
177
178
179

## Reasoning Outputs

You can also use structured outputs with <project:#reasoning-outputs> for reasoning models.

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r1
```

Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:

180
??? code
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206

    ```python
    from pydantic import BaseModel


    class People(BaseModel):
        name: str
        age: int


    completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": "Generate a JSON with the name and age of one random person.",
            }
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "people",
                "schema": People.model_json_schema()
            }
        },
    )
207
    print("reasoning: ", completion.choices[0].message.reasoning)
208
209
    print("content: ", completion.choices[0].message.content)
    ```
210

211
See also: [full example](../examples/online_serving/structured_outputs.md)
212

213
214
215
216
217
218
!!! note
    When using Qwen3 Coder models with reasoning enabled, structured outputs might become disabled if the reasoning content does not get parsed into the `reasoning` field separately (v0.11.2+).
    To use both features together, you must explicitly enable structured outputs in reasoning mode.
    To do so, add the following flag when starting the vLLM server: `--structured-outputs-config.enable_in_reasoning=True`.
    See also: [Reasoning Outputs](reasoning_outputs.md) documentation.

219
220
221
222
223
224
## Experimental Automatic Parsing (OpenAI API)

This section covers the OpenAI beta wrapper over the `client.chat.completions.create()` method that provides richer integrations with Python specific types.

At the time of writing (`openai==1.54.4`), this is a "beta" feature in the OpenAI client library. Code reference can be found [here](https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104).

225
For the following examples, vLLM was set up using `vllm serve meta-llama/Llama-3.1-8B-Instruct`
226
227
228

Here is a simple example demonstrating how to get structured output using Pydantic models:

229
??? code
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255

    ```python
    from pydantic import BaseModel
    from openai import OpenAI

    class Info(BaseModel):
        name: str
        age: int

    client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
    model = client.models.list().data[0].id
    completion = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
        ],
        response_format=Info,
    )

    message = completion.choices[0].message
    print(message)
    assert message.parsed
    print("Name:", message.parsed.name)
    print("Age:", message.parsed.age)
    ```
256
257
258
259
260
261
262
263
264

```console
ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
Name: Cameron
Age: 28
```

Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:

265
??? code
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295

    ```python
    from typing import List
    from pydantic import BaseModel
    from openai import OpenAI

    class Step(BaseModel):
        explanation: str
        output: str

    class MathResponse(BaseModel):
        steps: list[Step]
        final_answer: str

    completion = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful expert math tutor."},
            {"role": "user", "content": "Solve 8x + 31 = 2."},
        ],
        response_format=MathResponse,
    )

    message = completion.choices[0].message
    print(message)
    assert message.parsed
    for i, step in enumerate(message.parsed.steps):
        print(f"Step #{i}:", step)
    print("Answer:", message.parsed.final_answer)
    ```
296
297
298
299
300
301
302
303
304
305
306

Output:

```console
ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
Answer: x = -29/8
```

307
An example of using `structural_tag` can be found here: [examples/online_serving/structured_outputs](../../examples/online_serving/structured_outputs)
308

309
310
## Offline Inference

311
Offline inference allows for the same types of structured outputs.
312
To use it, we'll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
313
The main available options inside `StructuredOutputsParams` are:
314
315
316
317
318

- `json`
- `regex`
- `choice`
- `grammar`
319
- `structural_tag`
320

321
These parameters can be used in the same way as the parameters from the Online
322
Serving examples above. One example for the usage of the `choice` parameter is
323
shown below:
324

325
??? code
326

327
328
    ```python
    from vllm import LLM, SamplingParams
329
    from vllm.sampling_params import StructuredOutputsParams
330

331
332
    llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")

333
334
    structured_outputs_params = StructuredOutputsParams(choice=["Positive", "Negative"])
    sampling_params = SamplingParams(structured_outputs=structured_outputs_params)
335
336
337
338
339
340
    outputs = llm.generate(
        prompts="Classify this sentiment: vLLM is wonderful!",
        sampling_params=sampling_params,
    )
    print(outputs[0].outputs[0].text)
    ```
341

342
See also: [full example](../examples/online_serving/structured_outputs.md)