@@ -10,7 +10,9 @@ Reasoning models return a additional `reasoning_content` field in their outputs,
...
@@ -10,7 +10,9 @@ Reasoning models return a additional `reasoning_content` field in their outputs,
vLLM currently supports the following reasoning models:
vLLM currently supports the following reasoning models:
-[DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d)(`deepseek_r1`, which looks for `<think> ... </think>`)
| Model Series | Parser Name | Structured Output Support |
@@ -78,11 +80,51 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
...
@@ -78,11 +80,51 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
Please note that it is not compatible with the OpenAI Python client library. You can use the `requests` library to make streaming requests. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py).
Please note that it is not compatible with the OpenAI Python client library. You can use the `requests` library to make streaming requests. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py).
## Structured output
The reasoning content is also available in the structured output. The structured output engine like `xgrammar` will use the reasoning content to generate structured output.
```python
fromopenaiimportOpenAI
frompydanticimportBaseModel
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key="EMPTY"
openai_api_base="http://localhost:8000/v1"
client=OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models=client.models.list()
model=models.data[0].id
classPeople(BaseModel):
name:str
age:int
json_schema=People.model_json_schema()
prompt=("Generate a JSON with the name and age of one random person.")
The structured output engine like xgrammar will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
The structured output engine like `xgrammar` will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
Finally, you can enable reasoning for the model by using the `--enable-reasoning` and `--reasoning-parser` flags.
Finally, you can enable reasoning for the model by using the `--enable-reasoning` and `--reasoning-parser` flags.