Unverified Commit 49314869 authored by WeiQing Chen's avatar WeiQing Chen Committed by GitHub
Browse files

[Doc] Added warning of speculating with draft model (#22047)


Signed-off-by: default avatarDilute-l <dilu2333@163.com>
Co-authored-by: default avatarDilute-l <dilu2333@163.com>
parent 0f81b310
...@@ -15,6 +15,10 @@ Speculative decoding is a technique which improves inter-token latency in memory ...@@ -15,6 +15,10 @@ Speculative decoding is a technique which improves inter-token latency in memory
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time. The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
!!! warning
In vllm v0.10.0, speculative decoding with a draft model is not supported.
If you use the following code, you will get a `NotImplementedError`.
??? code ??? code
```python ```python
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment