@@ -27,7 +27,7 @@ Easy, fast, and cheap LLM serving for everyone
...
@@ -27,7 +27,7 @@ Easy, fast, and cheap LLM serving for everyone
-[2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
-[2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
---
---
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is a fast and easy-to-use library for LLM inference and serving.
@@ -58,11 +58,10 @@ Next, you need to rewrite the :code:`forward` methods of your model by following
...
@@ -58,11 +58,10 @@ Next, you need to rewrite the :code:`forward` methods of your model by following
+ positions: torch.Tensor,
+ positions: torch.Tensor,
+ kv_caches: List[KVCache],
+ kv_caches: List[KVCache],
+ input_metadata: InputMetadata,
+ input_metadata: InputMetadata,
+ cache_events: Optional[List[torch.cuda.Event]],
+) -> Optional[SamplerOutput]:
+) -> SamplerOutput:
3. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
1. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
4. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
2. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
.. note::
.. note::
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.