Unverified Commit 1af090b5 authored by Zhuohan Li's avatar Zhuohan Li Committed by GitHub
Browse files

Bump up version to v0.3.0 (#2656)

parent 3dad9444
...@@ -46,7 +46,7 @@ vLLM is fast with: ...@@ -46,7 +46,7 @@ vLLM is fast with:
- Efficient management of attention key and value memory with **PagedAttention** - Efficient management of attention key and value memory with **PagedAttention**
- Continuous batching of incoming requests - Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph - Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629) - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629), FP8 KV Cache
- Optimized CUDA kernels - Optimized CUDA kernels
vLLM is flexible and easy to use with: vLLM is flexible and easy to use with:
...@@ -57,6 +57,8 @@ vLLM is flexible and easy to use with: ...@@ -57,6 +57,8 @@ vLLM is flexible and easy to use with:
- Streaming outputs - Streaming outputs
- OpenAI-compatible API server - OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs - Support NVIDIA GPUs and AMD GPUs
- (Experimental) Prefix caching support
- (Experimental) Multi-lora support
vLLM seamlessly supports many Hugging Face models, including the following architectures: vLLM seamlessly supports many Hugging Face models, including the following architectures:
......
...@@ -31,7 +31,7 @@ vLLM is fast with: ...@@ -31,7 +31,7 @@ vLLM is fast with:
* Efficient management of attention key and value memory with **PagedAttention** * Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests * Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph * Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_ * Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
* Optimized CUDA kernels * Optimized CUDA kernels
vLLM is flexible and easy to use with: vLLM is flexible and easy to use with:
...@@ -42,6 +42,8 @@ vLLM is flexible and easy to use with: ...@@ -42,6 +42,8 @@ vLLM is flexible and easy to use with:
* Streaming outputs * Streaming outputs
* OpenAI-compatible API server * OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs * Support NVIDIA GPUs and AMD GPUs
* (Experimental) Prefix caching support
* (Experimental) Multi-lora support
For more information, check out the following: For more information, check out the following:
......
...@@ -8,7 +8,7 @@ from vllm.entrypoints.llm import LLM ...@@ -8,7 +8,7 @@ from vllm.entrypoints.llm import LLM
from vllm.outputs import CompletionOutput, RequestOutput from vllm.outputs import CompletionOutput, RequestOutput
from vllm.sampling_params import SamplingParams from vllm.sampling_params import SamplingParams
__version__ = "0.2.7" __version__ = "0.3.0"
__all__ = [ __all__ = [
"LLM", "LLM",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment