README.md 3.5 KB
Newer Older
Zhuohan Li's avatar
Zhuohan Li committed
1
2
3
4
5
6
<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="./docs/source/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="./docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
7

Zhuohan Li's avatar
Zhuohan Li committed
8
9
10
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
11

Zhuohan Li's avatar
Zhuohan Li committed
12
13
<p align="center">
| <a href="https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/"><b>Documentation</b></a> | <a href=""><b>Blog</b></a> |
Woosuk Kwon's avatar
Woosuk Kwon committed
14

Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

Zhuohan Li's avatar
Zhuohan Li committed
17
---
18

Zhuohan Li's avatar
Zhuohan Li committed
19
*Latest News* 🔥
20

Zhuohan Li's avatar
Zhuohan Li committed
21
22
23
- [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post]().

---
24

Zhuohan Li's avatar
Zhuohan Li committed
25
vLLM is a fast and easy to use library for LLM inference and serving.
26

Zhuohan Li's avatar
Zhuohan Li committed
27
vLLM is fast with:
28

Zhuohan Li's avatar
Zhuohan Li committed
29
- State-of-the-art serving throughput
30
31
32
- Efficient management of attention key and value memory with **PagedAttention**
- Dynamic batching of incoming requests
- Optimized CUDA kernels
Zhuohan Li's avatar
Zhuohan Li committed
33
34
35
36
37

vLLM is flexible and easy to use with:

- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
38
39
40
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
41

Zhuohan Li's avatar
Zhuohan Li committed
42
43
44
45
46
47
48
49
50
51
52
53
54
Install vLLM with pip or [from source](https://llm-serving-cacheflow.readthedocs-hosted.com/en/latest/getting_started/installation.html#build-from-source):

```bash
pip install vllm
```

## Getting Started

Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started.
- [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html)
- [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html)
- [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html)

55
## Performance
56

57
58
vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
For details, check out our [blog post]().
59

60
<p align="center">
Zhuohan Li's avatar
Zhuohan Li committed
61
62
63
64
65
66
67
68
  <picture>
  <source media="(prefers-color-scheme: dark)" srcset="./docs/source/assets/figures/perf_a10g_n1_dark.png">
  <img src="./docs/source/assets/figures/perf_a10g_n1_light.png" width="45%">
  </picture>
  <picture>
  <source media="(prefers-color-scheme: dark)" srcset="./docs/source/assets/figures/perf_a100_n1_dark.png">
  <img src="./docs/source/assets/figures/perf_a100_n1_light.png" width="45%">
  </picture>
69
70
71
  <br>
  <em> Serving throughput when each request asks for 1 output completion. </em>
</p>
72

73
<p align="center">
Zhuohan Li's avatar
Zhuohan Li committed
74
75
76
77
78
79
80
81
  <picture>
  <source media="(prefers-color-scheme: dark)" srcset="./docs/source/assets/figures/perf_a10g_n3_dark.png">
  <img src="./docs/source/assets/figures/perf_a10g_n3_light.png" width="45%">
  </picture>
  <picture>
  <source media="(prefers-color-scheme: dark)" srcset="./docs/source/assets/figures/perf_a100_n3_dark.png">
  <img src="./docs/source/assets/figures/perf_a100_n3_light.png" width="45%">
  </picture>  <br>
82
83
  <em> Serving throughput when each request asks for 3 output completions. </em>
</p>
84

85
## Contributing
86

87
88
We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.