README.md 4.25 KB
Newer Older
Zhuohan Li's avatar
Zhuohan Li committed
1
2
<p align="center">
  <picture>
Zhuohan Li's avatar
Zhuohan Li committed
3
4
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
5
6
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
7

Zhuohan Li's avatar
Zhuohan Li committed
8
9
10
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
11

Zhuohan Li's avatar
Zhuohan Li committed
12
<p align="center">
Woosuk Kwon's avatar
Woosuk Kwon committed
13
| <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://github.com/vllm-project/vllm/discussions"><b>Discussions</b></a> |
Woosuk Kwon's avatar
Woosuk Kwon committed
14

Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

Zhuohan Li's avatar
Zhuohan Li committed
17
---
18

Zhuohan Li's avatar
Zhuohan Li committed
19
*Latest News* 🔥
20

Lianmin Zheng's avatar
Lianmin Zheng committed
21
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
Zhuohan Li's avatar
Zhuohan Li committed
22
23

---
24

Woosuk Kwon's avatar
Woosuk Kwon committed
25
vLLM is a fast and easy-to-use library for LLM inference and serving.
26

Zhuohan Li's avatar
Zhuohan Li committed
27
vLLM is fast with:
28

Zhuohan Li's avatar
Zhuohan Li committed
29
- State-of-the-art serving throughput
30
31
32
- Efficient management of attention key and value memory with **PagedAttention**
- Dynamic batching of incoming requests
- Optimized CUDA kernels
Zhuohan Li's avatar
Zhuohan Li committed
33
34
35
36
37

vLLM is flexible and easy to use with:

- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
38
39
40
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
41

42
43
vLLM seamlessly supports many Huggingface models, including the following architectures:

Zhuohan Li's avatar
Zhuohan Li committed
44
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
45
46
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
47
48
- LLaMA (`lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
49

50
Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
Zhuohan Li's avatar
Zhuohan Li committed
51
52
53
54
55
56
57

```bash
pip install vllm
```

## Getting Started

58
59
60
61
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
Zhuohan Li's avatar
Zhuohan Li committed
62

63
## Performance
64

65
vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
Woosuk Kwon's avatar
Woosuk Kwon committed
66
For details, check out our [blog post](https://vllm.ai).
67

68
<p align="center">
Zhuohan Li's avatar
Zhuohan Li committed
69
  <picture>
Zhuohan Li's avatar
Zhuohan Li committed
70
71
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n1_dark.png">
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n1_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
72
73
  </picture>
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
74
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n1_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
75
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n1_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
76
  </picture>
77
78
79
  <br>
  <em> Serving throughput when each request asks for 1 output completion. </em>
</p>
80

81
<p align="center">
Zhuohan Li's avatar
Zhuohan Li committed
82
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
83
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n3_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
84
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n3_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
85
86
  </picture>
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
87
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n3_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
88
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n3_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
89
  </picture>  <br>
90
91
  <em> Serving throughput when each request asks for 3 output completions. </em>
</p>
92

93
## Contributing
94

95
96
We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.