"test/vscode:/vscode.git/clone" did not exist on "78eaf2b80d39277a59ff600573949740439259d3"
README.md 6.12 KB
Newer Older
Zhuohan Li's avatar
Zhuohan Li committed
1
2
<p align="center">
  <picture>
Zhuohan Li's avatar
Zhuohan Li committed
3
4
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
5
6
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
7

Zhuohan Li's avatar
Zhuohan Li committed
8
9
10
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
11

Zhuohan Li's avatar
Zhuohan Li committed
12
<p align="center">
Woosuk Kwon's avatar
Woosuk Kwon committed
13
| <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://github.com/vllm-project/vllm/discussions"><b>Discussions</b></a> |
Woosuk Kwon's avatar
Woosuk Kwon committed
14

Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

Zhuohan Li's avatar
Zhuohan Li committed
17
---
18

Zhuohan Li's avatar
Zhuohan Li committed
19
*Latest News* 🔥
Woosuk Kwon's avatar
Woosuk Kwon committed
20
- [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv!
Zhuohan Li's avatar
Zhuohan Li committed
21
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
Zhuohan Li's avatar
Zhuohan Li committed
22
- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
23
- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds.
Lianmin Zheng's avatar
Lianmin Zheng committed
24
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
Zhuohan Li's avatar
Zhuohan Li committed
25
26

---
27

Woosuk Kwon's avatar
Woosuk Kwon committed
28
vLLM is a fast and easy-to-use library for LLM inference and serving.
29

Zhuohan Li's avatar
Zhuohan Li committed
30
vLLM is fast with:
31

Zhuohan Li's avatar
Zhuohan Li committed
32
- State-of-the-art serving throughput
33
- Efficient management of attention key and value memory with **PagedAttention**
34
- Continuous batching of incoming requests
35
- Optimized CUDA kernels
Zhuohan Li's avatar
Zhuohan Li committed
36
37
38

vLLM is flexible and easy to use with:

39
- Seamless integration with popular Hugging Face models
Zhuohan Li's avatar
Zhuohan Li committed
40
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
41
42
43
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
44

45
vLLM seamlessly supports many Hugging Face models, including the following architectures:
46

47
- Aquila (`BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
48
- Baichuan (`baichuan-inc/Baichuan-7B`, `baichuan-inc/Baichuan-13B-Chat`, etc.)
Woosuk Kwon's avatar
Woosuk Kwon committed
49
- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
50
- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
51
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
52
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
53
- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
54
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
55
- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
56
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
57
- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
58
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
59
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
60

61
Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
Zhuohan Li's avatar
Zhuohan Li committed
62
63
64
65
66
67
68

```bash
pip install vllm
```

## Getting Started

69
70
71
72
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
Zhuohan Li's avatar
Zhuohan Li committed
73

74
## Performance
75

76
vLLM outperforms Hugging Face Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
Woosuk Kwon's avatar
Woosuk Kwon committed
77
For details, check out our [blog post](https://vllm.ai).
78

79
<p align="center">
Zhuohan Li's avatar
Zhuohan Li committed
80
  <picture>
Zhuohan Li's avatar
Zhuohan Li committed
81
82
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n1_dark.png">
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n1_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
83
84
  </picture>
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
85
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n1_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
86
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n1_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
87
  </picture>
88
89
90
  <br>
  <em> Serving throughput when each request asks for 1 output completion. </em>
</p>
91

92
<p align="center">
Zhuohan Li's avatar
Zhuohan Li committed
93
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
94
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n3_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
95
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n3_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
96
97
  </picture>
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
98
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n3_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
99
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n3_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
100
  </picture>  <br>
101
102
  <em> Serving throughput when each request asks for 3 output completions. </em>
</p>
103

104
## Contributing
105

106
107
We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
Woosuk Kwon's avatar
Woosuk Kwon committed
108
109
110
111
112
113
114
115
116
117
118
119

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, 
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```