README.md 5.46 KB
Newer Older
Zhuohan Li's avatar
Zhuohan Li committed
1
2
<p align="center">
  <picture>
Zhuohan Li's avatar
Zhuohan Li committed
3
4
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
5
6
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
7

Zhuohan Li's avatar
Zhuohan Li committed
8
9
10
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
11

Zhuohan Li's avatar
Zhuohan Li committed
12
<p align="center">
Woosuk Kwon's avatar
Woosuk Kwon committed
13
| <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://github.com/vllm-project/vllm/discussions"><b>Discussions</b></a> |
Woosuk Kwon's avatar
Woosuk Kwon committed
14

Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

Zhuohan Li's avatar
Zhuohan Li committed
17
---
18

Zhuohan Li's avatar
Zhuohan Li committed
19
*Latest News* 🔥
Zhuohan Li's avatar
Zhuohan Li committed
20
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
Zhuohan Li's avatar
Zhuohan Li committed
21
- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
22
- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds.
Lianmin Zheng's avatar
Lianmin Zheng committed
23
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
Zhuohan Li's avatar
Zhuohan Li committed
24
25

---
26

Woosuk Kwon's avatar
Woosuk Kwon committed
27
vLLM is a fast and easy-to-use library for LLM inference and serving.
28

Zhuohan Li's avatar
Zhuohan Li committed
29
vLLM is fast with:
30

Zhuohan Li's avatar
Zhuohan Li committed
31
- State-of-the-art serving throughput
32
- Efficient management of attention key and value memory with **PagedAttention**
33
- Continuous batching of incoming requests
34
- Optimized CUDA kernels
Zhuohan Li's avatar
Zhuohan Li committed
35
36
37
38
39

vLLM is flexible and easy to use with:

- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
40
41
42
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
43

44
45
vLLM seamlessly supports many Huggingface models, including the following architectures:

46
- Aquila (`BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
47
- Baichuan (`baichuan-inc/Baichuan-7B`, `baichuan-inc/Baichuan-13B-Chat`, etc.)
Woosuk Kwon's avatar
Woosuk Kwon committed
48
- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
49
- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
50
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
51
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
52
- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
53
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
54
- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
55
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
56
- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
Zhuohan Li's avatar
Zhuohan Li committed
57
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
58
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
59

60
Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
Zhuohan Li's avatar
Zhuohan Li committed
61
62
63
64
65
66
67

```bash
pip install vllm
```

## Getting Started

68
69
70
71
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
Zhuohan Li's avatar
Zhuohan Li committed
72

73
## Performance
74

75
vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
Woosuk Kwon's avatar
Woosuk Kwon committed
76
For details, check out our [blog post](https://vllm.ai).
77

78
<p align="center">
Zhuohan Li's avatar
Zhuohan Li committed
79
  <picture>
Zhuohan Li's avatar
Zhuohan Li committed
80
81
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n1_dark.png">
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n1_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
82
83
  </picture>
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
84
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n1_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
85
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n1_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
86
  </picture>
87
88
89
  <br>
  <em> Serving throughput when each request asks for 1 output completion. </em>
</p>
90

91
<p align="center">
Zhuohan Li's avatar
Zhuohan Li committed
92
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
93
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n3_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
94
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n3_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
95
96
  </picture>
  <picture>
Woosuk Kwon's avatar
Woosuk Kwon committed
97
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n3_dark.png">
Zhuohan Li's avatar
Zhuohan Li committed
98
  <img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n3_light.png" width="45%">
Zhuohan Li's avatar
Zhuohan Li committed
99
  </picture>  <br>
100
101
  <em> Serving throughput when each request asks for 3 output completions. </em>
</p>
102

103
## Contributing
104

105
106
We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.