README.md 8.79 KB
Newer Older
Zhuohan Li's avatar
Zhuohan Li committed
1
2
<p align="center">
  <picture>
Zhuohan Li's avatar
Zhuohan Li committed
3
4
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
5
6
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
7

Zhuohan Li's avatar
Zhuohan Li committed
8
9
10
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
11

Zhuohan Li's avatar
Zhuohan Li committed
12
<p align="center">
Simon Mo's avatar
Simon Mo committed
13
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
Zhuohan Li's avatar
Zhuohan Li committed
14
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
15

16
17
---

18
19
20
21
The first vLLM meetup in 2025 is happening on January 22nd, Wednesday, with Google Cloud in San Francisco! We will talk about vLLM's performant V1 architecture, Q1 roadmap, Google Cloud's innovation around vLLM: networking, Cloud Run, Vertex, and TPU! [Register Now](https://lu.ma/zep56hui)

---

Zhuohan Li's avatar
Zhuohan Li committed
22
*Latest News* 🔥
Diego Marinho's avatar
Diego Marinho committed
23
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
24
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
25
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
26
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users!
27
- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
28
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
29
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
30
- [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing).
31
- [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing).
32
33
- [2024/01] We hosted [the second vLLM meetup](https://lu.ma/ygxbpzhl) with IBM! Please find the meetup slides [here](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing).
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) with a16z! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
Zhuohan Li's avatar
Zhuohan Li committed
34
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
Lianmin Zheng's avatar
Lianmin Zheng committed
35
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
Zhuohan Li's avatar
Zhuohan Li committed
36
37

---
38
## About
Woosuk Kwon's avatar
Woosuk Kwon committed
39
vLLM is a fast and easy-to-use library for LLM inference and serving.
40

41
42
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evloved into a community-driven project with contributions from both academia and industry.

Zhuohan Li's avatar
Zhuohan Li committed
43
vLLM is fast with:
44

Zhuohan Li's avatar
Zhuohan Li committed
45
- State-of-the-art serving throughput
46
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
47
- Continuous batching of incoming requests
48
- Fast model execution with CUDA/HIP graph
Simon Mo's avatar
Simon Mo committed
49
50
51
52
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
Zhuohan Li's avatar
Zhuohan Li committed
53

54
**Performance benchmark**: We include a performance benchmark at the end of [our blog post](https://blog.vllm.ai/2024/09/05/perf-update.html). It compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [SGLang](https://github.com/sgl-project/sglang) and [LMDeploy](https://github.com/InternLM/lmdeploy)). The implementation is under [nightly-benchmarks folder](.buildkite/nightly-benchmarks/) and you can [reproduce](https://github.com/vllm-project/vllm/issues/8176) this benchmark using our one-click runnable script.
55

Zhuohan Li's avatar
Zhuohan Li committed
56
57
vLLM is flexible and easy to use with:

58
- Seamless integration with popular Hugging Face models
Zhuohan Li's avatar
Zhuohan Li committed
59
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
60
- Tensor parallelism and pipeline parallelism support for distributed inference
61
62
- Streaming outputs
- OpenAI-compatible API server
Simon Mo's avatar
Simon Mo committed
63
64
65
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
- Prefix caching support
- Multi-lora support
66

67
68
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
69
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
Simon Mo's avatar
Simon Mo committed
70
- Embedding Models (e.g. E5-Mistral)
71
72
73
74
75
- Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Getting Started
76

77
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
Zhuohan Li's avatar
Zhuohan Li committed
78
79
80
81
82

```bash
pip install vllm
```

83
84
85
86
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation/index.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
Zhuohan Li's avatar
Zhuohan Li committed
87

88
## Contributing
89

90
91
We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
Woosuk Kwon's avatar
Woosuk Kwon committed
92

93
94
95
96
97
98
## Sponsors

vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!

<!-- Note: Please sort them in alphabetical order. -->
<!-- Note: Please keep these consistent with docs/source/community/sponsors.md -->
99
Cash Donations:
100
- a16z
101
102
103
104
105
106
- Dropbox
- Sequoia Capital
- Skywork AI
- ZhenFund

Compute Resources:
107
108
109
110
111
112
- AMD
- Anyscale
- AWS
- Crusoe Cloud
- Databricks
- DeepInfra
113
- Google Cloud
114
- Lambda Lab
115
- Nebius
116
- Novita AI
117
118
119
120
121
122
123
- NVIDIA
- Replicate
- Roblox
- RunPod
- Trainy
- UC Berkeley
- UC San Diego
124
125

Slack Sponsor: Anyscale
126
127
128

We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.

Woosuk Kwon's avatar
Woosuk Kwon committed
129
130
131
132
133
## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
134
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
Woosuk Kwon's avatar
Woosuk Kwon committed
135
136
137
138
139
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```
140
141
142
143
144

## Contact Us

* For technical questions and feature requests, please use Github issues or discussions.
* For discussing with fellow users, please use Discord.
145
* For coordinating contributions and development, please use Slack.
146
* For security disclosures, please use Github's security advisory feature.
147
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
Simon Mo's avatar
Simon Mo committed
148
149
150
151

## Media Kit

* If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).