README.md 4.72 KB
Newer Older
1
<!-- markdownlint-disable MD001 MD041 -->
Zhuohan Li's avatar
Zhuohan Li committed
2
3
<p align="center">
  <picture>
4
5
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
6
7
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
8

Zhuohan Li's avatar
Zhuohan Li committed
9
10
11
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
12

Zhuohan Li's avatar
Zhuohan Li committed
13
<p align="center">
14
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

17
18
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
19
20

---
21

22
## About
23

Woosuk Kwon's avatar
Woosuk Kwon committed
24
vLLM is a fast and easy-to-use library for LLM inference and serving.
25

26
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
27

Zhuohan Li's avatar
Zhuohan Li committed
28
vLLM is fast with:
29

Zhuohan Li's avatar
Zhuohan Li committed
30
- State-of-the-art serving throughput
31
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
32
- Continuous batching of incoming requests
33
- Fast model execution with CUDA/HIP graph
34
35
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
Simon Mo's avatar
Simon Mo committed
36
37
- Speculative decoding
- Chunked prefill
Zhuohan Li's avatar
Zhuohan Li committed
38
39
40

vLLM is flexible and easy to use with:

41
- Seamless integration with popular Hugging Face models
Zhuohan Li's avatar
Zhuohan Li committed
42
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
43
- Tensor, pipeline, data and expert parallelism support for distributed inference
44
45
- Streaming outputs
- OpenAI-compatible API server
46
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
Simon Mo's avatar
Simon Mo committed
47
- Prefix caching support
48
- Multi-LoRA support
49

50
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
51

52
- Transformer-like LLMs (e.g., Llama)
53
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
54
- Embedding Models (e.g., E5-Mistral)
55
56
57
58
59
- Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Getting Started
60

61
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
Zhuohan Li's avatar
Zhuohan Li committed
62
63
64
65
66

```bash
pip install vllm
```

67
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
68

69
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
70
71
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
Zhuohan Li's avatar
Zhuohan Li committed
72

73
## Contributing
74

75
We welcome and value any contributions and collaborations.
Reid's avatar
Reid committed
76
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
Woosuk Kwon's avatar
Woosuk Kwon committed
77
78
79
80

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
81

Woosuk Kwon's avatar
Woosuk Kwon committed
82
83
```bibtex
@inproceedings{kwon2023efficient,
84
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
Woosuk Kwon's avatar
Woosuk Kwon committed
85
86
87
88
89
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```
90
91
92

## Contact Us

93
<!-- --8<-- [start:contact-us] -->
94
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
95
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
96
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
97
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
98
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
99
<!-- --8<-- [end:contact-us] -->
Simon Mo's avatar
Simon Mo committed
100
101
102

## Media Kit

103
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)