README_ORIGIN.md 4.72 KB
Newer Older
1
<!-- markdownlint-disable MD001 MD041 -->
zhuwenwen's avatar
zhuwenwen committed
2
3
<p align="center">
  <picture>
zhuwenwen's avatar
zhuwenwen committed
4
5
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
zhuwenwen's avatar
zhuwenwen committed
6
7
8
9
10
11
12
13
  </picture>
</p>

<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>

<p align="center">
zhuwenwen's avatar
zhuwenwen committed
14
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
zhuwenwen's avatar
zhuwenwen committed
15
16
</p>

17
18
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
zhuwenwen's avatar
zhuwenwen committed
19

zhuwenwen's avatar
zhuwenwen committed
20
---
21

zhuwenwen's avatar
zhuwenwen committed
22
## About
zhuwenwen's avatar
zhuwenwen committed
23

zhuwenwen's avatar
zhuwenwen committed
24
25
vLLM is a fast and easy-to-use library for LLM inference and serving.

zhuwenwen's avatar
zhuwenwen committed
26
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
zhuwenwen's avatar
zhuwenwen committed
27

zhuwenwen's avatar
zhuwenwen committed
28
29
30
vLLM is fast with:

- State-of-the-art serving throughput
zhuwenwen's avatar
zhuwenwen committed
31
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
zhuwenwen's avatar
zhuwenwen committed
32
33
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
zhuwenwen's avatar
zhuwenwen committed
34
35
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
36
37
- Speculative decoding
- Chunked prefill
zhuwenwen's avatar
zhuwenwen committed
38
39
40
41
42

vLLM is flexible and easy to use with:

- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
43
- Tensor, pipeline, data and expert parallelism support for distributed inference
zhuwenwen's avatar
zhuwenwen committed
44
45
- Streaming outputs
- OpenAI-compatible API server
46
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
47
- Prefix caching support
zhuwenwen's avatar
zhuwenwen committed
48
- Multi-LoRA support
zhuwenwen's avatar
zhuwenwen committed
49

zhuwenwen's avatar
zhuwenwen committed
50
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
51

zhuwenwen's avatar
zhuwenwen committed
52
- Transformer-like LLMs (e.g., Llama)
53
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
zhuwenwen's avatar
zhuwenwen committed
54
- Embedding Models (e.g., E5-Mistral)
zhuwenwen's avatar
zhuwenwen committed
55
56
57
58
59
- Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Getting Started
zhuwenwen's avatar
zhuwenwen committed
60

zhuwenwen's avatar
zhuwenwen committed
61
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
zhuwenwen's avatar
zhuwenwen committed
62
63
64
65
66

```bash
pip install vllm
```

zhuwenwen's avatar
zhuwenwen committed
67
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
68

zhuwenwen's avatar
zhuwenwen committed
69
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
zhuwenwen's avatar
zhuwenwen committed
70
71
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
zhuwenwen's avatar
zhuwenwen committed
72
73
74
75

## Contributing

We welcome and value any contributions and collaborations.
zhuwenwen's avatar
zhuwenwen committed
76
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
zhuwenwen's avatar
zhuwenwen committed
77
78
79
80

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
zhuwenwen's avatar
zhuwenwen committed
81

zhuwenwen's avatar
zhuwenwen committed
82
83
84
85
86
87
88
```bibtex
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
zhuwenwen's avatar
zhuwenwen committed
89
90
91
92
```

## Contact Us

zhuwenwen's avatar
zhuwenwen committed
93
<!-- --8<-- [start:contact-us] -->
94
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
zhuwenwen's avatar
zhuwenwen committed
95
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
zhuwenwen's avatar
zhuwenwen committed
96
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
zhuwenwen's avatar
zhuwenwen committed
97
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
98
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
zhuwenwen's avatar
zhuwenwen committed
99
<!-- --8<-- [end:contact-us] -->
zhuwenwen's avatar
zhuwenwen committed
100
101
102

## Media Kit

zhuwenwen's avatar
zhuwenwen committed
103
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)