README.md 5.56 KB
Newer Older
1
<!-- markdownlint-disable MD001 MD041 -->
Zhuohan Li's avatar
Zhuohan Li committed
2
3
<p align="center">
  <picture>
4
5
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
6
7
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
8

Zhuohan Li's avatar
Zhuohan Li committed
9
10
11
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
12

Zhuohan Li's avatar
Zhuohan Li committed
13
<p align="center">
14
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

17
🔥 We have built a vLLM website to help you get started with vLLM. Please visit [vllm.ai](https://vllm.ai) to learn more.
18
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
19
20

---
21

22
## About
23

Woosuk Kwon's avatar
Woosuk Kwon committed
24
vLLM is a fast and easy-to-use library for LLM inference and serving.
25

Michael Goin's avatar
Michael Goin committed
26
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.
27

Zhuohan Li's avatar
Zhuohan Li committed
28
vLLM is fast with:
29

Zhuohan Li's avatar
Zhuohan Li committed
30
- State-of-the-art serving throughput
31
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
Michael Goin's avatar
Michael Goin committed
32
33
34
35
36
37
38
39
- Continuous batching of incoming requests, chunked prefill, prefix caching
- Fast and flexible model execution with piecewise and full CUDA/HIP graphs
- Quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and [more](https://docs.vllm.ai/en/latest/features/quantization/index.html)
- Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton
- Optimized GEMM/MoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL
- Speculative decoding including n-gram, suffix, EAGLE, DFlash
- Automatic kernel generation and graph-level transformations using torch.compile
- Disaggregated prefill, decode, and encode
Zhuohan Li's avatar
Zhuohan Li committed
40
41
42

vLLM is flexible and easy to use with:

43
- Seamless integration with popular Hugging Face models
Zhuohan Li's avatar
Zhuohan Li committed
44
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
Michael Goin's avatar
Michael Goin committed
45
- Tensor, pipeline, data, expert, and context parallelism for distributed inference
46
- Streaming outputs
Michael Goin's avatar
Michael Goin committed
47
48
49
50
51
- Generation of structured outputs using xgrammar or guidance
- Tool calling and reasoning parsers
- OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
- Efficient multi-LoRA support for dense and MoE layers
- Support for NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.
52

53
vLLM seamlessly supports 200+ model architectures on Hugging Face, including:
54

Michael Goin's avatar
Michael Goin committed
55
56
57
58
59
60
- Decoder-only LLMs (e.g., Llama, Qwen, Gemma)
- Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)
- Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)
- Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)
- Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)
- Reward and classification models (e.g., Qwen-Math)
61
62
63
64

Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Getting Started
65

Michael Goin's avatar
Michael Goin committed
66
Install vLLM with [`uv`](https://docs.astral.sh/uv/) (recommended) or `pip`:
Zhuohan Li's avatar
Zhuohan Li committed
67
68

```bash
Michael Goin's avatar
Michael Goin committed
69
uv pip install vllm
Zhuohan Li's avatar
Zhuohan Li committed
70
71
```

Michael Goin's avatar
Michael Goin committed
72
73
Or [build from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source) for development.

74
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
75

76
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
77
78
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
Zhuohan Li's avatar
Zhuohan Li committed
79

80
## Contributing
81

82
We welcome and value any contributions and collaborations.
Reid's avatar
Reid committed
83
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
Woosuk Kwon's avatar
Woosuk Kwon committed
84
85
86
87

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
88

Woosuk Kwon's avatar
Woosuk Kwon committed
89
90
```bibtex
@inproceedings{kwon2023efficient,
91
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
Woosuk Kwon's avatar
Woosuk Kwon committed
92
93
94
95
96
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```
97
98
99

## Contact Us

100
<!-- --8<-- [start:contact-us] -->
101
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
102
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
103
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
104
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
105
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
106
<!-- --8<-- [end:contact-us] -->
Simon Mo's avatar
Simon Mo committed
107
108
109

## Media Kit

110
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)