README.md 5.56 KB
Newer Older
1
<!-- markdownlint-disable MD001 MD041 -->
Zhuohan Li's avatar
Zhuohan Li committed
2
3
<p align="center">
  <picture>
4
5
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
6
7
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
8

Zhuohan Li's avatar
Zhuohan Li committed
9
10
11
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
12

Zhuohan Li's avatar
Zhuohan Li committed
13
<p align="center">
14
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

17
18
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
19
20

---
21

22
## About
23

Woosuk Kwon's avatar
Woosuk Kwon committed
24
vLLM is a fast and easy-to-use library for LLM inference and serving.
25

Michael Goin's avatar
Michael Goin committed
26
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.
27

Zhuohan Li's avatar
Zhuohan Li committed
28
vLLM is fast with:
29

Zhuohan Li's avatar
Zhuohan Li committed
30
- State-of-the-art serving throughput
31
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
Michael Goin's avatar
Michael Goin committed
32
33
34
35
36
37
38
39
- Continuous batching of incoming requests, chunked prefill, prefix caching
- Fast and flexible model execution with piecewise and full CUDA/HIP graphs
- Quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and [more](https://docs.vllm.ai/en/latest/features/quantization/index.html)
- Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton
- Optimized GEMM/MoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL
- Speculative decoding including n-gram, suffix, EAGLE, DFlash
- Automatic kernel generation and graph-level transformations using torch.compile
- Disaggregated prefill, decode, and encode
Zhuohan Li's avatar
Zhuohan Li committed
40
41
42

vLLM is flexible and easy to use with:

43
- Seamless integration with popular Hugging Face models
Zhuohan Li's avatar
Zhuohan Li committed
44
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
Michael Goin's avatar
Michael Goin committed
45
- Tensor, pipeline, data, expert, and context parallelism for distributed inference
46
- Streaming outputs
Michael Goin's avatar
Michael Goin committed
47
48
49
50
51
- Generation of structured outputs using xgrammar or guidance
- Tool calling and reasoning parsers
- OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
- Efficient multi-LoRA support for dense and MoE layers
- Support for NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.
52

Michael Goin's avatar
Michael Goin committed
53
vLLM seamlessly supports 200+ model architectures on HuggingFace, including:
54

Michael Goin's avatar
Michael Goin committed
55
56
57
58
59
60
- Decoder-only LLMs (e.g., Llama, Qwen, Gemma)
- Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)
- Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)
- Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)
- Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)
- Reward and classification models (e.g., Qwen-Math)
61
62
63
64

Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Getting Started
65

Michael Goin's avatar
Michael Goin committed
66
Install vLLM with [`uv`](https://docs.astral.sh/uv/) (recommended) or `pip`:
Zhuohan Li's avatar
Zhuohan Li committed
67
68

```bash
Michael Goin's avatar
Michael Goin committed
69
uv pip install vllm
Zhuohan Li's avatar
Zhuohan Li committed
70
71
```

Michael Goin's avatar
Michael Goin committed
72
73
Or [build from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source) for development.

74
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
75

76
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
77
78
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
Zhuohan Li's avatar
Zhuohan Li committed
79

80
## Contributing
81

82
We welcome and value any contributions and collaborations.
Reid's avatar
Reid committed
83
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
Woosuk Kwon's avatar
Woosuk Kwon committed
84
85
86
87

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
88

Woosuk Kwon's avatar
Woosuk Kwon committed
89
90
```bibtex
@inproceedings{kwon2023efficient,
91
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
Woosuk Kwon's avatar
Woosuk Kwon committed
92
93
94
95
96
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```
97
98
99

## Contact Us

100
<!-- --8<-- [start:contact-us] -->
101
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
102
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
103
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
104
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
105
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
106
<!-- --8<-- [end:contact-us] -->
Simon Mo's avatar
Simon Mo committed
107
108
109

## Media Kit

110
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)