README.md 7.95 KB
Newer Older
1
<!-- markdownlint-disable MD001 MD041 -->
Zhuohan Li's avatar
Zhuohan Li committed
2
3
<p align="center">
  <picture>
4
5
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
6
7
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
8

Zhuohan Li's avatar
Zhuohan Li committed
9
10
11
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
12

Zhuohan Li's avatar
Zhuohan Li committed
13
<p align="center">
14
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

17
18
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
19
20

---
21

22
## About
23

chenzk's avatar
chenzk committed
24
The model compression function of kv cache pruning has been added to the official vllm.
25

chenzk's avatar
chenzk committed
26
vLLM prune with:
27

chenzk's avatar
chenzk committed
28
29
30
- [**SNAPKV**](https://arxiv.org/pdf/2404.14469)
- [**COMPACTOR**](https://arxiv.org/pdf/2507.08143)
- [**CRITICALADAKV**](https://arxiv.org/pdf/2502.03805)
31

32

chenzk's avatar
chenzk committed
33
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
Zhuohan Li's avatar
Zhuohan Li committed
34

chenzk's avatar
chenzk committed
35
- Transformer-like LLMs (e.g., Qwen3/Llama)
Zhuohan Li's avatar
Zhuohan Li committed
36

chenzk's avatar
chenzk committed
37
## Env
chenzk's avatar
chenzk committed
38

chenzk's avatar
chenzk committed
39
40
41
42
43
44
45
46
```bash
cd vllm
python use_existing_torch.py
# then add torch in requires of pyproject.toml
export SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM="0.6.0"
pip install -e . --no-build-isolation -v -i https://mirrors.aliyun.com/pypi/simple/
pip install numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/
```
chenzk's avatar
chenzk committed
47

chenzk's avatar
chenzk committed
48
More related libraries:
chenzk's avatar
chenzk committed
49

chenzk's avatar
chenzk committed
50
51
- flash_attn-2.8.3+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
- torchvision-0.24.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
chenzk's avatar
chenzk committed
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
- triton-3.3.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl

This project is compatible with triton-3.1.0, triton-3.3.0, and triton-3.5.1. However, for triton-3.5.1, when the underlying environment uses clang 17 and LLVM 22.0, the following modifications are required due to triton's own compatibility issues:

In /usr/local/lib/python3.10/dist-packages/triton/backends/amd/compiler.py, locate the make_llir(src, metadata, options) function within the HIPBackend(BaseBackend) class. Replace `return str(llvm_mod)` with
```
# compatibility fix for clang 17 + LLVM 22.0

llir = str(llvm_mod)
llir = re.sub(r"getelementptr inbounds\s+nuw\s+", "getelementptr inbounds ", llir)
llir = re.sub(r"getelementptr\s+nuw\s+", "getelementptr ", llir)
llir = re.sub(r"getelementptr inbounds\s+nusw\s+", "getelementptr inbounds ", llir)
llir = re.sub(r"getelementptr\s+nusw\s+", "getelementptr ", llir)
return llir
```
chenzk's avatar
chenzk committed
67
68
69
70
71

## Quick Start
Basic Chat Generation with Compression:
```
python test.py --schedule pdtriton 
chenzk's avatar
chenzk committed
72
```
chenzk's avatar
chenzk committed
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
test.py:

```python 
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

# PYTHONPATH=/home/vllm-project/vllm python test.py --schedule pdtriton 

from __future__ import annotations

import argparse
import os
import sys
from multiprocessing import freeze_support


def _apply_kvprune_attention_env(schedule: str | None) -> None:
    """Map CLI -> VLLM_KVPRUNE_ATTENTION_SCHEDULE (fa_triton | pdtriton | pdfa)."""
    if not schedule:
        return
    os.environ["VLLM_KVPRUNE_ATTENTION_SCHEDULE"] = schedule


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--schedule",
        type=str,
        default="pdtriton",
        choices=("fa_triton", "pdtriton", "pdfa"),
        help=(
            "fa_triton=FA prefill + Triton decode;"
            "pdtriton=Triton prefill + Triton decode;"
            "pdfa=FA prefill + FA decode (page KV writing is Triton);"
        ),
    )
    args, _unknown = parser.parse_known_args()
    _apply_kvprune_attention_env(args.schedule)

    from transformers import AutoTokenizer

    from vllm import CompressionParams, LLM, SamplingParams

    model_id = "Qwen/Qwen3-8B"

    tokenizer = AutoTokenizer.from_pretrained(model_id)

    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.8,
        repetition_penalty=1.05,
        max_tokens=512,
    )

    llm = LLM(
        model=model_id,
        tensor_parallel_size=4,
        max_model_len=8192,
        gpu_memory_utilization=0.85,
        kvprune_compression=True, # True, False
        )

    prompt = (
        "Write a 200-word English prompt for a creative writing task. The prompt should be "
        "a single coherent paragraph without any bullet points, numbered lists, or markdown "
        "formatting. It should describe a specific scenario, character, or conflict, and end "
        "with a clear question that invites the writer to continue the story. Do not use any "
        "special symbols or line breaks. The tone can be mysterious, tense, or reflective. "
        "After the paragraph, include the question on the same line directly following the "
        "period, without hitting enter."
    )

    messages = [{"role": "user", "content": prompt}]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True, # True
    )

    compression = [
        CompressionParams(
            compression_ratio=0.5,
chenzk's avatar
chenzk committed
157
            compression_method="snapkv",
chenzk's avatar
chenzk committed
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
        ),
    ]

    outputs = llm.generate(
        [text],
        sampling_params=sampling_params,
        compression=compression,
    )

    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Generated text: {generated_text!r}")


if __name__ == "__main__":
    freeze_support()
    main()
```

`kvprune_compression=True` is used for pruning, disable the CUDA graph mode of the vLLM v1 engine to reduce inference time, and minimize the GPU memory usage of the vLLM v1 engine.

If a DCU kernel error occurs, prepend the test command with `export HIP_LAUNCH_BLOCKING=1` to work around the instability of the DCU Triton kernel. In the long term, the stability needs to be fundamentally improved by Triton compilation engineers.
180

chenzk's avatar
chenzk committed
181

chenzk's avatar
chenzk committed
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
If test ruler datasets:

```
rm -rf ~/.triton ~/.cache/torch /tmp/triton* /tmp/torch*
export PYTHONPATH=/home/vllm-project/vllm
export HIP_LAUNCH_BLOCKING=1

python vllm/tests/kvprune/evaluate/eval_ruler.py \
  --tensor-parallel-size 4 \
  --dataset-parquet ruler/4096/test-00000-of-00001.parquet \
  --dataset-split train \
  --model Qwen/Qwen3-8B \
  --compression-method snapkv \
  --seq-compression-ratio 0.5 \
  --attention-schedule pdtriton
```
Zhuohan Li's avatar
Zhuohan Li committed
198

199
## Contributing
200

201
We welcome and value any contributions and collaborations.
Woosuk Kwon's avatar
Woosuk Kwon committed
202
203
204
205

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
206

Woosuk Kwon's avatar
Woosuk Kwon committed
207
208
```bibtex
@inproceedings{kwon2023efficient,
209
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
Woosuk Kwon's avatar
Woosuk Kwon committed
210
211
212
213
214
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```
215
216
217

## Contact Us

218
<!-- --8<-- [start:contact-us] -->
219
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
220
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
221
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
222
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
223
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
224
<!-- --8<-- [end:contact-us] -->
Simon Mo's avatar
Simon Mo committed
225
226
227

## Media Kit

228
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)