README.md 6.24 KB
Newer Older
1
<!-- markdownlint-disable MD001 MD041 -->
Zhuohan Li's avatar
Zhuohan Li committed
2
3
<p align="center">
  <picture>
4
5
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
Zhuohan Li's avatar
Zhuohan Li committed
6
7
  </picture>
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
8

Zhuohan Li's avatar
Zhuohan Li committed
9
10
11
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Woosuk Kwon's avatar
Woosuk Kwon committed
12

Zhuohan Li's avatar
Zhuohan Li committed
13
<p align="center">
14
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
Zhuohan Li's avatar
Zhuohan Li committed
15
</p>
Woosuk Kwon's avatar
Woosuk Kwon committed
16

17
18
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
19
20

---
21

22
## About
23

chenzk's avatar
chenzk committed
24
The model compression function of kv cache pruning has been added to the official vllm.
25

chenzk's avatar
chenzk committed
26
vLLM prune with:
27

chenzk's avatar
chenzk committed
28
29
30
- [**SNAPKV**](https://arxiv.org/pdf/2404.14469)
- [**COMPACTOR**](https://arxiv.org/pdf/2507.08143)
- [**CRITICALADAKV**](https://arxiv.org/pdf/2502.03805)
31
32


33
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
34

chenzk's avatar
chenzk committed
35
- Transformer-like LLMs (e.g., Qwen3/Llama)
36

chenzk's avatar
chenzk committed
37
## Env
Zhuohan Li's avatar
Zhuohan Li committed
38
39

```bash
chenzk's avatar
chenzk committed
40
41
42
43
44
45
cd vllm
python use_existing_torch.py
# then add torch in requires of pyproject.toml
export SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM="0.6.0"
pip install -e . --no-build-isolation -v -i https://mirrors.aliyun.com/pypi/simple/
pip install numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/
Zhuohan Li's avatar
Zhuohan Li committed
46
47
```

chenzk's avatar
chenzk committed
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
More related libraries:

- flash_attn-2.8.3+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
- torchvision-0.24.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
- triton-3.5.1+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl

## Quick Start
Basic Chat Generation with Compression:
```
python test.py --schedule pdtriton 
```
test.py:

```python 
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

# PYTHONPATH=/home/vllm-project/vllm python test.py --schedule pdtriton 

from __future__ import annotations

import argparse
import os
import sys
from multiprocessing import freeze_support


def _apply_kvprune_attention_env(schedule: str | None) -> None:
    """Map CLI -> VLLM_KVPRUNE_ATTENTION_SCHEDULE (fa_triton | pdtriton | pdfa)."""
    if not schedule:
        return
    os.environ["VLLM_KVPRUNE_ATTENTION_SCHEDULE"] = schedule


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--schedule",
        type=str,
        default="pdtriton",
        choices=("fa_triton", "pdtriton", "pdfa"),
        help=(
            "fa_triton=FA prefill + Triton decode;"
            "pdtriton=Triton prefill + Triton decode;"
            "pdfa=FA prefill + FA decode (page KV writing is Triton);"
        ),
    )
    args, _unknown = parser.parse_known_args()
    _apply_kvprune_attention_env(args.schedule)

    from transformers import AutoTokenizer

    from vllm import CompressionParams, LLM, SamplingParams

    model_id = "Qwen/Qwen3-8B"

    tokenizer = AutoTokenizer.from_pretrained(model_id)

    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.8,
        repetition_penalty=1.05,
        max_tokens=512,
    )

    llm = LLM(
        model=model_id,
        tensor_parallel_size=4,
        max_model_len=8192,
        gpu_memory_utilization=0.85,
        kvprune_compression=True,
    )

    prompt = (
        "Write a 200-word English prompt for a creative writing task. The prompt should be "
        "a single coherent paragraph without any bullet points, numbered lists, or markdown "
        "formatting. It should describe a specific scenario, character, or conflict, and end "
        "with a clear question that invites the writer to continue the story. Do not use any "
        "special symbols or line breaks. The tone can be mysterious, tense, or reflective. "
        "After the paragraph, include the question on the same line directly following the "
        "period, without hitting enter."
    )

    messages = [{"role": "user", "content": prompt}]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True, # True
    )

    compression = [
        CompressionParams(
            compression_ratio=0.5,
            compression_method="snapkv",
        ),
    ]

    outputs = llm.generate(
        [text],
        sampling_params=sampling_params,
        compression=compression,
    )

    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Generated text: {generated_text!r}")


if __name__ == "__main__":
    freeze_support()
    main()
```
162

Zhuohan Li's avatar
Zhuohan Li committed
163

164
## Contributing
165

166
We welcome and value any contributions and collaborations.
Woosuk Kwon's avatar
Woosuk Kwon committed
167
168
169
170

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
171

Woosuk Kwon's avatar
Woosuk Kwon committed
172
173
```bibtex
@inproceedings{kwon2023efficient,
174
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
Woosuk Kwon's avatar
Woosuk Kwon committed
175
176
177
178
179
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```
180
181
182

## Contact Us

183
<!-- --8<-- [start:contact-us] -->
184
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
185
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
186
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
187
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
188
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
189
<!-- --8<-- [end:contact-us] -->
Simon Mo's avatar
Simon Mo committed
190
191
192

## Media Kit

193
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)