| Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more. For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us. --- ## About The model compression function of kv cache pruning has been added to the official vllm. vLLM prune with: - [**SNAPKV**](https://arxiv.org/pdf/2404.14469) - [**COMPACTOR**](https://arxiv.org/pdf/2507.08143) - [**CRITICALADAKV**](https://arxiv.org/pdf/2502.03805) vLLM seamlessly supports most popular open-source models on HuggingFace, including: - Transformer-like LLMs (e.g., Qwen3/Llama) ## Env ```bash cd vllm python use_existing_torch.py # then add torch in requires of pyproject.toml export SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM="0.6.0" pip install -e . --no-build-isolation -v -i https://mirrors.aliyun.com/pypi/simple/ pip install numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ ``` More related libraries: - flash_attn-2.8.3+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl - torchvision-0.24.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl - triton-3.3.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl This project is compatible with triton-3.1.0, triton-3.3.0, and triton-3.5.1. However, for triton-3.5.1, when the underlying environment uses clang 17 and LLVM 22.0, the following modifications are required due to triton's own compatibility issues: In /usr/local/lib/python3.10/dist-packages/triton/backends/amd/compiler.py, locate the make_llir(src, metadata, options) function within the HIPBackend(BaseBackend) class. Replace `return str(llvm_mod)` with ``` # compatibility fix for clang 17 + LLVM 22.0 llir = str(llvm_mod) llir = re.sub(r"getelementptr inbounds\s+nuw\s+", "getelementptr inbounds ", llir) llir = re.sub(r"getelementptr\s+nuw\s+", "getelementptr ", llir) llir = re.sub(r"getelementptr inbounds\s+nusw\s+", "getelementptr inbounds ", llir) llir = re.sub(r"getelementptr\s+nusw\s+", "getelementptr ", llir) return llir ``` ## Quick Start Basic Chat Generation with Compression: ``` python test.py --schedule pdtriton ``` test.py: ```python # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project # PYTHONPATH=/home/vllm-project/vllm python test.py --schedule pdtriton from __future__ import annotations import argparse import os import sys from multiprocessing import freeze_support def _apply_kvprune_attention_env(schedule: str | None) -> None: """Map CLI -> VLLM_KVPRUNE_ATTENTION_SCHEDULE (fa_triton | pdtriton | pdfa).""" if not schedule: return os.environ["VLLM_KVPRUNE_ATTENTION_SCHEDULE"] = schedule def main() -> None: parser = argparse.ArgumentParser() parser.add_argument( "--schedule", type=str, default="pdtriton", choices=("fa_triton", "pdtriton", "pdfa"), help=( "fa_triton=FA prefill + Triton decode;" "pdtriton=Triton prefill + Triton decode;" "pdfa=FA prefill + FA decode (page KV writing is Triton);" ), ) args, _unknown = parser.parse_known_args() _apply_kvprune_attention_env(args.schedule) from transformers import AutoTokenizer from vllm import CompressionParams, LLM, SamplingParams model_id = "Qwen/Qwen3-8B" tokenizer = AutoTokenizer.from_pretrained(model_id) sampling_params = SamplingParams( temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512, ) llm = LLM( model=model_id, tensor_parallel_size=4, max_model_len=8192, gpu_memory_utilization=0.85, kvprune_compression=True, # True, False ) prompt = ( "Write a 200-word English prompt for a creative writing task. The prompt should be " "a single coherent paragraph without any bullet points, numbered lists, or markdown " "formatting. It should describe a specific scenario, character, or conflict, and end " "with a clear question that invites the writer to continue the story. Do not use any " "special symbols or line breaks. The tone can be mysterious, tense, or reflective. " "After the paragraph, include the question on the same line directly following the " "period, without hitting enter." ) messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True, # True ) compression = [ CompressionParams( compression_ratio=0.5, compression_method="snapkv", ), ] outputs = llm.generate( [text], sampling_params=sampling_params, compression=compression, ) for output in outputs: generated_text = output.outputs[0].text print(f"Generated text: {generated_text!r}") if __name__ == "__main__": freeze_support() main() ``` `kvprune_compression=True` is used for pruning, disable the CUDA graph mode of the vLLM v1 engine to reduce inference time, and minimize the GPU memory usage of the vLLM v1 engine. If a DCU kernel error occurs, prepend the test command with `export HIP_LAUNCH_BLOCKING=1` to work around the instability of the DCU Triton kernel. In the long term, the stability needs to be fundamentally improved by Triton compilation engineers. If test ruler datasets: ``` rm -rf ~/.triton ~/.cache/torch /tmp/triton* /tmp/torch* export PYTHONPATH=/home/vllm-project/vllm export HIP_LAUNCH_BLOCKING=1 python vllm/tests/kvprune/evaluate/eval_ruler.py \ --tensor-parallel-size 4 \ --dataset-parquet ruler/4096/test-00000-of-00001.parquet \ --dataset-split train \ --model Qwen/Qwen3-8B \ --compression-method snapkv \ --seq-compression-ratio 0.5 \ --attention-schedule pdtriton ``` ## Contributing We welcome and value any contributions and collaborations. ## Citation If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180): ```bibtex @inproceedings{kwon2023efficient, title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica}, booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles}, year={2023} } ``` ## Contact Us - For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues) - For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai) - For coordinating contributions and development, please use [Slack](https://slack.vllm.ai) - For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature - For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai) ## Media Kit - If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)