README.md

<!-- markdownlint-disable MD001 MD041 -->
<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
  </picture>
</p>

<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>

<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>

🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.

---

## About

The model compression function of kv cache pruning has been added to the official vllm.

vLLM prune with:

- [**SNAPKV**](https://arxiv.org/pdf/2404.14469)
- [**COMPACTOR**](https://arxiv.org/pdf/2507.08143)
- [**CRITICALADAKV**](https://arxiv.org/pdf/2502.03805)


vLLM seamlessly supports most popular open-source models on HuggingFace, including:

- Transformer-like LLMs (e.g., Qwen3/Llama)

## Env

```bash
cd vllm
python use_existing_torch.py
# then add torch in requires of pyproject.toml
export SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM="0.6.0"
pip install -e . --no-build-isolation -v -i https://mirrors.aliyun.com/pypi/simple/
pip install numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/
```

More related libraries:

- flash_attn-2.8.3+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
- torchvision-0.24.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
- triton-3.3.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl

This project is compatible with triton-3.1.0, triton-3.3.0, and triton-3.5.1. However, for triton-3.5.1, when the underlying environment uses clang 17 and LLVM 22.0, the following modifications are required due to triton's own compatibility issues:

In /usr/local/lib/python3.10/dist-packages/triton/backends/amd/compiler.py, locate the make_llir(src, metadata, options) function within the HIPBackend(BaseBackend) class. Replace `return str(llvm_mod)` with
```
# compatibility fix for clang 17 + LLVM 22.0

llir = str(llvm_mod)
llir = re.sub(r"getelementptr inbounds\s+nuw\s+", "getelementptr inbounds ", llir)
llir = re.sub(r"getelementptr\s+nuw\s+", "getelementptr ", llir)
llir = re.sub(r"getelementptr inbounds\s+nusw\s+", "getelementptr inbounds ", llir)
llir = re.sub(r"getelementptr\s+nusw\s+", "getelementptr ", llir)
return llir
```

## Quick Start
Basic Chat Generation with Compression:
```
python test.py --schedule pdtriton 
```
test.py:

```python 
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

# PYTHONPATH=/home/vllm-project/vllm python test.py --schedule pdtriton 

from __future__ import annotations

import argparse
import os
import sys
from multiprocessing import freeze_support


def _apply_kvprune_attention_env(schedule: str | None) -> None:
    """Map CLI -> VLLM_KVPRUNE_ATTENTION_SCHEDULE (fa_triton | pdtriton | pdfa)."""
    if not schedule:
        return
    os.environ["VLLM_KVPRUNE_ATTENTION_SCHEDULE"] = schedule


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--schedule",
        type=str,
        default="pdtriton",
        choices=("fa_triton", "pdtriton", "pdfa"),
        help=(
            "fa_triton=FA prefill + Triton decode;"
            "pdtriton=Triton prefill + Triton decode;"
            "pdfa=FA prefill + FA decode (page KV writing is Triton);"
        ),
    )
    args, _unknown = parser.parse_known_args()
    _apply_kvprune_attention_env(args.schedule)

    from transformers import AutoTokenizer

    from vllm import CompressionParams, LLM, SamplingParams

    model_id = "Qwen/Qwen3-8B"

    tokenizer = AutoTokenizer.from_pretrained(model_id)

    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.8,
        repetition_penalty=1.05,
        max_tokens=512,
    )

    llm = LLM(
        model=model_id,
        tensor_parallel_size=4,
        max_model_len=8192,
        gpu_memory_utilization=0.85,
        kvprune_compression=True, # True, False
        )

    prompt = (
        "Write a 200-word English prompt for a creative writing task. The prompt should be "
        "a single coherent paragraph without any bullet points, numbered lists, or markdown "
        "formatting. It should describe a specific scenario, character, or conflict, and end "
        "with a clear question that invites the writer to continue the story. Do not use any "
        "special symbols or line breaks. The tone can be mysterious, tense, or reflective. "
        "After the paragraph, include the question on the same line directly following the "
        "period, without hitting enter."
    )

    messages = [{"role": "user", "content": prompt}]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True, # True
    )

    compression = [
        CompressionParams(
            compression_ratio=0.5,
            compression_method="snapkv",
        ),
    ]

    outputs = llm.generate(
        [text],
        sampling_params=sampling_params,
        compression=compression,
    )

    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Generated text: {generated_text!r}")


if __name__ == "__main__":
    freeze_support()
    main()
```

`kvprune_compression=True` is used for pruning, disable the CUDA graph mode of the vLLM v1 engine to reduce inference time, and minimize the GPU memory usage of the vLLM v1 engine.

If a DCU kernel error occurs, prepend the test command with `export HIP_LAUNCH_BLOCKING=1` to work around the instability of the DCU Triton kernel. In the long term, the stability needs to be fundamentally improved by Triton compilation engineers.


If test ruler datasets：

```
rm -rf ~/.triton ~/.cache/torch /tmp/triton* /tmp/torch*
export PYTHONPATH=/home/vllm-project/vllm
export HIP_LAUNCH_BLOCKING=1

python vllm/tests/kvprune/evaluate/eval_ruler.py \
  --tensor-parallel-size 4 \
  --dataset-parquet ruler/4096/test-00000-of-00001.parquet \
  --dataset-split train \
  --model Qwen/Qwen3-8B \
  --compression-method snapkv \
  --seq-compression-ratio 0.5 \
  --attention-schedule pdtriton
```

## Contributing

We welcome and value any contributions and collaborations.

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):

```bibtex
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```

## Contact Us

<!-- --8<-- [start:contact-us] -->
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
<!-- --8<-- [end:contact-us] -->

## Media Kit

- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)