@@ -21,149 +21,59 @@ For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
...
@@ -21,149 +21,59 @@ For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
## About
## About
The model compression function of kv cache pruning has been added to the official vllm.
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM prune with:
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
vLLM is flexible and easy to use with:
- Transformer-like LLMs (e.g., Qwen3/Llama)
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
## Env
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
```bash
- Transformer-like LLMs (e.g., Llama)
cd vllm
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
---
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
## Getting Started
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
```bash
pip install vllm
```
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
-[List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
## Contributing
We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
## Citation
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```
## Contact Us
<!-- --8<-- [start:contact-us] -->
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
<!-- --8<-- [end:contact-us] -->
## Media Kit
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)
- Lint/format: Ruff is configured in `pyproject.toml`. If installed, run `ruff check .` and `ruff format .` (cache is ignored via `.gitignore`).
## Testing Guidelines
- Framework: `pytest`. Prefer parameterized tests for kernels and keep GPU tests deterministic (seed RNGs; `torch.cuda.synchronize()` before assertions when needed).
- When changing kernels or compression logic, add/extend a focused regression test and, when feasible, compare against a reference backend (e.g., FlashAttention).
## Commit & Pull Request Guidelines
- Commits in history are short and imperative (e.g., “fix plot”, “update package layout”); keep subjects concise and scoped.
- PRs should include: a clear description, reproduction commands, expected correctness/perf impact, GPU/CUDA details for kernel changes, and new/updated tests. Add plots/screenshots when changing benchmarks or figures.
## Environment & Configuration Tips
- Requires an NVIDIA CUDA GPU; ensure compatible versions of PyTorch, Triton, and `flash-attn`.
- Kernel constraint: `head_dim` (`D`) must be a power of two; new model configs may trigger autotuning on first use.
**Nearly zero-overhead KV-cache compression in a minimal vLLM-style engine**
Long-context LLMs quickly become bottlenecked by the key–value (KV) cache: memory usage and bandwidth both scale linearly with the number of tokens. **compactor-vllm** is a small, simple inference engine that makes long-context inference more practical by combining:
-**Paged KV cache manager** – for efficient memory allocation and management
-**Custom Triton kernels** – for sparse (and dense) variable-length attention and fast KV compression
-**Training-free KV compression** – out-of-the-box, with the Compactor compression method.
## Key Features
### 🚀 Speed
Custom Triton attention kernels for head-sparse that outperform FlashAttention2 by up to 45% on long-context tasks, for compressed and uncompressed KV caches (benchmarked and tuned on H100, L40, A100, H100 NVL, H200). Over 15x faster than NVIDIA's KVPress Library for KV Cache Compression
### 💾 Memory Efficiency
Achieve up to 50% memory savings while maintaining strong task performance.
### ⚡ Zero-Overhead Compression
Carefully overlapped KV compression operations with memory-bound portions of the prefill process.
### ❗ Use Cases
-**Long-document QA** - Reduce memory for 100K+ token contexts
-**Multi-turn conversations** - Compress chat history while maintaining quality
-**RAG systems** - Handle large retrieved contexts efficiently
-**Batch processing** - Increase batch sizes with compressed KV cache
### ⏱️ Coming Soon
-**Prefix Caching**
-**Calibrated Compression - automatically determine how much compression your context can tolerate**
-**More Models**
-**More Compression Methods**
-**Fine-grained Compression Policies** - Specify specific regions of the context to compress (i.e don't compress system prompt, but compress few-shot exemplars).
---
## Performance
### Throughput Comparison (50% KV Retention)
At 50% KV retention, compactor-vllm achieves comparable throughput to **uncompressed vLLM** while using significantly less memory (see the first image).
### Memory Usage (60% KV Retention)
On the RULER 4K dataset with an H100 GPU, compactor-vllm reduces peak KV cache memory from 60GB to 36GB – a 40% reduction, as expected.
**COMPACTOR_TRITON**: Custom sparse variable-length attention kernel optimized for long contexts and compressed KV caches. Was developed in order to support prefix-caching (coming soon!)
1. Implement `*ForCausalLM` under `models/` using shared `layers/`
2. Register in `MODEL_REGISTRY` with the appropriate `model_type` key
---
## Testing
Run kernel and component tests:
```bash
pytest tests/
```
---
## Project Structure
```
compactor_vllm/
├── core/# Engine, scheduler, memory management
│ ├── llm_engine.py
│ ├── model_runner.py
│ ├── scheduler.py
│ └── memory_manager.py
├── compression/# Compression methods and configuration
│ ├── compactor.py
│ ├── snapkv.py
│ └── compression_params.py
├── attention/# Attention kernels and backends
│ ├── sparse_varlen_kernel.py
│ └── sparse_decode_kernel.py
├── kv_cache/# Paged KV cache implementation
│ ├── page_table.py
│ └── store_kv_cache.py
├── layers/# Model layers
│ ├── attention.py
│ ├── moe.py
│ └── ...
├── models/# Model implementations
│ ├── llama.py
│ ├── qwen3.py
│ └── ...
├── utils/# Utilities and helpers
└── triton_kernels/# Fast MOE kernels from Triton Lang repo
```
---
## Citation
If you use compactor-vllm or the Compactor method in your research, please cite:
```bibtex
@article{chari2025compactor,
title={Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores},
author={Vivek Chari and Benjamin Van Durme},
journal={arXiv preprint arXiv:2507.08143},
year={2025},
url={https://arxiv.org/abs/2507.08143}
}
```
---
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Acknowledgments
* See https://github.com/NVIDIA/kvpress for additional compression methods in an easy-to-use format
## MIT License
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"narrativeqa":"You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: \n\n<text>\n{context}\n</text>\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper":"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: \n\n<text>\n{context}\n</text>\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"multifieldqa_en":"Read the following text and answer briefly.\n\n<text>\n{context}\n</text>\n\n Now, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"hotpotqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"gov_report":"You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum":"You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n\n<text>\n{context}\n</text>\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"multi_news":"You are given several news passages. Write a one-page summary of all news. \n\nNews:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of all the news.\n\nSummary:",
"trec":"Please determine the type of the question below. Here are some examples of questions.\n\n<text>\n{context}\n</text>\n\n{input}",
"triviaqa":"Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"samsum":"Summarize the dialogue into a few short sentences. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"passage_count":"There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n<text>\n{context}\n</text>\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval_en":"Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n<text>\n{context}\n</text>\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",
"narrativeqa":"You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: \n\n<text>\n{context}\n</text>\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper":"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: \n\n<text>\n{context}\n</text>\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"multifieldqa_en":"Read the following text and answer briefly.\n\n<text>\n{context}\n</text>\n\n Now, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"hotpotqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"gov_report":"You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum":"You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n\n<text>\n{context}\n</text>\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"multi_news":"You are given several news passages. Write a one-page summary of all news. \n\nNews:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of all the news.\n\nSummary:",
"trec":"Please determine the type of the question below. Here are some examples of questions.\n\n<text>\n{context}\n</text>\n\n{input}",
"triviaqa":"Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"samsum":"Summarize the dialogue into a few short sentences. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"passage_count":"There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n<text>\n{context}\n</text>\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval_en":"Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n<text>\n{context}\n</text>\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",