"narrativeqa":"You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: \n\n<text>\n{context}\n</text>\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper":"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: \n\n<text>\n{context}\n</text>\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"multifieldqa_en":"Read the following text and answer briefly.\n\n<text>\n{context}\n</text>\n\n Now, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"hotpotqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"gov_report":"You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum":"You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n\n<text>\n{context}\n</text>\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"multi_news":"You are given several news passages. Write a one-page summary of all news. \n\nNews:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of all the news.\n\nSummary:",
"trec":"Please determine the type of the question below. Here are some examples of questions.\n\n<text>\n{context}\n</text>\n\n{input}",
"triviaqa":"Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"samsum":"Summarize the dialogue into a few short sentences. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"passage_count":"There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n<text>\n{context}\n</text>\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval_en":"Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n<text>\n{context}\n</text>\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",
"narrativeqa":"You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: \n\n<text>\n{context}\n</text>\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper":"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: \n\n<text>\n{context}\n</text>\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"multifieldqa_en":"Read the following text and answer briefly.\n\n<text>\n{context}\n</text>\n\n Now, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"hotpotqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"gov_report":"You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum":"You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n\n<text>\n{context}\n</text>\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"multi_news":"You are given several news passages. Write a one-page summary of all news. \n\nNews:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of all the news.\n\nSummary:",
"trec":"Please determine the type of the question below. Here are some examples of questions.\n\n<text>\n{context}\n</text>\n\n{input}",
"triviaqa":"Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"samsum":"Summarize the dialogue into a few short sentences. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"passage_count":"There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n<text>\n{context}\n</text>\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval_en":"Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n<text>\n{context}\n</text>\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",
- Lint/format: Ruff is configured in `pyproject.toml`. If installed, run `ruff check .` and `ruff format .` (cache is ignored via `.gitignore`).
## Testing Guidelines
- Framework: `pytest`. Prefer parameterized tests for kernels and keep GPU tests deterministic (seed RNGs; `torch.cuda.synchronize()` before assertions when needed).
- When changing kernels or compression logic, add/extend a focused regression test and, when feasible, compare against a reference backend (e.g., FlashAttention).
## Commit & Pull Request Guidelines
- Commits in history are short and imperative (e.g., “fix plot”, “update package layout”); keep subjects concise and scoped.
- PRs should include: a clear description, reproduction commands, expected correctness/perf impact, GPU/CUDA details for kernel changes, and new/updated tests. Add plots/screenshots when changing benchmarks or figures.
## Environment & Configuration Tips
- Requires an NVIDIA CUDA GPU; ensure compatible versions of PyTorch, Triton, and `flash-attn`.
- Kernel constraint: `head_dim` (`D`) must be a power of two; new model configs may trigger autotuning on first use.
**Nearly zero-overhead KV-cache compression in a minimal vLLM-style engine**
Long-context LLMs quickly become bottlenecked by the key–value (KV) cache: memory usage and bandwidth both scale linearly with the number of tokens. **compactor-vllm** is a small, simple inference engine that makes long-context inference more practical by combining:
-**Paged KV cache manager** – for efficient memory allocation and management
-**Custom Triton kernels** – for sparse (and dense) variable-length attention and fast KV compression
-**Training-free KV compression** – out-of-the-box, with the Compactor compression method.
## Key Features
### 🚀 Speed
Custom Triton attention kernels for head-sparse that outperform FlashAttention2 by up to 45% on long-context tasks, for compressed and uncompressed KV caches (benchmarked and tuned on H100, L40, A100, H100 NVL, H200). Over 15x faster than NVIDIA's KVPress Library for KV Cache Compression
### 💾 Memory Efficiency
Achieve up to 50% memory savings while maintaining strong task performance.
### ⚡ Zero-Overhead Compression
Carefully overlapped KV compression operations with memory-bound portions of the prefill process.
### ❗ Use Cases
-**Long-document QA** - Reduce memory for 100K+ token contexts
-**Multi-turn conversations** - Compress chat history while maintaining quality
-**RAG systems** - Handle large retrieved contexts efficiently
-**Batch processing** - Increase batch sizes with compressed KV cache
### ⏱️ Coming Soon
-**Prefix Caching**
-**Calibrated Compression - automatically determine how much compression your context can tolerate**
-**More Models**
-**More Compression Methods**
-**Fine-grained Compression Policies** - Specify specific regions of the context to compress (i.e don't compress system prompt, but compress few-shot exemplars).
---
## Performance
### Throughput Comparison (50% KV Retention)
At 50% KV retention, compactor-vllm achieves comparable throughput to **uncompressed vLLM** while using significantly less memory (see the first image).
### Memory Usage (60% KV Retention)
On the RULER 4K dataset with an H100 GPU, compactor-vllm reduces peak KV cache memory from 60GB to 36GB – a 40% reduction, as expected.
**COMPACTOR_TRITON**: Custom sparse variable-length attention kernel optimized for long contexts and compressed KV caches. Was developed in order to support prefix-caching (coming soon!)
1. Implement `*ForCausalLM` under `models/` using shared `layers/`
2. Register in `MODEL_REGISTRY` with the appropriate `model_type` key
---
## Testing
Run kernel and component tests:
```bash
pytest tests/
```
---
## Project Structure
```
compactor_vllm/
├── core/# Engine, scheduler, memory management
│ ├── llm_engine.py
│ ├── model_runner.py
│ ├── scheduler.py
│ └── memory_manager.py
├── compression/# Compression methods and configuration
│ ├── compactor.py
│ ├── snapkv.py
│ └── compression_params.py
├── attention/# Attention kernels and backends
│ ├── sparse_varlen_kernel.py
│ └── sparse_decode_kernel.py
├── kv_cache/# Paged KV cache implementation
│ ├── page_table.py
│ └── store_kv_cache.py
├── layers/# Model layers
│ ├── attention.py
│ ├── moe.py
│ └── ...
├── models/# Model implementations
│ ├── llama.py
│ ├── qwen3.py
│ └── ...
├── utils/# Utilities and helpers
└── triton_kernels/# Fast MOE kernels from Triton Lang repo
```
---
## Citation
If you use compactor-vllm or the Compactor method in your research, please cite:
```bibtex
@article{chari2025compactor,
title={Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores},
author={Vivek Chari and Benjamin Van Durme},
journal={arXiv preprint arXiv:2507.08143},
year={2025},
url={https://arxiv.org/abs/2507.08143}
}
```
---
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Acknowledgments
* See https://github.com/NVIDIA/kvpress for additional compression methods in an easy-to-use format
## MIT License
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"narrativeqa":"You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: \n\n<text>\n{context}\n</text>\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper":"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: \n\n<text>\n{context}\n</text>\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"multifieldqa_en":"Read the following text and answer briefly.\n\n<text>\n{context}\n</text>\n\n Now, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"hotpotqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"gov_report":"You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum":"You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n\n<text>\n{context}\n</text>\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"multi_news":"You are given several news passages. Write a one-page summary of all news. \n\nNews:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of all the news.\n\nSummary:",
"trec":"Please determine the type of the question below. Here are some examples of questions.\n\n<text>\n{context}\n</text>\n\n{input}",
"triviaqa":"Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"samsum":"Summarize the dialogue into a few short sentences. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"passage_count":"There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n<text>\n{context}\n</text>\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval_en":"Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n<text>\n{context}\n</text>\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",
"narrativeqa":"You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: \n\n<text>\n{context}\n</text>\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper":"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: \n\n<text>\n{context}\n</text>\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"multifieldqa_en":"Read the following text and answer briefly.\n\n<text>\n{context}\n</text>\n\n Now, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"hotpotqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"gov_report":"You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum":"You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n\n<text>\n{context}\n</text>\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"multi_news":"You are given several news passages. Write a one-page summary of all news. \n\nNews:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of all the news.\n\nSummary:",
"trec":"Please determine the type of the question below. Here are some examples of questions.\n\n<text>\n{context}\n</text>\n\n{input}",
"triviaqa":"Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"samsum":"Summarize the dialogue into a few short sentences. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"passage_count":"There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n<text>\n{context}\n</text>\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval_en":"Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n<text>\n{context}\n</text>\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",