Commit d29c39ca authored by chenzk's avatar chenzk
Browse files

vllm kvprune wo:v1.1.0

parent f81ce56b
...@@ -21,149 +21,59 @@ For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us. ...@@ -21,149 +21,59 @@ For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
## About ## About
The model compression function of kv cache pruning has been added to the official vllm. vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM prune with: Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
- [**SNAPKV**](https://arxiv.org/pdf/2404.14469) vLLM is fast with:
- [**COMPACTOR**](https://arxiv.org/pdf/2507.08143)
- [**CRITICALADAKV**](https://arxiv.org/pdf/2502.03805)
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
vLLM seamlessly supports most popular open-source models on HuggingFace, including: vLLM is flexible and easy to use with:
- Transformer-like LLMs (e.g., Qwen3/Llama) - Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
## Env vLLM seamlessly supports most popular open-source models on HuggingFace, including:
```bash - Transformer-like LLMs (e.g., Llama)
cd vllm - Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
python use_existing_torch.py - Embedding Models (e.g., E5-Mistral)
# then add torch in requires of pyproject.toml - Multi-modal LLMs (e.g., LLaVA)
export SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM="0.6.0"
pip install -e . --no-build-isolation -v -i https://mirrors.aliyun.com/pypi/simple/
pip install numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/
```
More related libraries: Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
- flash_attn-2.8.3+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl ## Getting Started
- torchvision-0.24.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
- triton-3.5.1+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
## Quick Start Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
Basic Chat Generation with Compression:
``` ```bash
python test.py --schedule pdtriton pip install vllm
```
test.py:
```python
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
# PYTHONPATH=/home/vllm-project/vllm python test.py --schedule pdtriton
from __future__ import annotations
import argparse
import os
import sys
from multiprocessing import freeze_support
def _apply_kvprune_attention_env(schedule: str | None) -> None:
"""Map CLI -> VLLM_KVPRUNE_ATTENTION_SCHEDULE (fa_triton | pdtriton | pdfa)."""
if not schedule:
return
os.environ["VLLM_KVPRUNE_ATTENTION_SCHEDULE"] = schedule
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
"--schedule",
type=str,
default="pdtriton",
choices=("fa_triton", "pdtriton", "pdfa"),
help=(
"fa_triton=FA prefill + Triton decode;"
"pdtriton=Triton prefill + Triton decode;"
"pdfa=FA prefill + FA decode (page KV writing is Triton);"
),
)
args, _unknown = parser.parse_known_args()
_apply_kvprune_attention_env(args.schedule)
from transformers import AutoTokenizer
from vllm import CompressionParams, LLM, SamplingParams
model_id = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.8,
repetition_penalty=1.05,
max_tokens=512,
)
llm = LLM(
model=model_id,
tensor_parallel_size=4,
max_model_len=8192,
gpu_memory_utilization=0.85,
kvprune_compression=True,
)
prompt = (
"Write a 200-word English prompt for a creative writing task. The prompt should be "
"a single coherent paragraph without any bullet points, numbered lists, or markdown "
"formatting. It should describe a specific scenario, character, or conflict, and end "
"with a clear question that invites the writer to continue the story. Do not use any "
"special symbols or line breaks. The tone can be mysterious, tense, or reflective. "
"After the paragraph, include the question on the same line directly following the "
"period, without hitting enter."
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # True
)
compression = [
CompressionParams(
compression_ratio=0.5,
compression_method="snapkv",
),
]
outputs = llm.generate(
[text],
sampling_params=sampling_params,
compression=compression,
)
for output in outputs:
generated_text = output.outputs[0].text
print(f"Generated text: {generated_text!r}")
if __name__ == "__main__":
freeze_support()
main()
``` ```
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
## Contributing ## Contributing
We welcome and value any contributions and collaborations. We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
## Citation ## Citation
......
<!-- markdownlint-disable MD001 MD041 -->
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
</picture>
</p>
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
---
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
## Getting Started
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
```bash
pip install vllm
```
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
## Contributing
We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
## Citation
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```
## Contact Us
<!-- --8<-- [start:contact-us] -->
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
<!-- --8<-- [end:contact-us] -->
## Media Kit
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)
/.ruff_cache
/.DS_Store
/.idea
*.pyc
\ No newline at end of file
# Repository Guidelines
## Project Structure & Module Organization
- `src/compactor_vllm/`: main Python package (minimal vLLM-style engine).
- `core/`: engine loop, scheduling, memory management.
- `attention/`: Triton attention backends and `compile_kernels.py` autotuning helper.
- `compression/`: compression methods (e.g., Compactor, SnapKV) and registries/config.
- `kv_cache/`: paged KV cache + store helpers.
- `models/`, `layers/`, `utils/`: model definitions and reusable building blocks.
- `triton_kernels/`: low-level kernels (treat as vendor-style code; avoid drive-by edits).
- `tests/`: GPU correctness tests (`tests/test_*.py`).
- `evaluate/`: evaluation scripts (RULER/LongBench) and configs (`evaluate/longbench_config/`).
- Repo root: figures/plots used by `README.md`.
## Build, Test, and Development Commands
- `pip install -e .`: editable install for local development.
- `pip install -e ".[evaluate]"`: install optional evaluation dependencies (you may also need `pip install datasets`).
- `pytest tests/`: run unit/kernel tests (expects a CUDA-capable GPU and working `flash-attn`/Triton setup).
- `python -m compactor_vllm.attention.compile_kernels --max-length 16384 --HKV 8 --HQ 32 --D 128 --page-size 128`: pre-autotune Triton kernels (results are cached on disk; avoids first-run autotuning latency).
- `python evaluate/eval_ruler.py --help`: run RULER evaluation (downloads datasets).
- `python evaluate/eval_longbench.py`: run LongBench evaluation (downloads datasets).
## Coding Style & Naming Conventions
- Use 4-space indentation and follow existing patterns in `src/compactor_vllm/` (type hints, `@dataclass` configs, `logging` over `print`).
- Naming: `snake_case` (modules/functions), `PascalCase` (classes), `UPPER_SNAKE_CASE` (constants).
- Lint/format: Ruff is configured in `pyproject.toml`. If installed, run `ruff check .` and `ruff format .` (cache is ignored via `.gitignore`).
## Testing Guidelines
- Framework: `pytest`. Prefer parameterized tests for kernels and keep GPU tests deterministic (seed RNGs; `torch.cuda.synchronize()` before assertions when needed).
- When changing kernels or compression logic, add/extend a focused regression test and, when feasible, compare against a reference backend (e.g., FlashAttention).
## Commit & Pull Request Guidelines
- Commits in history are short and imperative (e.g., “fix plot”, “update package layout”); keep subjects concise and scoped.
- PRs should include: a clear description, reproduction commands, expected correctness/perf impact, GPU/CUDA details for kernel changes, and new/updated tests. Add plots/screenshots when changing benchmarks or figures.
## Environment & Configuration Tips
- Requires an NVIDIA CUDA GPU; ensure compatible versions of PyTorch, Triton, and `flash-attn`.
- Kernel constraint: `head_dim` (`D`) must be a power of two; new model configs may trigger autotuning on first use.
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## 概述
compactor-vllm 是一个用于长上下文 LLM 的极简推理引擎,支持无需训练的 KV 缓存压缩。它实现了分页 KV 缓存管理器、针对压缩缓存优化的自定义 Triton 注意力内核,以及多种压缩方法(Compactor、SnapKV)。
## 安装
```bash
pip install -e .
```
依赖:Python 3.10+、带 CUDA 的 PyTorch、Triton、FlashAttention、Transformers。
## 测试
```bash
pytest tests/
```
测试包括内核正确性测试(`test_triton_attention.py`)和 KV 缓存存储测试(`test_store_kv.py`)。
## 内核自动调优
内核在首次使用时会自动调优。建议在生产环境预先调优:
```bash
python compactor_vllm/attention/compile_kernels.py --max-length 16384 --HKV 8 --HQ 32 --D 128 --page-size 128
```
根据您的模型配置调整参数:
- `--HKV`: KV 头数(模型的 `num_key_value_heads`
- `--HQ`: 查询头数(模型的 `num_attention_heads`
- `--D`: 头维度(必须是 2 的幂)
- `--max-length`: 预期的最大序列长度
## 核心架构
### 执行流程
1. **LLM (LLMEngine)**: 高层入口,为张量并行推理生成多个 ModelRunner 进程
2. **ModelRunner**: 每个秩的执行循环,管理模型加载、预热和主推理循环
3. **Scheduler**: 管理序列生命周期(待处理 → 运行中 → 已完成)和批处理
4. **KVCacheManager**: 分配和跟踪分页 KV 缓存内存
5. **Attention Layer**: 应用压缩并使用选定的后端计算注意力
### 压缩流水线
压缩在 **预填充(prefill)** 阶段分两步进行:
1. **RoPE 前评分** (`apply_prerope_compression`):查询无关的重要性评分(例如 Compactor 的近似杠杆分数)
2. **RoPE 后评分** (`apply_postrope_compression`):在旋转位置编码后可选的精化
3. **KV 提取** (`extract_and_store_top_kv`):仅将评分最高的 KV 对存储到分页缓存中
分数在 CUDA 流(`STORE_STREAM`)上异步计算,以与内存密集型操作重叠。
### 注意力后端
通过 `LLMConfig(attention_backend=...)` 选择:
- **COMPACTOR_TRITON**(默认):为压缩缓存优化的自定义稀疏变长注意力内核
- **FLASH_ATTENTION**:FlashAttention varlen 后端(备选方案,压缩时效率较低)
内核位于 `attention/sparse_varlen_kernel.py`(预填充)和 `attention/sparse_decode_kernel.py`(解码)。
### 分页 KV 缓存
- **PagedKVCache** (`kv_cache/page_table.py`):由固定大小页面支持的全局 KV 缓存
- 每层都有一个页表,将 `(batch, kv_head, logical_page)` 映射到物理页面 ID
- 页面从每层的空闲列表(最小堆)中分配
- 页面大小默认为 128 个 token(`kvcache_page_size` 配置)
### 模型注册
模型在 `models/__init__.py` 中通过 `MODEL_REGISTRY` 注册:
```python
MODEL_REGISTRY = {
"llama": LlamaForCausalLM,
"qwen3": Qwen3ForCausalLM,
"qwen3_moe": Qwen3MoeForCausalLM,
}
```
添加新模型:
1.`models/` 中创建 `*ForCausalLM` 类,使用共享的 `layers/`(Attention、MoE 等)
2. 使用 HuggingFace 配置中相应的 `model_type` 键进行注册
### 添加压缩方法
1.`compression/` 中创建 `BaseCompressionMethod` 的子类:
- 实现 `pre_rope_scoring(q, k, v, context)` → 返回重要性分数
- 可选实现 `post_rope_scoring(q, k, v, prerope_scores, context)`
2.`compression/__init__.py` 中注册:
```python
COMPRESSION_REGISTRY[CompressionMethod.MY_METHOD] = MyCompressionMethod
```
3.`compression/compression_config.py``CompressionMethod` 中添加枚举值
### 多 GPU 推理
张量并行推理使用 `torch.distributed` (NCCL)。在 `LLMConfig` 中设置 `tensor_parallel_size`。world size 必须能整除 `num_key_value_heads`
## 目录结构
```
src/compactor_vllm/
├── attention/ # Triton 注意力内核(预填充 + 解码)
├── compression/ # 压缩方法实现
├── config/ # LLMConfig、SamplingParams、枚举
├── core/ # LLMEngine、ModelRunner、Scheduler、KVCacheManager
├── kv_cache/ # PagedKVCache、页表、KV 存储工具
├── layers/ # 可复用的模型层(Attention、MoE、Linear 等)
├── models/ # 特定模型的实现(llama3、qwen3 等)
├── utils/ # Sequence 数据类、上下文管理、辅助函数
└── triton_kernels/ # 来自 Triton Lang 仓库的快速 MoE 内核
```
## 重要实现细节
### 序列管理
- **Sequence**:跟踪 prompt token、生成 token、采样参数和压缩参数的数据类
- **SequenceStatus**:WAITING → RUNNING → FINISHED
- 每个序列通过迭代器计数器获得唯一的 `seq_id`
### 压缩参数
- **BatchCompressionParams**:压缩方法和分块策略(应用于整个批次)
- **SequenceCompressionParams**:每序列压缩比和受保护的 token 区域
受保护的 token(首部/尾部)在压缩期间永远不会被丢弃。
### 上下文管理
`utils/context.py` 提供线程本地 `Context` 对象,存储:
- 当前阶段(预填充 vs 解码)
- 压缩上下文
- 批次映射、序列长度、累积序列长度
- 异步操作的 CUDA 流
使用 `get_context()` 访问,`set_context()`/`reset_context()` 管理。
### 内存分配
KV 缓存内存在预热期间计算,并根据 `gpu_memory_utilization` 分配。如果可用内存不足,引擎将失败。
### Triton 内核
内核使用 Triton 的自动调优器并缓存到磁盘。由于内核要求,`head_dim` 必须是 2 的幂。
#容器登录
/public/home/lixh6/laibao/ssh/kvpress.sh
使用这个登录容器
# compactor-vllm
[![arXiv](https://img.shields.io/badge/arXiv-2507.08143-b31b1b.svg)](https://arxiv.org/abs/2507.08143)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
![Token Throughput](vllm_throughput_comparison.png)
**Nearly zero-overhead KV-cache compression in a minimal vLLM-style engine**
Long-context LLMs quickly become bottlenecked by the key–value (KV) cache: memory usage and bandwidth both scale linearly with the number of tokens. **compactor-vllm** is a small, simple inference engine that makes long-context inference more practical by combining:
- **Paged KV cache manager** – for efficient memory allocation and management
- **Custom Triton kernels** – for sparse (and dense) variable-length attention and fast KV compression
- **Training-free KV compression** – out-of-the-box, with the Compactor compression method.
## Key Features
### 🚀 Speed
Custom Triton attention kernels for head-sparse that outperform FlashAttention2 by up to 45% on long-context tasks, for compressed and uncompressed KV caches (benchmarked and tuned on H100, L40, A100, H100 NVL, H200). Over 15x faster than NVIDIA's KVPress Library for KV Cache Compression
### 💾 Memory Efficiency
Achieve up to 50% memory savings while maintaining strong task performance.
### ⚡ Zero-Overhead Compression
Carefully overlapped KV compression operations with memory-bound portions of the prefill process.
### ❗ Use Cases
- **Long-document QA** - Reduce memory for 100K+ token contexts
- **Multi-turn conversations** - Compress chat history while maintaining quality
- **RAG systems** - Handle large retrieved contexts efficiently
- **Batch processing** - Increase batch sizes with compressed KV cache
### ⏱️ Coming Soon
- **Prefix Caching**
- **Calibrated Compression - automatically determine how much compression your context can tolerate**
- **More Models**
- **More Compression Methods**
- **Fine-grained Compression Policies** - Specify specific regions of the context to compress (i.e don't compress system prompt, but compress few-shot exemplars).
---
## Performance
### Throughput Comparison (50% KV Retention)
At 50% KV retention, compactor-vllm achieves comparable throughput to **uncompressed vLLM** while using significantly less memory (see the first image).
### Memory Usage (60% KV Retention)
On the RULER 4K dataset with an H100 GPU, compactor-vllm reduces peak KV cache memory from 60GB to 36GB – a 40% reduction, as expected.
![Memory Usage](vllm_memory_comparison.png)
### Task Performance (RULER Benchmark, Compactor KV Compression, Query Agnostic)
| KV Discarded | 0% | 25% | 50% | 75% | 95% |
|--------------|-------|-------|-------|-------|-------|
| Llama 3.1-8B | 95.39 | 95.63 | 94.75 | 83.07 | 64.79 |
| Qwen3-8B | 95.01 | 94.57 | 92.29 | 76.48 | 44.69 |
At 50% compression, both models maintain over **97%** of their full-cache performance. Most tasks can tolerate
at least 50% KV compression, and some can tolerate even more! An example of a RULER question:
> A special magic uuid is hidden within the following text. Make sure to memorize it. I will quiz you about the uuid afterwards.
One of the special magic uuids for 3ce915e7-c9d6-463b-8a3c-6f5f5bb5c40c is: 2c9b662e-040a-4aae-92e2-afd996bf10ab.<br> ...<br>
One of the special magic uuids for bde13c1b-2073-4f6d-8d6a-05b343ef2016 is: bee3eb79-1d18-4ee9-86ad-8a8c6bc4123e.
What is the special magic uuid for a93b12cd-1c24-420e-acab-d7e7cc6b66e5 mentioned in the provided text?
### Attention Kernel Performance
Our Triton kernels match outperform FlashAttention2 by upto 45% across different sequence lengths.
and KV cache sizes, **even for uncompressed caches**:![img.png](flash_attn_vs_triton_h100.png)
---
## Installation
### From Source
```bash
git clone https://github.com/vnchari/compactor_vllm.git
cd compactor_vllm
pip install -e .
```
### Requirements
- Python 3.10+
- NVIDIA GPU with CUDA support
- PyTorch with CUDA
- Transformers (for model downloading)
- Triton
- FlashAttention
### Autotuning kernels
You can autotune kernels ahead of time instead of occuring at first use. Autotuning results are
automatically cached to the disk, so they only need to be done once per attention configuration
```bash
python3 compactor_vllm/attention/compile_kernels.py --max-length 16384 --HKV 8 --HQ 32 --D 128 --page-size 128
```
---
## Quick Start
### Basic Chat Generation with Compression
```python
from compactor_vllm import (
LLM,
LLMConfig,
SamplingParams,
CompressionMethod,
)
from compactor_vllm.compression import (
BatchCompressionParams,
SequenceCompressionParams
)
# Configure the model
config = LLMConfig(
model="Qwen/Qwen3-8B",
max_model_len=40960,
)
llm = LLM(config)
# Set up sampling parameters
sampling = SamplingParams(temperature=0.7, max_new_tokens=256)
# Configure compression
compression = BatchCompressionParams(
compression_method=CompressionMethod.COMPACTOR, # or SNAPKV
)
# Create conversation
messages_batch = [
[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the main idea of KV cache compression."},
],
]
# Generate with 50% KV retention
sequence_compression = SequenceCompressionParams(compression_ratio=0.5)
answers = llm.generate_chat(
messages_batch=messages_batch,
sampling_params=sampling,
batch_compression_params=compression,
per_sequence_compression_params=sequence_compression
)
print(answers[0])
```
---
## Core Components
### Compression Methods
compactor-vllm supports multiple KV cache compression strategies:
#### **COMPACTOR**
- Query-agnostic compression based on approximate leverage scores
- Training-free and parameter-free
- Maintains strong performance with aggressive compression ratios
#### **SnapKV**
- Query-aware compression using recent-token attention statistics
- Well-suited for scenarios where the question is known at inference-time
#### **None**
- Baseline with no compression
- Standard paged KV cache behavior
### Attention Backends
Choose your attention implementation via `attention_backend`:
```python
from compactor_vllm import LLMConfig, AttentionBackend
config = LLMConfig(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
attention_backend=AttentionBackend.COMPACTOR_TRITON, # Recommended
)
```
**COMPACTOR_TRITON**: Custom sparse variable-length attention kernel optimized for long contexts and compressed KV caches. Was developed in order to support prefix-caching (coming soon!)
**FLASH_ATTENTION**: FlashAttention reference backend.
### Supported Models
Models are registered in `MODEL_REGISTRY` and include:
- **Llama 3 family** – Full support for Meta's Llama 3 models
- **Qwen3** – Dense Qwen3 models
- **Qwen3 MoE** – Mixture-of-Experts Qwen3 variants
Check supported architectures:
```python
from compactor_vllm.models import MODEL_REGISTRY
print(list(MODEL_REGISTRY.keys()))
# ['llama', 'qwen3', 'qwen3_moe']
```
---
## Advanced Usage
### Configuring Compression Ratios
Control how aggressively to compress the KV cache:
```python
from compactor_vllm.compression import SequenceCompressionParams
# Retain 50% of KV cache (discard 50%)
sequence_compression = SequenceCompressionParams(compression_ratio=0.5)
# More aggressive: retain only 25%
sequence_compression = SequenceCompressionParams(compression_ratio=0.25)
```
### Multi-GPU Inference
compactor-vllm supports tensor-parallel inference across multiple GPUs using `torch.distributed`. Specify `tensor_parallel_size` in ``LLMConfig``
### Batch Processing
Process multiple conversations efficiently:
```python
messages_batch = [
[{"role": "user", "content": "Question 1"}],
[{"role": "user", "content": "Question 2"}],
[{"role": "user", "content": "Question 3"}],
]
answers = llm.generate_chat(
messages_batch=messages_batch,
sampling_params=sampling,
batch_compression_params=compression,
)
```
---
## Extending compactor-vllm
### Adding a New Compression Method
1. Create a subclass of `BaseCompressionMethod`:
```python
# compression/my_method.py
from compactor_vllm.compression import BaseCompressionMethod
class MyCompressionMethod(BaseCompressionMethod):
def pre_rope_scoring(self, ...):
# Implement scoring logic
pass
def post_rope_scoring(self, ...):
# Optional refinement
pass
```
2. Register in `compression/__init__.py`:
```python
from compactor_vllm.compression import COMPRESSION_REGISTRY, CompressionMethod
COMPRESSION_REGISTRY[CompressionMethod.MY_METHOD] = MyCompressionMethod
```
### Adding a New Model Architecture
1. Implement `*ForCausalLM` under `models/` using shared `layers/`
2. Register in `MODEL_REGISTRY` with the appropriate `model_type` key
---
## Testing
Run kernel and component tests:
```bash
pytest tests/
```
---
## Project Structure
```
compactor_vllm/
├── core/ # Engine, scheduler, memory management
│ ├── llm_engine.py
│ ├── model_runner.py
│ ├── scheduler.py
│ └── memory_manager.py
├── compression/ # Compression methods and configuration
│ ├── compactor.py
│ ├── snapkv.py
│ └── compression_params.py
├── attention/ # Attention kernels and backends
│ ├── sparse_varlen_kernel.py
│ └── sparse_decode_kernel.py
├── kv_cache/ # Paged KV cache implementation
│ ├── page_table.py
│ └── store_kv_cache.py
├── layers/ # Model layers
│ ├── attention.py
│ ├── moe.py
│ └── ...
├── models/ # Model implementations
│ ├── llama.py
│ ├── qwen3.py
│ └── ...
├── utils/ # Utilities and helpers
└── triton_kernels/ # Fast MOE kernels from Triton Lang repo
```
---
## Citation
If you use compactor-vllm or the Compactor method in your research, please cite:
```bibtex
@article{chari2025compactor,
title = {Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores},
author = {Vivek Chari and Benjamin Van Durme},
journal = {arXiv preprint arXiv:2507.08143},
year = {2025},
url = {https://arxiv.org/abs/2507.08143}
}
```
---
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Acknowledgments
* See https://github.com/NVIDIA/kvpress for additional compression methods in an easy-to-use format
## MIT License
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
import json
import logging
from datasets import concatenate_datasets, load_dataset
from compactor_vllm import (
LLM,
LLMConfig,
SamplingParams,
)
from compactor_vllm.compression import (
BatchCompressionParams,
CompressionMethod,
SequenceCompressionParams,
)
from compactor_vllm.config.engine_config import AttentionBackend
from longbench_metrics import dataset2metric
if __name__ == "__main__":
logging.basicConfig(
level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s"
)
datasets = [
"narrativeqa",
"qasper",
"multifieldqa_en",
"hotpotqa",
"2wikimqa",
"musique",
"gov_report",
"qmsum",
"multi_news",
"trec",
"triviaqa",
"samsum",
"passage_retrieval_en",
"passage_count",
"lcc",
"repobench-p",
]
dataset = concatenate_datasets(
[
load_dataset("THUDM/LongBench", n, split="test", trust_remote_code=True)
for n in datasets
]
).shuffle(seed=42)
# dataset = dataset.take(200)
prompts = json.load(open("longbench_config/dataset2prompt.json", "r"))
max_gen_lens = json.load(open("longbench_config/dataset2maxlen.json", "r"))
tokenizer_kwargs = {"add_generation_prompt": True, "enable_thinking": False}
dset_names = [
item["dataset"] if item["dataset"][-2:] != "_e" else item["dataset"][:-2]
for item in dataset
]
gen_lengths = [max_gen_lens[dset_name] for dset_name in dset_names]
messages = [
[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{"role": "user", "content": prompts[dset_name].format(**item)},
]
for dset_name, item in zip(dset_names, dataset)
]
# model = "Qwen/Qwen3-8B"
model = "meta-llama/Llama-3.1-8B-Instruct"
# model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
config = LLMConfig(
model,
max_num_seqs=64,
gpu_memory_utilization=0.95,
tensor_parallel_size=2,
max_model_len=128000,
attention_backend=AttentionBackend.COMPACTOR_TRITON,
leverage_sketch_size=32,
)
llm = LLM(config)
responses = llm.generate_chat(
messages,
[SamplingParams(max_new_tokens=g, temperature=0.00001) for g in gen_lengths],
BatchCompressionParams(
compression_method=CompressionMethod.COMPACTOR,
do_chunked_compression=False,
chunk_size=4096,
),
per_sequence_compression_params=[
SequenceCompressionParams(
0.25, protected_first_tokens=8, protected_last_tokens=64
)
]
* len(messages),
tokenizer_kwargs=tokenizer_kwargs,
return_sequences=False,
)
results = {}
for dset_name, prediction, item in zip(dset_names, responses, dataset):
if dset_name not in results:
results[dset_name] = []
score = 0.0
if dset_name in ["trec", "triviaqa", "samsum", "lsht"]:
prediction = prediction.lstrip("\n").split("\n")[0]
for ground_truth in item["answers"]:
score = max(
score,
dataset2metric[dset_name](
prediction, ground_truth, all_classes=item["all_classes"]
),
)
results[dset_name].append(score)
all_sum, all_count = 0, 0
for task, scores in results.items():
this_task_sum = sum(scores)
this_task_count = len(scores)
print(task, f"{this_task_sum / this_task_count:.2f}")
all_sum += sum(scores)
all_count += this_task_count
print(f"ALL: {all_sum / all_count:.2f}")
import argparse
import logging
import os
import sys
import json
from datetime import datetime
from pathlib import Path
import torch
from datasets import load_dataset
from ruler_metrics import score_function
# Allow running without `pip install -e .` by pointing to `compactor-vllm/src`.
here = Path(__file__).resolve()
repo_root = here.parents[1]
src_dir = repo_root / "src"
if src_dir.is_dir() and str(src_dir) not in sys.path:
sys.path.insert(0, str(src_dir))
from compactor_vllm import (
LLM,
LLMConfig,
SamplingParams,
) # noqa: E402
from compactor_vllm.compression import (
BatchCompressionParams,
CompressionMethod,
SequenceCompressionParams,
) # noqa: E402
from compactor_vllm.config.engine_config import AttentionBackend # noqa: E402
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Run RULER evaluation with compactor_vllm."
)
parser.add_argument(
"--log-level",
type=str,
default="INFO",
choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
help="Logging level.",
)
parser.add_argument(
"--dataset-length",
type=str,
default="4096",
help="Dataset configuration name.",
)
parser.add_argument(
"--dataset-parquet",
type=str,
default=None,
help=(
"Optional local Parquet dataset path (single .parquet file or a glob). "
"If provided, the script will load the dataset from local Parquet instead of "
"downloading 'simonjegou/ruler'."
),
)
parser.add_argument(
"--dataset-split",
type=str,
default="test",
help=(
"Dataset split to load. For local parquet, this is typically 'train'. "
"For the online ruler dataset, default is 'test'."
),
)
parser.add_argument(
"--seed",
type=int,
default=42,
help="Shuffle seed for the dataset.",
)
parser.add_argument(
"--fraction",
type=float,
default=1.0,
help=(
"Fraction of the dataset to use in (0, 1]. "
"E.g., 0.1 uses 10%% of the shuffled dataset."
),
)
parser.add_argument(
"--model",
type=str,
default="meta-llama/Llama-3.1-8B-Instruct",
help="Model name or path.",
)
parser.add_argument(
"--max-num-seqs",
type=int,
default=32,
help="Maximum number of sequences to batch.",
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.95,
help="Fraction of GPU memory to use.",
)
parser.add_argument(
"--tensor-parallel-size",
type=int,
default=1,
help="Tensor parallelism degree.",
)
parser.add_argument(
"--max-model-len",
type=int,
default=40960,
help="Maximum model context length.",
)
parser.add_argument(
"--enforce-eager",
action="store_true",
help="Disable CUDA graph capture and always run in eager mode.",
)
backend_choices = [backend.name.lower() for backend in AttentionBackend]
parser.add_argument(
"--attention-backend",
type=str,
default="compactor_triton",
choices=backend_choices,
help=f"Attention backend to use. Choices: {backend_choices}",
)
parser.add_argument(
"--leverage-sketch-size",
type=int,
default=48,
help="Leverage sketch size for compactor attention.",
)
parser.add_argument(
"--max-new-tokens",
type=int,
default=256,
help="Maximum number of new tokens to generate.",
)
parser.add_argument(
"--temperature",
type=float,
default=0.0,
help="Sampling temperature (0 is greedy).",
)
method_choices = [m.name.lower() for m in CompressionMethod]
parser.add_argument(
"--compression-method",
type=str,
default="compactor",
choices=method_choices,
help=f"Compression method. Choices: {method_choices}",
)
parser.add_argument(
"--chunk-size",
type=int,
default=2048,
help="Chunk size for chunked compression.",
)
parser.add_argument(
"--no-chunked-compression",
dest="do_chunked_compression",
action="store_false",
help="Disable leverage chunked compression (enabled by default).",
)
parser.set_defaults(do_chunked_compression=True)
parser.add_argument(
"--seq-compression-ratio",
type=float,
default=0.5,
help="Compression ratio for SequenceCompressionParams.",
)
parser.add_argument(
"--protected-first-tokens",
type=int,
default=8,
help="Number of protected tokens at the beginning of each sequence.",
)
parser.add_argument(
"--extra-protected-last-tokens",
type=int,
default=16,
help=(
"Extra number of protected tokens at the end, in addition to the "
"tokenized length of answer_prefix+question."
),
)
parser.add_argument(
"--tokenizer-add-generation-prompt",
action="store_true",
help="Set tokenizer_kwargs['add_generation_prompt']=True (default False).",
)
parser.add_argument(
"--tokenizer-enable-thinking",
action="store_true",
help="Set tokenizer_kwargs['enable_thinking']=True (default False).",
)
parser.add_argument(
"--no-tokenizer-continue-final-message",
dest="tokenizer_continue_final_message",
action="store_false",
help="Set tokenizer_kwargs['continue_final_message']=False (default True).",
)
parser.set_defaults(tokenizer_continue_final_message=True)
parser.add_argument(
"--results-dir",
type=str,
default="results",
help="Directory to save detailed evaluation results.",
)
return parser.parse_args()
def main(args: argparse.Namespace) -> None:
torch.manual_seed(args.seed)
logging.basicConfig(
level=getattr(logging, args.log_level.upper(), logging.INFO),
format="%(asctime)s %(levelname)s: %(message)s",
)
logger = logging.getLogger(__name__)
if args.dataset_parquet:
logger.info(
"Loading local parquet dataset from %s (split=%s)",
args.dataset_parquet,
args.dataset_split,
)
# datasets supports a file path or glob pattern via data_files.
dataset = load_dataset(
"parquet",
data_files=args.dataset_parquet,
split=args.dataset_split,
)
else:
logger.info(
"Loading dataset %s (length=%s, split=%s)",
"simonjegou/ruler",
args.dataset_length,
args.dataset_split,
)
dataset = load_dataset(
"simonjegou/ruler",
args.dataset_length,
split=args.dataset_split,
)
if args.seed is not None and args.seed >= 0:
logger.info("Shuffling dataset with seed %d", args.seed)
dataset = dataset.shuffle(seed=args.seed)
if not (0 < args.fraction <= 1.0):
raise ValueError("--fraction must be in the interval (0, 1].")
if args.fraction < 1.0:
n_examples = max(1, int(len(dataset) * args.fraction))
logger.info(
"Using %.2f fraction of data: %d / %d examples",
args.fraction,
n_examples,
len(dataset),
)
dataset = dataset.select(range(n_examples))
else:
logger.info("Using full dataset: %d examples", len(dataset))
tokenizer_kwargs = {
"add_generation_prompt": args.tokenizer_add_generation_prompt,
"enable_thinking": args.tokenizer_enable_thinking,
"continue_final_message": args.tokenizer_continue_final_message,
}
messages = [
[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": example["context"] + " " + example["question"],
},
{
"role": "assistant",
"content": example["answer_prefix"],
},
]
for example in dataset
]
attention_backend = AttentionBackend[args.attention_backend.upper()]
compression_method = CompressionMethod[args.compression_method.upper()]
logger.info("Using model: %s", args.model)
model_path = args.model if os.path.isdir(args.model) else None
if model_path is not None:
logger.info("Detected local model path: %s", model_path)
config = LLMConfig(
args.model,
path=model_path,
max_num_seqs=args.max_num_seqs,
gpu_memory_utilization=args.gpu_memory_utilization,
tensor_parallel_size=args.tensor_parallel_size,
max_model_len=args.max_model_len,
enforce_eager=args.enforce_eager,
attention_backend=attention_backend,
leverage_sketch_size=args.leverage_sketch_size,
)
llm = LLM(config)
end_protected_lengths = [
args.extra_protected_last_tokens
+ len(
llm.tokenizer(example["answer_prefix"] + example["question"])["input_ids"]
)
for example in dataset
]
per_sequence_compression_params = [
SequenceCompressionParams(
args.seq_compression_ratio,
protected_first_tokens=args.protected_first_tokens,
protected_last_tokens=end_protected_length,
)
for end_protected_length in end_protected_lengths
]
# Sampling params
sampling_params = SamplingParams(
max_new_tokens=args.max_new_tokens,
temperature=args.temperature,
)
# Batch compression params
batch_compression_params = BatchCompressionParams(
compression_method=compression_method,
do_chunked_compression=args.do_chunked_compression,
chunk_size=args.chunk_size,
)
logger.info("Running generate_chat on %d examples.", len(messages))
responses = llm.generate_chat(
messages,
sampling_params,
batch_compression_params,
per_sequence_compression_params=per_sequence_compression_params,
tokenizer_kwargs=tokenizer_kwargs,
return_sequences=False,
)
logger.info("Scoring responses.")
results = {}
per_example = []
all_sum, all_count = 0.0, 0
for idx, (example, response) in enumerate(zip(dataset, responses)):
task = example["task"]
answer = example["answer"]
score = score_function(
generated=response,
ground_truth=answer,
task_category=task,
)
if task not in results:
results[task] = []
results[task].append(score)
all_sum += score
all_count += 1
per_example.append(
{
"index": idx,
"task": task,
"context": example["context"],
"question": example["question"],
"answer_prefix": example["answer_prefix"],
"ground_truth": answer,
"generated": response,
"score": score,
"compression_params": {
"seq_compression_ratio": args.seq_compression_ratio,
"protected_first_tokens": args.protected_first_tokens,
"protected_last_tokens": end_protected_lengths[idx],
},
}
)
per_task_summary = {}
for task, scores in results.items():
this_task_sum = sum(scores)
this_task_count = len(scores)
avg = this_task_sum / this_task_count
print(task, f"{avg:.3f}")
per_task_summary[task] = {
"avg_score": avg,
"num_examples": this_task_count,
"sum_scores": this_task_sum,
}
overall_avg = all_sum / all_count if all_count > 0 else 0.0
print(f"ALL: {overall_avg:.3f}")
os.makedirs(args.results_dir, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
safe_model_name = args.model.replace("/", "_")
base_name = f"ruler_{args.dataset_length}_{safe_model_name}_{timestamp}"
summary_path = os.path.join(args.results_dir, base_name + "_summary.json")
details_path = os.path.join(args.results_dir, base_name + "_details.jsonl")
logger.info("Saving summary to %s", summary_path)
with open(summary_path, "w", encoding="utf-8") as f:
json.dump(
{
"timestamp": timestamp,
"model": args.model,
"dataset": "simonjegou/ruler",
"dataset_length": args.dataset_length,
"num_examples": len(dataset),
"overall_avg_score": overall_avg,
"per_task": per_task_summary,
"arguments": vars(args), # all CLI args
},
f,
ensure_ascii=False,
indent=2,
)
logger.info("Saving per-example details to %s", details_path)
with open(details_path, "w", encoding="utf-8") as f:
for row in per_example:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
if __name__ == "__main__":
main(parse_args())
#HIP_LAUNCH_BLOCKING=1 TORCHDYNAMO_DISABLE=1 python eval_ruler.py --dataset-parquet /home/laibao/proj/kvpress/compactor-vllm/evaluate/test-00000-of-00001.parquet --dataset-split train --model /mnt/data/llm-models/Qwen3-8B/ --compression-method compactor --seq-compression-ratio 1 --enforce-eager
\ No newline at end of file
{
"narrativeqa": 128,
"qasper": 128,
"multifieldqa_en": 64,
"multifieldqa_zh": 64,
"hotpotqa": 32,
"2wikimqa": 32,
"musique": 32,
"dureader": 128,
"gov_report": 512,
"qmsum": 512,
"multi_news": 512,
"vcsum": 512,
"trec": 64,
"triviaqa": 32,
"samsum": 128,
"lsht": 64,
"passage_count": 32,
"passage_retrieval_en": 32,
"passage_retrieval_zh": 32,
"lcc": 64,
"repobench-p": 64
}
\ No newline at end of file
{
"narrativeqa": "You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: \n\n<text>\n{context}\n</text>\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper": "You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: \n\n<text>\n{context}\n</text>\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"multifieldqa_en": "Read the following text and answer briefly.\n\n<text>\n{context}\n</text>\n\n Now, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"multifieldqa_zh": "阅读以下文字并用中文简短回答:\n\n<text>\n{context}\n</text>\n\n现在请基于上面的文章回答下面的问题,只告诉我答案,不要输出任何其他字词。\n\n问题:{input}\n回答:",
"hotpotqa": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"dureader": "请基于给定的文章回答下述问题。\n\n文章:\n\n<text>\n{context}\n</text>\n\n请基于上述文章回答下面的问题。\n\n问题:{input}\n回答:",
"gov_report": "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum": "You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n\n<text>\n{context}\n</text>\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"multi_news": "You are given several news passages. Write a one-page summary of all news. \n\nNews:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of all the news.\n\nSummary:",
"vcsum": "下面有一段会议记录,请你阅读后,写一段总结,总结会议的内容。\n会议记录:\n\n<text>\n{context}\n</text>\n\n会议总结:",
"trec": "Please determine the type of the question below. Here are some examples of questions.\n\n<text>\n{context}\n</text>\n\n{input}",
"triviaqa": "Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"samsum": "Summarize the dialogue into a few short sentences. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"lsht": "请判断给定新闻的类别,下面是一些例子。\n\n<text>\n{context}\n</text>\n\n{input}",
"passage_count": "There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n<text>\n{context}\n</text>\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval_en": "Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n<text>\n{context}\n</text>\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",
"passage_retrieval_zh": "以下是若干段落文字,以及其中一个段落的摘要。请确定给定的摘要出自哪一段。\n\n<text>\n{context}\n</text>\n\n下面是一个摘要\n\n{input}\n\n请输入摘要所属段落的编号。答案格式必须是\"段落1\"\"段落2\"等格式\n\n答案是:",
"lcc": "Please complete the code given below. \n{context}Next line of code:\n",
"repobench-p": "Please complete the code given below. \n{context}{input}Next line of code:\n"
}
\ No newline at end of file
{
"narrativeqa": "You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: \n\n<text>\n{context}\n</text>\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper": "You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: \n\n<text>\n{context}\n</text>\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"multifieldqa_en": "Read the following text and answer briefly.\n\n<text>\n{context}\n</text>\n\n Now, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"multifieldqa_zh": "阅读以下文字并用中文简短回答:\n\n<text>\n{context}\n</text>\n\n现在请基于上面的文章回答下面的问题,只告诉我答案,不要输出任何其他字词。\n\n问题:{input}\n回答:",
"hotpotqa": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n\n<text>\n{context}\n</text>\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"dureader": "请基于给定的文章回答下述问题。\n\n文章:\n\n<text>\n{context}\n</text>\n\n请基于上述文章回答下面的问题。\n\n问题:{input}\n回答:",
"gov_report": "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum": "You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n\n<text>\n{context}\n</text>\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"multi_news": "You are given several news passages. Write a one-page summary of all news. \n\nNews:\n\n<text>\n{context}\n</text>\n\nNow, write a one-page summary of all the news.\n\nSummary:",
"vcsum": "下面有一段会议记录,请你阅读后,写一段总结,总结会议的内容。\n会议记录:\n\n<text>\n{context}\n</text>\n\n会议总结:",
"trec": "Please determine the type of the question below. Here are some examples of questions.\n\n<text>\n{context}\n</text>\n\n{input}",
"triviaqa": "Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"samsum": "Summarize the dialogue into a few short sentences. The following are some examples.\n\n<text>\n{context}\n</text>\n\n{input}",
"lsht": "请判断给定新闻的类别,下面是一些例子。\n\n<text>\n{context}\n</text>\n\n{input}",
"passage_count": "There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n<text>\n{context}\n</text>\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval_en": "Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n<text>\n{context}\n</text>\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",
"passage_retrieval_zh": "以下是若干段落文字,以及其中一个段落的摘要。请确定给定的摘要出自哪一段。\n\n<text>\n{context}\n</text>\n\n下面是一个摘要\n\n{input}\n\n请输入摘要所属段落的编号。答案格式必须是\"段落1\"\"段落2\"等格式\n\n答案是:",
"lcc": "Please complete the code given below. \n{context}Next line of code:\n",
"repobench-p": "Please complete the code given below. \n{context}{input}Next line of code:\n"
}
\ No newline at end of file
import re
import string
from collections import Counter
import jieba
from fuzzywuzzy import fuzz
from rouge import Rouge
def normalize_answer(s):
"""Lower text and remove punctuation, articles and extra whitespace."""
def remove_articles(text):
return re.sub(r"\b(a|an|the)\b", " ", text)
def white_space_fix(text):
return " ".join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return "".join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def normalize_zh_answer(s):
"""Lower text and remove punctuation, extra whitespace."""
def white_space_fix(text):
return "".join(text.split())
def remove_punc(text):
cn_punctuation = "!?。。"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
all_punctuation = set(string.punctuation + cn_punctuation)
return "".join(ch for ch in text if ch not in all_punctuation)
def lower(text):
return text.lower()
return white_space_fix(remove_punc(lower(s)))
def count_score(prediction, ground_truth, **kwargs):
numbers = re.findall(r"\d+", prediction)
right_num = 0
for number in numbers:
if str(number) == str(ground_truth):
right_num += 1
final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
return float(final_score)
def retrieval_score(prediction, ground_truth, **kwargs):
pattern = r"Paragraph (\d+)"
matches = re.findall(pattern, ground_truth)
ground_truth_id = matches[0]
numbers = re.findall(r"\d+", prediction)
right_num = 0
for number in numbers:
if str(number) == str(ground_truth_id):
right_num += 1
final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
return float(final_score)
def retrieval_zh_score(prediction, ground_truth, **kwargs):
pattern = r"段落(\d+)"
matches = re.findall(pattern, ground_truth)
ground_truth_id = matches[0]
numbers = re.findall(r"\d+", prediction)
right_num = 0
for number in numbers:
if str(number) == str(ground_truth_id):
right_num += 1
final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
return float(final_score)
def code_sim_score(prediction, ground_truth, **kwargs):
all_lines = prediction.lstrip("\n").split("\n")
prediction = ""
for line in all_lines:
if ("`" not in line) and ("#" not in line) and ("//" not in line):
prediction = line
break
return fuzz.ratio(prediction, ground_truth) / 100
def classification_score(prediction, ground_truth, **kwargs):
em_match_list = []
all_classes = kwargs["all_classes"]
for class_name in all_classes:
if class_name in prediction:
em_match_list.append(class_name)
for match_term in em_match_list:
if match_term in ground_truth and match_term != ground_truth:
em_match_list.remove(match_term)
if ground_truth in em_match_list:
score = 1.0 / len(em_match_list)
else:
score = 0.0
return score
def rouge_score(prediction, ground_truth, **kwargs):
rouge = Rouge()
try:
scores = rouge.get_scores([prediction], [ground_truth], avg=True)
except:
return 0.0
return scores["rouge-l"]["f"]
def rouge_zh_score(prediction, ground_truth, **kwargs):
prediction = " ".join(list(jieba.cut(prediction, cut_all=False)))
ground_truth = " ".join(list(jieba.cut(ground_truth, cut_all=False)))
score = rouge_score(prediction, ground_truth)
return score
def f1_score(prediction, ground_truth, **kwargs):
common = Counter(prediction) & Counter(ground_truth)
num_same = sum(common.values())
if num_same == 0:
return 0
precision = 1.0 * num_same / len(prediction)
recall = 1.0 * num_same / len(ground_truth)
f1 = (2 * precision * recall) / (precision + recall)
return f1
def qa_f1_score(prediction, ground_truth, **kwargs):
normalized_prediction = normalize_answer(prediction)
normalized_ground_truth = normalize_answer(ground_truth)
prediction_tokens = normalized_prediction.split()
ground_truth_tokens = normalized_ground_truth.split()
return f1_score(prediction_tokens, ground_truth_tokens)
def qa_f1_zh_score(prediction, ground_truth, **kwargs):
prediction_tokens = list(jieba.cut(prediction, cut_all=False))
ground_truth_tokens = list(jieba.cut(ground_truth, cut_all=False))
prediction_tokens = [normalize_zh_answer(token) for token in prediction_tokens]
ground_truth_tokens = [normalize_zh_answer(token) for token in ground_truth_tokens]
prediction_tokens = [token for token in prediction_tokens if len(token) > 0]
ground_truth_tokens = [token for token in ground_truth_tokens if len(token) > 0]
return f1_score(prediction_tokens, ground_truth_tokens)
dataset2metric = {
"narrativeqa": qa_f1_score,
"qasper": qa_f1_score,
"multifieldqa_en": qa_f1_score,
"multifieldqa_zh": qa_f1_zh_score,
"hotpotqa": qa_f1_score,
"2wikimqa": qa_f1_score,
"musique": qa_f1_score,
"dureader": rouge_zh_score,
"gov_report": rouge_score,
"qmsum": rouge_score,
"multi_news": rouge_score,
"vcsum": rouge_zh_score,
"trec": classification_score,
"triviaqa": qa_f1_score,
"samsum": rouge_score,
"lsht": classification_score,
"passage_retrieval_en": retrieval_score,
"passage_count": count_score,
"passage_retrieval_zh": retrieval_zh_score,
"lcc": code_sim_score,
"repobench-p": code_sim_score,
}
# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import re
from typing import List
import pandas as pd
def string_match_part(preds, refs):
score = (
sum(
[
max([1.0 if r.lower() in pred.lower() else 0.0 for r in ref])
for pred, ref in zip(preds, refs)
]
)
/ len(preds)
* 100
)
return round(score, 2)
def string_match_all(preds, refs):
score = (
sum(
[
sum([1.0 if r.lower() in pred.lower() else 0.0 for r in ref]) / len(ref)
for pred, ref in zip(preds, refs)
]
)
/ len(preds)
* 100
)
return round(score, 2)
def calculate_metrics(df: pd.DataFrame) -> dict:
scores = {}
np_pattern = re.compile(r"[\x00-\x1f]")
df["predicted_answer"] = df["predicted_answer"].apply(
lambda x: np_pattern.sub("", x.strip()).strip()
)
for task, df_task in df.groupby("task"):
task_category = task.split("_")[0]
metric_fn = string_match_part if task_category == "qa" else string_match_all
preds = df_task["predicted_answer"].tolist()
refs = df_task["answer"].tolist()
score = metric_fn(preds, refs)
scores[task] = {"string_match": score}
return scores
def score_function(*, generated, ground_truth: List[str], task_category: str):
np_pattern = re.compile(r"[\x00-\x1f]")
generated = np_pattern.sub("", generated.strip()).strip()
task_category = task_category.split("_")[0]
metric_fn = string_match_part if task_category == "qa" else string_match_all
return metric_fn([generated], [ground_truth])
import argparse
import inspect
import logging
import os
import sys
from pathlib import Path
def _maybe_add_src_to_path() -> None:
# Allow running without `pip install -e .` by pointing to `compactor-vllm/src`.
here = Path(__file__).resolve()
repo_root = here.parents[1]
src_dir = repo_root / "src"
if src_dir.is_dir() and str(src_dir) not in sys.path:
sys.path.insert(0, str(src_dir))
_maybe_add_src_to_path()
from compactor_vllm import LLM, LLMConfig, SamplingParams # noqa: E402
from compactor_vllm.compression import ( # noqa: E402
BatchCompressionParams,
CompressionMethod,
SequenceCompressionParams,
)
from compactor_vllm.config.engine_config import AttentionBackend # noqa: E402
def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Minimal smoke test for compactor-vllm (no speculative decoding)."
)
parser.add_argument(
"--model",
type=str,
default=os.environ.get("MODEL", "/mnt/data/llm-models/Qwen3-8B"),
help="Local model directory or HF id. In the container this is usually a local dir.",
)
parser.add_argument(
"--tp",
type=int,
default=int(os.environ.get("TP", "1")),
help="Tensor parallel size (world size).",
)
parser.add_argument(
"--nccl-port",
type=int,
default=int(os.environ.get("NCCL_PORT", "1218")),
help="TCP port for torch.distributed init (only used for NCCL init_method=tcp://localhost:<port>).",
)
parser.add_argument("--max-model-len", type=int, default=2048)
parser.add_argument("--max-num-seqs", type=int, default=2)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=float(os.environ.get("GPU_MEMORY_UTILIZATION", "0.9")),
help="Fraction of total GPU memory used for KV cache + activations.",
)
parser.add_argument(
"--attention-backend",
type=str,
default="compactor_triton",
choices=[b.name.lower() for b in AttentionBackend],
)
parser.add_argument(
"--compression-method",
type=str,
default="compactor",
choices=[m.name.lower() for m in CompressionMethod],
)
parser.add_argument(
"--compression-ratio",
type=float,
default=0.8,
help="Sequence-level compression ratio (e.g. 0.8 keeps 80%% of tokens).",
)
parser.add_argument("--chunk-size", type=int, default=512)
parser.add_argument(
"--no-chunked-compression",
dest="do_chunked_compression",
action="store_false",
)
parser.set_defaults(do_chunked_compression=True)
parser.add_argument("--prompt", type=str, default="用一句话介绍你自己,给我讲一个故事,200字左右。")
parser.add_argument("--max-new-tokens", type=int, default=64)
parser.add_argument(
"--temperature",
type=float,
default=0.0,
help="0.0 = greedy decoding (recommended for smoke tests).",
)
parser.add_argument(
"--tokenizer-enable-thinking",
dest="tokenizer_enable_thinking",
action="store_true",
help="Pass enable_thinking=True to tokenizer.apply_chat_template (if supported).",
)
parser.add_argument(
"--no-tokenizer-enable-thinking",
dest="tokenizer_enable_thinking",
action="store_false",
help="Pass enable_thinking=False to tokenizer.apply_chat_template (if supported).",
)
parser.set_defaults(tokenizer_enable_thinking=False)
parser.add_argument(
"--tokenizer-add-generation-prompt",
dest="tokenizer_add_generation_prompt",
action="store_true",
help="Pass add_generation_prompt=True to tokenizer.apply_chat_template (if supported).",
)
parser.add_argument(
"--no-tokenizer-add-generation-prompt",
dest="tokenizer_add_generation_prompt",
action="store_false",
help="Pass add_generation_prompt=False to tokenizer.apply_chat_template (if supported).",
)
parser.set_defaults(tokenizer_add_generation_prompt=True)
parser.add_argument(
"--tokenizer-continue-final-message",
dest="tokenizer_continue_final_message",
action="store_true",
help="Pass continue_final_message=True to tokenizer.apply_chat_template (if supported).",
)
parser.add_argument(
"--no-tokenizer-continue-final-message",
dest="tokenizer_continue_final_message",
action="store_false",
help="Pass continue_final_message=False to tokenizer.apply_chat_template (if supported).",
)
parser.set_defaults(tokenizer_continue_final_message=False)
parser.add_argument(
"--skip-special-tokens",
dest="skip_special_tokens",
action="store_true",
help="Skip special tokens in output decoding (recommended).",
)
parser.add_argument(
"--no-skip-special-tokens",
dest="skip_special_tokens",
action="store_false",
help="Keep special tokens in output decoding (e.g. <|im_end|>).",
)
parser.set_defaults(skip_special_tokens=True)
parser.add_argument(
"--log-level",
type=str,
default="INFO",
choices=["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"],
)
return parser.parse_args()
def main() -> None:
args = _parse_args()
logging.basicConfig(
level=getattr(logging, args.log_level.upper()),
format="%(asctime)s - %(levelname)s - %(message)s",
)
attention_backend = AttentionBackend[args.attention_backend.upper()]
compression_method = CompressionMethod[args.compression_method.upper()]
model = args.model
cfg = LLMConfig(
model=model,
path=model,
tensor_parallel_size=int(args.tp),
nccl_port=int(args.nccl_port),
max_model_len=int(args.max_model_len),
max_num_seqs=int(args.max_num_seqs),
gpu_memory_utilization=float(args.gpu_memory_utilization),
enforce_eager=True,
attention_backend=attention_backend,
show_progress_bar=False,
)
llm = LLM(cfg)
tokenizer_kwargs = {
"add_generation_prompt": bool(args.tokenizer_add_generation_prompt),
"enable_thinking": bool(args.tokenizer_enable_thinking),
"continue_final_message": bool(args.tokenizer_continue_final_message),
}
if tokenizer_kwargs.get("add_generation_prompt") and tokenizer_kwargs.get(
"continue_final_message"
):
# HF tokenizer API rejects these being simultaneously True.
tokenizer_kwargs["continue_final_message"] = False
# Be defensive: only pass kwargs supported by this tokenizer build.
try:
supported = set(inspect.signature(llm.tokenizer.apply_chat_template).parameters)
tokenizer_kwargs = {k: v for k, v in tokenizer_kwargs.items() if k in supported}
except (TypeError, ValueError):
pass
outs = llm.generate_chat(
[[{"role": "user", "content": args.prompt}]],
sampling_params=SamplingParams(
temperature=float(args.temperature),
max_new_tokens=int(args.max_new_tokens),
),
batch_compression_params=BatchCompressionParams(
compression_method=compression_method,
do_chunked_compression=bool(args.do_chunked_compression),
chunk_size=int(args.chunk_size),
),
per_sequence_compression_params=SequenceCompressionParams(
compression_ratio=float(args.compression_ratio),
),
tokenizer_kwargs=tokenizer_kwargs,
detokenizer_kwargs={"skip_special_tokens": bool(args.skip_special_tokens)},
)
print(outs[0])
llm.exit()
if __name__ == "__main__":
main()
Package Version
---------------------------------- ------------------------------------------
accelerate 1.12.0
addict 2.4.0
aiofiles 25.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.2
aiohttp-cors 0.8.1
aiosignal 1.4.0
airportsdata 20250909
amdsmi 24.5.3+02cbffb.dirty
annotated-doc 0.0.4
annotated-types 0.7.0
anyio 4.12.0
apex 1.5.0+das.opt1.dtk25042
astor 0.8.1
async-timeout 5.0.1
attrs 25.4.0
backports.asyncio.runner 1.2.0
blake3 1.0.8
blinker 1.9.0
boto3 1.42.10
botocore 1.42.10
cachetools 6.2.4
certifi 2025.11.12
charset-normalizer 3.4.4
click 8.2.1
cloudpickle 3.1.2
cmake 3.29.0
coloredlogs 15.0.1
colorful 0.5.8
compressed-tensors 0.10.2
contourpy 1.3.2
cryptography 3.4.8
cupy 12.3.0
cycler 0.12.1
datasets 4.4.1
dbus-python 1.2.18
dcu-megatron 0.13.0+das.opt1.dtk25042
deepspeed 0.15.4+das.opt1.dtk25042
depyf 0.18.0
dgl 2.2.1+das.opt1.dtk25042
dill 0.4.0
diskcache 5.6.3
distlib 0.4.0
distro 1.7.0
dnspython 2.8.0
dropout_layer_norm 2.6.1+das.opt1.dtk2504
eft 0.0.7
einops 0.8.1
email-validator 2.3.0
exceptiongroup 1.3.1
fastapi 0.124.4
fastapi-cli 0.0.16
fastapi-cloud-cli 0.6.0
fastar 0.8.0
fastpt 2.1.1+das.dtk25042
fastrlock 0.8.3
filelock 3.20.1
flash_attn 2.6.1+das.opt1.dtk2504.20251216.gbd5c0f0c
flash_mla 1.0.0+das.opt1.dtk2504.20251210.g124c5ef1
Flask 3.1.2
flatbuffers 25.9.23
fonttools 4.61.1
frozenlist 1.8.0
fsspec 2025.12.0
fused_dense_lib 2.6.1+das.opt1.dtk2504
future 1.0.0
gguf 0.17.1
google-api-core 2.28.1
google-auth 2.45.0
googleapis-common-protos 1.72.0
greenlet 3.3.0
grouped-gemm 0.5.0+das.dtk2504
grouped-gemm-int4 0.5.0+das.dtk2504
grpcio 1.76.0
h11 0.16.0
h2 4.3.0
hf-xet 1.2.0
hiredis 3.3.0
hjson 3.1.0
hpack 4.1.0
httpcore 1.0.9
httplib2 0.20.2
httptools 0.7.1
httpx 0.28.1
huggingface-hub 0.36.0
humanfriendly 10.0
humanize 4.14.0
Hypercorn 0.18.0
hyperframe 6.1.0
hypothesis 5.35.1
idna 3.11
importlib_metadata 8.7.0
iniconfig 2.3.0
interegular 0.3.3
itsdangerous 2.2.0
jeepney 0.7.1
Jinja2 3.1.6
jiter 0.12.0
jmespath 1.0.1
jsonschema 4.25.1
jsonschema-specifications 2025.9.1
keyring 23.5.0
kiwisolver 1.4.9
lark 1.2.2
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
libnacl 2.1.0
lightop 0.6.0+das.dtk25042.20251216.g3830d4e2
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.10.12
lmslim 0.3.1+das.opt1.dtk25042.20251202.g07a5af3e
markdown-it-py 4.0.0
MarkupSafe 3.0.3
matplotlib 3.10.8
mdurl 0.1.2
megatron-core 0.13.2
mistral_common 1.8.6
mmcv 2.2.0+das.opt1.dtk25042
mmengine 0.10.7
moe-w8a8 0.0.1+das.dtk2504
moe-w8a8-prefill-gemm 0.0.1+das.dtk2504
more-itertools 8.10.0
mpmath 1.3.0
msgpack 1.1.2
msgspec 0.20.0
multidict 6.7.0
multiprocess 0.70.18
nest-asyncio 1.6.0
networkx 3.4.2
ninja 1.11.1
numa 1.4.6
numba 0.61.2
numpy 1.25.0
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
oauthlib 3.2.0
onnxruntime 1.19.2+das.opt1.dtk25042
openai 1.90.0
opencensus 0.11.4
opencensus-context 0.1.3
opencv-python 4.12.0.88
opencv-python-headless 4.12.0.88
opentelemetry-api 1.39.1
opentelemetry-exporter-prometheus 0.60b1
opentelemetry-proto 1.39.1
opentelemetry-sdk 1.39.1
opentelemetry-semantic-conventions 0.60b1
outlines 0.1.11
outlines_core 0.1.26
packaging 25.0
pandas 2.3.3
partial-json-parser 0.2.1.1.post7
peft 0.18.0
pillow 12.0.0
pip 25.3
platformdirs 4.5.1
pluggy 1.6.0
priority 2.0.0
prometheus_client 0.23.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.4.1
proto-plus 1.26.1
protobuf 6.33.2
psutil 7.1.3
py-cpuinfo 9.0.0
py-spy 0.4.1
pyarrow 22.0.0
pyasn1 0.6.1
pyasn1_modules 0.4.2
pybase64 1.4.3
pycountry 24.6.1
pydantic 2.12.5
pydantic_core 2.41.5
pydantic-extra-types 2.10.6
Pygments 2.19.2
PyGObject 3.42.1
PyHive 0.7.0
PyJWT 2.3.0
PyMySQL 1.1.2
pyparsing 3.2.5
pytest 9.0.2
pytest-asyncio 1.3.0
python-apt 2.4.0+ubuntu4
python-dateutil 2.9.0.post0
python-dotenv 1.2.1
python-json-logger 4.0.0
python-multipart 0.0.20
PyTrie 0.4.0
pytz 2025.2
PyYAML 6.0.3
pyzmq 27.1.0
Quart 0.20.0
ray 2.48.0
redis 7.1.0
referencing 0.37.0
regex 2025.11.3
requests 2.32.5
rich 14.2.0
rich-toolkit 0.17.0
rignore 0.7.6
rotary_emb 2.6.1+das.opt1.dtk2504
rpds-py 0.30.0
rsa 4.9.1
runai-model-streamer 0.11.0
runai-model-streamer-s3 0.11.0
s3transfer 0.16.0
safetensors 0.7.0
scipy 1.15.3
SecretStorage 3.3.1
sentencepiece 0.2.1
sentry-sdk 2.47.0
setuptools 80.8.0
setuptools-scm 9.2.2
shellingham 1.5.4
six 1.16.0
smart_open 7.5.0
sniffio 1.3.1
sortedcontainers 2.4.0
SQLAlchemy 2.0.45
starlette 0.50.0
sympy 1.13.1
taskgroup 0.2.2
tensorboardX 2.6.4
tensorizer 2.12.0
termcolor 3.2.0
threadpoolctl 3.6.0
tiktoken 0.12.0
tokenizers 0.22.1
tomli 2.3.0
torch 2.5.1+das.opt1.dtk25042
torchaudio 2.5.1+das.opt1.dtk25042
torchdata 0.8.0
torchvision 0.20.1+das.opt1.dtk25042
tqdm 4.67.1
transformer_engine 2.5.0+das.opt1.dtk25042
transformers 4.57.3
triton 3.1+das.opt1.dtk25042
typer 0.20.0
typer-slim 0.20.0
typing_extensions 4.15.0
typing-inspection 0.4.2
tzdata 2025.3
urllib3 2.6.2
uvicorn 0.38.0
uvloop 0.22.1
virtualenv 20.35.4
vllm 0.9.2+das.opt2.ffcc47b.dtk25042
wadllib 1.3.6
watchfiles 1.1.1
websockets 15.0.1
Werkzeug 3.1.4
wheel 0.37.1
wrapt 2.0.1
wsproto 1.3.2
xentropy_cuda_lib 2.6.1+das.opt1.dtk2504
xgrammar 0.1.19
xxhash 3.6.0
yapf 0.43.0
yarl 1.22.0
zipp 3.23.0
[project]
name = "compactor-vllm"
description = "Fast KV Cache Compression for LLMs"
version = "0.0.1"
dependencies = [
# "triton>=3.5.0",
"transformers",
# "torch>=2.9.0",
"safetensors",
"tqdm",
"flash-attn",
"pytest"
]
requires-python = ">= 3.8"
authors = [
{name = "Vivek Chari", email = "viveknchari@gmail.com"},
]
[project.optional-dependencies]
evaluate = ["rouge", "pandas", "fuzzywuzzy"]
[tool.ruff]
exclude = [
"triton_kernels"
]
from compactor_vllm.compression import CompressionMethod
from compactor_vllm.config.engine_config import AttentionBackend, LLMConfig
from compactor_vllm.config.sampling_params import SamplingParams
from compactor_vllm.core.llm_engine import LLMEngine as _LLMEngine
class LLM(_LLMEngine):
pass
__all__ = [
"LLMConfig",
"LLM",
"SamplingParams",
"AttentionBackend",
"CompressionMethod",
]
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment