# Repository Guidelines ## Project Structure & Module Organization - `src/compactor_vllm/`: main Python package (minimal vLLM-style engine). - `core/`: engine loop, scheduling, memory management. - `attention/`: Triton attention backends and `compile_kernels.py` autotuning helper. - `compression/`: compression methods (e.g., Compactor, SnapKV) and registries/config. - `kv_cache/`: paged KV cache + store helpers. - `models/`, `layers/`, `utils/`: model definitions and reusable building blocks. - `triton_kernels/`: low-level kernels (treat as vendor-style code; avoid drive-by edits). - `tests/`: GPU correctness tests (`tests/test_*.py`). - `evaluate/`: evaluation scripts (RULER/LongBench) and configs (`evaluate/longbench_config/`). - Repo root: figures/plots used by `README.md`. ## Build, Test, and Development Commands - `pip install -e .`: editable install for local development. - `pip install -e ".[evaluate]"`: install optional evaluation dependencies (you may also need `pip install datasets`). - `pytest tests/`: run unit/kernel tests (expects a CUDA-capable GPU and working `flash-attn`/Triton setup). - `python -m compactor_vllm.attention.compile_kernels --max-length 16384 --HKV 8 --HQ 32 --D 128 --page-size 128`: pre-autotune Triton kernels (results are cached on disk; avoids first-run autotuning latency). - `python evaluate/eval_ruler.py --help`: run RULER evaluation (downloads datasets). - `python evaluate/eval_longbench.py`: run LongBench evaluation (downloads datasets). ## Coding Style & Naming Conventions - Use 4-space indentation and follow existing patterns in `src/compactor_vllm/` (type hints, `@dataclass` configs, `logging` over `print`). - Naming: `snake_case` (modules/functions), `PascalCase` (classes), `UPPER_SNAKE_CASE` (constants). - Lint/format: Ruff is configured in `pyproject.toml`. If installed, run `ruff check .` and `ruff format .` (cache is ignored via `.gitignore`). ## Testing Guidelines - Framework: `pytest`. Prefer parameterized tests for kernels and keep GPU tests deterministic (seed RNGs; `torch.cuda.synchronize()` before assertions when needed). - When changing kernels or compression logic, add/extend a focused regression test and, when feasible, compare against a reference backend (e.g., FlashAttention). ## Commit & Pull Request Guidelines - Commits in history are short and imperative (e.g., “fix plot”, “update package layout”); keep subjects concise and scoped. - PRs should include: a clear description, reproduction commands, expected correctness/perf impact, GPU/CUDA details for kernel changes, and new/updated tests. Add plots/screenshots when changing benchmarks or figures. ## Environment & Configuration Tips - Requires an NVIDIA CUDA GPU; ensure compatible versions of PyTorch, Triton, and `flash-attn`. - Kernel constraint: `head_dim` (`D`) must be a power of two; new model configs may trigger autotuning on first use.