AGENTS.md

# Repository Guidelines

## Project Structure & Module Organization

- `src/compactor_vllm/`: main Python package (minimal vLLM-style engine).
  - `core/`: engine loop, scheduling, memory management.
  - `attention/`: Triton attention backends and `compile_kernels.py` autotuning helper.
  - `compression/`: compression methods (e.g., Compactor, SnapKV) and registries/config.
  - `kv_cache/`: paged KV cache + store helpers.
  - `models/`, `layers/`, `utils/`: model definitions and reusable building blocks.
  - `triton_kernels/`: low-level kernels (treat as vendor-style code; avoid drive-by edits).
- `tests/`: GPU correctness tests (`tests/test_*.py`).
- `evaluate/`: evaluation scripts (RULER/LongBench) and configs (`evaluate/longbench_config/`).
- Repo root: figures/plots used by `README.md`.

## Build, Test, and Development Commands

- `pip install -e .`: editable install for local development.
- `pip install -e ".[evaluate]"`: install optional evaluation dependencies (you may also need `pip install datasets`).
- `pytest tests/`: run unit/kernel tests (expects a CUDA-capable GPU and working `flash-attn`/Triton setup).
- `python -m compactor_vllm.attention.compile_kernels --max-length 16384 --HKV 8 --HQ 32 --D 128 --page-size 128`: pre-autotune Triton kernels (results are cached on disk; avoids first-run autotuning latency).
- `python evaluate/eval_ruler.py --help`: run RULER evaluation (downloads datasets).
- `python evaluate/eval_longbench.py`: run LongBench evaluation (downloads datasets).

## Coding Style & Naming Conventions

- Use 4-space indentation and follow existing patterns in `src/compactor_vllm/` (type hints, `@dataclass` configs, `logging` over `print`).
- Naming: `snake_case` (modules/functions), `PascalCase` (classes), `UPPER_SNAKE_CASE` (constants).
- Lint/format: Ruff is configured in `pyproject.toml`. If installed, run `ruff check .` and `ruff format .` (cache is ignored via `.gitignore`).

## Testing Guidelines

- Framework: `pytest`. Prefer parameterized tests for kernels and keep GPU tests deterministic (seed RNGs; `torch.cuda.synchronize()` before assertions when needed).
- When changing kernels or compression logic, add/extend a focused regression test and, when feasible, compare against a reference backend (e.g., FlashAttention).

## Commit & Pull Request Guidelines

- Commits in history are short and imperative (e.g., “fix plot”, “update package layout”); keep subjects concise and scoped.
- PRs should include: a clear description, reproduction commands, expected correctness/perf impact, GPU/CUDA details for kernel changes, and new/updated tests. Add plots/screenshots when changing benchmarks or figures.

## Environment & Configuration Tips

- Requires an NVIDIA CUDA GPU; ensure compatible versions of PyTorch, Triton, and `flash-attn`.
- Kernel constraint: `head_dim` (`D`) must be a power of two; new model configs may trigger autotuning on first use.