## Dynamo KV Block Manager Kernels

This workspace houses CUDA + Rust + Python tooling for shuttling attention
blocks between three commonly used layouts:

1. **Stacked NHD / HND blocks** – `nl * no` tensors per block, each shaped
   `[nt, nh, hd]` (NHD) or `[nh, nt, hd]` (HND).
   - primarily used by vLLM
2. **Operational blocks** – flattened buffers shaped `[nl, no, inner]`,
   where `inner = nt * nh * hd`.
   - primarily used by TensorRT LLM
   - used by Dynamo's KVBM for non-device storage when no adjustments to
     the layout is need to translate to/from different TP world sizes
3. **Universal blocks** – contiguous buffers shaped `[nh, nl, no, nt, hd]`.
   - move the head dimension to the front
   - excellent format for storage blocks that can be used by different tp
     world sizes by scattering/gathering on slices of the leading dimension
     allowing for large contiguous transfers.

All kernels are batch aware: a single launch can process `nb` blocks by
walking flattened pointer tables that the host code prepares ahead of time.
Bindings are provided for both Rust and PyTorch so you can slot the kernels
into existing pipelines without living in CUDA all day.

---

### Layout Cheat Sheet

| Term                | Logical Shape              | Stored As                          | Notes                         |
|---------------------|----------------------------|------------------------------------|-------------------------------|
| NHD block stack     | `[nl][no][nt, nh, hd]`     | list of `nl * no` pointers         | Inner layout = NHD            |
| HND block stack     | `[nl][no][nh, nt, hd]`     | list of `nl * no` pointers         | Inner layout = HND            |
| Operational block   | `[nl, no, inner]`          | contiguous buffer per block        | `inner = nt * nh * hd`        |
| Universal block     | `[nh, nl, no, nt, hd]`     | contiguous buffer per block        | Ideal when all dims are fixed |

> **Pointer prep**
> For each logical block you provide:
> - one universal pointer,
> - `nl * no` pointers for either NHD or HND chunks, and
> - one operational pointer (when needed).

---

### Repository Structure

```
.
├── Cargo.toml              # Rust lib/bin targets
├── build.rs                # NVCC build script (sm80+sm90 by default)
├── cuda/
│   └── tensor_kernels.cu   # Batched CUDA kernels + memcpy fallback
├── src/
│   ├── lib.rs              # Rust facade for the kernels
│   ├── main.rs             # Legacy cudaMemcpyBatchAsync demo (bin)
│   └── tensor_kernels.rs   # FFI wrappers + integration tests
└── run.sh / Dockerfile     # Optional CUDA 12.9 container harness
```

> **Note:** Python bindings (`python.rs`) and tests have been moved to
> `lib/bindings/kvbm/` as part of the integrated `kvbm` wheel.

---

### Building the CUDA Library

The CUDA code is compiled via `nvcc` in `build.rs`. Supported architectures
default to `sm_80` (Ampere) and `sm_90` (Hopper). Override with `CUDA_ARCHS`
for broader compatibility:

```bash
# Default build (sm_80, sm_90)
cargo build

# Broader compatibility across GPU generations
CUDA_ARCHS="80,86,89,90,100" cargo build

# Common architectures:
# 80  = Ampere (A100)
# 86  = Ampere (RTX 30xx)
# 89  = Ada Lovelace (RTX 40xx, L4, L40)
# 90  = Hopper (H100, H200)
# 100 = Blackwell (B100, B200, GB200)
```

> **Prerequisites**
> - CUDA 12.1+ toolkit on PATH
> - `nvcc` and compatible driver
> - Rust stable (1.70+) with `cargo`

For rapid iteration without the Python bindings:

```bash
cargo check
cargo test fused_copy_roundtrip -- --nocapture
```

The unit test synthesizes two blocks on-device, exercises every conversion
path (block ⇄ universal ⇄ operational), and asserts lossless round-trips.

---

### Python Bindings & Tests

> **Note:** The Python bindings and tests have been migrated to the `kvbm` wheel
> at `lib/bindings/kvbm/`. Install and test using that package instead.

#### Install locally

```bash
cd lib/bindings/kvbm
uv pip install -e ".[dev]"
```

This installs the `kvbm` package with all development dependencies including
the CUDA tensor kernels, pytest, and build tools.

#### Validate against PyTorch baselines

```bash
cd lib/bindings/kvbm
pytest tests/
```

Each test synthesizes random CUDA tensors, permutes them using native PyTorch
ops, then compares the kernel output with tolerances tuned per dtype.

#### Python API Sketch

```python
import torch
from kvbm import kernels

blocks = [...]         # list[list[torch.Tensor]] sized nb x (nl*no)
universals = [...]     # list[torch.Tensor] sized nb
operationals = [...]   # list[torch.Tensor] sized nb

kernels.block_to_universal(blocks, universals, layout="NHD")
kernels.universal_to_block(universals, blocks, layout="NHD")

kernels.block_to_operational(blocks, operationals, backend="batch")  # or "async" / "kernel" / "auto"
kernels.operational_to_block(operationals, blocks, backend="auto")
```

All tensors must be CUDA accessible by the specificed device and match the expected
shapes and be contiguous in those shapes. The bindings validate shapes/dtypes, stage
pointer tables on-device, and launch the appropriate CUDA kernel.

---

### Docker Workflow (Optional)

Need a reproducible environment? The repo includes a CUDA 12.9 container that
installs Rust and builds the project.

```bash
# Build and run the demo binary inside the container
./run.sh

# Or build manually
# Or build manually
docker build -t kvbm-kernels .
docker run --rm --gpus all kvbm-kernels
```

To develop interactively with Python, extend the Dockerfile with your preferred
Python distribution and PyTorch wheel.

---

### Troubleshooting

| Symptom                               | Likely Cause / Fix                                                 |
|---------------------------------------|--------------------------------------------------------------------|
| `cudaErrorInvalidValue` on launch     | Pointer counts mismatch (`nb`, `nl`, `no`) or non-contiguous input |
| Wrong values when using HND layout    | Inner tensors not permuted to `[nh, nt, hd]` before passing in     |
| Python bindings complain about dtype  | Mixed precision in a batch; convert tensors to a common dtype      |
| Kernels take unexpected time          | Verify that `CUDA_ARCHS` matches your GPU to avoid JIT at runtime  |
- `backend="auto"` defaults to the fused kernel, then `cudaMemcpyBatchAsync`, then `cudaMemcpyAsync`. Override if you want to benchmark a specific path.