CLAUDE.md 4.66 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What This Is

kvbm-kernels is a high-performance CUDA transfer library for batched H2D, D2H, and D2D block copies used by the Dynamo KV cache system. The core API (`vectorized_copy`, `memcpy_batch`) is always available and handles the common case of moving KV cache blocks between host and device without layout changes. Fused permute-and-copy kernels for layout conversion between **Block Stack** (vLLM) and **Universal** (Dynamo storage) formats are feature-gated behind `permute_kernels`.

## Build Commands

```bash
# Default build (auto-detects nvcc -> source build; no nvcc -> stubs)
cargo build

# Build from source with custom GPU architectures
CUDA_ARCHS="80,86,89,90,100" cargo build

# Static linking (embed kernels into binary instead of .so)
cargo build --features static-kernels

# Check compilation without linking
cargo check

# Run CUDA integration tests for core transfer APIs (requires GPU + nvcc)
cargo test --features testing-cuda

# Run all CUDA integration tests including permute kernels
cargo test --features testing-cuda,permute_kernels

# Run a specific test
cargo test --features testing-cuda,permute_kernels fused_copy_roundtrip -- --nocapture --test-threads=1

# Run benchmarks (Llama 3.1 70B KV cache profile)
cargo run --example kvbench --features kvbench
```

**Environment variables**: `CUDA_ARCHS` (comma-separated SM versions), `CUDA_PTX_ARCHS` (PTX targets), `KVBM_REQUIRE_CUDA` (fail if nvcc missing), `CUDA_PATH`/`CUDA_HOME`.

## Architecture

### Two-tier build system (`build.rs`)

The build script selects one of two modes: **FromSource** (nvcc available, compiles CUDA, requires CUDA >= 12.0) or **Stubs** (no nvcc, C stubs that abort on call). Stubs set the `stub_kernels` cfg flag so tests can be conditionally skipped.

### Core transfer API (always available)

These live in `src/tensor_kernels.rs` and work on any device-visible memory (device allocations or pinned host via unified addressing):

- **`vectorized_copy`** — Batched copy of `(src, dst)` pointer pairs. Per-pair runtime alignment detection selects the widest safe vector width (int4/int2/int/char for 16/8/4/1-byte loads).
- **`memcpy_batch`** — Takes HOST arrays of src/dst pointers. Dispatches to `cudaMemcpyBatchAsync` (CUDA 12.9+) with fallback to individual `cudaMemcpyAsync` loop. Three modes: `BatchedWithFallback`, `FallbackOnly`, `BatchWithoutFallback`.
- **`is_using_stubs`** / **`is_memcpy_batch_available`** — Runtime capability queries.

### Permute kernels (feature-gated: `permute_kernels`)

These fuse layout permutation with copy for non-standard transfer paths:

- **`universal_from_block`** / **`block_from_universal`** — Permute between block stack layout (`nl*no` separate allocations, each `[nt, nh, hd]` NHD or `[nh, nt, hd]` HND) and universal layout (contiguous `[nh, nl, no, nt, hd]`).

### Source organization

- `cuda/tensor_kernels.cu` — All CUDA kernels. C++ templates on dtype (F16/BF16/F32/F64) and layout (NHD/HND), exposed via `extern "C"` functions prefixed `kvbm_kernels_launch_*` / `kvbm_kernels_memcpy_batch`.
- `cuda/stubs.c` — Abort-on-call fallbacks for all `extern "C"` symbols.
- `src/tensor_kernels.rs` — Rust FFI wrappers, enums (`TensorDataType`, `BlockLayout`, `MemcpyBatchMode`), and integration tests.
- `examples/kvbench.rs` — Benchmark harness (Llama 3.1 70B profile, CSV output).
- `scripts/plot_roofline.py` — Roofline bandwidth plots from kvbench output.

### Dimension conventions

`nl` = layers, `no` = outer chunks (2: K and V), `nh` = attention heads, `nt` = tokens per block, `hd` = head dimension.

### Pointer conventions

All pointer-list parameters (e.g. `universal_ptrs`, `src_ptrs`) must be device-accessible: allocated via `cudaMalloc` (device memory) or `cudaMallocHost` / `cuMemHostRegister` (pinned/registered/page-locked host memory).

### Cargo features

| Feature | Purpose |
|---------|---------|
| `permute_kernels` | Enable fused permute-and-copy kernels (block<->universal) |
| `testing-cuda` | Enable CUDA integration tests |
| `static-kernels` | Link as `.a` instead of `.so` |
| `kvbench` | Enable benchmark example (pulls in `clap`) |

### Test organization

- `tests/stub_build.rs` — Verifies stub behavior (gated on `stub_kernels`).
- `tests/memcpy_batch.rs` — Core transfer API roundtrip tests (H2D + D2H via pinned host memory). Gated on `testing-cuda`.
- `tests/kernel_roundtrip.rs` — Permute kernel roundtrip tests across all dtypes and layouts. Gated on `testing-cuda` + `permute_kernels`.
- Inline tests in `src/tensor_kernels.rs` — Integration tests including `universal_roundtrip`. Gated on `testing-cuda` + `permute_kernels`.