## Dynamo KV Block Manager Kernels GPU kernels for converting KV cache blocks between three memory layouts used by LLM inference frameworks. All conversions run entirely on-device via fused CUDA kernels. ### Dimensions | Symbol | Meaning | Example | |--------|--------------------------------|------------------| | `nb` | Number of blocks in the batch | 1–128 | | `nl` | Number of layers | 32 (Llama-70B) | | `no` | Outer chunks (K and V) | 2 | | `nh` | Number of attention heads | 32 or 64 | | `nt` | Tokens per block | 128 or 256 | | `hd` | Head dimension | 128 | ### Layouts #### Block Stack (NHD or HND) `nl * no` separate GPU allocations per block. Each allocation holds one layer's keys or values. - **NHD shape**: `[nt, nh, hd]` — index: `(nt_idx * nh + nh_idx) * hd + hd_idx` - **HND shape**: `[nh, nt, hd]` — index: `(nh_idx * nt + nt_idx) * hd + hd_idx` Passed to kernels as a flat pointer table of length `nb * nl * no`. #### Operational Single contiguous buffer per block: `[nl, no, inner]` where `inner = nt * nh * hd`. The three innermost dimensions (`nt`, `nh`, `hd`) are fused into one `inner` dimension. When no layout permutation is needed (same TP config, same head layout), block-to-operational is a flat copy — the cheapest conversion. Transforming to/from other layouts requires knowing the constituent dimensions. #### Universal Single contiguous buffer per block: `[nh, nl, no, nt, hd]`. Heads are the outermost dimension so that tensor-parallelism resharding is a contiguous slice along `nh`. A block saved from a TP=4 deployment can be loaded into TP=8 by slicing the head dimension differently. ### Layout Cheat Sheet | Layout | Logical Shape | Stored As | Notes | |---------------------|----------------------------|------------------------------------|-------------------------------| | NHD block stack | `[nl][no][nt, nh, hd]` | list of `nl * no` pointers | Inner layout = NHD | | HND block stack | `[nl][no][nh, nt, hd]` | list of `nl * no` pointers | Inner layout = HND | | Operational block | `[nl, no, inner]` | contiguous buffer per block | `inner = nt * nh * hd` | | Universal block | `[nh, nl, no, nt, hd]` | contiguous buffer per block | Heads outermost for TP slicing | ### Kernel Functions All kernels are batched: a single launch processes `nb` blocks from flat pointer tables prepared by host code. #### Layout permutation kernels | C API | Conversion | |----------------------------------------------|-----------------------------| | `kvbm_kernels_launch_universal_from_block` | Block stack → Universal | | `kvbm_kernels_launch_block_from_universal` | Universal → Block stack | Both accept `layout_value` (NHD=0, HND=1) and `dtype_value` (F16=0, BF16=1, F32=2, F64=3). Internally dispatched to C++ template kernels specialized on dtype and layout. #### Standalone copy utilities | C API | Description | |------------------------------------------|----------------------------------------------------------| | `kvbm_kernels_launch_vectorized_copy` | Adaptive vectorized copy (16/8/4-byte or scalar) across `num_pairs` pointer pairs | | `kvbm_kernels_memcpy_batch` | Batched `cudaMemcpyAsync` from host pointer arrays | | `kvbm_kernels_has_memcpy_batch_async` | Returns `true` if `cudaMemcpyBatchAsync` is available | | `kvbm_kernels_is_stub_build` | Returns `true` if built without CUDA (stub mode) | ### Python Bindings (Planned) Python kernel bindings are not yet implemented. The `lib/bindings/kvbm/` crate currently exposes block manager functionality only. Future work will add Python wrappers for the permute and copy kernels. ### Development ```bash # Default build (auto-detects nvcc → source; no nvcc → stubs) cargo build # Custom GPU architectures CUDA_ARCHS="80,86,89,90,100" cargo build # Static linking cargo build --features static-kernels # Run CUDA integration tests (requires GPU + nvcc) cargo test --features testing-cuda,permute_kernels # Specific test with output cargo test --features testing-cuda,permute_kernels fused_copy_roundtrip -- --nocapture # Python bindings cd lib/bindings/kvbm uv pip install -e ".[dev]" pytest tests/ ``` **Environment variables**: `CUDA_ARCHS` (comma-separated SM versions, default `80,86,89,90,100,120`), `CUDA_PATH`/`CUDA_HOME` (toolkit root), `KVBM_REQUIRE_CUDA` (fail build if nvcc missing). ### Benchmarking ```text root@9eb240f7ded8:/workspace/lib/kvbm-kernels# cargo run --release --example kvbench --features testing-cuda,kvbench -- --num-blocks=1,128 --tokens-per-block=16,64 -- backend vectorized,batched --direction h2d ... Running `/workspace/target/release/examples/kvbench --num-blocks=1,128 --tokens-per-block=16,64 --backend vectorized,batched --direction h2d` KV Cache Transfer Benchmark Model: Llama 3.1 70B (bf16) Layers: 80, KV heads: 8, Head dim: 128, Outer dim: 2 Warmup: 10, Timed: 100 Batch API available: true tokens_per_block: [16, 64] num_blocks: [1, 128] directions: [h2d] patterns: [fc_to_fc, lw_to_fc] backends: [vectorized, batched] Total tests: 16 tokens_per_block,num_blocks,pattern,direction,backend,total_bytes,inner_bytes,copy_size,num_copies,median_ms,bandwidth_gbps --- tokens_per_block=16, inner=32768 bytes (32 KB), block=5242880 bytes (5.0 MB) --- [1/16] tpb=16 N= 1 fc_to_fc h2d vectorized ... 16,1,fc_to_fc,h2d,vectorized,5242880,32768,5242880,1,1.8686,2.81 2.81 GB/s (1.8686 ms) [2/16] tpb=16 N= 1 fc_to_fc h2d batched ... 16,1,fc_to_fc,h2d,batched,5242880,32768,5242880,1,0.2105,24.91 24.91 GB/s (0.2105 ms) [3/16] tpb=16 N= 1 lw_to_fc h2d vectorized ... 16,1,lw_to_fc,h2d,vectorized,5242880,32768,32768,160,0.2171,24.15 24.15 GB/s (0.2171 ms) [4/16] tpb=16 N= 1 lw_to_fc h2d batched ... 16,1,lw_to_fc,h2d,batched,5242880,32768,32768,160,0.2775,18.89 18.89 GB/s (0.2775 ms) [5/16] tpb=16 N=128 fc_to_fc h2d vectorized ... 16,128,fc_to_fc,h2d,vectorized,671088640,32768,5242880,128,26.6097,25.22 25.22 GB/s (26.6097 ms) [6/16] tpb=16 N=128 fc_to_fc h2d batched ... 16,128,fc_to_fc,h2d,batched,671088640,32768,5242880,128,26.6180,25.21 25.21 GB/s (26.6180 ms) [7/16] tpb=16 N=128 lw_to_fc h2d vectorized ... 16,128,lw_to_fc,h2d,vectorized,671088640,32768,32768,20480,26.6034,25.23 25.23 GB/s (26.6034 ms) [8/16] tpb=16 N=128 lw_to_fc h2d batched ... 16,128,lw_to_fc,h2d,batched,671088640,32768,32768,20480,30.3346,22.12 22.12 GB/s (30.3346 ms) --- tokens_per_block=64, inner=131072 bytes (128 KB), block=20971520 bytes (20.0 MB) --- [9/16] tpb=64 N= 1 fc_to_fc h2d vectorized ... 64,1,fc_to_fc,h2d,vectorized,20971520,131072,20971520,1,7.5837,2.77 2.77 GB/s (7.5837 ms) [10/16] tpb=64 N= 1 fc_to_fc h2d batched ... 64,1,fc_to_fc,h2d,batched,20971520,131072,20971520,1,0.8334,25.16 25.16 GB/s (0.8334 ms) [11/16] tpb=64 N= 1 lw_to_fc h2d vectorized ... 64,1,lw_to_fc,h2d,vectorized,20971520,131072,131072,160,0.8407,24.95 24.95 GB/s (0.8407 ms) [12/16] tpb=64 N= 1 lw_to_fc h2d batched ... 64,1,lw_to_fc,h2d,batched,20971520,131072,131072,160,0.9020,23.25 23.25 GB/s (0.9020 ms) [13/16] tpb=64 N=128 fc_to_fc h2d vectorized ... 64,128,fc_to_fc,h2d,vectorized,2684354560,131072,20971520,128,106.3677,25.24 25.24 GB/s (106.3677 ms) [14/16] tpb=64 N=128 fc_to_fc h2d batched ... 64,128,fc_to_fc,h2d,batched,2684354560,131072,20971520,128,106.3199,25.25 25.25 GB/s (106.3199 ms) [15/16] tpb=64 N=128 lw_to_fc h2d vectorized ... 64,128,lw_to_fc,h2d,vectorized,2684354560,131072,131072,20480,106.3158,25.25 25.25 GB/s (106.3158 ms) [16/16] tpb=64 N=128 lw_to_fc h2d batched ... 64,128,lw_to_fc,h2d,batched,2684354560,131072,131072,20480,110.0665,24.39 24.39 GB/s (110.0665 ms) Done. ``` ### Troubleshooting | Symptom | Likely Cause / Fix | |---------------------------------------|--------------------------------------------------------------------| | `cudaErrorInvalidValue` on launch | Pointer counts mismatch (`nb`, `nl`, `no`) or non-contiguous input | | Wrong values when using HND layout | Inner tensors not shaped as `[nh, nt, hd]` before passing in | | Python bindings complain about dtype | Mixed precision in a batch; convert tensors to a common dtype | | Kernels take unexpected time | Verify that `CUDA_ARCHS` matches your GPU to avoid JIT at runtime |