# Migration Guide: block_manager to kvbm-physical Guide for migrating from `dynamo-llm::block_manager` (v1) to `kvbm-physical`. ## Overview `kvbm-physical` is a ground-up rewrite of the physical transfer layer from `lib/llm/src/block_manager/`. The core data flow is the same (register layouts, exchange metadata, execute transfers), but `kvbm-physical` adds block format awareness, richer transfer options, and a cleaner separation between logical tiers and physical handles. Both implementations use the same `vectorized_copy` CUDA kernel. The original embeds it in a `.fatbin` (`lib/llm/src/block_manager/block/transfer/kernels/vectorized_copy.fatbin`) loaded via `cuModuleLoadData`. `kvbm-physical` wraps the same kernel via the `kvbm-kernels` crate with explicit Rust FFI for transparency and testability. ## Type mapping table | Original (block_manager) | kvbm-physical | Notes | |--------------------------|---------------|-------| | `TransportManager` | `TransferManager` | Same role, richer API | | `LayoutHandle` | `LayoutHandle` | Same concept; encoding changed — see LayoutHandle docs for details | | `PhysicalLayout` + builder | `PhysicalLayout` + builder | Same pattern; adds `with_external_device_regions()` | | `LayoutConfig` | `LayoutConfig` | Same fields + optional `num_heads` | | `TransferOptions` | `TransferOptions` | Adds `cuda_stream`, `src_kv_layout`, `dst_kv_layout` | | `TransferCapabilities` | `TransferCapabilities` | Same | | `TransferPreferences` | `TransferPreferences` | Same | | `SerializedLayout` | `SerializedLayout` | Same wire format concept | | `WorkerAddress` | `WorkerAddress` | Same | | `TransferCompleteNotification` (oneshot) | `TransferCompleteNotification` (`Either`/`EventAwaiter`) | Zero-cost sync path | | `BounceBufferSpec` (trait object) | `BounceBuffer` (concrete struct) | Simpler, no heap allocation | | N/A | `LogicalLayoutDescriptor` | **New** — tier bridging | | N/A | `KvBlockLayout` | **New** — block format awareness | | N/A | `KvBlocks` | **New** — grouped blocks with layout override | | `CudaBlockingH2D` / `CudaBlockingD2H` | Removed | Async-only; `.await` for sync behavior | | `OperationalCopyBackend` | Removed | Replaced by `kvbm_kernels` direct FFI | ## What kvbm-physical adds ### LogicalLayoutDescriptor Bridges `LayoutHandle` (physical) to `LogicalLayoutHandle` (G1/G2/G3/G4 tier). This is the key new abstraction for multi-worker coordination: callers say "copy from G1 to G2" while `TransferManager` resolves worker-specific handles. ```rust,ignore // Build descriptor for RDMA exchange let descriptor = manager.build_logical_descriptor(gpu_handle, LogicalLayoutHandle::G1)?; ``` ### KvBlockLayout Five named block formats plus `Custom` and `Unknown`. Enables type-driven kernel selection for transfers between different dimension orderings. ```rust,ignore let needs_permute = src_layout.requires_transform(&dst_layout); ``` ### kvbm-kernels FFI The `kvbm_kernels` crate provides `memcpy_batch` using CUDA 12.9+ batch API with automatic fallback to individual copies. This replaces the fatbin-loading approach with direct Rust FFI. ### Stream pooling 4 H2D + 4 D2H streams with round-robin selection, replacing the original 1+1 stream pair. Reduces contention for concurrent transfers. ### Caller-provided CUDA stream `TransferOptions::cuda_stream` lets the caller pass in a stream. The executor skips event recording; the caller manages synchronization. Useful for layer-wise transfers where all layers must execute on the same stream. ```rust,ignore let stream = manager.context().acquire_h2d_stream(); let options = TransferOptions::builder() .cuda_stream(stream.clone()) .build()?; ``` ### CudaMemPool Device memory pool for kernel temporary allocations (permute buffers, etc.). Configured via `TransferConfig`: ```rust,ignore TransferManager::builder() .cuda_pool_reserve_size(64 * 1024 * 1024) // 64 MiB pre-allocated .cuda_pool_release_threshold(Some(64 * 1024 * 1024)) // free above this .build()?; ``` ### TransferCompleteNotification::aggregate() Compose multiple transfer notifications into one that completes when all are done. Optimizes away the aggregation when all inputs are already complete. ```rust,ignore let combined = TransferCompleteNotification::aggregate( vec![n1, n2, n3], manager.context().event_system(), &tokio::runtime::Handle::current(), )?; combined.await?; ``` ### src/dst kv_layout overrides `TransferOptions` now supports overriding the source and destination block layout interpretation, enabling cross-format transfers without modifying the registered layout. ```rust,ignore let options = TransferOptions::builder() .src_kv_layout(KvBlockLayout::OperationalNHD) .dst_kv_layout(KvBlockLayout::UniversalTP) .build()?; ``` ## What was intentionally removed ### Blocking CUDA strategies `CudaBlockingH2D` and `CudaBlockingD2H` are removed. All transfers are async. For synchronous behavior, just `.await` immediately: ```rust,ignore // v1 (blocking) let result = blocking_h2d_transfer(...); // kvbm-physical (async, but can be used synchronously) let notification = manager.execute_transfer(...)?; notification.await?; ``` ### OperationalCopyBackend enum The `OperationalCopyBackend` enum (which selected between different kernel loading strategies) is removed. `kvbm-physical` uses `kvbm_kernels` direct FFI exclusively, making kernel dispatch transparent. ### Trait object bounce buffer `BounceBufferSpec` (a trait object requiring heap allocation) is replaced by `BounceBuffer`, a concrete struct wrapping a `LayoutHandle` + block IDs: ```rust,ignore // v1 struct MyBounce { layout: PhysicalLayout, blocks: Vec } impl BounceBufferSpec for MyBounce { ... } // kvbm-physical let bounce = BounceBuffer::from_handle(host_handle, vec![0, 1, 2, 3]); ``` ## Migration steps ### 1. Replace TransportManager with TransferManager The builder pattern is the same. `TransferManager::builder()` returns the same kind of fluent builder. ```rust,ignore // v1 let manager = TransportManager::builder() .worker_id(0) .nixl_backend("ucx") .cuda_device_id(0) .build()?; // kvbm-physical let manager = TransferManager::builder() .nixl_backend("ucx") .cuda_device_id(0) .build()?; // worker_id is now derived from the event system ``` ### 2. Replace TransferOptions Add new fields as needed. Existing `layer_range` and `nixl_write_notification` work the same way. ```rust,ignore // v1 let options = TransferOptions::builder() .layer_range(0..16) .build()?; // kvbm-physical (same, with optional new fields) let options = TransferOptions::builder() .layer_range(0..16) .cuda_stream(stream) // new: caller-managed stream .src_kv_layout(layout) // new: format override .build()?; ``` ### 3. Replace BounceBufferSpec with BounceBuffer ```rust,ignore // v1 — trait object let spec: Box = Box::new(MyBounce::new(layout, blocks)); options.bounce_buffer(spec); // kvbm-physical — concrete type let bounce = BounceBuffer::from_handle(host_handle, block_ids); let options = TransferOptions::builder() .bounce_buffer(bounce) .build()?; ``` ### 4. Replace TransferCompleteNotification await pattern The notification now implements `IntoFuture` directly instead of wrapping a oneshot channel. ```rust,ignore // v1 let notification = manager.execute_transfer(...)?; notification.recv().await??; // kvbm-physical let notification = manager.execute_transfer(...)?; notification.await?; ``` ### 5. Add LogicalLayoutDescriptor for multi-worker tier resolution If you coordinate transfers across multiple workers by tier name (G1, G2, etc.), use `LogicalLayoutDescriptor`: ```rust,ignore // Build descriptors that include tier information let g1_desc = manager.build_logical_descriptor(gpu_handle, LogicalLayoutHandle::G1)?; let g2_desc = manager.build_logical_descriptor(host_handle, LogicalLayoutHandle::G2)?; // Remote workers can now resolve "copy G1 to G2" to the correct physical handles ``` ### 6. Consider KvBlockLayout annotations for cross-format transfers If your transfers involve blocks stored in different dimension orderings (e.g., operational NHD from the engine vs. universal TP for storage), annotate with `KvBlockLayout`: ```rust,ignore let options = TransferOptions::builder() .src_kv_layout(KvBlockLayout::OperationalNHD) .dst_kv_layout(KvBlockLayout::UniversalTP) .build()?; ``` This tells the executor to select a permute kernel instead of a direct copy.