v1_migration.md 8.39 KB
Newer Older
Ryan Olson's avatar
Ryan Olson committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# Migration Guide: block_manager to kvbm-physical

Guide for migrating from `dynamo-llm::block_manager` (v1) to `kvbm-physical`.

## Overview

`kvbm-physical` is a ground-up rewrite of the physical transfer layer from `lib/llm/src/block_manager/`. The core data flow is the same (register layouts, exchange metadata, execute transfers), but `kvbm-physical` adds block format awareness, richer transfer options, and a cleaner separation between logical tiers and physical handles.

Both implementations use the same `vectorized_copy` CUDA kernel. The original embeds it in a `.fatbin` (`lib/llm/src/block_manager/block/transfer/kernels/vectorized_copy.fatbin`) loaded via `cuModuleLoadData`. `kvbm-physical` wraps the same kernel via the `kvbm-kernels` crate with explicit Rust FFI for transparency and testability.

## Type mapping table

| Original (block_manager) | kvbm-physical | Notes |
|--------------------------|---------------|-------|
| `TransportManager` | `TransferManager` | Same role, richer API |
| `LayoutHandle` | `LayoutHandle` | Same concept; encoding changed — see LayoutHandle docs for details |
| `PhysicalLayout` + builder | `PhysicalLayout` + builder | Same pattern; adds `with_external_device_regions()` |
| `LayoutConfig` | `LayoutConfig` | Same fields + optional `num_heads` |
| `TransferOptions` | `TransferOptions` | Adds `cuda_stream`, `src_kv_layout`, `dst_kv_layout` |
| `TransferCapabilities` | `TransferCapabilities` | Same |
| `TransferPreferences` | `TransferPreferences` | Same |
| `SerializedLayout` | `SerializedLayout` | Same wire format concept |
| `WorkerAddress` | `WorkerAddress` | Same |
| `TransferCompleteNotification` (oneshot) | `TransferCompleteNotification` (`Either`/`EventAwaiter`) | Zero-cost sync path |
| `BounceBufferSpec` (trait object) | `BounceBuffer` (concrete struct) | Simpler, no heap allocation |
| N/A | `LogicalLayoutDescriptor` | **New** — tier bridging |
| N/A | `KvBlockLayout` | **New** — block format awareness |
| N/A | `KvBlocks` | **New** — grouped blocks with layout override |
| `CudaBlockingH2D` / `CudaBlockingD2H` | Removed | Async-only; `.await` for sync behavior |
| `OperationalCopyBackend` | Removed | Replaced by `kvbm_kernels` direct FFI |

## What kvbm-physical adds

### LogicalLayoutDescriptor

Bridges `LayoutHandle` (physical) to `LogicalLayoutHandle` (G1/G2/G3/G4 tier). This is the key new abstraction for multi-worker coordination: callers say "copy from G1 to G2" while `TransferManager` resolves worker-specific handles.

```rust,ignore
// Build descriptor for RDMA exchange
let descriptor = manager.build_logical_descriptor(gpu_handle, LogicalLayoutHandle::G1)?;
```

### KvBlockLayout

Five named block formats plus `Custom` and `Unknown`. Enables type-driven kernel selection for transfers between different dimension orderings.

```rust,ignore
let needs_permute = src_layout.requires_transform(&dst_layout);
```

### kvbm-kernels FFI

The `kvbm_kernels` crate provides `memcpy_batch` using CUDA 12.9+ batch API with automatic fallback to individual copies. This replaces the fatbin-loading approach with direct Rust FFI.

### Stream pooling

4 H2D + 4 D2H streams with round-robin selection, replacing the original 1+1 stream pair. Reduces contention for concurrent transfers.

### Caller-provided CUDA stream

`TransferOptions::cuda_stream` lets the caller pass in a stream. The executor skips event recording; the caller manages synchronization. Useful for layer-wise transfers where all layers must execute on the same stream.

```rust,ignore
let stream = manager.context().acquire_h2d_stream();
let options = TransferOptions::builder()
    .cuda_stream(stream.clone())
    .build()?;
```

### CudaMemPool

Device memory pool for kernel temporary allocations (permute buffers, etc.). Configured via `TransferConfig`:

```rust,ignore
TransferManager::builder()
    .cuda_pool_reserve_size(64 * 1024 * 1024)         // 64 MiB pre-allocated
    .cuda_pool_release_threshold(Some(64 * 1024 * 1024)) // free above this
    .build()?;
```

### TransferCompleteNotification::aggregate()

Compose multiple transfer notifications into one that completes when all are done. Optimizes away the aggregation when all inputs are already complete.

```rust,ignore
let combined = TransferCompleteNotification::aggregate(
    vec![n1, n2, n3],
    manager.context().event_system(),
    &tokio::runtime::Handle::current(),
)?;
combined.await?;
```

### src/dst kv_layout overrides

`TransferOptions` now supports overriding the source and destination block layout interpretation, enabling cross-format transfers without modifying the registered layout.

```rust,ignore
let options = TransferOptions::builder()
    .src_kv_layout(KvBlockLayout::OperationalNHD)
    .dst_kv_layout(KvBlockLayout::UniversalTP)
    .build()?;
```

## What was intentionally removed

### Blocking CUDA strategies

`CudaBlockingH2D` and `CudaBlockingD2H` are removed. All transfers are async. For synchronous behavior, just `.await` immediately:

```rust,ignore
// v1 (blocking)
let result = blocking_h2d_transfer(...);

// kvbm-physical (async, but can be used synchronously)
let notification = manager.execute_transfer(...)?;
notification.await?;
```

### OperationalCopyBackend enum

The `OperationalCopyBackend` enum (which selected between different kernel loading strategies) is removed. `kvbm-physical` uses `kvbm_kernels` direct FFI exclusively, making kernel dispatch transparent.

### Trait object bounce buffer

`BounceBufferSpec` (a trait object requiring heap allocation) is replaced by `BounceBuffer`, a concrete struct wrapping a `LayoutHandle` + block IDs:

```rust,ignore
// v1
struct MyBounce { layout: PhysicalLayout, blocks: Vec<BlockId> }
impl BounceBufferSpec for MyBounce { ... }

// kvbm-physical
let bounce = BounceBuffer::from_handle(host_handle, vec![0, 1, 2, 3]);
```

## Migration steps

### 1. Replace TransportManager with TransferManager

The builder pattern is the same. `TransferManager::builder()` returns the same kind of fluent builder.

```rust,ignore
// v1
let manager = TransportManager::builder()
    .worker_id(0)
    .nixl_backend("ucx")
    .cuda_device_id(0)
    .build()?;

// kvbm-physical
let manager = TransferManager::builder()
    .nixl_backend("ucx")
    .cuda_device_id(0)
    .build()?;
// worker_id is now derived from the event system
```

### 2. Replace TransferOptions

Add new fields as needed. Existing `layer_range` and `nixl_write_notification` work the same way.

```rust,ignore
// v1
let options = TransferOptions::builder()
    .layer_range(0..16)
    .build()?;

// kvbm-physical (same, with optional new fields)
let options = TransferOptions::builder()
    .layer_range(0..16)
    .cuda_stream(stream)        // new: caller-managed stream
    .src_kv_layout(layout)      // new: format override
    .build()?;
```

### 3. Replace BounceBufferSpec with BounceBuffer

```rust,ignore
// v1 — trait object
let spec: Box<dyn BounceBufferSpec> = Box::new(MyBounce::new(layout, blocks));
options.bounce_buffer(spec);

// kvbm-physical — concrete type
let bounce = BounceBuffer::from_handle(host_handle, block_ids);
let options = TransferOptions::builder()
    .bounce_buffer(bounce)
    .build()?;
```

### 4. Replace TransferCompleteNotification await pattern

The notification now implements `IntoFuture` directly instead of wrapping a oneshot channel.

```rust,ignore
// v1
let notification = manager.execute_transfer(...)?;
notification.recv().await??;

// kvbm-physical
let notification = manager.execute_transfer(...)?;
notification.await?;
```

### 5. Add LogicalLayoutDescriptor for multi-worker tier resolution

If you coordinate transfers across multiple workers by tier name (G1, G2, etc.), use `LogicalLayoutDescriptor`:

```rust,ignore
// Build descriptors that include tier information
let g1_desc = manager.build_logical_descriptor(gpu_handle, LogicalLayoutHandle::G1)?;
let g2_desc = manager.build_logical_descriptor(host_handle, LogicalLayoutHandle::G2)?;

// Remote workers can now resolve "copy G1 to G2" to the correct physical handles
```

### 6. Consider KvBlockLayout annotations for cross-format transfers

If your transfers involve blocks stored in different dimension orderings (e.g., operational NHD from the engine vs. universal TP for storage), annotate with `KvBlockLayout`:

```rust,ignore
let options = TransferOptions::builder()
    .src_kv_layout(KvBlockLayout::OperationalNHD)
    .dst_kv_layout(KvBlockLayout::UniversalTP)
    .build()?;
```

This tells the executor to select a permute kernel instead of a direct copy.