feat: adding kvbm-engine (#6773)

Signed-off-by: Ryan Olson <rolson@nvidia.com>

feat: adding kvbm-engine (#6773)
Signed-off-by: Ryan Olson <rolson@nvidia.com>
008683d6 · Ryan Olson · GitHub · cf79c4fc · 008683d6 · 008683d6
Unverified Commit 008683d6 authored Apr 08, 2026 by Ryan Olson Committed by GitHub Apr 08, 2026
20 changed files
--- a/lib/kvbm-engine/bin/bench_engine.rs
+++ b/lib/kvbm-engine/bin/bench_engine.rs
--- a/lib/kvbm-engine/docs/architecture.md
+++ b/lib/kvbm-engine/docs/architecture.md
+# kvbm-engine
+`kvbm-engine` provides distributed coordination primitives for KV Block Management (KVBM).
+It implements a tiered storage model where KV cache blocks flow between GPU memory, host
+DRAM, local disk, and object storage. The crate coordinates leaders (which own block
+metadata and make placement decisions) with workers (which execute data transfers via
+RDMA, NVMe, or object storage APIs).
+## Storage Tier Model
+| Tier | Medium | Latency | Capacity | Description |
+|------|--------|---------|----------|-------------|
+| G1 | GPU HBM | ~ns | Smallest | Active KV cache used by attention kernels |
+| G2 | Pinned DRAM | ~us | Medium | Staging area for RDMA transfers and tier promotion |
+| G3 | NVMe/SSD | ~ms | Large | Persistent warm-block storage |
+| G4 | S3/MinIO | ~100ms | Unlimited | Cold/archival object storage |
+## Architecture
+```text
+                    +-----------------+
+                    | InstanceLeader  |
+                    |  (find_matches, |
+                    |   BlockAccessor)|
+                    +--------+--------+
+                             |
+               +-------------+-------------+
+               |                           |
+      +--------v--------+        +--------v--------+
+      | CoordinatedWorker|       | CoordinatedWorker|
+      |   (rank 0)       |       |   (rank 1)       |
+      +--------+---------+       +--------+---------+
+               |                           |
+      +--------v--------+        +--------v--------+
+      | PhysicalWorker   |       | PhysicalWorker   |
+      | (TransferManager)|       | (TransferManager)|
+      +-----------------+        +-----------------+
+```
+The leader drives workers through the `ParallelWorkers` trait (`SpmdParallelWorkers`
+for SPMD execution). For onboarding, the leader creates sessions that progress through
+stages: search, hold, prepare (G3->G2), and pull (remote G2->local G2 via RDMA).
+## Modules
+| Module | Purpose |
+|--------|---------|
+| `leader` | Block coordination: matching, onboarding sessions, policy-based scanning |
+| `worker` | Transfer execution: local, RDMA, and object storage data movement |
+| `object` | G4 storage: S3/MinIO client for cold-tier block persistence |
+| `offload` | Tier demotion pipeline: batched G2->G3 and G2->G4 offloading |
+| `runtime` | Shared infrastructure: `KvbmRuntime`, tokio handle, NIXL agent |
+| `pubsub` | Event pub/sub: block-level notifications for cross-instance coordination |
+| `collectives` | NCCL collectives for multi-GPU synchronization (feature-gated) |
+| `testing` | Test utilities: mock workers, in-memory block managers (feature-gated) |
+## Feature Flags
+| Flag | Dependencies | Description |
+|------|-------------|-------------|
+| `default` | `["s3"]` | Default features |
+| `s3` | `aws-sdk-s3`, `aws-config`, `rayon`, `tokio-rayon`, `chrono` | S3/MinIO object storage support |
+| `collectives` | `nixl-sys`, `nccl` | NIXL + NCCL multi-GPU collectives |
+| `nccl` | `cudarc` | NCCL support via cudarc |
+| `testing-nccl` | `collectives` | Enable collectives for tests |
+| `nats` | `async-nats`, `flume` | NATS-based pub/sub transport |
+| `testing` | `kvbm-logical/testing`, `kvbm-physical/testing` | Test utilities and mock infrastructure |
+| `nvtx` | `kvbm-config/nvtx` | NVIDIA Tools Extension profiling markers |
+## Quick Start
+```rust,ignore
+use kvbm_engine::{KvbmRuntime, leader::InstanceLeader};
+// Build runtime from environment
+let runtime = KvbmRuntime::from_env_leader().await?;
+// Create a leader instance
+let leader = InstanceLeader::new(/* ... */);
+// Search for cached blocks
+let result = leader.find_matches(&sequence_hashes)?;
+```
--- a/lib/kvbm-engine/docs/leader.md
+++ b/lib/kvbm-engine/docs/leader.md
+# Leader Module
+The leader module implements block coordination for a single KVBM instance. It owns
+block metadata (via `BlockManager<G2>` and `BlockManager<G3>`), resolves cache lookups,
+and orchestrates multi-stage onboarding sessions that move blocks between storage tiers
+and across instances.
+## Leader Trait
+The `Leader` trait defines the core coordination interface:
+```rust,ignore
+pub trait Leader: Send + Sync {
+    fn find_matches(&self, sequence_hashes: &[SequenceHash]) -> Result<FindMatchesResult>;
+    fn find_matches_with_options(
+        &self, sequence_hashes: &[SequenceHash], options: FindMatchesOptions,
+    ) -> Result<FindMatchesResult>;
+}
+```
+`find_matches` searches for blocks matching the given sequence hashes and returns
+either an immediate result or an async session depending on the staging mode and
+search scope.
+## InstanceLeader
+`InstanceLeader` is the primary implementation of `Leader`. It holds:
+- `BlockManager<G2>` and optional `BlockManager<G3>` for local block registries
+- A `ParallelWorkers` instance for driving transfer execution
+- Session state for active onboarding operations
+- Remote leader connections for cross-instance coordination
+## FindMatchesResult
+The result of `find_matches` is one of two variants:
+- **`Ready`** -- Returned when `search_remote == false` AND `staging_mode == Hold`.
+  Blocks are held in place via RAII without creating a session. The `ReadyResult`
+  directly owns `Vec<ImmutableBlock<G2>>`.
+- **`AsyncSession`** -- Returned when remote search or staging is required. Contains
+  a `SessionId`, a `watch::Receiver<OnboardingStatus>` for progress tracking, and
+  an optional `SessionHandle` for deferred control.
+## StagingMode
+Controls how matched blocks are staged and when the session completes:
+| Mode | Behavior | Session Lifetime |
+|------|----------|-----------------|
+| `Hold` | Blocks remain in their current tiers (G2/G3) on original instances | Stays alive for deferred operations |
+| `Prepare` | G3->G2 staging on all instances; no RDMA pulls | Stays alive after staging completes |
+| `Full` | G3->G2 everywhere, then RDMA pull remote G2->local G2 | Completes when all blocks are in local G2 |
+The progression `Hold -> Prepare -> Full` can be driven incrementally via
+`SessionHandle::prepare()` and `SessionHandle::pull()`.
+## OnboardingStatus State Machine
+```text
+Searching
+    |
+    +---> Holding { local_g2, local_g3, remote_g2, remote_g3, pending_g4, ... }
+    |         |
+    |         +---> (prepare) ---> Preparing { matched, staging_local, staging_remote }
+    |                                  |
+    +---> Preparing ------------------>+
+    |                                  |
+    |                            Prepared { local_g2, remote_g2 }
+    |                                  |
+    |                                  +---> (pull) ---> Staging { matched, ..., pulling }
+    |                                                        |
+    +---> Staging ------------------------------------------>+
+                                                             |
+                                                        Complete { matched_blocks }
+```
+Each status variant carries counters for progress tracking and cost analysis.
+`Holding` includes G4 load tracking (`pending_g4`, `loaded_g4`, `failed_g4`).
+## SessionHandle
+`SessionHandle` provides deferred control over `Hold` and `Prepare` sessions:
+- `prepare()` -- Trigger G3->G2 staging (Hold -> Prepare transition)
+- `pull()` -- Trigger RDMA pull of remote G2->local G2 (Prepare -> Complete)
+- `cancel()` -- Cancel session and release all held blocks
+Not available for `StagingMode::Full` (which runs to completion automatically).
+## BlockAccessor
+`BlockAccessor` provides a stateless, `Send + Sync` interface for policy-based
+block scanning. Each `find()` call independently searches G2 then G3, acquiring
+blocks via RAII. The companion `PolicyContext` adds result collection via
+`yield_item()` for streaming scan results back to the caller.
--- a/lib/kvbm-engine/docs/object.md
+++ b/lib/kvbm-engine/docs/object.md
+# Object Storage Module
+The object module provides traits and implementations for storing KV cache
+blocks in object storage systems (S3, MinIO). This corresponds to the G4
+(object store) tier in the storage hierarchy.
+## ObjectBlockOps Trait
+The primary trait for block-level object storage operations:
+| Method | Purpose |
+|--------|---------|
+| `has_blocks(keys)` | Check existence and size of blocks |
+| `put_blocks(keys, src_layout, block_ids)` | Upload blocks using logical layout handle |
+| `get_blocks(keys, dst_layout, block_ids)` | Download blocks using logical layout handle |
+| `put_blocks_with_layout(keys, layout, block_ids)` | Upload using resolved physical layout |
+| `get_blocks_with_layout(keys, layout, block_ids)` | Download using resolved physical layout |
+### Logical vs Physical Layout
+The trait offers two APIs for put/get:
+- **Logical** (`put_blocks` / `get_blocks`): Takes a `LogicalLayoutHandle` (G1, G2, G3).
+  Workers resolve this to their own physical layout internally. Used by the leader
+  (which doesn't have physical layouts) and by `CoordinatedWorker`.
+- **Physical** (`put_blocks_with_layout` / `get_blocks_with_layout`): Takes a resolved
+  `PhysicalLayout` directly. Used by `PhysicalWorker` after resolving its handles, and
+  by `S3ObjectBlockClient` which performs the actual I/O.
+## Key Formatting
+Keys map `SequenceHash` values to object storage paths:
+- **`DefaultKeyFormatter`**: Uses the hash's Display representation
+  (e.g., `0:abc123`). Suitable for single-worker scenarios.
+- **`RankPrefixedKeyFormatter`**: Prefixes with worker rank
+  (e.g., `0/0:abc123`). Required for SPMD workers where multiple workers
+  store the same logical block with different physical data.
+The `create_key_formatter(rank)` factory returns the appropriate formatter.
+## ObjectLockManager
+Distributed locking protocol for coordinated offloads to prevent duplicate
+uploads:
+```text
+has_meta(hash)
+  → true  → skip (already offloaded)
+  → false → try_acquire_lock(hash)
+              → true  → transfer → create_meta(hash) → release_lock(hash)
+              → false → skip (another instance owns it)
+```
+Uses conditional PUT (`If-None-Match: *`) for lock acquisition with deadline-based
+expiry for stale lock recovery.
+## S3 Implementation
+The `s3` submodule (feature-gated behind `s3`) provides:
+- **`S3ObjectBlockClient`**: Implements `ObjectBlockOps` for S3-compatible storage.
+  Supports concurrent uploads/downloads via `rayon` thread pool and contiguous
+  memory fast paths for aligned block data.
+- **`S3LockManager`**: Implements `ObjectLockManager` using S3 conditional writes.
+## Factory Functions
+- **`create_object_client(config, rank)`**: Creates an `Arc<dyn ObjectBlockOps>`
+  from configuration. Selects the backend (S3 or future alternatives) based on
+  `ObjectClientConfig`.
+- **`create_lock_manager(config, instance_id)`**: Creates an
+  `Arc<dyn ObjectLockManager>` for distributed lock coordination.
--- a/lib/kvbm-engine/docs/offload-developer.md
+++ b/lib/kvbm-engine/docs/offload-developer.md
--- a/lib/kvbm-engine/docs/offload.md
+++ b/lib/kvbm-engine/docs/offload.md
+# Offload Module
+The offload module manages the asynchronous transfer of KV cache blocks between storage tiers. It provides a pipeline-based architecture for evaluating, batching, and executing block transfers with full cancellation support.
+## Overview
+Offloading moves blocks from a source tier (e.g., GPU memory) to a destination tier (e.g., host memory, remote storage, or object storage). The pipeline ensures:
+- **Policy-based filtering**: Only blocks meeting criteria are transferred
+- **Batched execution**: Blocks are grouped for efficient transfer
+- **Cancellation support**: Transfers can be cancelled at any point before commitment
+- **Precondition synchronization**: Transfers wait for forward pass completion
+## Pipeline Architecture
+```text
+┌─────────────────┐     ┌─────────────────────┐     ┌─────────────────────┐     ┌──────────────────┐
+│ PolicyEvaluator │────►│ PreconditionAwaiter │────►│       Batcher       │────►│ TransferExecutor │
+└─────────────────┘     └─────────────────────┘     └─────────────────────┘     └──────────────────┘
+                                                             ▲                          ▲
+                                                             │                          │
+                                                    CancellableQueue          CancellableQueue
+                                                             │                          │
+                                                             └──────── CancelSweeper ───┘
+```
+### Stages
+| Stage | Purpose |
+|-------|---------|
+| **PolicyEvaluator** | Filters blocks based on configured policies (frequency, presence, etc.) |
+| **PreconditionAwaiter** | Waits for forward pass completion before proceeding |
+| **Batcher** | Groups containers into batches based on total block count |
+| **TransferExecutor** | Upgrades blocks and executes the actual transfer |
+## Container Data Model
+The fundamental unit flowing through the pipeline is an **OffloadContainer**:
+```rust,ignore
+struct OffloadContainer<T: BlockMetadata> {
+    /// The blocks to offload
+    blocks: Vec<SourceBlock<T>>,
+    /// Precondition event (forward pass completion)
+    precondition: Option<EventHandle>,
+    /// Cancellation token
+    cancel_token: CancellationToken,
+}
+```
+Containers are grouped into batches for efficient transfer:
+```rust,ignore
+struct OffloadBatch<T: BlockMetadata> {
+    /// Multiple containers, each independently cancellable
+    containers: Vec<OffloadContainer<T>>,
+}
+```
+### P1: Container is the Unit of Cancellation
+Individual blocks within a container are not independently cancellable. When a container is cancelled, all its blocks are cancelled together.
+### P2: Token Travels with Container
+Each container carries its own `CancellationToken`, cloned from the `TransferHandle` at enqueue time. The token travels with the container through all pipeline stages until upgrade.
+### P3: Upgrade is the Commitment Boundary
+The upgrade step (Weak → Strong) is the point of no return:
+- **Before upgrade**: Containers can be cancelled via sweep or token check
+- **After upgrade**: We own the blocks; cancellation no longer applies
+### P4: Sweep Before Upgrade
+The last cancellation check occurs immediately before upgrade. The `TransferExecutor` calls `batch.sweep_cancelled()` to remove cancelled containers before committing.
+### P5: Flat Map After Upgrade
+After upgrade, all blocks from all containers are consolidated into a single `Vec<ImmutableBlock<T>>` for efficient batch transfer. Per-container identity is lost at this point.
+### P6: PreconditionAwaiter Uses Select
+The precondition awaiter can be cancelled via `select!` on both the precondition event and the cancellation token. If cancelled while waiting, the container is dropped immediately.
+## Configuration
+Pipeline behavior is controlled via `PipelineConfig`:
+| Option | Default | Description |
+|--------|---------|-------------|
+| `batch_config.max_batch_size` | 64 | Maximum blocks per batch |
+| `batch_config.min_batch_size` | 8 | Minimum blocks before flush |
+| `batch_config.flush_interval` | 10ms | Time before flushing partial batch |
+| `policy_timeout` | 100ms | Timeout for policy evaluation |
+| `sweep_interval` | 10ms | Interval for cancel sweeper |
+| `max_concurrent_transfers` | 1 | Concurrent transfer batches |
+## Usage
+### Enqueueing Blocks
+```rust,ignore
+let handle = pipeline.enqueue(source_blocks, precondition_event);
+// Track progress
+println!("Status: {:?}", handle.status());
+// Wait for completion
+let result = handle.wait().await?;
+```
+### Cancelling a Transfer
+```rust,ignore
+// Request cancellation and wait for confirmation
+handle.cancel().await;
+// All blocks are now released
+```
+## Related Documentation
+- [offload-developer.md](offload-developer.md) - Implementation details and extension rules
--- a/lib/kvbm-engine/docs/onboarding.md
+++ b/lib/kvbm-engine/docs/onboarding.md
--- a/lib/kvbm-engine/docs/runtime.md
+++ b/lib/kvbm-engine/docs/runtime.md
+# Runtime
+The `KvbmRuntime` is the composed shared infrastructure for KVBM operations. It bundles
+the minimal set of components that all downstream managers and services need:
+- **Tokio runtime** -- async execution context (owned or borrowed handle)
+- **Messenger (Velo)** -- distributed RPC for leader/worker communication and peer discovery
+- **NixlAgent** -- RDMA/UCX data transfers (optional, disabled when NixL config is absent)
+- **EventManager** -- worker coordination and transfer completion notifications (accessed via Messenger)
+## Construction
+Two quick constructors cover the common case:
+```rust,ignore
+// Leader role (reads KVBM_* env vars + TOML files)
+let runtime = KvbmRuntime::from_env_leader().await?;
+// Worker role
+let runtime = KvbmRuntime::from_env_worker().await?;
+```
+For tests or custom setups, use the builder:
+```rust,ignore
+let config = KvbmConfig::from_env()?;
+let runtime = KvbmRuntime::builder(config)
+    .with_runtime_handle(Handle::current())   // inject existing tokio runtime
+    .with_messenger(messenger)                // inject pre-built Messenger
+    .with_nixl_agent(agent)                   // inject pre-built NixlAgent
+    .build_leader()
+    .await?;
+```
+`KvbmRuntimeBuilder::from_json(json)` is the primary entrypoint for vLLM's
+`kv_connector_extra_config` dict -- JSON values have highest priority, overriding
+env vars, TOML files, and defaults.
+## Component access
+| Method              | Returns                      | Notes                                 |
+|---------------------|------------------------------|---------------------------------------|
+| `handle()` / `tokio()` | `tokio::runtime::Handle`  | Borrowed or owned runtime handle      |
+| `messenger()`       | `&Arc<Messenger>`            | Velo RPC                              |
+| `nixl_agent()`      | `Option<&NixlAgent>`        | `None` when NixL disabled in config   |
+| `event_system()`    | `Arc<velo::EventManager>`   | From Messenger, used for transfer notifications |
+| `config()`          | `&KvbmConfig`               | Full configuration snapshot            |
+## RuntimeHandle
+`RuntimeHandle` is an enum that abstracts over owned (`Arc<Runtime>`) and borrowed
+(`Handle`) tokio runtimes. The builder creates an owned runtime from config when none
+is injected.
--- a/lib/kvbm-engine/docs/session.md
+++ b/lib/kvbm-engine/docs/session.md
--- a/lib/kvbm-engine/docs/testing.md
+++ b/lib/kvbm-engine/docs/testing.md
+# Testing Module
+Test infrastructure for the kvbm-engine crate. Core block and token utilities
+are re-exported from `kvbm_logical::testing` and `kvbm_physical::testing`;
+this module adds engine-specific helpers for transport, sessions, offload
+pipelines, and multi-instance scenarios.
+## Test Helpers
+### TestManagerBuilder / TestRegistryBuilder
+Create test block managers and registries with synthetic physical layouts.
+`TestManagerBuilder` produces a `BlockManager<T>` backed by mock memory.
+`TestRegistryBuilder` produces a `BlockRegistry` pre-populated with hashes.
+Use `populate_manager_with_blocks` and `create_and_populate_manager` to
+quickly set up managers with pre-allocated blocks for testing.
+### MessengerPair
+Creates a pair of connected Velo `Messenger` instances for transport
+testing without a real network. Messages sent through one messenger are
+received by the other, enabling end-to-end session testing in a single
+process.
+```rust,ignore
+let (messenger_a, messenger_b) = create_messenger_pair_tcp().await?;
+```
+### TestSession
+Helper for testing distributed session protocols. Sets up the full session
+infrastructure (dispatch maps, transport, channels) for testing
+`InitiatorSession` / `ResponderSession` / `ControllableSession` interactions.
+### EventsPipelineFixture
+Test fixture for the offload pipeline. Provides pre-configured pipeline
+stages, event managers, and block managers for testing policy evaluation,
+batching, and transfer execution in isolation.
+### MultiInstancePopulator
+Sets up multi-instance distributed test scenarios with multiple leaders,
+workers, and block managers. Populates each instance with configurable
+block patterns for testing cross-instance onboarding.
+```rust,ignore
+let populated = MultiInstancePopulator::builder()
+    .instance_count(3)
+    .blocks_per_instance(100)
+    .build()?
+    .populate()
+    .await?;
+```
+### Physical Test Utilities
+`TestAgent` and `TestAgentBuilder` create mock `NixlAgent` instances for
+testing `TransferManager` without real RDMA hardware. `TransferChecksums`
+provides utilities for verifying transfer correctness.
+### Token Block Helpers
+The `token_blocks` module provides utilities for creating test blocks with
+known token sequences, useful for verifying search and match operations.
+## Writing a New Test
+1. Choose the appropriate fixture for your test scope:
+   - Single-instance transfer → `TestManagerBuilder` + `TestAgent`
+   - Session protocol → `TestSession` + `MessengerPair`
+   - Offload pipeline → `EventsPipelineFixture`
+   - Multi-instance → `MultiInstancePopulator`
+2. Build the fixture and populate with test data
+3. Exercise the code under test
+4. Assert on results and verify cleanup (blocks released, sessions closed)
--- a/lib/kvbm-engine/docs/worker-group.md
+++ b/lib/kvbm-engine/docs/worker-group.md
--- a/lib/kvbm-engine/docs/worker.md
+++ b/lib/kvbm-engine/docs/worker.md
--- a/lib/kvbm-engine/scripts/bench_viewer.html
+++ b/lib/kvbm-engine/scripts/bench_viewer.html
--- a/lib/kvbm-engine/scripts/test-s3.sh
+++ b/lib/kvbm-engine/scripts/test-s3.sh
--- a/lib/kvbm-engine/src/collectives/bootstrap.rs
+++ b/lib/kvbm-engine/src/collectives/bootstrap.rs
--- a/lib/kvbm-engine/src/collectives/mod.rs
+++ b/lib/kvbm-engine/src/collectives/mod.rs
--- a/lib/kvbm-engine/src/collectives/nccl.rs
+++ b/lib/kvbm-engine/src/collectives/nccl.rs
--- a/lib/kvbm-engine/src/collectives/stub.rs
+++ b/lib/kvbm-engine/src/collectives/stub.rs
--- a/lib/kvbm-engine/src/leader/accessor.rs
+++ b/lib/kvbm-engine/src/leader/accessor.rs
--- a/lib/kvbm-engine/src/leader/instance.rs
+++ b/lib/kvbm-engine/src/leader/instance.rs