"lib/bindings/python/vscode:/vscode.git/clone" did not exist on "cbbaa6b799a1bd2c39eb1310705661d9a36a5c44"
Unverified Commit 008683d6 authored by Ryan Olson's avatar Ryan Olson Committed by GitHub
Browse files

feat: adding kvbm-engine (#6773)


Signed-off-by: default avatarRyan Olson <rolson@nvidia.com>
parent cf79c4fc
This diff is collapsed.
# kvbm-engine
`kvbm-engine` provides distributed coordination primitives for KV Block Management (KVBM).
It implements a tiered storage model where KV cache blocks flow between GPU memory, host
DRAM, local disk, and object storage. The crate coordinates leaders (which own block
metadata and make placement decisions) with workers (which execute data transfers via
RDMA, NVMe, or object storage APIs).
## Storage Tier Model
| Tier | Medium | Latency | Capacity | Description |
|------|--------|---------|----------|-------------|
| G1 | GPU HBM | ~ns | Smallest | Active KV cache used by attention kernels |
| G2 | Pinned DRAM | ~us | Medium | Staging area for RDMA transfers and tier promotion |
| G3 | NVMe/SSD | ~ms | Large | Persistent warm-block storage |
| G4 | S3/MinIO | ~100ms | Unlimited | Cold/archival object storage |
## Architecture
```text
+-----------------+
| InstanceLeader |
| (find_matches, |
| BlockAccessor)|
+--------+--------+
|
+-------------+-------------+
| |
+--------v--------+ +--------v--------+
| CoordinatedWorker| | CoordinatedWorker|
| (rank 0) | | (rank 1) |
+--------+---------+ +--------+---------+
| |
+--------v--------+ +--------v--------+
| PhysicalWorker | | PhysicalWorker |
| (TransferManager)| | (TransferManager)|
+-----------------+ +-----------------+
```
The leader drives workers through the `ParallelWorkers` trait (`SpmdParallelWorkers`
for SPMD execution). For onboarding, the leader creates sessions that progress through
stages: search, hold, prepare (G3->G2), and pull (remote G2->local G2 via RDMA).
## Modules
| Module | Purpose |
|--------|---------|
| `leader` | Block coordination: matching, onboarding sessions, policy-based scanning |
| `worker` | Transfer execution: local, RDMA, and object storage data movement |
| `object` | G4 storage: S3/MinIO client for cold-tier block persistence |
| `offload` | Tier demotion pipeline: batched G2->G3 and G2->G4 offloading |
| `runtime` | Shared infrastructure: `KvbmRuntime`, tokio handle, NIXL agent |
| `pubsub` | Event pub/sub: block-level notifications for cross-instance coordination |
| `collectives` | NCCL collectives for multi-GPU synchronization (feature-gated) |
| `testing` | Test utilities: mock workers, in-memory block managers (feature-gated) |
## Feature Flags
| Flag | Dependencies | Description |
|------|-------------|-------------|
| `default` | `["s3"]` | Default features |
| `s3` | `aws-sdk-s3`, `aws-config`, `rayon`, `tokio-rayon`, `chrono` | S3/MinIO object storage support |
| `collectives` | `nixl-sys`, `nccl` | NIXL + NCCL multi-GPU collectives |
| `nccl` | `cudarc` | NCCL support via cudarc |
| `testing-nccl` | `collectives` | Enable collectives for tests |
| `nats` | `async-nats`, `flume` | NATS-based pub/sub transport |
| `testing` | `kvbm-logical/testing`, `kvbm-physical/testing` | Test utilities and mock infrastructure |
| `nvtx` | `kvbm-config/nvtx` | NVIDIA Tools Extension profiling markers |
## Quick Start
```rust,ignore
use kvbm_engine::{KvbmRuntime, leader::InstanceLeader};
// Build runtime from environment
let runtime = KvbmRuntime::from_env_leader().await?;
// Create a leader instance
let leader = InstanceLeader::new(/* ... */);
// Search for cached blocks
let result = leader.find_matches(&sequence_hashes)?;
```
# Leader Module
The leader module implements block coordination for a single KVBM instance. It owns
block metadata (via `BlockManager<G2>` and `BlockManager<G3>`), resolves cache lookups,
and orchestrates multi-stage onboarding sessions that move blocks between storage tiers
and across instances.
## Leader Trait
The `Leader` trait defines the core coordination interface:
```rust,ignore
pub trait Leader: Send + Sync {
fn find_matches(&self, sequence_hashes: &[SequenceHash]) -> Result<FindMatchesResult>;
fn find_matches_with_options(
&self, sequence_hashes: &[SequenceHash], options: FindMatchesOptions,
) -> Result<FindMatchesResult>;
}
```
`find_matches` searches for blocks matching the given sequence hashes and returns
either an immediate result or an async session depending on the staging mode and
search scope.
## InstanceLeader
`InstanceLeader` is the primary implementation of `Leader`. It holds:
- `BlockManager<G2>` and optional `BlockManager<G3>` for local block registries
- A `ParallelWorkers` instance for driving transfer execution
- Session state for active onboarding operations
- Remote leader connections for cross-instance coordination
## FindMatchesResult
The result of `find_matches` is one of two variants:
- **`Ready`** -- Returned when `search_remote == false` AND `staging_mode == Hold`.
Blocks are held in place via RAII without creating a session. The `ReadyResult`
directly owns `Vec<ImmutableBlock<G2>>`.
- **`AsyncSession`** -- Returned when remote search or staging is required. Contains
a `SessionId`, a `watch::Receiver<OnboardingStatus>` for progress tracking, and
an optional `SessionHandle` for deferred control.
## StagingMode
Controls how matched blocks are staged and when the session completes:
| Mode | Behavior | Session Lifetime |
|------|----------|-----------------|
| `Hold` | Blocks remain in their current tiers (G2/G3) on original instances | Stays alive for deferred operations |
| `Prepare` | G3->G2 staging on all instances; no RDMA pulls | Stays alive after staging completes |
| `Full` | G3->G2 everywhere, then RDMA pull remote G2->local G2 | Completes when all blocks are in local G2 |
The progression `Hold -> Prepare -> Full` can be driven incrementally via
`SessionHandle::prepare()` and `SessionHandle::pull()`.
## OnboardingStatus State Machine
```text
Searching
|
+---> Holding { local_g2, local_g3, remote_g2, remote_g3, pending_g4, ... }
| |
| +---> (prepare) ---> Preparing { matched, staging_local, staging_remote }
| |
+---> Preparing ------------------>+
| |
| Prepared { local_g2, remote_g2 }
| |
| +---> (pull) ---> Staging { matched, ..., pulling }
| |
+---> Staging ------------------------------------------>+
|
Complete { matched_blocks }
```
Each status variant carries counters for progress tracking and cost analysis.
`Holding` includes G4 load tracking (`pending_g4`, `loaded_g4`, `failed_g4`).
## SessionHandle
`SessionHandle` provides deferred control over `Hold` and `Prepare` sessions:
- `prepare()` -- Trigger G3->G2 staging (Hold -> Prepare transition)
- `pull()` -- Trigger RDMA pull of remote G2->local G2 (Prepare -> Complete)
- `cancel()` -- Cancel session and release all held blocks
Not available for `StagingMode::Full` (which runs to completion automatically).
## BlockAccessor
`BlockAccessor` provides a stateless, `Send + Sync` interface for policy-based
block scanning. Each `find()` call independently searches G2 then G3, acquiring
blocks via RAII. The companion `PolicyContext` adds result collection via
`yield_item()` for streaming scan results back to the caller.
# Object Storage Module
The object module provides traits and implementations for storing KV cache
blocks in object storage systems (S3, MinIO). This corresponds to the G4
(object store) tier in the storage hierarchy.
## ObjectBlockOps Trait
The primary trait for block-level object storage operations:
| Method | Purpose |
|--------|---------|
| `has_blocks(keys)` | Check existence and size of blocks |
| `put_blocks(keys, src_layout, block_ids)` | Upload blocks using logical layout handle |
| `get_blocks(keys, dst_layout, block_ids)` | Download blocks using logical layout handle |
| `put_blocks_with_layout(keys, layout, block_ids)` | Upload using resolved physical layout |
| `get_blocks_with_layout(keys, layout, block_ids)` | Download using resolved physical layout |
### Logical vs Physical Layout
The trait offers two APIs for put/get:
- **Logical** (`put_blocks` / `get_blocks`): Takes a `LogicalLayoutHandle` (G1, G2, G3).
Workers resolve this to their own physical layout internally. Used by the leader
(which doesn't have physical layouts) and by `CoordinatedWorker`.
- **Physical** (`put_blocks_with_layout` / `get_blocks_with_layout`): Takes a resolved
`PhysicalLayout` directly. Used by `PhysicalWorker` after resolving its handles, and
by `S3ObjectBlockClient` which performs the actual I/O.
## Key Formatting
Keys map `SequenceHash` values to object storage paths:
- **`DefaultKeyFormatter`**: Uses the hash's Display representation
(e.g., `0:abc123`). Suitable for single-worker scenarios.
- **`RankPrefixedKeyFormatter`**: Prefixes with worker rank
(e.g., `0/0:abc123`). Required for SPMD workers where multiple workers
store the same logical block with different physical data.
The `create_key_formatter(rank)` factory returns the appropriate formatter.
## ObjectLockManager
Distributed locking protocol for coordinated offloads to prevent duplicate
uploads:
```text
has_meta(hash)
→ true → skip (already offloaded)
→ false → try_acquire_lock(hash)
→ true → transfer → create_meta(hash) → release_lock(hash)
→ false → skip (another instance owns it)
```
Uses conditional PUT (`If-None-Match: *`) for lock acquisition with deadline-based
expiry for stale lock recovery.
## S3 Implementation
The `s3` submodule (feature-gated behind `s3`) provides:
- **`S3ObjectBlockClient`**: Implements `ObjectBlockOps` for S3-compatible storage.
Supports concurrent uploads/downloads via `rayon` thread pool and contiguous
memory fast paths for aligned block data.
- **`S3LockManager`**: Implements `ObjectLockManager` using S3 conditional writes.
## Factory Functions
- **`create_object_client(config, rank)`**: Creates an `Arc<dyn ObjectBlockOps>`
from configuration. Selects the backend (S3 or future alternatives) based on
`ObjectClientConfig`.
- **`create_lock_manager(config, instance_id)`**: Creates an
`Arc<dyn ObjectLockManager>` for distributed lock coordination.
This diff is collapsed.
# Offload Module
The offload module manages the asynchronous transfer of KV cache blocks between storage tiers. It provides a pipeline-based architecture for evaluating, batching, and executing block transfers with full cancellation support.
## Overview
Offloading moves blocks from a source tier (e.g., GPU memory) to a destination tier (e.g., host memory, remote storage, or object storage). The pipeline ensures:
- **Policy-based filtering**: Only blocks meeting criteria are transferred
- **Batched execution**: Blocks are grouped for efficient transfer
- **Cancellation support**: Transfers can be cancelled at any point before commitment
- **Precondition synchronization**: Transfers wait for forward pass completion
## Pipeline Architecture
```text
┌─────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ ┌──────────────────┐
│ PolicyEvaluator │────►│ PreconditionAwaiter │────►│ Batcher │────►│ TransferExecutor │
└─────────────────┘ └─────────────────────┘ └─────────────────────┘ └──────────────────┘
▲ ▲
│ │
CancellableQueue CancellableQueue
│ │
└──────── CancelSweeper ───┘
```
### Stages
| Stage | Purpose |
|-------|---------|
| **PolicyEvaluator** | Filters blocks based on configured policies (frequency, presence, etc.) |
| **PreconditionAwaiter** | Waits for forward pass completion before proceeding |
| **Batcher** | Groups containers into batches based on total block count |
| **TransferExecutor** | Upgrades blocks and executes the actual transfer |
## Container Data Model
The fundamental unit flowing through the pipeline is an **OffloadContainer**:
```rust,ignore
struct OffloadContainer<T: BlockMetadata> {
/// The blocks to offload
blocks: Vec<SourceBlock<T>>,
/// Precondition event (forward pass completion)
precondition: Option<EventHandle>,
/// Cancellation token
cancel_token: CancellationToken,
}
```
Containers are grouped into batches for efficient transfer:
```rust,ignore
struct OffloadBatch<T: BlockMetadata> {
/// Multiple containers, each independently cancellable
containers: Vec<OffloadContainer<T>>,
}
```
### P1: Container is the Unit of Cancellation
Individual blocks within a container are not independently cancellable. When a container is cancelled, all its blocks are cancelled together.
### P2: Token Travels with Container
Each container carries its own `CancellationToken`, cloned from the `TransferHandle` at enqueue time. The token travels with the container through all pipeline stages until upgrade.
### P3: Upgrade is the Commitment Boundary
The upgrade step (Weak → Strong) is the point of no return:
- **Before upgrade**: Containers can be cancelled via sweep or token check
- **After upgrade**: We own the blocks; cancellation no longer applies
### P4: Sweep Before Upgrade
The last cancellation check occurs immediately before upgrade. The `TransferExecutor` calls `batch.sweep_cancelled()` to remove cancelled containers before committing.
### P5: Flat Map After Upgrade
After upgrade, all blocks from all containers are consolidated into a single `Vec<ImmutableBlock<T>>` for efficient batch transfer. Per-container identity is lost at this point.
### P6: PreconditionAwaiter Uses Select
The precondition awaiter can be cancelled via `select!` on both the precondition event and the cancellation token. If cancelled while waiting, the container is dropped immediately.
## Configuration
Pipeline behavior is controlled via `PipelineConfig`:
| Option | Default | Description |
|--------|---------|-------------|
| `batch_config.max_batch_size` | 64 | Maximum blocks per batch |
| `batch_config.min_batch_size` | 8 | Minimum blocks before flush |
| `batch_config.flush_interval` | 10ms | Time before flushing partial batch |
| `policy_timeout` | 100ms | Timeout for policy evaluation |
| `sweep_interval` | 10ms | Interval for cancel sweeper |
| `max_concurrent_transfers` | 1 | Concurrent transfer batches |
## Usage
### Enqueueing Blocks
```rust,ignore
let handle = pipeline.enqueue(source_blocks, precondition_event);
// Track progress
println!("Status: {:?}", handle.status());
// Wait for completion
let result = handle.wait().await?;
```
### Cancelling a Transfer
```rust,ignore
// Request cancellation and wait for confirmation
handle.cancel().await;
// All blocks are now released
```
## Related Documentation
- [offload-developer.md](offload-developer.md) - Implementation details and extension rules
This diff is collapsed.
# Runtime
The `KvbmRuntime` is the composed shared infrastructure for KVBM operations. It bundles
the minimal set of components that all downstream managers and services need:
- **Tokio runtime** -- async execution context (owned or borrowed handle)
- **Messenger (Velo)** -- distributed RPC for leader/worker communication and peer discovery
- **NixlAgent** -- RDMA/UCX data transfers (optional, disabled when NixL config is absent)
- **EventManager** -- worker coordination and transfer completion notifications (accessed via Messenger)
## Construction
Two quick constructors cover the common case:
```rust,ignore
// Leader role (reads KVBM_* env vars + TOML files)
let runtime = KvbmRuntime::from_env_leader().await?;
// Worker role
let runtime = KvbmRuntime::from_env_worker().await?;
```
For tests or custom setups, use the builder:
```rust,ignore
let config = KvbmConfig::from_env()?;
let runtime = KvbmRuntime::builder(config)
.with_runtime_handle(Handle::current()) // inject existing tokio runtime
.with_messenger(messenger) // inject pre-built Messenger
.with_nixl_agent(agent) // inject pre-built NixlAgent
.build_leader()
.await?;
```
`KvbmRuntimeBuilder::from_json(json)` is the primary entrypoint for vLLM's
`kv_connector_extra_config` dict -- JSON values have highest priority, overriding
env vars, TOML files, and defaults.
## Component access
| Method | Returns | Notes |
|---------------------|------------------------------|---------------------------------------|
| `handle()` / `tokio()` | `tokio::runtime::Handle` | Borrowed or owned runtime handle |
| `messenger()` | `&Arc<Messenger>` | Velo RPC |
| `nixl_agent()` | `Option<&NixlAgent>` | `None` when NixL disabled in config |
| `event_system()` | `Arc<velo::EventManager>` | From Messenger, used for transfer notifications |
| `config()` | `&KvbmConfig` | Full configuration snapshot |
## RuntimeHandle
`RuntimeHandle` is an enum that abstracts over owned (`Arc<Runtime>`) and borrowed
(`Handle`) tokio runtimes. The builder creates an owned runtime from config when none
is injected.
This diff is collapsed.
# Testing Module
Test infrastructure for the kvbm-engine crate. Core block and token utilities
are re-exported from `kvbm_logical::testing` and `kvbm_physical::testing`;
this module adds engine-specific helpers for transport, sessions, offload
pipelines, and multi-instance scenarios.
## Test Helpers
### TestManagerBuilder / TestRegistryBuilder
Create test block managers and registries with synthetic physical layouts.
`TestManagerBuilder` produces a `BlockManager<T>` backed by mock memory.
`TestRegistryBuilder` produces a `BlockRegistry` pre-populated with hashes.
Use `populate_manager_with_blocks` and `create_and_populate_manager` to
quickly set up managers with pre-allocated blocks for testing.
### MessengerPair
Creates a pair of connected Velo `Messenger` instances for transport
testing without a real network. Messages sent through one messenger are
received by the other, enabling end-to-end session testing in a single
process.
```rust,ignore
let (messenger_a, messenger_b) = create_messenger_pair_tcp().await?;
```
### TestSession
Helper for testing distributed session protocols. Sets up the full session
infrastructure (dispatch maps, transport, channels) for testing
`InitiatorSession` / `ResponderSession` / `ControllableSession` interactions.
### EventsPipelineFixture
Test fixture for the offload pipeline. Provides pre-configured pipeline
stages, event managers, and block managers for testing policy evaluation,
batching, and transfer execution in isolation.
### MultiInstancePopulator
Sets up multi-instance distributed test scenarios with multiple leaders,
workers, and block managers. Populates each instance with configurable
block patterns for testing cross-instance onboarding.
```rust,ignore
let populated = MultiInstancePopulator::builder()
.instance_count(3)
.blocks_per_instance(100)
.build()?
.populate()
.await?;
```
### Physical Test Utilities
`TestAgent` and `TestAgentBuilder` create mock `NixlAgent` instances for
testing `TransferManager` without real RDMA hardware. `TransferChecksums`
provides utilities for verifying transfer correctness.
### Token Block Helpers
The `token_blocks` module provides utilities for creating test blocks with
known token sequences, useful for verifying search and match operations.
## Writing a New Test
1. Choose the appropriate fixture for your test scope:
- Single-instance transfer → `TestManagerBuilder` + `TestAgent`
- Session protocol → `TestSession` + `MessengerPair`
- Offload pipeline → `EventsPipelineFixture`
- Multi-instance → `MultiInstancePopulator`
2. Build the fixture and populate with test data
3. Exercise the code under test
4. Assert on results and verify cleanup (blocks released, sessions closed)
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment