Unverified Commit 16a28058 authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

chore: remove stale indexer benchmarking results (#6233)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent cb88fdc7
......@@ -178,7 +178,7 @@ The main KV-aware routing arguments:
- `--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. For example, with a value of 0.8 (default) and max tree size of 1048576, the router will prune down to approximately 838860 blocks when the threshold is exceeded. Defaults to 0.8 when `--no-kv-events` is used. This creates headroom before the next pruning cycle.
- `--router-event-threads`: Number of event processing threads for the KV indexer. When set to 1 (default), the router uses a single-threaded radix tree with channel-based event processing, which supports TTL-based expiration and pruning. When set to a value greater than 1, the router uses a concurrent radix tree with a thread pool of the specified size for higher event throughput. Note: the concurrent indexer does not support TTL/pruning (`--router-ttl`, `--router-max-tree-size`, `--router-prune-target-ratio` are ignored when `--router-event-threads > 1`). Can be set via `DYN_ROUTER_EVENT_THREADS` env var.
- `--router-event-threads`: Number of event processing threads for the KV indexer. When set to 1 (default), the router uses a single-threaded radix tree with channel-based event processing, which supports TTL-based expiration and pruning. When set to a value greater than 1, the router uses a concurrent radix tree with a thread pool of the specified size for higher event throughput. Note: the concurrent indexer does not support TTL/pruning (`--router-ttl`, `--router-max-tree-size`, `--router-prune-target-ratio` are ignored when `--router-event-threads > 1`). Can be set via `DYN_ROUTER_EVENT_THREADS` env var. For details on the underlying index data structures (`RadixTree`, `ConcurrentRadixTree`, `PositionalIndexer`) and their concurrency model (inline reads, sticky-routed writes via thread pool), see the [KV Router Index documentation](../../../../lib/kv-router/README.md).
>[!Note]
> **State persistence** depends on the event transport mode:
......@@ -431,5 +431,6 @@ curl http://localhost:8000/busy_threshold
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Examples](router-examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[KV Router Index Data Structures](../../../../lib/kv-router/README.md)**: `RadixTree`, `ConcurrentRadixTree`, and `PositionalIndexer` internals and concurrency model
- **[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
- **[KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md)**: Integrate custom inference engines with KV-aware routing
# KV Router Index Data Structures
# ⚡ FlashIndexer — KV Router Index Data Structures
This document explains the KV cache index implementations: `RadixTree`, `ConcurrentRadixTree`, and `PositionalIndexer` (NestedMap).
This document explains the KV cache index implementations: `RadixTree` (and its concurrent variant `ConcurrentRadixTree`) and `PositionalIndexer` (NestedMap).
The concurrent indexers achieve a combined throughput of over **10 million events + requests per second** with **p99 latency under 10 microseconds**.
## Motivation: The Four Block Identifiers
......@@ -27,8 +29,8 @@ LocalBlockHash = hash(tokens) = 0xABCD1234
Sequence A: [block0, block1, block2]
Sequence B: [block0', block1', block2] // block2 has same content but different prefix
block2 in A: seq_hash = hash(hash(hash(block0) + block1) + block2) = 0x1111
block2 in B: seq_hash = hash(hash(hash(block0') + block1') + block2) = 0x2222
block2 in A: seq_hash = hash(hash(hash(block0) || block1) || block2) = 0x1111
block2 in B: seq_hash = hash(hash(hash(block0') || block1') || block2) = 0x2222
```
**Computation**: `seq_hash[i] = hash(seq_hash[i-1] || local_hash[i])` where `seq_hash[0] = local_hash[0]`
......@@ -53,7 +55,7 @@ block2 in B: seq_hash = hash(hash(hash(block0') + block1') + block2) = 0x2222
**Why**: The router needs to know which workers can serve a request based on their cached blocks.
### 4. Position (`u64`)
### 4. Position (`usize`)
**What**: The block's index in the sequence (0, 1, 2, ...).
......@@ -65,13 +67,13 @@ block2 in B: seq_hash = hash(hash(hash(block0') + block1') + block2) = 0x2222
Both data structures support three operations:
| Operation | Description | Hot Path? |
|-----------|-------------|-----------|
| `store_blocks` | Add blocks for a worker | No (background) |
| `remove_blocks` | Remove blocks for a worker | No (background) |
| `find_matches` | Find workers with matching prefix | **Yes** (per-request) |
| Operation | Description |
|-----------|-------------|
| `store_blocks` | Add blocks for a worker (background) |
| `remove_blocks` | Remove blocks for a worker (background) |
| `find_matches` | Find workers with matching prefix (per-request) |
The key insight: **reads (find_matches) are far more frequent than writes (store/remove)**. This motivates different structural tradeoffs.
**Read vs write cost and frequency**: When the radix tree has little or no shared prefix for a request, `find_matches` can exit after a single root-level lookup (first block miss)—reads then do less work than writes (which traverse and update multiple nodes). Vice versa, with large prefix overlap, reads traverse deeper and can be invoked more often than writes. We consider both extremes; the proposed data structures are designed to handle them.
---
......@@ -148,6 +150,36 @@ Where W = number of workers.
---
## ConcurrentRadixTree: Thread-Safe Variant
`ConcurrentRadixTree` adapts the `RadixTree` for concurrent access. The key change is replacing `Rc<RefCell<>>` with `Arc<RwLock<>>` per node, and using a `DashMap` for the per-worker lookup table:
```
ConcurrentRadixTree
├── root: SharedBlock (Arc<RwLock<Block>>)
└── lookup: DashMap<Worker, RwLock<HashMap<SeqHash, SharedBlock>>>
```
The `DashMap` distributes lock contention across shards, while each worker's block map is behind its own `RwLock`. This means `find_matches` only takes read locks — on the tree nodes and on the lookup — so multiple reads can proceed in parallel without blocking each other.
Writes (`store_blocks`, `remove_blocks`) take write locks on the affected nodes using hand-over-hand locking (parent before child). To avoid write–write contention, `ConcurrentRadixTree` is designed to be wrapped in a `ThreadPoolIndexer`, which uses per-worker sticky routing: each `WorkerId` is assigned to a dedicated OS thread via a `DashMap<WorkerId, usize>` mapping, and events are dispatched through per-thread `flume` channels. Since KV events for a given worker always land on the same thread, writes to that worker's subtree are serialized without cross-thread locking.
```
┌──────────────────────────────────┐
find_matches() ──→ │ Arc<ConcurrentRadixTree> │ ← reads go inline
│ │
KV events ──→ flume[0] ──→ thread 0 (W0, W3) ──→ │
──→ flume[1] ──→ thread 1 (W1, W4) ──→ │ ← writes via sticky
──→ flume[2] ──→ thread 2 (W2, W5) ──→ │ worker assignment
└──────────────────────────────────┘
```
This same pattern — inline reads on the caller thread, sticky-routed writes through a thread pool — is shared with `PositionalIndexer` (see below). Both implement the `SyncIndexer` trait and are wrapped in `ThreadPoolIndexer`.
One trade-off: `ConcurrentRadixTree` drops the `recent_uses` frequency tracking from `RadixTree`, keeping `find_matches` fully read-only (no mutable state updates on the read path).
---
## PositionalIndexer (NestedMap): Position-First HashMap Index
### Structure
......@@ -250,9 +282,9 @@ Query: [b0, b1, b2, ..., b63, b64, ..., b127, ...]
|-----------|------|-------|
| store_blocks (N blocks) | O(N) | O(N) entries |
| remove_blocks (N blocks) | O(N) | - |
| find_matches (depth D) | O(D/J + J×W) | O(W) |
| find_matches (depth D) | O(D/J) | O(W) |
Where J = jump_size, W = number of workers. The jump optimization reduces D iterations to D/J jumps plus occasional scans.
Where J = jump_size, W = number of workers. The jump optimization reduces D sequential lookups to D/J jumps, with occasional linear scans over skipped positions when workers drop out at a jump point.
---
......@@ -269,17 +301,6 @@ Where J = jump_size, W = number of workers. The jump optimization reduces D iter
| **Memory** | Higher (Rc/Arc overhead per node) | Lower (flat entries) |
| **Cache locality** | Poor (pointer chasing) | Better (position-first) |
### Benchmark Results (1M blocks, depth 1024, 128 workers)
| Operation | RadixTree | NestedMap | Winner |
|-----------|-----------|-----------|--------|
| STORE_BLOCK | 90µs | 98µs | RadixTree (1.1x) |
| REMOVE_BLOCK | 91µs | 233µs | RadixTree (2.5x) |
| FIND_MATCHES (HIT) | 227µs | **44µs** | **NestedMap (5.2x)** |
| FIND_MATCHES (PARTIAL) | 216µs | **44µs** | **NestedMap (4.9x)** |
**Recommendation**: Use NestedMap for read-heavy workloads (typical router usage).
---
## Why Position Matters for PositionalIndexer
......@@ -297,36 +318,3 @@ let workers_at_64 = index.get(&(64, local_hashes[64])); // O(1) lookup
let workers_at_128 = index.get(&(128, local_hashes[128])); // O(1) lookup
// Skip positions 1-63, 65-127 entirely!
```
---
## SeqEntry Optimization
The innermost level uses an enum to avoid HashMap allocation in the common case:
```rust
enum SeqEntry {
// Common: one prefix leads to this (position, local_hash)
Single(SeqHash, HashSet<Worker>),
// Rare: different prefixes converge on same (position, local_hash)
Multi(HashMap<SeqHash, HashSet<Worker>>),
}
```
**When does Multi occur?**
Only when two different sequences have:
1. Same local block content at position P
2. Different prefix histories (different seq_hash)
Example:
```
Sequence A: [tok1, tok2, tok3] → positions 0,1,2
Sequence B: [tok4, tok5, tok3] → positions 0,1,2
^^^^
Same local content at pos=2
but different seq_hash!
```
This is rare in practice, so `Single` saves ~48 bytes per entry.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment