@@ -88,7 +88,7 @@ To get a feel for how KV Cache management works on a single worker with KV Cache
1.**Request tokenization**: The incoming prompt is converted into tokens
2.**Block partitioning**: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block)
3.**Block hashing**: Each block of tokens is hashed to create a unique identifier
3.**Block hashing**: Each block of tokens is hashed to create a unique identifier. When a LoRA adapter is active, the adapter name is incorporated into the hash so that blocks cached under different adapters produce distinct identifiers.
4.**Cache lookup**:
- For each block, the system checks if a matching block already exists in the KV cache
- If a match is found, the existing KV cache block is reused
- Check that the LoRA is loaded on the worker handling your request
- For disaggregated serving, ensure both prefill and decode workers have the LoRA
## KV Cache-Aware LoRA Routing
When KV-aware routing is enabled, the router automatically accounts for LoRA adapter identity when computing block hashes. This means:
-**Distinct hash spaces per adapter**: Blocks cached under adapter `A` will never be confused with blocks cached under adapter `B` or the base model, even if the token sequences are identical. The adapter name is mixed into the `LocalBlockHash` computation.
-**Automatic prefix sharing within the same adapter**: Requests targeting the same LoRA adapter benefit from KV cache prefix matching just like base model requests do.
-**No configuration required**: The LoRA name is propagated automatically through KV events (`BlockStored`) from the engine to the router. The router uses the `lora_name` field on events to route LoRA requests to workers that have matching cached blocks.
This works end-to-end across the publisher pipeline, the KV consolidator (for deduplication), and the routing query path.
-**`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests.
-**`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`)
-**`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent).
-**`lora_id`**: LoRA adapter ID (0 if not using LoRA)
-**`lora_name`**: LoRA adapter name string (omit or `None` for base model). When set, the adapter name is incorporated into block hash computation so that blocks for different LoRA adapters (or the base model) are never conflated.
For `BlockRemoved` events:
-**`block_hashes`**: List of sequence block hashes being evicted
Publish a block-stored event. Event IDs are managed internally.
Publish a block-stored event. Event IDs are managed internally. When `lora_name` is provided, the adapter name is mixed into block hash computation so blocks cached under different adapters produce distinct hashes.
@@ -10,13 +10,14 @@ Every cached KV block in a distributed LLM system needs four pieces of informati
### 1. Local Block Hash (`LocalBlockHash`, u64)
**What**: Hash of the tokens *within* a single block (e.g., 64 tokens).
**What**: Hash of the tokens *within* a single block (e.g., 64 tokens), optionally including LoRA adapter name and multimodal metadata.
**Why**: Identifies the content of this specific block, independent of context. Two blocks with the same tokens have the same local hash.
**Why**: Identifies the content of this specific block, independent of context. Two blocks with the same tokens (and same LoRA adapter) have the same local hash. When a LoRA adapter name is provided, it is length-prefixed and appended to the byte buffer before hashing, ensuring that blocks under different adapters (or the base model) always produce distinct hashes.
> In practice, the `ExternalSequenceBlockHash` may come directly from the inference engine (e.g., vLLM, TensorRT-LLM) using a rolling hash algorithm that we don't know or control. The engine computes these hashes internally and reports them via KV cache events.
>
> **LoRA identity**: The engine is responsible for incorporating the LoRA adapter identity into the `ExternalSequenceBlockHash` before emitting KV events. Dynamo does not add LoRA information at the router layer. For example, vLLM does this via `_gen_lora_extra_hash_keys`, which appends the LoRA ID as extra keys when calling `hash_block_tokens(..., extra_keys)`. Any engine integrating with the KV router must follow the same convention to ensure correct cache isolation between LoRA adapters.
>
> **Implications for index implementations:**
>
> - **RadixTree**: Can handle engine-provided hashes because it traverses the tree structure using `LocalBlockHash` for navigation and only uses `ExternalSequenceBlockHash` as an opaque identifier for lookups. It doesn't need to recompute hashes.