Unverified Commit 3b6dbef2 authored by Chi McIsaac's avatar Chi McIsaac Committed by GitHub
Browse files

feat: Update docs to indicate need to use consistent hashing for KV events in...


feat: Update docs to indicate need to use consistent hashing for KV events in backend engines (#2981)
Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
Co-authored-by: default avatarYan Ru Pei <yanrpei@gmail.com>
parent 00061061
...@@ -125,8 +125,8 @@ for i in $(seq 1 $NUM_WORKERS); do ...@@ -125,8 +125,8 @@ for i in $(seq 1 $NUM_WORKERS); do
"${EXTRA_ARGS[@]}" "${EXTRA_ARGS[@]}"
else else
echo "[Worker-$i] Using GPUs: $GPU_DEVICES" echo "[Worker-$i] Using GPUs: $GPU_DEVICES"
# Run vLLM engine (exec with env for proper syntax) # Run vLLM engine with PYTHONHASHSEED=0 for deterministic event IDs in KV-aware routing
exec env CUDA_VISIBLE_DEVICES=$GPU_DEVICES python -m dynamo.vllm \ exec env PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=$GPU_DEVICES python -m dynamo.vllm \
--model "$MODEL_PATH" \ --model "$MODEL_PATH" \
--endpoint dyn://test.vllm.generate \ --endpoint dyn://test.vllm.generate \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \ --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
......
...@@ -237,4 +237,4 @@ We currently provide deployment examples for Kubernetes and SLURM. ...@@ -237,4 +237,4 @@ We currently provide deployment examples for Kubernetes and SLURM.
- **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)** - **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)**
## SLURM ## SLURM
- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)** - **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
\ No newline at end of file
...@@ -168,6 +168,18 @@ See `args.py` for the full list of configuration options and their defaults. ...@@ -168,6 +168,18 @@ See `args.py` for the full list of configuration options and their defaults.
The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM. The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
### Hashing Consistency for KV Events
When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:
```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
```
See the high-level notes in [KV Cache Routing](../../../docs/architecture/kv_cache_routing.md) on deterministic event IDs.
## Request Migration ## Request Migration
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker: You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
......
...@@ -203,6 +203,10 @@ The two types of events are: ...@@ -203,6 +203,10 @@ The two types of events are:
The publisher can be initialized and used through C bindings or Python bindings. The publisher can be initialized and used through C bindings or Python bindings.
### Deterministic Event IDs
For KV-aware routing to work across multiple workers and restarts, engines must emit deterministic block identifiers in KV events. Ensure all workers use identical engine versions/configuration so that block IDs for the same token content remain consistent. If your engine relies on Python's builtin `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect. The router recomputes local block hashes from tokens for matching, but parent/child links and removals depend on engine-provided IDs being stable.
### KVIndexer ### KVIndexer
The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker. The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment