".github/git@developer.sourcefind.cn:OpenDAS/Uni-Core.git" did not exist on "552e8b25081037774ad23b7c6d653d7aa3f55399"
Commit 2153ee81 authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

fix: inconsistent router args (#94)


Co-authored-by: default avatarhongkuanz <hongkuanz@nvidia.com>
parent b4281383
......@@ -79,17 +79,17 @@ TRT_LOG=DEBUG http --port 8181
### Processor
Processor routes the requests to the (decode) workers. Three scheduling strategies are supported: 1. random, 2. round-robin, 3. kv-aware.
Processor routes the requests to the (decode) workers. Three scheduling strategies are supported: 1. random, 2. round-robin, 3. kv (see [Kv Router](#kv-router)).
```
# Processor must take the same args as the (decoer) worker
# Processor must take the same args as the (decoder) worker
# This is temporary until we communicate the ModelDeploymentCard over etcd
RUST_LOG=info python3 processor.py \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--block-size 64 \
--max-model-len 16384 \
<--random-router / --round-robin-router / --kv-router>
--router <random/round-robin/kv>
```
Alternatively, the processor can be bypassed by directly hitting the worker endpoints:
......@@ -113,14 +113,14 @@ CUDA_VISIBLE_DEVICES=1 python3 routerless/worker.py \
--kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}'
```
### kv router
### Kv Router
The KV Router is a component that aggregates KV Events from all the workers and maintains
a prefix tree of the cached tokens. It makes decisions on which worker to route requests
to based on the length of the prefix match and the load on the workers.
There are three steps needed to enable the kv router:
1. Use `--kv-router` in the processor.
2. Use `--kv-router` and `--enable-prefix-caching` in all the (decode) workers.
1. Use `--router kv` in the processor.
2. Use `--router kv` and `--enable-prefix-caching` in all the (decode) workers.
3. Launch the kv router in a separate terminal.
```
RUST_LOG=info python3 kv_router.py \
......@@ -185,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0 python3 worker.py \
--enforce-eager \
--block-size 64 \
--max-model-len 16384 \
<optional kv router args: --kv-router --enable-prefix-caching>
<optional kv router args: --router kv --enable-prefix-caching>
```
#### Disaggregated
......@@ -213,7 +213,7 @@ CUDA_VISIBLE_DEVICES=1 python3 worker.py \
--block-size 64 \
--max-num-batched-tokens 16384 \
--max-model-len 16384 \
<optional kv router args: --kv-router --enable-prefix-caching>
<optional kv router args: --router kv --enable-prefix-caching>
<optional disaggregated router args: --conditional-disagg --custom-disagg-router --max-local-prefill-length <length>>
```
......
......@@ -137,7 +137,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
else:
prefill_client = None
if engine_args.kv_router:
if engine_args.router == "kv":
# TODO: do we need these env vars?
VLLM_WORKER_ID = endpoint.lease_id()
os.environ["VLLM_WORKER_ID"] = str(VLLM_WORKER_ID)
......@@ -158,7 +158,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
else "vllm"
)
if engine_args.kv_router:
if engine_args.router == "kv":
engine_client.set_metrics_publisher(metrics_publisher)
# Initially send dummy metrics to kick start,
......@@ -197,7 +197,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
).generate
)
]
if engine_args.kv_router:
if engine_args.router == "kv":
endpoints.append(metrics_publisher.create_endpoint(component))
await asyncio.gather(*endpoints)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment