Commit 2153ee81 authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

fix: inconsistent router args (#94)


Co-authored-by: default avatarhongkuanz <hongkuanz@nvidia.com>
parent b4281383
...@@ -79,17 +79,17 @@ TRT_LOG=DEBUG http --port 8181 ...@@ -79,17 +79,17 @@ TRT_LOG=DEBUG http --port 8181
### Processor ### Processor
Processor routes the requests to the (decode) workers. Three scheduling strategies are supported: 1. random, 2. round-robin, 3. kv-aware. Processor routes the requests to the (decode) workers. Three scheduling strategies are supported: 1. random, 2. round-robin, 3. kv (see [Kv Router](#kv-router)).
``` ```
# Processor must take the same args as the (decoer) worker # Processor must take the same args as the (decoder) worker
# This is temporary until we communicate the ModelDeploymentCard over etcd # This is temporary until we communicate the ModelDeploymentCard over etcd
RUST_LOG=info python3 processor.py \ RUST_LOG=info python3 processor.py \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--block-size 64 \ --block-size 64 \
--max-model-len 16384 \ --max-model-len 16384 \
<--random-router / --round-robin-router / --kv-router> --router <random/round-robin/kv>
``` ```
Alternatively, the processor can be bypassed by directly hitting the worker endpoints: Alternatively, the processor can be bypassed by directly hitting the worker endpoints:
...@@ -113,14 +113,14 @@ CUDA_VISIBLE_DEVICES=1 python3 routerless/worker.py \ ...@@ -113,14 +113,14 @@ CUDA_VISIBLE_DEVICES=1 python3 routerless/worker.py \
--kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}' --kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}'
``` ```
### kv router ### Kv Router
The KV Router is a component that aggregates KV Events from all the workers and maintains The KV Router is a component that aggregates KV Events from all the workers and maintains
a prefix tree of the cached tokens. It makes decisions on which worker to route requests a prefix tree of the cached tokens. It makes decisions on which worker to route requests
to based on the length of the prefix match and the load on the workers. to based on the length of the prefix match and the load on the workers.
There are three steps needed to enable the kv router: There are three steps needed to enable the kv router:
1. Use `--kv-router` in the processor. 1. Use `--router kv` in the processor.
2. Use `--kv-router` and `--enable-prefix-caching` in all the (decode) workers. 2. Use `--router kv` and `--enable-prefix-caching` in all the (decode) workers.
3. Launch the kv router in a separate terminal. 3. Launch the kv router in a separate terminal.
``` ```
RUST_LOG=info python3 kv_router.py \ RUST_LOG=info python3 kv_router.py \
...@@ -185,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0 python3 worker.py \ ...@@ -185,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0 python3 worker.py \
--enforce-eager \ --enforce-eager \
--block-size 64 \ --block-size 64 \
--max-model-len 16384 \ --max-model-len 16384 \
<optional kv router args: --kv-router --enable-prefix-caching> <optional kv router args: --router kv --enable-prefix-caching>
``` ```
#### Disaggregated #### Disaggregated
...@@ -213,7 +213,7 @@ CUDA_VISIBLE_DEVICES=1 python3 worker.py \ ...@@ -213,7 +213,7 @@ CUDA_VISIBLE_DEVICES=1 python3 worker.py \
--block-size 64 \ --block-size 64 \
--max-num-batched-tokens 16384 \ --max-num-batched-tokens 16384 \
--max-model-len 16384 \ --max-model-len 16384 \
<optional kv router args: --kv-router --enable-prefix-caching> <optional kv router args: --router kv --enable-prefix-caching>
<optional disaggregated router args: --conditional-disagg --custom-disagg-router --max-local-prefill-length <length>> <optional disaggregated router args: --conditional-disagg --custom-disagg-router --max-local-prefill-length <length>>
``` ```
......
...@@ -137,7 +137,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs): ...@@ -137,7 +137,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
else: else:
prefill_client = None prefill_client = None
if engine_args.kv_router: if engine_args.router == "kv":
# TODO: do we need these env vars? # TODO: do we need these env vars?
VLLM_WORKER_ID = endpoint.lease_id() VLLM_WORKER_ID = endpoint.lease_id()
os.environ["VLLM_WORKER_ID"] = str(VLLM_WORKER_ID) os.environ["VLLM_WORKER_ID"] = str(VLLM_WORKER_ID)
...@@ -158,7 +158,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs): ...@@ -158,7 +158,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
else "vllm" else "vllm"
) )
if engine_args.kv_router: if engine_args.router == "kv":
engine_client.set_metrics_publisher(metrics_publisher) engine_client.set_metrics_publisher(metrics_publisher)
# Initially send dummy metrics to kick start, # Initially send dummy metrics to kick start,
...@@ -197,7 +197,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs): ...@@ -197,7 +197,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
).generate ).generate
) )
] ]
if engine_args.kv_router: if engine_args.router == "kv":
endpoints.append(metrics_publisher.create_endpoint(component)) endpoints.append(metrics_publisher.create_endpoint(component))
await asyncio.gather(*endpoints) await asyncio.gather(*endpoints)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment