fix: inconsistent router args (#94)

Co-authored-by: hongkuanz <hongkuanz@nvidia.com>

fix: inconsistent router args (#94)
Co-authored-by: hongkuanz <hongkuanz@nvidia.com>
2153ee81 · Hongkuan Zhou · GitHub · b4281383 · 2153ee81 · 2153ee81
Commit 2153ee81 authored Mar 11, 2025 by Hongkuan Zhou Committed by GitHub Mar 11, 2025
Showing with 11 additions and 11 deletions

examples/python_rs/llm/vllm_nixl/README.md examples/python_rs/llm/vllm_nixl/README.md +8 -8

examples/python_rs/llm/vllm_nixl/worker.py examples/python_rs/llm/vllm_nixl/worker.py +3 -3

No files found.
--- a/examples/python_rs/llm/vllm_nixl/README.md
+++ b/examples/python_rs/llm/vllm_nixl/README.md
@@ -79,17 +79,17 @@ TRT_LOG=DEBUG http --port 8181

 ### Processor

-Processor routes the requests to the (decode) workers. Three scheduling strategies are supported: 1. random, 2. round-robin, 3. kv-aware.
+Processor routes the requests to the (decode) workers. Three scheduling strategies are supported: 1. random, 2. round-robin, 3. kv (see [Kv Router](#kv-router)).

 ```
-# Processor must take the same args as the (decoer) worker
+# Processor must take the same args as the (decoder) worker
 # This is temporary until we communicate the ModelDeploymentCard over etcd
 RUST_LOG=info python3 processor.py \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --block-size 64 \
    --max-model-len 16384 \
-    <--random-router / --round-robin-router / --kv-router>
+    --router <random/round-robin/kv>
 ```

 Alternatively, the processor can be bypassed by directly hitting the worker endpoints:
@@ -113,14 +113,14 @@ CUDA_VISIBLE_DEVICES=1 python3 routerless/worker.py \
    --kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}'
 ```

-### kv router
+### Kv Router

 The KV Router is a component that aggregates KV Events from all the workers and maintains
 a prefix tree of the cached tokens. It makes decisions on which worker to route requests
 to based on the length of the prefix match and the load on the workers.
 There are three steps needed to enable the kv router:
-1. Use `--kv-router` in the processor.
-2. Use `--kv-router` and `--enable-prefix-caching` in all the (decode) workers.
+1. Use `--router kv` in the processor.
+2. Use `--router kv` and `--enable-prefix-caching` in all the (decode) workers.
 3. Launch the kv router in a separate terminal.
   ```
   RUST_LOG=info python3 kv_router.py \
@@ -185,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0 python3 worker.py \
    --enforce-eager \
    --block-size 64 \
    --max-model-len 16384 \
-    <optional kv router args: --kv-router --enable-prefix-caching>
+    <optional kv router args: --router kv --enable-prefix-caching>
 ```

 #### Disaggregated
@@ -213,7 +213,7 @@ CUDA_VISIBLE_DEVICES=1 python3 worker.py \
    --block-size 64 \
    --max-num-batched-tokens 16384 \
    --max-model-len 16384 \
-    <optional kv router args: --kv-router --enable-prefix-caching>
+    <optional kv router args: --router kv --enable-prefix-caching>
    <optional disaggregated router args: --conditional-disagg --custom-disagg-router --max-local-prefill-length <length>>
 ```


--- a/examples/python_rs/llm/vllm_nixl/worker.py
+++ b/examples/python_rs/llm/vllm_nixl/worker.py
@@ -137,7 +137,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
    else:
        prefill_client = None

-    if engine_args.kv_router:
+    if engine_args.router == "kv":
        # TODO: do we need these env vars?
        VLLM_WORKER_ID = endpoint.lease_id()
        os.environ["VLLM_WORKER_ID"] = str(VLLM_WORKER_ID)
@@ -158,7 +158,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
            else "vllm"
        )

-        if engine_args.kv_router:
+        if engine_args.router == "kv":
            engine_client.set_metrics_publisher(metrics_publisher)

            # Initially send dummy metrics to kick start,
@@ -197,7 +197,7 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
                ).generate
            )
        ]
-        if engine_args.kv_router:
+        if engine_args.router == "kv":
            endpoints.append(metrics_publisher.create_endpoint(component))
        await asyncio.gather(*endpoints)