feat: unified entry point for vllm-nixl (#83)

Co-authored-by: hongkuanz <hongkuanz@nvidia.com>

feat: unified entry point for vllm-nixl (#83)
Co-authored-by: hongkuanz <hongkuanz@nvidia.com>
30c5a79f · Hongkuan Zhou · GitHub · 2340751b · 30c5a79f · 30c5a79f
Commit 30c5a79f authored Mar 10, 2025 by Hongkuan Zhou Committed by GitHub Mar 10, 2025
9 changed files
--- a/examples/python_rs/llm/vllm_nixl/README.md
+++ b/examples/python_rs/llm/vllm_nixl/README.md
@@ -49,115 +49,93 @@ All of the commands below are run inside the same container.

 ## Run deployment

-Add model to dynamo and start http server.
+This figure shows an overview of the major components to deploy:
+
+```
+                                                 +----------------+
+                                          +------| prefill worker |-------+
+                                   notify |      |   (optional)   |       |
+                                 finished |      +----------------+       | pull
+                                          v                               v
+------+      +-----------+      +------------------+    push     +---------------+
+| HTTP |----->| processor |----->| decode/monolith  |------------>| prefill queue |
+|      |<-----|           |<-----|      worker      | (if disagg) |   (optional)  |
+------+      +-----------+      +------------------+             +---------------+
+                  |    ^                  |
+       query best |    | return           | publish kv events
+           worker |    | worker_id        v
+                  |    |         +------------------+
+                  |    +---------|     kv-router    |
+                  +------------->|    (optional)    |
+                                 +------------------+
+
+```

+Add model to dynamo and start http server.
 ```
+llmctl http add chat-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B dynamo-init.process.chat/completions
 TRT_LOG=DEBUG http --port 8181
 ```

-### Router-less Deployment
+### Processor

-Router-less deployment without kv router and disaggregated router.
+Processor routes the requests to the (decode) workers. Three scheduling strategies are supported: 1. random, 2. round-robin, 3. kv-aware.

-For router-less deployment, the client should directly hit the vllm.generate endpoint,
 ```
-llmctl http add chat-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B dynamo-init.vllm.generate
+# Processor must take the same args as the (decoer) worker
+# This is temporary until we communicate the ModelDeploymentCard over etcd
+RUST_LOG=info python3 processor.py \
+    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+    --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+    --block-size 64 \
+    --max-model-len 16384 \
+    <--random-router / --round-robin-router / --kv-router>
 ```

-#### Monolithic
-
+Alternatively, the processor can be bypassed by directly hitting the worker endpoints:
 ```
-cd /workspace/examples/python_rs/llm/vllm_nixl
+llmctl http add chat-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B dynamo-init.vllm.generate
+
+# monolithic
 CUDA_VISIBLE_DEVICES=0 python3 routerless/worker.py \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager
-```
-
-#### Disaggregated
-
-In disaggregated router-less deployment, the decode worker will directly send requests to a random prefill worker. All the requests will be sent to prefill worker(s) for remote prefill.

-In terminal 1:
-
-```
-cd /workspace/examples/python_rs/llm/vllm_nixl
+# disaggregated
 CUDA_VISIBLE_DEVICES=0 python routerless/prefill_worker.py \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager \
-    --block-size 64 \
-    --kv-transfer-config \
-    '{"kv_connector":"DynamoNixlConnector"}'
-```
-
-In terminal 2:
-```
-cd /workspace/examples/python_rs/llm/vllm_nixl
-CUDA_VISIBLE_DEVICES=1,2 python3 routerless/worker.py \
+    --kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}'
+CUDA_VISIBLE_DEVICES=1 python3 routerless/worker.py \
    --remote-prefill \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager \
-    --block-size 64 \
-    --tensor-parallel-size 2 \
-    --kv-transfer-config \
-    '{"kv_connector":"DynamoNixlConnector"}'
+    --kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}'
 ```

-### Router-based Deployment
-
-Router-based deployment use kv router to schedule the request to the best decode worker and disaggregated router to decide whether to prefill locally or remotely. The remote prefill requests will be sent to a global prefill queue to balance the prefill load.
+### kv router

-For router deployment, the client should hit the endpoint of the processor,
-```
-llmctl http add chat-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B dynamo-init.process.chat/completions
-```
-
-To launch disaggregated vllm deployment, there are four major components:
-1. Processor
-2. KV Router
-3. Disaggregated Router
-4. Prefill and Decode Workers
-
-#### Processor
-
-```
-# Processor must take the same args as the worker
-# This is temporary until we communicate the ModelDeploymentCard over etcd
-# Currently only block-size=64 is supported
-cd /workspace/examples/python_rs/llm/vllm_nixl
-RUST_LOG=info python3 router/processor.py \
-    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-    --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-    --enable-prefix-caching \
-    --block-size 64 \
-    --max-model-len 16384
-```
-
-#### KV Router
-
-The KV Router is a component that aggregates KV Events from all the workers and maintains a prefix tree of the cached tokens. It makes decisions on which worker to route requests to based on the length of the prefix match and the load on the workers.
-
-To launch the KV Router, run the following command:
-```
-RUST_LOG=info python3 router/kv_router.py \
-    --routing-strategy prefix \
-    --model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-    --min-workers 1
-```
-
-There is also a custom router that uses a cost function defined in python to make routing decisions. To launch the custom router, run the following command:
-```
-RUST_LOG=info python3 router/kv_router.py \
+The KV Router is a component that aggregates KV Events from all the workers and maintains
+a prefix tree of the cached tokens. It makes decisions on which worker to route requests
+to based on the length of the prefix match and the load on the workers.
+There are three steps needed to enable the kv router:
+1. Use `--kv-router` in the processor.
+2. Use `--kv-router` and `--enable-prefix-caching` in all the (decode) workers.
+3. Launch the kv router in a separate terminal.
+   ```
+   RUST_LOG=info python3 kv_router.py \
       --routing-strategy prefix \
       --model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-    --custom-router \
+       --block-size 64 \
       --min-workers 1
-```
+   ```
+   where `--min-workers` is the number of (decode) workers.
+   There is also python-based customized router that can be enabled by `--custom-router`.

 You can choose only the prefix strategy for now:
 - `prefix`: Route requests to the worker that has the longest prefix match.

-
-#### Disaggregated Router
+### Disaggregated Router

 The disaggregated router determines whether a request should be send to a
 remote prefill engine or a local prefill engine for prefilling based on the
@@ -185,20 +163,39 @@ There are two types of disaggregated router implementations:
  kv router as the rust kv router does not report kv cache hit ratio.
  To use the python disaggregated router, add the following commands when launching
  the decode worker:
-  ```
-  python worker.py \
-    --custom-disagg-router \
-    --max-local-prefill-length <length> \
-    --max-remote-prefill-cache-hit-ratio <ratio>
-  ```

-#### Workers
+To enable the disaggregated router, add the following commands in the decode workers:
+```
+python worker.py \
+...
+--conditional-disagg \
+<optional: --custom-disagg-router> \
+--max-local-prefill-length <length>
+```
+
+### Worker
+
+#### Monolithic

+Only kv router is supported for monolithic deployment.
+
+```
+CUDA_VISIBLE_DEVICES=0 python3 worker.py \
+    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+    --enforce-eager \
+    --block-size 64 \
+    --max-model-len 16384 \
+    <optional kv router args: --kv-router --enable-prefix-caching>
 ```
-# start prefill worker in Terminal 1
+
+#### Disaggregated
+
+Kv router and disaggregated router are supported and can be turned on/off individually.
+
+```
+# start prefill worker in one terminal
 # Note: prefix caching is not supported in the prefill for now
-cd /workspace/examples/python_rs/llm/vllm_nixl
-CUDA_VISIBLE_DEVICES=0 python3 router/prefill_worker.py \
+CUDA_VISIBLE_DEVICES=0 python3 prefill_worker.py \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager \
    --kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}' \
@@ -206,25 +203,18 @@ CUDA_VISIBLE_DEVICES=0 python3 router/prefill_worker.py \
    --max-num-batched-tokens 16384 \
    --max-model-len 16384

-# start decode worker in Terminal 2
-cd /workspace/examples/python_rs/llm/vllm_nixl
-CUDA_VISIBLE_DEVICES=1 python3 router/worker.py \
+# start decode worker in another terminal
+CUDA_VISIBLE_DEVICES=1 python3 worker.py \
    --remote-prefill \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager \
    --tensor-parallel-size 1 \
    --kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}' \
-    --enable-prefix-caching \
    --block-size 64 \
    --max-num-batched-tokens 16384 \
-    --max-model-len 16384
-```
-
-Alternatively, we also provide a script to launch all workers in one go (with the python customized router):
-```
-# this TODO: change to dynamo-deploy functionality
-./start_single_node.sh
-# Usage [--model <model>] [--p_tensor_parallel_size <size>] [--d_tensor_parallel_size <size>] [--max_model_len <len>] [--max_num_batched_tokens <tokens>] [--max_num_seqs <seqs>] [--gpu_memory_utilization <utilization>] [--enable_chunked_prefill <True/False>] [--num_p <p>] [--num_d <d>]
+    --max-model-len 16384 \
+    <optional kv router args: --kv-router --enable-prefix-caching>
+    <optional disaggregated router args: --conditional-disagg --custom-disagg-router --max-local-prefill-length <length>>
 ```

 ### Common Issues
@@ -294,7 +284,6 @@ pkill -9 -f python
 - [ ] Add etcd for discovery
 - [ ] Multi-node deployment support
 - [ ] Enable chunked prefill
- [ ] Support mixed tp
 - [ ] Process many remote prefill in one iteration
 - [ ] Support recompute preemption
 - [ ] Make sure decode does not preempt blocks before xfer finishes
@@ -304,6 +293,7 @@ pkill -9 -f python
 - [ ] Support pp > 1
 - [ ] Check why adding extra seed input is crashing vllm with remote prefill
 - [ ] Unified worker for both prefill and decode
+- [x] Support mixed tp
 - [x] Require sending two parallel requests to start decode for the first time
 - [x] Concurrency > 2 is not working
 - [x] Parse cmdline args

--- a/examples/python_rs/llm/vllm_nixl/router/disagg_router.py
+++ b/examples/python_rs/llm/vllm_nixl/router/disagg_router.py
--- a/examples/python_rs/llm/vllm_nixl/router/kv_router.py
+++ b/examples/python_rs/llm/vllm_nixl/router/kv_router.py
@@ -173,13 +173,14 @@ async def worker(runtime: DistributedRuntime, args: Namespace):
    endpoint = router_component.endpoint("generate")

    if args.custom_router:
-        indexer = KvIndexer(kv_listener)
+        indexer = KvIndexer(kv_listener, args.block_size)
        metrics_aggregator = KvMetricsAggregator(kv_listener)
        await endpoint.serve_endpoint(
            CustomRouter(indexer, metrics_aggregator).generate
        )
    else:
-        router = KvRouter(runtime, kv_listener)
+        # TODO Read block_size from MDC
+        router = KvRouter(runtime, kv_listener, args.block_size)
        await endpoint.serve_endpoint(Router(router, args.routing_strategy).generate)


@@ -208,6 +209,11 @@ if __name__ == "__main__":
        default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        help="Model that is being served",
    )
+    parser.add_argument(
+        "--block-size",
+        type=int,
+        help="KV block size",
+    )
    parser.add_argument(
        "--custom-router",
        type=bool,

--- a/examples/python_rs/llm/vllm_nixl/router/prefill_worker.py
+++ b/examples/python_rs/llm/vllm_nixl/router/prefill_worker.py
--- a/examples/python_rs/llm/vllm_nixl/router/processor.py
+++ b/examples/python_rs/llm/vllm_nixl/router/processor.py
@@ -62,6 +62,7 @@ class Processor(ProcessMixIn):
        )
        self.router_client = router_client
        self.workers_client = workers_client
+        self.router_mode = engine_args.router

    def _create_tokenizer(self, engine_args: AsyncEngineArgs) -> AnyTokenizer:
        """Create a TokenizerGroup using engine arguments similar to VLLM's approach"""
@@ -91,6 +92,8 @@ class Processor(ProcessMixIn):
            engine_prompt,
            sampling_params,
        ) = await self._parse_raw_request(raw_request)
+
+        if self.router_mode == "kv":
            worker_id_generator: AsyncIterator = await self.router_client.generate(
                Tokens(tokens=engine_prompt["prompt_token_ids"]).model_dump_json()
            )
@@ -118,6 +121,22 @@ class Processor(ProcessMixIn):
                    ).model_dump_json(),
                    int(worker_id),
                )
+        elif self.router_mode == "random":
+            engine_generator = await self.workers_client.random(
+                vLLMGenerateRequest(
+                    engine_prompt=engine_prompt,
+                    sampling_params=sampling_params,
+                    request_id=request_id,
+                ).model_dump_json()
+            )
+        elif self.router_mode == "round-robin":
+            engine_generator = await self.workers_client.round_robin(
+                vLLMGenerateRequest(
+                    engine_prompt=engine_prompt,
+                    sampling_params=sampling_params,
+                    request_id=request_id,
+                ).model_dump_json()
+            )

        output = self._generate_responses(engine_generator, request_type)


--- a/examples/python_rs/llm/vllm_nixl/router/protocol.py
+++ b/examples/python_rs/llm/vllm_nixl/router/protocol.py
--- a/examples/python_rs/llm/vllm_nixl/router/start_single_node.sh
+++ b/examples/python_rs/llm/vllm_nixl/router/start_single_node.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# default values
-model=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-p_tensor_parallel_size=1
-d_tensor_parallel_size=1
-max_model_len=16384
-max_num_batched_tokens=16384
-max_num_seqs=1024
-gpu_memory_utilization=0.9
-enable_chunked_prefill=False
-block_size=64
-
-num_p=2
-num_d=2
-total_rank=$((num_p + num_d))
-curr_kv_rank=0
-
-# Function to display usage
-usage() {
-    echo "Usage: $0 [--model <model>] [--p_tensor_parallel_size <size>] [--d_tensor_parallel_size <size>] [--max_model_len <len>] [--max_num_batched_tokens <tokens>] [--max_num_seqs <seqs>] [--gpu_memory_utilization <utilization>] [--enable_chunked_prefill <True/False>] [--num_p <p>] [--num_d <d>]"
-    exit 1
-}
-
-# Parse the command-line arguments
-while [[ $# -gt 0 ]]; do
-    case "$1" in
-        --model)
-            model="$2"
-            shift 2
-            ;;
-        --p_tensor_parallel_size)
-            p_tensor_parallel_size="$2"
-            shift 2
-            ;;
-        --d_tensor_parallel_size)
-            d_tensor_parallel_size="$2"
-            shift 2
-            ;;
-        --max_model_len)
-            max_model_len="$2"
-            shift 2
-            ;;
-        --max_num_batched_tokens)
-            max_num_batched_tokens="$2"
-            shift 2
-            ;;
-        --max_num_seqs)
-            max_num_seqs="$2"
-            shift 2
-            ;;
-        --gpu_memory_utilization)
-            gpu_memory_utilization="$2"
-            shift 2
-            ;;
-        --enable_chunked_prefill)
-            enable_chunked_prefill="$2"
-            shift 2
-            ;;
-        --num_p)
-            num_p="$2"
-            shift 2
-            ;;
-        --num_d)
-            num_d="$2"
-            shift 2
-            ;;
-        --total_rank)
-            total_rank="$2"
-            shift 2
-            ;;
-        --curr_kv_rank)
-            curr_kv_rank="$2"
-            shift 2
-            ;;
-        --block_size)
-            block_size="$2"
-            shift 2
-            ;;
-        *)
-            usage
-            ;;
-    esac
-done
-
-# rank here is GPU rank
-curr_rank=0
-
-echo "total rank: "${total_rank}
-
-for (( i=1; i<=num_d; i++ )); do
-    cuda_devices=$(seq $curr_rank $(($curr_rank + $d_tensor_parallel_size - 1)))
-    cuda_devices=$(echo $cuda_devices | tr ' ' ',')
-    echo "starting gpu rank "${cuda_devices}" (decode)"
-
-    CUDA_VISIBLE_DEVICES=${cuda_devices} python3 worker.py \
-    --remote-prefill \
-    --model ${model} \
-    --max-model-len ${max_model_len} \
-    --max-num-batched-tokens ${max_num_batched_tokens} \
-    --enable-chunked-prefill ${enable_chunked_prefill} \
-    --gpu-memory-utilization ${gpu_memory_utilization} \
-    --enforce-eager \
-    --enable-prefix-caching \
-    --tensor-parallel-size ${d_tensor_parallel_size} \
-    --block-size ${block_size} \
-    --kv-transfer-config '{"kv_connector":"dynamoNixlConnector"}' & disown
-    curr_rank=$((curr_rank + d_tensor_parallel_size))
-    curr_kv_rank=$((curr_kv_rank + 1))
-done
-
-for (( i=1; i<=num_p; i++ )); do
-    cuda_devices=$(seq $curr_rank $(($curr_rank + $p_tensor_parallel_size - 1)))
-    cuda_devices=$(echo $cuda_devices | tr ' ' ',')
-    echo "starting gpu rank "${cuda_devices}" (prefill)"
-
-    CUDA_VISIBLE_DEVICES=${cuda_devices} python3 prefill_worker.py \
-    --model ${model} \
-    --max-model-len ${max_model_len} \
-    --max-num-batched-tokens ${max_num_batched_tokens} \
-    --enable-chunked-prefill ${enable_chunked_prefill} \
-    --gpu-memory-utilization ${gpu_memory_utilization} \
-    --enforce-eager \
-    --tensor-parallel-size ${p_tensor_parallel_size} \
-    --block-size ${block_size} \
-    --kv-transfer-config '{"kv_connector":"dynamoNixlConnector"}' & disown
-    curr_rank=$((curr_rank + p_tensor_parallel_size))
-    curr_kv_rank=$((curr_kv_rank + 1))
-done
-
--- a/examples/python_rs/llm/vllm_nixl/utils/vllm.py
+++ b/examples/python_rs/llm/vllm_nixl/utils/vllm.py
@@ -20,6 +20,13 @@ from vllm.utils import FlexibleArgumentParser

 def parse_vllm_args() -> AsyncEngineArgs:
    parser = FlexibleArgumentParser()
+    parser.add_argument(
+        "--router",
+        type=str,
+        choices=["random", "round-robin", "kv"],
+        default="random",
+        help="Router type to use for scheduling requests to workers",
+    )
    parser.add_argument(
        "--remote-prefill", action="store_true", help="Enable remote prefill"
    )
@@ -49,6 +56,7 @@ def parse_vllm_args() -> AsyncEngineArgs:
    parser = AsyncEngineArgs.add_cli_args(parser)
    args = parser.parse_args()
    engine_args = AsyncEngineArgs.from_cli_args(args)
+    engine_args.router = args.router
    engine_args.remote_prefill = args.remote_prefill
    engine_args.conditional_disagg = args.conditional_disagg
    engine_args.custom_disagg_router = args.custom_disagg_router

--- a/examples/python_rs/llm/vllm_nixl/router/worker.py
+++ b/examples/python_rs/llm/vllm_nixl/router/worker.py
@@ -54,10 +54,6 @@ class RequestHandler:
            do_remote_prefill  # remote prefill is still controlled by the router
        )
        self.disaggregated_router = disaggregated_router
-        if do_remote_prefill:
-            assert (
-                disaggregated_router is not None
-            ), "Disaggregated router is required for remote prefill"

        self._prefill_queue_nats_server = os.getenv(
            "NATS_SERVER", "nats://localhost:4222"
@@ -90,6 +86,7 @@ class RequestHandler:
        else:
            # always prefill remotely if no disaggregated router is provided
            disagg_router_decision = True
+
        if self.do_remote_prefill and disagg_router_decision:
            remote_prefill_params = RemotePrefillParams(
                is_remote_prefill=True,
@@ -130,13 +127,17 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):

    endpoint = component.endpoint("generate")

+    if engine_args.remote_prefill:
        prefill_client = (
            await runtime.namespace("dynamo-init")
            .component("prefill")
            .endpoint("generate")
            .client()
        )
+    else:
+        prefill_client = None

+    if engine_args.kv_router:
        # TODO: do we need these env vars?
        VLLM_WORKER_ID = endpoint.lease_id()
        os.environ["VLLM_WORKER_ID"] = str(VLLM_WORKER_ID)
@@ -156,14 +157,8 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
            if engine_args.served_model_name is not None
            else "vllm"
        )
-        disaggregated_router = PyDisaggregatedRouter(
-            runtime,
-            served_model_name,
-            custom_disagg_router=engine_args.custom_disagg_router,
-            max_local_prefill_length=engine_args.max_local_prefill_length,
-            max_remote_prefill_cache_hit_ratio=engine_args.max_remote_prefill_cache_hit_ratio,
-        )

+        if engine_args.kv_router:
            engine_client.set_metrics_publisher(metrics_publisher)

            # Initially send dummy metrics to kick start,
@@ -175,28 +170,43 @@ async def worker(runtime: DistributedRuntime, engine_args: AsyncEngineArgs):
                1024,
            )

+        if engine_args.remote_prefill:
            metadata = engine_client.nixl_metadata
            metadata_store = NixlMetadataStore("dynamo-init", runtime)
            await metadata_store.put(metadata.engine_id, metadata)

-        await asyncio.gather(
+        if engine_args.conditional_disagg:
+            disaggregated_router = PyDisaggregatedRouter(
+                runtime,
+                served_model_name,
+                custom_disagg_router=engine_args.custom_disagg_router,
+                max_local_prefill_length=engine_args.max_local_prefill_length,
+                max_remote_prefill_cache_hit_ratio=engine_args.max_remote_prefill_cache_hit_ratio,
+            )
+        else:
+            disaggregated_router = None
+
+        endpoints = [
            endpoint.serve_endpoint(
                RequestHandler(
                    model_name=served_model_name,
                    engine_client=engine_client,
                    prefill_client=prefill_client,
-                    do_remote_prefill=True,
+                    do_remote_prefill=engine_args.remote_prefill,
                    disaggregated_router=disaggregated_router,
                ).generate
-            ),
-            metrics_publisher.create_endpoint(component),
            )
+        ]
+        if engine_args.kv_router:
+            endpoints.append(metrics_publisher.create_endpoint(component))
+        await asyncio.gather(*endpoints)


 if __name__ == "__main__":
    uvloop.install()
    engine_args = parse_vllm_args()

+    if engine_args.remote_prefill:
        if engine_args.enable_chunked_prefill is not False:
            print("Chunked prefill is not supported yet, setting to False")
            engine_args.enable_chunked_prefill = False