[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161)

FIX issue https://github.com/vllm-project/vllm/issues/9688 https://github.com/vllm-project/vllm/issues/11086 #12487 --------- Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: weilong.yu <weilong.yu@shopee.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161)
FIX issue https://github.com/vllm-project/vllm/issues/9688 https://github.com/vllm-project/vllm/issues/11086 #12487 --------- Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: weilong.yu <weilong.yu@shopee.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
cb3e73e4 · fade_away · GitHub · b1340f9d · cb3e73e4 · cb3e73e4
Unverified Commit cb3e73e4 authored Feb 01, 2025 by fade_away Committed by GitHub Feb 01, 2025
Show whitespace changes
Inline Side-by-side

Showing with 4 additions and 3 deletions

vllm/worker/model_runner.py vllm/worker/model_runner.py +4 -0

vllm/worker/worker.py vllm/worker/worker.py +0 -3

No files found.
--- a/vllm/worker/model_runner.py
+++ b/vllm/worker/model_runner.py
@@ -1346,6 +1346,10 @@ class GPUModelRunnerBase(ModelRunnerBase[TModelInputForGPU]):

            self.execute_model(model_input, kv_caches, intermediate_tensors)
            torch.cuda.synchronize()
+            if self.lora_config:
+                # Remove dummy loras.
+                assert self.lora_manager is not None
+                self.remove_all_loras()
            return

    def remove_all_loras(self):

--- a/vllm/worker/worker.py
+++ b/vllm/worker/worker.py
@@ -264,10 +264,7 @@ class Worker(LocalOrDistributedWorkerBase):
               f"{(available_kv_cache_memory / GiB_bytes):.2f}GiB.")

        logger.info(msg)
-
        # Final cleanup
-        if self.model_runner.lora_manager:
-            self.model_runner.remove_all_loras()
        gc.collect()

        return num_gpu_blocks, num_cpu_blocks