Feat: Clear cache during weight loading to prevent OOM on GPUs with <=8GB VRAM

This change explicitly clears CUDA cache during weight loading to mitigate memory fragmentation issues, particularly beneficial for low-VRAM GPUs.

Feat: Clear cache during weight loading to prevent OOM on GPUs with <=8GB VRAM
This change explicitly clears CUDA cache during weight loading to mitigate memory fragmentation issues, particularly beneficial for low-VRAM GPUs.
cea07d19 · Yuhao Tsui · GitHub · eb039b72 · cea07d19
Unverified Commit cea07d19 authored Feb 24, 2025 by Yuhao Tsui Committed by GitHub Feb 24, 2025
Show whitespace changes
Inline Side-by-side

Showing with 1 addition and 0 deletions

ktransformers/util/utils.py ktransformers/util/utils.py +1 -0

No files found.
--- a/ktransformers/util/utils.py
+++ b/ktransformers/util/utils.py
@@ -70,6 +70,7 @@ def load_cur_state_dict(module: nn.Module, gguf_loader: GGUFLoader, prefix: str
            target_dtype = torch.get_default_dtype()
            device = get_device(translated_key[:translated_key.rfind(".")], gguf_loader.tensor_device_map)
            print(f"loading {translated_key} to {device}")
+            torch.cuda.empty_cache()
            # device = "cpu" if "embd" in translated_key else "cuda"
            weights = gguf_loader.load_gguf_tensor(translated_key, device = device).to(dtype = target_dtype)
            set_param(module, name, weights)