info="Controls the chunk size for applying rotary embeddings. Larger values may improve performance but increase memory usage. Only effective if 'rotary_chunk' is checked.",
info="Controls the chunk size for applying rotary embeddings. Larger values may improve performance but increase memory usage. Only effective if 'rotary_chunk' is checked.",
)
)
unload_modules=gr.Checkbox(
label="Unload Modules",
value=False,
info="Unload modules (T5, CLIP, DIT, etc.) after inference to reduce GPU/CPU memory usage",
)
clean_cuda_cache=gr.Checkbox(
clean_cuda_cache=gr.Checkbox(
label="Clean CUDA Memory Cache",
label="Clean CUDA Memory Cache",
value=False,
value=False,
...
@@ -883,6 +943,12 @@ def main():
...
@@ -883,6 +943,12 @@ def main():
value=1.0,
value=1.0,
info="Controls how much of the Dit model is offloaded to the CPU",
info="Controls how much of the Dit model is offloaded to the CPU",
)
)
t5_cpu_offload=gr.Checkbox(
label="T5 CPU Offloading",
value=False,
info="Offload the T5 Encoder model to CPU to reduce GPU memory usage",
@@ -1088,14 +1164,16 @@ if __name__ == "__main__":
...
@@ -1088,14 +1164,16 @@ if __name__ == "__main__":
default="wan2.1",
default="wan2.1",
help="Model class to use",
help="Model class to use",
)
)
parser.add_argument("--model_size",type=str,required=True,choices=["14b","1.3b"],help="Model type to use")
parser.add_argument("--task",type=str,required=True,choices=["i2v","t2v"],help="Specify the task type. 'i2v' for image-to-video translation, 't2v' for text-to-video generation.")
parser.add_argument("--task",type=str,required=True,choices=["i2v","t2v"],help="Specify the task type. 'i2v' for image-to-video translation, 't2v' for text-to-video generation.")
@@ -15,7 +15,7 @@ This project contains two main demo files:
...
@@ -15,7 +15,7 @@ This project contains two main demo files:
- Python 3.10+ (recommended)
- Python 3.10+ (recommended)
- CUDA 12.4+ (recommended)
- CUDA 12.4+ (recommended)
- At least 8GB GPU VRAM
- At least 8GB GPU VRAM
- At least 16GB system memory
- At least 16GB system memory (preferably at least 32GB)
- At least 128GB SSD solid-state drive (**💾 Strongly recommend using SSD solid-state drives to store model files! During "lazy loading" startup, significantly improves model loading speed and inference performance**)
- At least 128GB SSD solid-state drive (**💾 Strongly recommend using SSD solid-state drives to store model files! During "lazy loading" startup, significantly improves model loading speed and inference performance**)
### Install Dependencies
### Install Dependencies
...
@@ -80,8 +80,9 @@ vim run_gradio.sh
...
@@ -80,8 +80,9 @@ vim run_gradio.sh
bash run_gradio.sh
bash run_gradio.sh
# 3. Or start with parameters (recommended)
# 3. Or start with parameters (recommended)
bash run_gradio.sh --task i2v --lang en --port 8032
bash run_gradio.sh --task i2v --lang en --model_size 14b --port 8032
# bash run_gradio.sh --task t2v --lang en --port 8032
Lightx2v implements an advanced parameter offloading mechanism designed for large model inference under limited hardware resources. This system provides excellent speed-memory balance through intelligent management of model weights across different memory hierarchies.
**Core Features:**
-**Block/Phase Offloading**: Efficiently manages model weights in block/phase units for optimal memory usage
-**Block**: Basic computational unit of Transformer models, containing complete Transformer layers (self-attention, cross-attention, feed-forward networks, etc.), serving as larger memory management units
-**Phase**: Finer-grained computational stages within blocks, containing individual computational components (such as self-attention, cross-attention, feed-forward networks, etc.), providing more precise memory control
-**Multi-level Storage Support**: GPU → CPU → Disk hierarchy with intelligent caching
-**Asynchronous Operations**: Uses CUDA streams to overlap computation and data transfer
-**Disk/NVMe Serialization**: Supports secondary storage when memory is insufficient
## 🎯 Offloading Strategies
### Strategy 1: GPU-CPU Block/Phase Offloading
**Applicable Scenarios**: GPU VRAM insufficient but system memory adequate
**Working Principle**: Manages model weights in block or phase units between GPU and CPU memory, utilizing CUDA streams to overlap computation and data transfer. Blocks contain complete Transformer layers, while phases are individual computational components within blocks.
**Applicable Scenarios**: Both GPU VRAM and system memory insufficient
**Working Principle**: Introduces disk storage on top of Strategy 1, implementing a three-level storage hierarchy (Disk → CPU → GPU). CPU continues as a cache pool but with configurable size, suitable for CPU memory-constrained devices.
Solution: Increase max_memory or decrease num_disk_workers
```
3.**Loading Timeout**
```
Solution: Check disk performance, optimize file system
```
**Note**: This offloading mechanism is specifically designed for Lightx2v, fully utilizing modern hardware's asynchronous computing capabilities, significantly reducing the hardware threshold for large model inference.