💡 Refer to the [Model Structure Documentation](https://lightx2v-en.readthedocs.io/en/latest/deploy_guides/model_structure.html) to quickly get started with LightX2V
Lightx2v implements a state-of-the-art parameter offloading mechanism specifically designed for efficient large model inference under limited hardware resources. This system provides excellent speed-memory balance through intelligent management of model weights across different memory hierarchies, enabling dynamic scheduling between GPU, CPU, and disk storage.
LightX2V implements an advanced parameter offload mechanism specifically designed for large model inference under limited hardware resources. The system provides an excellent speed-memory balance by intelligently managing model weights across different memory hierarchies.
**Core Features:**
**Core Features:**
-**Intelligent Granularity Management**: Supports both Block and Phase offloading granularities for flexible memory control
-**Block/Phase-level Offload**: Efficiently manages model weights in block/phase units for optimal memory usage
-**Block Granularity**: Complete Transformer layers as management units, containing self-attention, cross-attention, feed-forward networks, etc., suitable for memory-sufficient environments
-**Block**: The basic computational unit of Transformer models, containing complete Transformer layers (self-attention, cross-attention, feedforward networks, etc.), serving as a larger memory management unit
-**Phase Granularity**: Individual computational components as management units, providing finer-grained memory control for memory-constrained deployment scenarios
-**Phase**: Finer-grained computational stages within blocks, containing individual computational components (such as self-attention, cross-attention, feedforward networks, etc.), providing more precise memory control
-**Multi-level Storage Architecture**: GPU → CPU → Disk three-tier storage hierarchy with intelligent caching strategies
-**Multi-tier Storage Support**: GPU → CPU → Disk hierarchy with intelligent caching
-**Asynchronous Parallel Processing**: CUDA stream-based asynchronous computation and data transfer for maximum hardware utilization
-**Asynchronous Operations**: Overlaps computation and data transfer using CUDA streams
-**Persistent Storage Support**: SSD/NVMe disk storage support for ultra-large model inference deployment
-**Disk/NVMe Serialization**: Supports secondary storage when memory is insufficient
## 🎯 Offloading Strategy Details
## 🎯 Offload Strategies
### Strategy 1: GPU-CPU Granularity Offloading
### Strategy 1: GPU-CPU Block/Phase Offload
**Applicable Scenarios**: GPU VRAM insufficient but system memory resources adequate
**Use Case**: Insufficient GPU memory but sufficient system memory
**Technical Principle**: Establishes efficient weight scheduling mechanism between GPU and CPU memory, managing model weights in Block or Phase units. Leverages CUDA stream asynchronous capabilities to achieve parallel execution of computation and data transfer. Blocks contain complete Transformer layer structures, while Phases correspond to individual computational components within layers.
**How It Works**: Manages model weights in block or phase units between GPU and CPU memory, utilizing CUDA streams to overlap computation and data transfer. Blocks contain complete Transformer layers, while Phases are individual computational components within blocks.
**Granularity Selection Guide**:
-**Block Granularity**: Suitable for memory-sufficient environments, reduces management overhead and improves overall performance
-**Phase Granularity**: Suitable for memory-constrained environments, provides more flexible memory control and optimizes resource utilization
-**Block Granularity**: Larger memory management unit containing complete Transformer layers (self-attention, cross-attention, feedforward networks, etc.), suitable for sufficient memory scenarios with reduced management overhead
-**Phase Granularity**: Finer-grained memory management containing individual computational components (such as self-attention, cross-attention, feedforward networks, etc.), suitable for memory-constrained scenarios with more flexible memory control
**Key Features:**
-**Asynchronous Transfer**: Uses three CUDA streams with different priorities for parallel computation and transfer
- Compute stream (priority=-1): High priority, handles current computation
- GPU load stream (priority=0): Medium priority, handles CPU to GPU prefetching
- CPU load stream (priority=0): Medium priority, handles GPU to CPU offloading
-**Prefetch Mechanism**: Preloads the next block/phase to GPU in advance
-**Intelligent Caching**: Maintains weight cache in CPU memory
-**Stream Synchronization**: Ensures correctness of data transfer and computation
-**Swap Operation**: Rotates block/phase positions after computation for continuous execution
**Applicable Scenarios**: Both GPU VRAM and system memory resources insufficient in constrained environments
**Use Case**: Both GPU memory and system memory are insufficient
**How It Works**: Builds upon Strategy 1 by introducing disk storage, implementing a three-tier storage hierarchy (Disk → CPU → GPU). CPU continues to serve as a cache pool with configurable size, suitable for devices with limited CPU memory.
**Technical Principle**: Introduces disk storage layer on top of Strategy 1, constructing a Disk→CPU→GPU three-level storage architecture. CPU serves as a configurable intelligent cache pool, suitable for various memory-constrained deployment environments.
For memory-constrained devices, a progressive offload strategy is recommended:
1.**Step 1**: Only enable `cpu_offload`, disable `t5_cpu_offload`, `clip_cpu_offload`, `vae_cpu_offload`
2.**Step 2**: If memory is still insufficient, gradually enable CPU offload for T5, CLIP, VAE
3.**Step 3**: If memory is still not enough, consider using quantization + CPU offload or enable `lazy_load`
**Practical Experience**:
-**RTX 4090 24GB + 14B Model**: Usually only need to enable `cpu_offload`, manually set other component offload to `false`, and use FP8 quantized version
-**Smaller Memory GPUs**: Need to combine quantization, CPU offload, and lazy loading
-**Quantization Schemes**: Refer to [Quantization Documentation](../method_tutorials/quantization.md) to select appropriate quantization strategy
Detailed configuration files can be referenced at [Official Configuration Repository](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
## 🎯 Deployment Strategy Recommendations
**Configuration File Reference**:
-**Wan2.1 Series Models**: Refer to [offload config files](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
-**Wan2.2 Series Models**: Refer to [wan22 config files](https://github.com/ModelTC/lightx2v/tree/main/configs/wan22) with `4090` suffix
- 🔄 GPU-CPU Granularity Offloading: Suitable for insufficient GPU VRAM (RTX 3090/4090 24G) but adequate system memory (>64G)
## 🎯 Usage Recommendations
- Advantages: Balances performance and memory usage, suitable for medium-scale model inference
- 🔄 GPU-CPU Block/Phase Offload: Suitable for insufficient GPU memory (RTX 3090/4090 24G) but sufficient system memory (>64/128G)
- 💾 Disk-CPU-GPU Three-Level Offloading: Suitable for limited GPU VRAM (RTX 3060/4090 8G) and insufficient system memory (16-32G)
- 💾 Disk-CPU-GPU Block/Phase Offload: Suitable for both insufficient GPU memory (RTX 3060/4090 8G) and system memory (16/32G)
- Advantages: Supports ultra-large model inference with lowest hardware threshold
- 🚫 No Offload Mode: Suitable for high-end hardware configurations pursuing optimal inference performance
- 🚫 No Offload: Suitable for high-end hardware configurations pursuing best performance
- Advantages: Maximizes computational efficiency, suitable for latency-sensitive application scenarios
## 🔍 Troubleshooting and Solutions
### Common Performance Issues and Optimization Strategies
## 🔍 Troubleshooting
### Common Issues and Solutions
1.**Disk I/O Bottleneck**
- Solution: Use NVMe SSD, increase num_disk_workers
1.**Disk I/O Performance Bottleneck**
- Problem Symptoms: Slow model loading speed, high inference latency
- Solutions:
- Upgrade to NVMe SSD storage devices
- Increase num_disk_workers parameter value
- Optimize file system configuration
2.**Memory Buffer Overflow**
2.**Memory Buffer Overflow**
- Problem Symptoms: Insufficient system memory, program abnormal exit
- Solution: Increase max_memory or reduce num_disk_workers
- Solutions:
- Increase max_memory parameter value
3.**Loading Timeout**
- Decrease num_disk_workers parameter value
- Solution: Check disk performance, optimize file system
- Adjust offload_granularity to "phase"
3.**Model Loading Timeout**
**Note**: This offload mechanism is specifically designed for LightX2V, fully utilizing the asynchronous computing capabilities of modern hardware, significantly lowering the hardware threshold for large model inference.
- Problem Symptoms: Timeout errors during model loading process
- Solutions:
- Check disk read/write performance
- Optimize file system parameters
- Verify storage device health status
## 📚 Technical Summary
Lightx2v's offloading mechanism is specifically designed for modern AI inference scenarios, fully leveraging GPU's asynchronous computing capabilities and multi-level storage architecture advantages. Through intelligent weight management and efficient parallel processing, this mechanism significantly reduces the hardware threshold for large model inference, providing developers with flexible and efficient deployment solutions.
LightX2V supports quantization inference for linear layers in `Dit`, supporting `w8a8-int8`, `w8a8-fp8`, `w8a8-fp8block`, `w8a8-mxfp8`, and `w4a4-nvfp4` matrix multiplication. Additionally, LightX2V also supports quantization of T5 and CLIP encoders to further improve inference performance.
## 📖 Overview
## 📊 Quantization Scheme Overview
LightX2V supports quantized inference for DIT, T5, and CLIP models, reducing memory usage and improving inference speed by lowering model precision.
### DIT Model Quantization
---
LightX2V supports multiple DIT matrix multiplication quantization schemes, configured through the `mm_type` parameter:
Download quantized models from the [LightX2V Official Model Repository](https://huggingface.co/lightx2v), refer to the [Model Structure Documentation](../deploy_guides/model_structure.md) for details.
For detailed quantization tool usage, refer to: [Model Conversion Documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md)
Use LightX2V's convert tool to convert models into quantized models. Refer to the [documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme.md).
---
## 📥 Loading Quantized Models for Inference
## 🚀 Using Quantized Models
### DIT Model Configuration
### DIT Model Quantization
Write the path of the converted quantized weights to the `dit_quantized_ckpt` field in the [configuration file](https://github.com/ModelTC/lightx2v/blob/main/configs/quantization).
By specifying `--config_json` to the specific config file, you can load the quantized model for inference.
> 💡 **Tip**: When a T5 quantized model exists in the script's specified `model_path` (such as `models_t5_umt5-xxl-enc-fp8.pth` or `models_t5_umt5-xxl-enc-int8.pth`), `t5_quantized_ckpt` doesn't need to be specified separately.
[Here](https://github.com/ModelTC/lightx2v/tree/main/scripts/quantization) are some running scripts for use.
> 💡 **Tip**: When a CLIP quantized model exists in the script's specified `model_path` (such as `models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth` or `models_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth`), `clip_quantized_ckpt` doesn't need to be specified separately.
For details, please refer to the documentation of the quantization tool [LLMC](https://github.com/ModelTC/llmc/blob/main/docs/en/source/backend/lightx2v.md)
### Performance Optimization Strategy
### Custom Quantization Kernels
If memory is insufficient, you can combine parameter offloading to further reduce memory usage. Refer to [Parameter Offload Documentation](../method_tutorials/offload.md):
LightX2V supports custom quantization kernels that can be extended in the following ways:
> - **Wan2.1 Configuration**: Refer to [offload config files](https://github.com/ModelTC/LightX2V/tree/main/configs/offload)
> - **Wan2.2 Configuration**: Refer to [wan22 config files](https://github.com/ModelTC/LightX2V/tree/main/configs/wan22) with `4090` suffix
1.**Register New mm_type**: Add new quantization classes in `mm_weight.py`
---
2.**Implement Quantization Functions**: Define quantization methods for weights and activations
3.**Integrate Compute Kernels**: Use custom matrix multiplication implementations