quantization.md 6.7 KB
Newer Older
gushiqiao's avatar
gushiqiao committed
1
# Model Quantization Techniques
2

gushiqiao's avatar
gushiqiao committed
3
## 📖 Overview
4

gushiqiao's avatar
gushiqiao committed
5
LightX2V supports quantized inference for DIT, T5, and CLIP models, reducing memory usage and improving inference speed by lowering model precision.
6

gushiqiao's avatar
gushiqiao committed
7
---
gushiqiao's avatar
gushiqiao committed
8

gushiqiao's avatar
gushiqiao committed
9
## 🔧 Quantization Modes
gushiqiao's avatar
gushiqiao committed
10

gushiqiao's avatar
gushiqiao committed
11
12
13
14
15
16
17
18
19
20
21
| Quantization Mode | Weight Quantization | Activation Quantization | Compute Kernel | Supported Hardware |
|--------------|----------|----------|----------|----------|
| `fp8-vllm` | FP8 channel symmetric | FP8 channel dynamic symmetric | [VLLM](https://github.com/vllm-project/vllm) | H100/H200/H800, RTX 40 series, etc. |
| `int8-vllm` | INT8 channel symmetric | INT8 channel dynamic symmetric | [VLLM](https://github.com/vllm-project/vllm) | A100/A800, RTX 30/40 series, etc.  |
| `fp8-sgl` | FP8 channel symmetric | FP8 channel dynamic symmetric | [SGL](https://github.com/sgl-project/sglang/tree/main/sgl-kernel) | H100/H200/H800, RTX 40 series, etc. |
| `int8-sgl` | INT8 channel symmetric | INT8 channel dynamic symmetric | [SGL](https://github.com/sgl-project/sglang/tree/main/sgl-kernel) | A100/A800, RTX 30/40 series, etc.  |
| `fp8-q8f` | FP8 channel symmetric | FP8 channel dynamic symmetric | [Q8-Kernels](https://github.com/KONAKONA666/q8_kernels) | RTX 40 series, L40S, etc. |
| `int8-q8f` | INT8 channel symmetric | INT8 channel dynamic symmetric | [Q8-Kernels](https://github.com/KONAKONA666/q8_kernels) | RTX 40 series, L40S, etc. |
| `int8-torchao` | INT8 channel symmetric | INT8 channel dynamic symmetric | [TorchAO](https://github.com/pytorch/ao) | A100/A800, RTX 30/40 series, etc. |
| `int4-g128-marlin` | INT4 group symmetric | FP16 | [Marlin](https://github.com/IST-DASLab/marlin) | H200/H800/A100/A800, RTX 30/40 series, etc. |
| `fp8-b128-deepgemm` | FP8 block symmetric | FP8 group symmetric | [DeepGemm](https://github.com/deepseek-ai/DeepGEMM) | H100/H200/H800, RTX 40 series, etc.|
gushiqiao's avatar
gushiqiao committed
22

gushiqiao's avatar
gushiqiao committed
23
---
gushiqiao's avatar
gushiqiao committed
24

gushiqiao's avatar
gushiqiao committed
25
## 🔧 Obtaining Quantized Models
gushiqiao's avatar
gushiqiao committed
26

gushiqiao's avatar
gushiqiao committed
27
### Method 1: Download Pre-Quantized Models
gushiqiao's avatar
gushiqiao committed
28

gushiqiao's avatar
gushiqiao committed
29
Download pre-quantized models from LightX2V model repositories:
gushiqiao's avatar
gushiqiao committed
30

gushiqiao's avatar
gushiqiao committed
31
**DIT Models**
gushiqiao's avatar
gushiqiao committed
32

gushiqiao's avatar
gushiqiao committed
33
Download pre-quantized DIT models from [Wan2.1-Distill-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models):
gushiqiao's avatar
gushiqiao committed
34

gushiqiao's avatar
gushiqiao committed
35
36
37
38
39
40
```bash
# Download DIT FP8 quantized model
huggingface-cli download lightx2v/Wan2.1-Distill-Models \
    --local-dir ./models \
    --include "wan2.1_i2v_720p_scaled_fp8_e4m3_lightx2v_4step.safetensors"
```
gushiqiao's avatar
gushiqiao committed
41

gushiqiao's avatar
gushiqiao committed
42
**Encoder Models**
gushiqiao's avatar
gushiqiao committed
43

gushiqiao's avatar
gushiqiao committed
44
Download pre-quantized T5 and CLIP models from [Encoders-LightX2V](https://huggingface.co/lightx2v/Encoders-Lightx2v):
gushiqiao's avatar
gushiqiao committed
45

gushiqiao's avatar
gushiqiao committed
46
47
48
49
50
```bash
# Download T5 FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_t5_umt5-xxl-enc-fp8.pth"
gushiqiao's avatar
gushiqiao committed
51

gushiqiao's avatar
gushiqiao committed
52
53
54
55
56
# Download CLIP FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth"
```
gushiqiao's avatar
gushiqiao committed
57

gushiqiao's avatar
gushiqiao committed
58
### Method 2: Self-Quantize Models
gushiqiao's avatar
gushiqiao committed
59

gushiqiao's avatar
gushiqiao committed
60
For detailed quantization tool usage, refer to: [Model Conversion Documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md)
61

gushiqiao's avatar
gushiqiao committed
62
---
63

gushiqiao's avatar
gushiqiao committed
64
## 🚀 Using Quantized Models
gushiqiao's avatar
gushiqiao committed
65

gushiqiao's avatar
gushiqiao committed
66
### DIT Model Quantization
67

gushiqiao's avatar
gushiqiao committed
68
#### Supported Quantization Modes
69

gushiqiao's avatar
gushiqiao committed
70
DIT quantization modes (`dit_quant_scheme`) support: `fp8-vllm`, `int8-vllm`, `fp8-sgl`, `int8-sgl`, `fp8-q8f`, `int8-q8f`, `int8-torchao`, `int4-g128-marlin`, `fp8-b128-deepgemm`
gushiqiao's avatar
gushiqiao committed
71

gushiqiao's avatar
gushiqiao committed
72
#### Configuration Example
gushiqiao's avatar
gushiqiao committed
73
74
75

```json
{
gushiqiao's avatar
gushiqiao committed
76
77
78
    "dit_quantized": true,
    "dit_quant_scheme": "fp8-sgl",
    "dit_quantized_ckpt": "/path/to/dit_quantized_model"  // Optional
gushiqiao's avatar
gushiqiao committed
79
80
81
}
```

gushiqiao's avatar
gushiqiao committed
82
> 💡 **Tip**: When there's only one DIT model in the script's `model_path`, `dit_quantized_ckpt` doesn't need to be specified separately.
gushiqiao's avatar
gushiqiao committed
83

gushiqiao's avatar
gushiqiao committed
84
85
86
### T5 Model Quantization

#### Supported Quantization Modes
gushiqiao's avatar
gushiqiao committed
87

gushiqiao's avatar
gushiqiao committed
88
89
90
T5 quantization modes (`t5_quant_scheme`) support: `int8-vllm`, `fp8-sgl`, `int8-q8f`, `fp8-q8f`, `int8-torchao`

#### Configuration Example
gushiqiao's avatar
gushiqiao committed
91
92
93
94

```json
{
    "t5_quantized": true,
gushiqiao's avatar
gushiqiao committed
95
96
    "t5_quant_scheme": "fp8-sgl",
    "t5_quantized_ckpt": "/path/to/t5_quantized_model"  // Optional
gushiqiao's avatar
gushiqiao committed
97
98
99
}
```

gushiqiao's avatar
gushiqiao committed
100
> 💡 **Tip**: When a T5 quantized model exists in the script's specified `model_path` (such as `models_t5_umt5-xxl-enc-fp8.pth` or `models_t5_umt5-xxl-enc-int8.pth`), `t5_quantized_ckpt` doesn't need to be specified separately.
gushiqiao's avatar
gushiqiao committed
101

gushiqiao's avatar
gushiqiao committed
102
### CLIP Model Quantization
gushiqiao's avatar
gushiqiao committed
103

gushiqiao's avatar
gushiqiao committed
104
#### Supported Quantization Modes
gushiqiao's avatar
gushiqiao committed
105

gushiqiao's avatar
gushiqiao committed
106
CLIP quantization modes (`clip_quant_scheme`) support: `int8-vllm`, `fp8-sgl`, `int8-q8f`, `fp8-q8f`, `int8-torchao`
gushiqiao's avatar
gushiqiao committed
107

gushiqiao's avatar
gushiqiao committed
108
#### Configuration Example
gushiqiao's avatar
gushiqiao committed
109
110
111
112

```json
{
    "clip_quantized": true,
gushiqiao's avatar
gushiqiao committed
113
114
    "clip_quant_scheme": "fp8-sgl",
    "clip_quantized_ckpt": "/path/to/clip_quantized_model"  // Optional
gushiqiao's avatar
gushiqiao committed
115
116
117
}
```

gushiqiao's avatar
gushiqiao committed
118
> 💡 **Tip**: When a CLIP quantized model exists in the script's specified `model_path` (such as `models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth` or `models_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth`), `clip_quantized_ckpt` doesn't need to be specified separately.
119

gushiqiao's avatar
gushiqiao committed
120
### Performance Optimization Strategy
gushiqiao's avatar
gushiqiao committed
121

gushiqiao's avatar
gushiqiao committed
122
If memory is insufficient, you can combine parameter offloading to further reduce memory usage. Refer to [Parameter Offload Documentation](../method_tutorials/offload.md):
gushiqiao's avatar
gushiqiao committed
123

gushiqiao's avatar
gushiqiao committed
124
125
> - **Wan2.1 Configuration**: Refer to [offload config files](https://github.com/ModelTC/LightX2V/tree/main/configs/offload)
> - **Wan2.2 Configuration**: Refer to [wan22 config files](https://github.com/ModelTC/LightX2V/tree/main/configs/wan22) with `4090` suffix
gushiqiao's avatar
gushiqiao committed
126

gushiqiao's avatar
gushiqiao committed
127
---
gushiqiao's avatar
gushiqiao committed
128

gushiqiao's avatar
gushiqiao committed
129
## 📚 Related Resources
gushiqiao's avatar
gushiqiao committed
130

gushiqiao's avatar
gushiqiao committed
131
132
133
134
### Configuration File Examples
- [INT8 Quantization Config](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v.json)
- [Q8F Quantization Config](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v_q8f.json)
- [TorchAO Quantization Config](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v_torchao.json)
gushiqiao's avatar
gushiqiao committed
135

gushiqiao's avatar
gushiqiao committed
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
### Run Scripts
- [Quantization Inference Scripts](https://github.com/ModelTC/LightX2V/tree/main/scripts/quantization)

### Tool Documentation
- [Quantization Tool Documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md)
- [LightCompress Quantization Documentation](https://github.com/ModelTC/llmc/blob/main/docs/zh_cn/source/backend/lightx2v.md)

### Model Repositories
- [Wan2.1-LightX2V Quantized Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models)
- [Wan2.2-LightX2V Quantized Models](https://huggingface.co/lightx2v/Wan2.2-Distill-Models)
- [Encoders Quantized Models](https://huggingface.co/lightx2v/Encoders-Lightx2v)

---

Through this document, you should be able to:

✅ Understand quantization schemes supported by LightX2V
✅ Select appropriate quantization strategies based on hardware
✅ Correctly configure quantization parameters
✅ Obtain and use quantized models
✅ Optimize inference performance and memory usage
gushiqiao's avatar
gushiqiao committed
157

gushiqiao's avatar
gushiqiao committed
158
If you have other questions, feel free to ask in [GitHub Issues](https://github.com/ModelTC/LightX2V/issues).