03.quantization.md 1.22 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Quantization

lightx2v supports quantized inference for linear layers, supporting w8a8 and fp8 matrix multiplication.


### Run Quantized Inference

```shell
# Modify the path in the script
bash scripts/run_wan_t2v_save_quant.sh
```

There are two execution commands in the script:

#### Save Quantization Weights

Set the `RUNNING_FLAG` environment variable to `save_naive_quant`, and set `--config_json` to the corresponding `json` file: `${lightx2v_path}/configs/wan_t2v_save_quant.json`. In this file, `quant_model_path` specifies the path to save the quantized model.

#### Load Quantization Weights and Inference

Set the `RUNNING_FLAG` environment variable to `infer`, and set `--config_json` to the `json` file from the previous step.

### Start Quantization Service

After saving the quantized weights, as in the previous loading step, set the `RUNNING_FLAG` environment variable to `infer`, and set `--config_json` to the `json` file from the first step.

For example, modify the `scripts/start_server.sh` script as follows:

```shell
export RUNNING_FLAG=infer

python -m lightx2v.api_server \
--model_cls wan2.1 \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/wan_t2v_save_quant.json \
--port 8000
```