03.quantization.md

# Quantization

lightx2v supports quantized inference for linear layers, supporting w8a8 and fp8 matrix multiplication.


### Run Quantized Inference

```shell
# Modify the path in the script
bash scripts/run_wan_t2v_save_quant.sh
```

There are two execution commands in the script:

#### Save Quantization Weights

Set the `RUNNING_FLAG` environment variable to `save_naive_quant`, and set `--config_json` to the corresponding `json` file: `${lightx2v_path}/configs/wan_t2v_save_quant.json`. In this file, `quant_model_path` specifies the path to save the quantized model.

#### Load Quantization Weights and Inference

Set the `RUNNING_FLAG` environment variable to `infer`, and set `--config_json` to the `json` file from the previous step.

### Start Quantization Service

After saving the quantized weights, as in the previous loading step, set the `RUNNING_FLAG` environment variable to `infer`, and set `--config_json` to the `json` file from the first step.

For example, modify the `scripts/start_server.sh` script as follows:

```shell
export RUNNING_FLAG=infer

python -m lightx2v.api_server \
--model_cls wan2.1 \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/wan_t2v_save_quant.json \
--port 8000
```