autoround.quantize_and_save(quant_path,format=format)# quantize and save
```
- Command Line Usage (Gaudi/CPU/Intel GPU/CUDA)
```bash
auto-round \
--model meta-llama/Llama-3.2-1B-Instruct \
--bits 4 \
--group_size 128 \
--format"auto_gptq,auto_awq,auto_round"\
--output_dir ./tmp_autoround
```
- known issues
Several limitations currently affect offline quantized model loading in sglang, These issues might be resolved in future updates of sglang. If you experience any problems, consider using Hugging Face Transformers as an alternative.
1. Mixed-bit Quantization Limitations
Mixed-bit quantization is not fully supported. Due to vLLM's layer fusion (e.g., QKV fusion), applying different bit-widths to components within the same fused layer can lead to compatibility issues.
2. Limited Support for Quantized MoE Models
Quantized MoE models may encounter inference issues due to kernel limitations (e.g., lack of support for mlp.gate layer quantization). To avoid such errors, please skip quantizing gate layers when processing quantization to MoE modules.
3. Limited Support for Quantized VLMs
<details>
<summary>VLM failure cases</summary>
Qwen2.5-VL-7B
auto_round:auto_gptq format: Accuracy is close to zero.
GPTQ format: Fails with:
```
The output size is not aligned with the quantized weight shape
```
auto_round:auto_awq and AWQ format: These work as expected.
</details>
#### Using [GPTQModel](https://github.com/ModelCloud/GPTQModel)