Note that only 4-bit models are supported for now. Furthermore, it is recommended to deactivate the exllama kernels if you are finetuning a quantized model with peft.
You can find the benchmark of these kernels [here](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)
#### Fine-tune a quantized model
With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
@@ -2759,7 +2759,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
logger.warning(
"You passed `quantization_config` to `from_pretrained` but the model you're loading already has a "
"`quantization_config` attribute and has already quantized weights. However, loading attributes"
" (e.g. disable_exllama, use_cuda_fp16, max_input_length, use_exllama_v2) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored."
" (e.g. disable_exllama, use_cuda_fp16, max_input_length) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored."
f"You need optimum > 1.13.2 and auto-gptq > 0.4.2 . Make sure to have that version installed - detected version : optimum {optimum_version} and autogptq {autogptq_version}"
)
self.disable_exllama=True
logger.warning("You have activated exllamav2 kernels. Exllama kernels will be disabled.")
ifnotself.disable_exllama:
logger.warning(
"""You have activated exllama backend. Note that you can get better inference
speed using exllamav2 kernel by setting `use_exllama_v2=True`.`disable_exllama` will be deprecated