The codes in this directory are mainly referenced from @qwopqwop200 's [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda), which itself is based on [gptq](https://github.com/IST-DASLab/gptq)
logger.info(f"Ignoring unknown parameter in the quantization configuration: {key}.")
ifcheckpoint_format_auto_inferred:
logger.info(f"`checkpoint_format` is missing from the quantization configuration and is automatically inferred to {normalized[CHECKPOINT_FORMAT_FIELD]}.")
# TODO: Remove and use instead accelerate.utils.modeling.load_checkpoint_in_model once https://github.com/huggingface/accelerate/pull/2588 is merged & accelerate 0.29 is released.
f"Some weights of the model checkpoint at {checkpoint} were not used when"
f" initializing {model.__class__.__name__}: {unexpected_keys}. This may or may not be an issue - make sure that the checkpoint does not have unnecessary parameters, or that the model definition correctly corresponds to the checkpoint."
"The method exllama_set_max_input_length should be called only when using the exllama backend **with act-order**."
)
uses_exllama=False
forname,submoduleinmodel.named_modules():
ifisinstance(submodule,ExllamaQuantLinear):
uses_exllama=True
ifnotuses_exllama:
raiseValueError(
f"The function exllama_set_max_input_length was called, but the model (instance of {model.__class__.__name__}) does not use the exllama backend for GPTQ. An other implementation is used (exllamav2, cuda, cuda-old, triton) and that the call to exllama_set_max_input_length is unnecessary. Please remove the call to exllama_set_max_input_length or use the exllama v1 backend."
# Otherwise, convert the model to Marlin format first and cache locally.
else:
# Loading the GPTQ checkpoint to do the conversion.
# TODO: Avoid loading the model with wrong QuantLinear, and directly use
# Marlin ones. The repacking can be done directly on the safetensors, just
# as for AWQ checkpoints.
load_checkpoint_in_model(
model,
dtype=torch_dtype,# This is very hacky but works due to https://github.com/huggingface/accelerate/blob/bd72a5f1a80d5146554458823f8aeda0a9db5297/src/accelerate/utils/modeling.py#L292
checkpoint=current_model_save_name,
device_map=device_map,
offload_state_dict=True,
offload_buffers=True,
)
# Convert model to marlin, repacking weights into Marlin format.