• Daniël de Kok's avatar
    Add initial support for compressed-tensors checkpoints (#2732) · a7850008
    Daniël de Kok authored
    compressed-tensors is a safetensors extension for sparse, quantized
    tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
    quantization, because
    
    - Different quantizer configurations can be used for different targets.
    - The format can specify input/output quantizers in addition to weight
      quantizers.
    - Configurable exclusions for quantization.
    
    This change adds a dependency on the `compressed-tensors` package for
    its configuration parsing and layer matching functionality.
    
    The following types of quantization are supported in this PR:
    
    - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
    - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
    
    Support for other quantization types will be added in subsequent PRs.
    a7850008
pyproject.toml 3.46 KB