1. 23 Jul, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for repacking AWQ weights for GPTQ-Marlin (#2278) · 9935720c
      Daniël de Kok authored
      * Add support for repacking AWQ weights for GPTQ-Marlin
      
      So far we couldn't support AWQ because virtually all AWQ models use
      symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
      has recently added support AWQ repacking and AWQ asymmetric quantization
      (zero_point=True).
      
      This change updates all GPTQ-Marlin kernels from upstream and wires up
      AWQ support. For now enabling AWQ using Marlin requires running TGI with
      `--quantize gptq`.
      
      * Enable Marlin for supported AWQ configurations by default
      
      This makes the AWQ -> GPTQ repack test redundant, since we are now
      testing this with the regular AWQ test.
      9935720c
  2. 11 Jul, 2024 1 commit
  3. 25 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for Marlin 2:4 sparsity (#2102) · f1f98e36
      Daniël de Kok authored
      This change adds support for 2:4 sparsity when using Marlin
      quantization. The 2:4 kernel is used when:
      
      * The quantizer is `marlin`;
      * the quantizer checkpoint format is `marlin_24`.
      
      Fixes #2098.
      f1f98e36
  4. 14 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for GPTQ Marlin (#2052) · 093a27c5
      Daniël de Kok authored
      Add support for GPTQ Marlin kernels
      
      GPTQ Marlin extends the Marlin kernels to support common GPTQ
      configurations:
      
      - bits: 4 or 8
      - groupsize: -1, 32, 64, or 128
      - desc_act: true/false
      
      Using the GPTQ Marlin kernels requires repacking the parameters in the
      Marlin quantizer format.
      
      The kernels were contributed by Neural Magic to VLLM. We vendor them
      here for convenience.
      093a27c5