server/text_generation_server/utils/quantization.py · 34f7dcfd80bc760c5e7c8479bdb34303c93272a9 · OpenDAS / text-generation-inference

Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300) · 34f7dcfd

Daniël de Kok authored Jul 31, 2024

The `GPTWeightLoader` was structured like this in pseudocode:

if marlin:
  Set up tensors in a way that GPTQ-Marlin expects
else:
  Set up tensors in a way that ExLlama/GPTQ/AWQ expect

However, the GPT-Marlin implementation details should really be in the
`marlin` module. So move the former part out to a separate
`GPTQMarlinWeightsLoader`.

34f7dcfd

quantization.py 7.29 KB

Replace quantization.py