• Daniël de Kok's avatar
    Add support for exl2 quantization · 36dd1601
    Daniël de Kok authored
    Mostly straightforward, changes to existing code:
    
    * Wrap quantizer parameters in a small wrapper to avoid passing
      around untyped tuples and needing to repack them as a dict.
    * Move scratch space computation to warmup, because we need the
      maximum input sequence length to avoid allocating huge
      scratch buffers that OOM.
    36dd1601
linear.py 7.37 KB