• Daniël de Kok's avatar
    Improve the handling of quantized weights (#2250) · ba291dad
    Daniël de Kok authored
    * Improve the handling of quantized weights
    
    Handling of quantized weights was split between two mechanisms:
    
    - For quantized checkpoints, we used the new weight loader
      infrastructure.
    - For quantization while loading (EETQ, FP8, bitsandbytes) we
      instead relied on conditional in `get_linear`.
    
    Weight loaders support context managers to selectively load
    particular layers with different weight loaders, which is useful
    for models like Idefics2 AWQ, which uses a quantized text model,
    but unquantized vision and connector models. However, the context
    manager would be overrided by `get_linear`, which string-checks
    `quantizer`. Also, the context manager would not work with
    EETQ, FP8, and bitsandbytes.
    
    This change migrates all quantizers to the weight loader infrastructure.
    This has several benefits:
    
    - We can use context managers with all quantizers.
    - All the implementation details move down to the quantizer layers,
      `get_linear` does not need to know how to handle quantizer linear
      layers.
    - All quantizer weights are strongly typed, we don't pass around
      raw tensors.
    - We don't have to pass around the `quantizer` string everywhere.
    
    * Exclude non-MLP layers when using FP8 quantization with Llama
    ba291dad
tensor_parallel.py 9.16 KB