• Daniël de Kok's avatar
    Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f
    Daniël de Kok authored
    
    
    * Add support for compressed-tensors w8a8 int checkpoints
    
    This change adds a loader for w8a8 int checkpoints. One large benefit of
    int8 support is that the corresponding cutlass matmul kernels also work on
    compute capability 7.5.
    
    Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:
    
    |     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
    |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
    |gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
    |               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
    |ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
    |               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
    |               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
    |               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|
    
    Which is the same ballpark as vLLM.
    
    As usual, lots of thanks to Neural Magic/vLLM for the kernels.
    
    * Always use dynamic input quantization for w8a8 int
    
    It's far less flaky and gives better output.
    
    * Use marlin-kernels 0.3.5
    
    * Fix a typo
    Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
    
    * Small fixes
    
    ---------
    Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
    3c9df21f
flake.lock 26.5 KB