server/text_generation_server/layers/fp8.py · ffe05ccd0566bba3bee6f9fb1678d193c3392ec5 · OpenDAS / text-generation-inference

Add support for scalar FP8 weight scales (#2550) · c29dc89c

Daniël de Kok authored Sep 24, 2024

* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print

c29dc89c

fp8.py 8.95 KB

Replace fp8.py