[Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651)

Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: mgoin <michael@neuralmagic.com>

[Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651)
Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: mgoin <michael@neuralmagic.com>
fa0050db · Elfie Guo · GitHub · cd9d06fb · fa0050db
Unverified Commit fa0050db authored Jan 15, 2025 by Elfie Guo Committed by GitHub Jan 16, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

vllm/model_executor/layers/quantization/fp8.py vllm/model_executor/layers/quantization/fp8.py +2 -1

No files found.
--- a/vllm/model_executor/layers/quantization/fp8.py
+++ b/vllm/model_executor/layers/quantization/fp8.py
@@ -355,7 +355,8 @@ class Fp8LinearMethod(LinearMethodBase):
            input_scale=layer.input_scale,
            bias=bias,
            cutlass_fp8_supported=self.cutlass_fp8_supported,
-            use_per_token_if_dynamic=False)
+            # Default to using per_token quantization if cutlass is supported
+            use_per_token_if_dynamic=self.cutlass_fp8_supported)
 class Fp8MoEMethod(FusedMoEMethodBase):