Unverified Commit fa0050db authored by Elfie Guo's avatar Elfie Guo Committed by GitHub
Browse files

[Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651)


Signed-off-by: default avatarmgoin <michael@neuralmagic.com>
Co-authored-by: default avatarMichael Goin <mgoin@redhat.com>
Co-authored-by: default avatarmgoin <michael@neuralmagic.com>
parent cd9d06fb
...@@ -355,7 +355,8 @@ class Fp8LinearMethod(LinearMethodBase): ...@@ -355,7 +355,8 @@ class Fp8LinearMethod(LinearMethodBase):
input_scale=layer.input_scale, input_scale=layer.input_scale,
bias=bias, bias=bias,
cutlass_fp8_supported=self.cutlass_fp8_supported, cutlass_fp8_supported=self.cutlass_fp8_supported,
use_per_token_if_dynamic=False) # Default to using per_token quantization if cutlass is supported
use_per_token_if_dynamic=self.cutlass_fp8_supported)
class Fp8MoEMethod(FusedMoEMethodBase): class Fp8MoEMethod(FusedMoEMethodBase):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment