Fix issues in fused_attn_bwd (#1574)
* fix dtypes of fused_attn_bwd in CP+A2A Signed-off-by:Xiaowei Ren <xren@nvidia.com> * fix dtypes of fused_attn_bwd in CP+P2P Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix amax_per_step Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * clone scaling factors of fwd quantizers Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix fwd quantizers of CP+P2P Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * dequantize fp8 out in CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * delete redundant None in FusedAttnFunc bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Showing
Please register or sign in to comment