[Common/PyTorch] Grouped GEMM via multi-stream cuBLAS (#853)
* GroupedGEMM via multi-stream cublas * fix A/B is nullptr while D is not nullptr * add fp8 grouped gemm * register with TorchScript * add the GroupedLinear layer --------- Signed-off-by:Xin Yao <xiny@nvidia.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Jiang Shao <jiangs@nvidia.com> Co-authored-by:
Qi Zhang <qizhang@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
Showing
This diff is collapsed.
Please register or sign in to comment