Luise/gbn optimization (#105)
* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50 * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
Showing
Please register or sign in to comment