-
luise.chen authored
* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50 * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
56c283b6