• luise.chen's avatar
    Luise/gbn optimization (#105) · 56c283b6
    luise.chen authored
    * GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS
    
    * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50
    
    * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
    56c283b6
batch_norm.py 12.7 KB