• Qianfeng's avatar
    Batchnorm splitk single kernel (#771) · 8f5cafaf
    Qianfeng authored
    * Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]
    
    * Add CountDataType as template parameter in blockwise_welford
    
    * Add utility/get_shift.hpp
    
    * Add BatchNorm multiblock single-kernel implementation
    
    * Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a
    
    * Renaming in device_batchnorm_forward_impl.hpp
    
    * Tiny fix in the batchnorm_fwd profiler
    
    * Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"
    
    This reverts commit d16d00919c43f10759e7b4e4d112125221ed9064.
    
    * Use the old two-kernel batchnorm multiblock method for gfx1030
    
    * Use the old two-kernel batchnorm multiblock method for gfx908
    
    * use the single-kernel batchnorm multiblock method only for gfx90a
    
    * Remove get_wave_id() from utility/get_id.hpp since it is not used
    
    * Set true for testing running mean/variance and saving mean/invvariance in the examples
    
    * Fix to copy-right words
    
    * Remove un-needed including in utility/get_id.hpp
    
    * Add comments to workgroup_synchronization.hpp
    
    * Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp
    
    * Renaming in the kernels
    
    * Remove un-used kernel file
    8f5cafaf
get_shift.hpp 348 Bytes