• rocking5566's avatar
    Layernorm welford (#346) · 0bd6b842
    rocking5566 authored
    
    
    * Add threadwise and blockwise welford
    
    * Rename gridwise op, prepare to add welford version
    
    * implement welford and integrate welford into layernorm
    
    * Take care of tail loop
    
    * Fix buf when ThreadSliceK > 1
    
    * Fix bug of merging of two empty set
    
    * Rename clip to clamp
    
    * 1. Fix type of count
    2. Remove useless static_assert
    
    * Do not inherit Reduction::Argument
    
    * [What] replace __syncthreads() with block_sync_lds()
    [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
    
    * Add y stride
    
    * Rename.
    DeviceLayernorm -> DeviceLayernormImpl
    DeviceNormalization2 -> DeviceLayernorm
    
    * Move literal ""_uz & ""_zu into namespace 'literals'
    
    * Move namespace 'literals' as 'ck::literals'
    Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
    Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
    0bd6b842
profile_layernorm_impl.hpp 9.76 KB