• rocking5566's avatar
    Standalone layernorm (#315) · 7f216620
    rocking5566 authored
    
    
    * Implement layernorm kernel and deviceOp
    
    * verify gpu kernel with host code
    
    * 1. Separate gamma aand beta from affine
    2. Check if argument is valid
    
    * clean
    
    * Sync the naming
    
    * Support sweep once mode if we can put k dimension data inside one block
    
    * [What] Get length from upper length.
    [Why] if we get length directly, we may get length after padding.
    
    * We only use one block in K dimension.
    Hence, we can simplify the indexing of global R/W.
    
    * Use 1d descriptor for gamma and beta
    
    * Add accElementwiseOp
    
    * Extract layernorm host code
    
    * Support different YVectorDim in GridwiseLayernorm
    
    * Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
    
    * Gamma and beta can share the VGPR.
    
    * Add test for fp32 and fp16
    
    * Fix bug of concurrency and add test case which may fail orignally
    
    * Propagate NaN for layernorm
    Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
    7f216620
CMakeLists.txt 1.78 KB