test/layernorm/test_layernorm_fp32.cpp · 7f216620896909e254284e418d08f4d20f938a01 · yangql / composable_kernel-1

rocking5566 authored Jul 14, 2022



* Implement layernorm kernel and deviceOp

* verify gpu kernel with host code

* 1. Separate gamma aand beta from affine
2. Check if argument is valid

* clean

* Sync the naming

* Support sweep once mode if we can put k dimension data inside one block

* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.

* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.

* Use 1d descriptor for gamma and beta

* Add accElementwiseOp

* Extract layernorm host code

* Support different YVectorDim in GridwiseLayernorm

* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp

* Gamma and beta can share the VGPR.

* Add test for fp32 and fp16

* Fix bug of concurrency and add test case which may fail orignally

* Propagate NaN for layernorm
Co-authored-by: Chao Liu <chao.liu2@amd.com>

7f216620

test_layernorm_fp32.cpp 1.74 KB

Replace test_layernorm_fp32.cpp