Copy into registers first when doing reductions with layernorm and softmax (#1489)
Avoids double global loads. Strided loops are unrolled which lets store results in array which compiler will use registers for since the index access is constant. Updated to handle large reductions so which results with a better stable diffusion result
Showing
Please register or sign in to comment