"vscode:/vscode.git/clone" did not exist on "fb0c2734838fcbcfb3d0cb3c13971cd1f668adc6"
Improve layernorm and reductions performance (#1348)
Compute mean and variance in same reduction Set block size to numbers divisible by 32 instead powers of 2 Global is also set exactly instead of being divisible by block size More exact matching of global/local can help get rid of branching/loops Reduce vectors first before doing dpp_reduce Explicitly vectorize array operators since the compiler doesnt always vectorize them Still uses old for loop when its computing at compile-time since the reinterpret_cast nor the all the vector types is supported
Showing
Please register or sign in to comment