• Paul Fultz II's avatar
    Improve layernorm and reductions performance (#1348) · 97a1ed2d
    Paul Fultz II authored
    Compute mean and variance in same reduction
    Set block size to numbers divisible by 32 instead powers of 2
    Global is also set exactly instead of being divisible by block size
    More exact matching of global/local can help get rid of branching/loops
    Reduce vectors first before doing dpp_reduce
    Explicitly vectorize array operators since the compiler doesnt always vectorize them
    Still uses old for loop when its computing at compile-time since the reinterpret_cast nor the all the vector types is supported
    97a1ed2d
test_conv_group_add.cpp 2.07 KB