Moved permute out of bwd kernel & qy shared cache
putting qy in shared is a little faster Changing internal memory layout means we can leave code in standard shape and only change layout external to kernel
Showing
Please register or sign in to comment