Parallelize CUDA bwd along seqlen_k instead of seqlen_q
This is faster since we only need to do atomic adds on dq, instead of atomic adds on both dk and dv.
Showing
Please register or sign in to comment
This is faster since we only need to do atomic adds on dq, instead of atomic adds on both dk and dv.