Optimize bwd kernel: incremental qdot_max and alpha/integral/etc
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop and saves about 20% of performance.
Showing
Please register or sign in to comment
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop and saves about 20% of performance.