[Refactor] Optimize RMS normalization kernel in rms_norm.py (#333)
- Introduced a new local fragment for squared values to improve performance. - Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency. - Refactored the final multiplication step to operate on the local fragment instead of shared memory. - Added a configuration option to the kernel compilation for better control over TMA lowering. These changes enhance the efficiency and clarity of the RMS normalization implementation.
Showing
Please register or sign in to comment