• Yu Cheng's avatar
    [Refactor] Optimize RMS normalization kernel in rms_norm.py (#333) · 85e411c8
    Yu Cheng authored
    - Introduced a new local fragment for squared values to improve performance.
    - Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency.
    - Refactored the final multiplication step to operate on the local fragment instead of shared memory.
    - Added a configuration option to the kernel compilation for better control over TMA lowering.
    
    These changes enhance the efficiency and clarity of the RMS normalization implementation.
    85e411c8
rms_norm.py 2.89 KB