• Zhengju Tang's avatar
    [Feature] Add 1D TMA support (#761) · 1774a1aa
    Zhengju Tang authored
    
    
    * [Feature] Add 1D TMA support
    - Check the contiguous conditions of 1D TMA copy
    - Add new interface and params order of `tma_load` and `tma_store` call
    - Add 1D `tma_store` interface in sm90 template
    - Add elementwise kernel for 1D TMA example
    
    * [Lint]
    
    * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors
    
    * [Lint]
    
    * [BugFix] 1D TMA load
    
    * [README] Update GDN README for clarity and add acknowledgements (#758)
    
    - Improved formatting and clarity of the GDN kernel implementation description.
    - Updated requirement section to list dependencies in a clearer format.
    - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.
    
    * cutlass v4.2.0 supporting cuda 13 (#760)
    
    * [Lint]
    
    * [Lint]
    
    * [MXFP4] Add test for bf16&mxfp4 gemm
    
    * [BugFix]
    
    * [Lint]
    
    ---------
    Co-authored-by: default avatarYu Cheng <54519279+chengyupku@users.noreply.github.com>
    Co-authored-by: default avatarJohnny <johnnync13@gmail.com>
    1774a1aa
example_wy_fast_bwd_split.py 22.3 KB