1. 02 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb
      Lei Wang authored
      
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      aef0a6bb
  2. 15 Oct, 2025 1 commit
  3. 11 Oct, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36
      Lei Wang authored
      [Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)
      
      * • InjectFenceProxy docs and tests
      
        - annotate proxy fence injector with context comments for async/generic detection
        - add compiler internals doc covering the pass mechanics and link it in docs index
        - repair fence proxy test by fixing descriptor init usage and fence counter logic
      
      * do not consider call_extern as async.
      
      * doc update.
      
      * reduce test size for sparse mla
      ddfaac36
  4. 24 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Add missing FP8 header include (#752) · cf7be057
      Lei Wang authored
      
      
      * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h
      
      - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
      - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      
      * [Enhancement] Include cuda_fp8.h in gemm_sm90.h
      
      - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types.
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      
      * lint fix
      
      * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h
      
      - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase.
      - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations.
      - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability.
      
      * lint fix
      
      * [Refactor] Update header inclusions in common.h and gemm_sm90.h
      
      - Removed the inclusion of "intrin.h" from common.h to streamline dependencies.
      - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability.
      
      * bug fix
      cf7be057