• Zhiwen Mo's avatar
    [Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa
    Zhiwen Mo authored
    * Implements tcgen05.ld instruction support for copying from shared.tmem
      to local.fragment on SM100/Blackwell architecture. Adds layout inference
      and lowering logic for tensor memory operations with proper physical
      coordinate range analysis and warpgroup alignment checks.
    
      Changes:
      - Add kTMemLoad and kTMemStore to CopyInst enumeration
      - Implement CheckTMemLoad() and CheckTMemStore() validation functions
      - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
      - Add tmem layout inference in InferLayout() using expandTcgen05Layout
      - Support multiple instruction variants (32dp32b/64b/128b/256b)
      - Add physical layout bounds analysis for tmem coordinates
      - Change clear_accum from bool to PrimExpr in GEMM operations
      - Fix std::optional access checks in layout_inference.cc
      - Add tmem_allocate/deallocate PTX intrinsic support
      - Fix cooperative_groups grid.sync() code generation
    
    * fix
    
    * pipeline fix
    
    * bug fix
    
    * bool fix
    5ccac4fa
gemm.h 6.34 KB