1. 02 Oct, 2025 1 commit
    • Zhiwen Mo's avatar
      [Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa
      Zhiwen Mo authored
      * Implements tcgen05.ld instruction support for copying from shared.tmem
        to local.fragment on SM100/Blackwell architecture. Adds layout inference
        and lowering logic for tensor memory operations with proper physical
        coordinate range analysis and warpgroup alignment checks.
      
        Changes:
        - Add kTMemLoad and kTMemStore to CopyInst enumeration
        - Implement CheckTMemLoad() and CheckTMemStore() validation functions
        - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
        - Add tmem layout inference in InferLayout() using expandTcgen05Layout
        - Support multiple instruction variants (32dp32b/64b/128b/256b)
        - Add physical layout bounds analysis for tmem coordinates
        - Change clear_accum from bool to PrimExpr in GEMM operations
        - Fix std::optional access checks in layout_inference.cc
        - Add tmem_allocate/deallocate PTX intrinsic support
        - Fix cooperative_groups grid.sync() code generation
      
      * fix
      
      * pipeline fix
      
      * bug fix
      
      * bool fix
      5ccac4fa
  2. 28 Sep, 2025 1 commit
    • Zhiwen Mo's avatar
      [SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43
      Zhiwen Mo authored
      * update sm100 related utcmma, tmem, ld/st256 in src
      * update sm100 related utcmma, tmem, ld/st256 in tilelang
      * Remove deprecated GEMM examples and related README documentation for SM100 architecture support
      * Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
      * Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
      * Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
      * Update README and source files to reflect TCGEN5.MMA terminology changes
      * Refactor CUDA GEMM header for improved readability
      f58bcd43
  3. 10 Sep, 2025 1 commit
    • Lei Wang's avatar
      [TileOp] Introduce a experimental python defined `T.gemm_v2` (#793) · 91a7bb2b
      Lei Wang authored
      * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability
      
      - Removed deprecated prime factorization functions from `gemm.cc` and `gemm_sp.cc`.
      - Introduced a new `GemmWarpPolicy` class to manage warp policy attributes and methods, improving encapsulation.
      - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.
      - Enhanced `GetArchInt` function in `utils.cc` for better readability and type safety.
      - Added new `gemm_v2` function in `gemm.py` for improved GEMM operation with additional parameters and checks.
      
      * Refactor GEMM and frontend legalize operations for improved clarity and functionality
      
      - Updated `gemm_py.h` to include the correct header for GEMM operations.
      - Renamed `FrontendLegalizer` class to `LetInliner` and updated related methods to reflect this change, enhancing code clarity.
      - Modified the pass function from `FrontendLegalize` to `LetInline` for better alignment with its purpose.
      - Updated test cases to utilize the new `gemm_v2` function and adjusted the testing framework for improved output and clarity.
      - Removed obsolete test file `test_tilelang_transform_frontend_legalize.py` to streamline the test suite.
      - Enhanced the `LowerAndLegalize` function to utilize the new `LetInline` pass, improving the overall transformation process.
      
      * Enhance CUDA code generation and testing for GEMM operations
      
      - Added indentation printing in `codegen_cuda.cc` for improved assembly code formatting.
      - Updated `test_tilelang_tilelibrary_gemm.py` to include additional GEMM test cases and shared memory allocation with specified scope.
      - Introduced new `matmul_sr` and `run_gemm_sr` functions for GEMM operations with shared and fragment memory layouts.
      - Refactored layout inference in `mma_macro_generator.py` to improve clarity and correctness in shared memory handling.
      - Enhanced `gemm/__init__.py` to support new GEMM operation combinations and layout inference logic.
      
      These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.
      
      * Refactor GEMM layout and testing for improved clarity and functionality
      
      - Updated `gemm_layouts.cc` to enhance the layout generation logic for transposed and non-transposed GEMM operations.
      - Renamed and modified functions in `test_tilelang_tilelibrary_gemm.py` to reflect changes in GEMM function signatures and improve test coverage.
      - Introduced new GEMM operation combinations in `gemm/__init__.py` to support additional layouts and configurations.
      - Enhanced layout inference in `mma_layout.py` and `mma_macro_generator.py` for better handling of shared memory layouts.
      
      These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.
      
      * Refactor GEMM layout and Python integration for improved functionality
      
      - Updated `gemm_layouts.cc` to correct the order of layout replication and repetition for transposed and non-transposed GEMM operations.
      - Enhanced `gemm_py.cc` to handle block realization more robustly, ensuring correct assignment of global symbols and block attributes.
      - Refactored `inject_pipeline.cc` to streamline buffer read/write region handling, improving clarity and maintainability.
      - Cleaned up test cases in `test_tilelang_tilelibrary_gemm.py` by removing unnecessary print statements and adjusting function calls for better test execution flow.
      
      These changes enhance the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.
      
      * Refactor GEMM layout and testing for improved clarity and functionality
      
      - Updated `gemm_layouts.cc` to enhance layout generation logic for transposed and non-transposed GEMM operations.
      - Improved block realization handling in `gemm_py.cc` for better assignment of global symbols.
      - Streamlined buffer read/write region handling in `inject_pipeline.cc` for clarity.
      - Enhanced test cases in `test_tilelang_tilelibrary_gemm.py` by adjusting function calls and adding new GEMM operation combinations.
      
      These changes improve the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.
      
      * tfloat32 support.
      
      * lint fix
      
      * lint fix
      
      * Refactor shared memory allocation in GEMM tests
      
      - Removed unnecessary scope specification in shared memory allocation for matrices A and B in `test_tilelang_tilelibrary_gemm.py`.
      - This change simplifies the allocation process and aligns with the updated GEMM function signatures.
      91a7bb2b