1. 09 May, 2025 7 commits
    • Lei Wang's avatar
      [Typo] Rename `power_of_int` with `pow_of_int` for consistency (#468) · c99b7056
      Lei Wang authored
      * typo fix
      
      * Rename `power_of_int` to `pow_of_int` in math operations and update corresponding Python API reference. Adjusted registration attributes to reflect the new naming convention.
      c99b7056
    • Lei Wang's avatar
      [Feature] Implement fast integer power operation and related API (#466) · 1f5eb492
      Lei Wang authored
      * [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * [Feature] Implement fast integer power operation and related API
      
      * Added a new math operation `tl.power_of_int` in `math.cc` for efficient integer exponentiation.
      * Introduced a corresponding Python API `pow_of_int` in `tir/op.py` to facilitate usage in TileLang.
      * Enhanced `common.h` with a template function for integer power calculations.
      * Updated documentation to reflect the new functionality and usage examples.
      1f5eb492
    • Lei Wang's avatar
      [Bugfix] Fix copy region automation for dynamic extent (#465) · 2ffbd369
      Lei Wang authored
      * [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * [Refactor] Improve buffer region validation in copy.py
      
      * Added handling for variable extents in buffer_region_to_tile_region function to enhance type checking and error handling.
      * Introduced debug print statements to trace values of region extents and temporary extents during validation.
      * Updated logic to account for variable extent counts when determining alignment of extents.
      
      * [Refactor] Remove debug print statements in buffer_region_to_tile_region function
      
      * Eliminated unnecessary print statements that were used for debugging temporary extents and region extents.
      * Streamlined the code for better readability while maintaining the existing functionality of buffer region validation.
      
      * [Refactor] Clean up whitespace in buffer_region_to_tile_region function
      
      * Removed an unnecessary blank line in the buffer_region_to_tile_region function to improve code readability and maintain consistency in formatting.
      2ffbd369
    • Lei Wang's avatar
      [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) · f41c467c
      Lei Wang authored
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      f41c467c
    • Lei Wang's avatar
      [Bugfix] Fix for T.copy with dynamic range (#462) · d946d1d4
      Lei Wang authored
      * [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py
      
      * Refactored barrier functions to use new signatures for improved clarity and consistency.
      * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
      * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
      
      * [Refactor] Update warp_specialized_rewriter with license change and code cleanup
      
      * Replaced Apache License header with MIT License in `warp_specialized_rewriter.cc`.
      * Removed the `ThreadTagChecker` class to streamline the code, as it was no longer needed.
      * Added `#include` for `common/collector.h` to support new functionality.
      * Updated file documentation to reflect the correct filename and purpose.
      * Improved overall code readability by removing unnecessary comments and sections.
      
      * [Feature] Add thread synchronization functions in builtin.py and refine buffer region checks in copy.py
      
      * Introduced `sync_threads` and `sync_thread_partial` functions in `builtin.py` for improved thread synchronization capabilities.
      * Enhanced documentation for new synchronization functions to clarify usage and parameters.
      * Updated buffer region validation in `copy.py` to ensure type checking for integer values, improving error handling for region extents.
      
      * lint fix
      
      * [Feature] Introduce TMA barrier injection and related utilities
      
      * Added `inject_tma_barrier.cc` to implement TMA barrier rewriting for CUDA GPU (sm90+).
      * Created `common/attr.h` and `common/collector.h` for attribute checks and information collection from the IR.
      * Updated `ir.cc` to use a constant for the main block name instead of a hardcoded string.
      * Cleaned up `warp_specialized_rewriter.cc` by removing unnecessary whitespace.
      * Enhanced thread tag validation with `ThreadTagChecker` to ensure only `threadIdx.x` is used in TMA barrier contexts.
      
      * lint fix
      d946d1d4
    • Cunxiao Ni's avatar
      [CI] Add elementwise and gemv examples to CI. (#458) · dd7eb488
      Cunxiao Ni authored
      * [CI] Add elementwise and gemv examples to CI.
      
      * fix lint
      
      * test
      
      * fix gemv lint
      
      * fix lint
      dd7eb488
    • Rinne's avatar
      8d5e803e
  2. 08 May, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Update barrier functions and remove argparse in... · b0122d74
      Lei Wang authored
      [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py (#457)
      
      * Refactored barrier functions to use new signatures for improved clarity and consistency.
      * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
      * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
      b0122d74
    • Lei Wang's avatar
      [Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9
      Lei Wang authored
      * Add example for warp specialization with flash attention
      
      * Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
      * Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
      * Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
      * Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.
      
      * Add memory barrier functions to builtin.py
      
      * Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
      * Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
      * The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
      * Improved code organization and readability by adding blank lines for better separation of logical sections.
      
      * Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py
      
      * Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
      * Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.
      
      * [Refactor] Update barrier functions and add new example for GEMM with warp specialization
      
      * Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
      * Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
      * Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
      * Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
      * Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.
      
      * lint fix
      
      * Update loop variable checks in SIMT loop and buffer region validation
      
      * Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
      * Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.
      
      * lint fix
      
      * test fix
      a91bc2a9
  3. 07 May, 2025 1 commit
  4. 06 May, 2025 5 commits
    • Lei Wang's avatar
      [Feature] Add cache directory management functions in tilelang.cache (#453) · 0aaef97d
      Lei Wang authored
      * [Feature] Add cache directory management functions in tilelang.cache
      
      * Introduced `get_cache_dir` and `set_cache_dir` functions to manage the kernel cache directory.
      * Updated `KernelCache` class to store cache directory as a `Path` object for improved path handling.
      * Enhanced documentation with examples for new cache directory functions.
      
      * [Refactor] Update cache imports in tilelang.__init__.py
      
      * Added `set_cache_dir` and `get_cache_dir` functions to the import statement for improved cache directory management.
      * This change enhances the accessibility of cache directory management functions within the module.
      0aaef97d
    • Lei Wang's avatar
      [Enhancement] Introduce pass_configs parameter for kernel Caching (#452) · b1ba0cc8
      Lei Wang authored
      * [Enhancement] Introduce pass_configs parameter for kernel compilation
      
      * Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations.
      * Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase.
      * Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output.
      * Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution.
      
      * lint fix
      b1ba0cc8
    • Lei Wang's avatar
      [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP (#450) · 0a8c8b99
      Lei Wang authored
      * [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP
      
      * Introduced TILELANG_CHECK_LAST_ERROR macro to streamline error checking for kernel launches in both CUDA and HIP.
      * Updated kernel launch code in wrapper.py to utilize the new macro, enhancing readability and maintainability.
      * This change improves error reporting by providing detailed messages when kernel execution fails.
      
      * [Refactor] Standardize error message formatting in TILELANG_CHECK_LAST_ERROR macro
      
      * Updated the TILELANG_CHECK_LAST_ERROR macro in both CUDA and HIP implementations to ensure consistent formatting of error messages.
      * Enhanced readability by aligning the error message structure across different platforms, improving maintainability of error handling code.
      0a8c8b99
    • Lei Wang's avatar
      Update requirements.txt (#451) · 025929d8
      Lei Wang authored
      025929d8
    • Lei Wang's avatar
      [Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      
      * [Feature] Add examples for warp specialization and TMA barrier integration
      
      * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
      * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
      * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
      * Updated the `phase.py` to include TMA barrier injection in the optimization process.
      * Improved documentation and comments for better clarity on usage and functionality.
      
      * [Feature] Add example for warp specialization in GEMM with TMA barriers
      
      * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
      * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
      * Enhanced documentation and comments for clarity on usage and functionality.
      
      * lint fix
      
      * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection
      
      * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
      * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
      * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
      * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.
      
      * lint fix
      
      * [Feature] Add new examples for warp specialization and TMA integration
      
      * Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
      * Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
      * Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
      * Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.
      
      * lint fix
      
      * Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.
      b5faf25a
  5. 03 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      
      * [Feature] Add examples for warp specialization and TMA barrier integration
      
      * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
      * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
      * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
      * Updated the `phase.py` to include TMA barrier injection in the optimization process.
      * Improved documentation and comments for better clarity on usage and functionality.
      
      * [Feature] Add example for warp specialization in GEMM with TMA barriers
      
      * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
      * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
      * Enhanced documentation and comments for clarity on usage and functionality.
      
      * lint fix
      
      * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection
      
      * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
      * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
      * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
      * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.
      
      * lint fix
      fce16b00
  6. 01 May, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix safe memory legalization for fragment store (#446) · e46653ac
      Lei Wang authored
      * [Enhancement] Improve layout inference accuracy in ParallelOp (#441)
      
      * Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
      * Enhanced comments to clarify the rationale behind buffer selection in layout inference process.
      
      * [Enhancement] Add error handling macros and refactor loop partitioning logic
      
      * Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
      * Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
      * Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
      * Updated pass configuration management to streamline vectorization control in the optimization process.
      
      * lint fix
      
      * remove debug print
      
      * [Refactor] Update legalize_safe_memory_access.cc to improve memory access handling
      
      * Replaced Apache License header with MIT License.
      * Added logic to handle local buffer conditions in memory access.
      * Introduced IsLocalBuffer function to check buffer scope.
      * Enhanced comments for clarity on memory access operations.
      e46653ac
  7. 30 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Language] Support explicit programming for identified warp groups (#445) · 6972aed7
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      6972aed7
    • dependabot[bot]'s avatar
      Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b (#444) · 0fa03398
      dependabot[bot] authored
      Bumps [transformers](https://github.com/huggingface/transformers) from 4.48.0 to 4.50.0.
      - [Release notes](https://github.com/huggingface/transformers/releases)
      - [Commits](https://github.com/huggingface/transformers/compare/v4.48.0...v4.50.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: transformers
        dependency-version: 4.50.0
        dependency-type: direct:production
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      0fa03398
  8. 29 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix layout inference for free fragment buffer (#443) · 2ea45ae9
      Lei Wang authored
      * [Enhancement] Improve layout inference accuracy in ParallelOp (#441)
      
      * Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
      * Enhanced comments to clarify the rationale behind buffer selection in layout inference process.
      
      * [Enhancement] Add error handling macros and refactor loop partitioning logic
      
      * Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
      * Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
      * Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
      * Updated pass configuration management to streamline vectorization control in the optimization process.
      
      * lint fix
      
      * remove debug print
      2ea45ae9
  9. 28 Apr, 2025 1 commit
  10. 27 Apr, 2025 3 commits
  11. 26 Apr, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] Simplify vectorization process in loop_vectorize.cc and add... · 3c5190e0
      Lei Wang authored
      [Enhancement] Simplify vectorization process in loop_vectorize.cc and add atomic add test (#436) (#439)
      
      * Removed redundant simplification step in vectorization logic to streamline performance.
      * Introduced a new test for atomic addition in TileLang, validating functionality with a reference implementation using PyTorch.
      3c5190e0
    • yyttt6's avatar
      [Bugfix] fix the unexpected keyword error of autotune (#438) · bfb5b0a3
      yyttt6 authored
      * yes
      
      * [Bugfix] fix the unexpected keyword error of autotune
      
      * format
      
      * test
      bfb5b0a3
    • Lei Wang's avatar
      [Language] Support accumulative `T.reduce_sum` (#436) · 6c737768
      Lei Wang authored
      * [Enhancement] Update reduce operations to support clear option in sum and abs sum (#436)
      
      * Modified reduce_sum and reduce_absmax functions to include a clear parameter, allowing for accumulation on existing values.
      * Updated ReduceOp::Lower method to handle initialization and buffer duplication based on the clear flag for sum and abs sum operations.
      * Added new tests for reduce_sum and reduce_max with clear functionality to ensure correctness in various scenarios.
      * Enhanced documentation for reduce functions to clarify the behavior of the clear parameter.
      
      * lint fix
      
      * Update tensor type annotations in test_tilelang_transform_annotate_device_regions.py from Buffer to Tensor
      
      * Update tensor type in reduce sum tests from float16 to float32 for improved precision
      6c737768
  12. 25 Apr, 2025 3 commits
    • Lei Wang's avatar
      [Bugfix] Removed the behavior that treated global -> local as a copy operation. (#435) · 181267c7
      Lei Wang authored
      * [Enhancement] Improve error handling in layout inference and update profiler type in tests
      
      * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B.
      * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy.
      
      * lint fix
      
      * [Refactor] Update OperandTraits to include num_warp_n parameter
      
      * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations.
      * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios.
      
      * lint fix
      
      * [Refactor] Update DispatchInstruction templates to include N parameter
      
      * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations.
      * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.
      
      * [Refactor] Simplify store buffer scope checks in pipeline planning
      
      * Removed redundant condition for 'local' scope in the store buffer checks, streamlining the logic for identifying global copy patterns.
      * Enhanced code clarity by reducing complexity in the conditional statements.
      181267c7
    • Lei Wang's avatar
      [Enhancement] Support cute mma tile mxn8ky (#434) · d1c15bc5
      Lei Wang authored
      * [Enhancement] Improve error handling in layout inference and update profiler type in tests
      
      * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B.
      * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy.
      
      * lint fix
      
      * [Refactor] Update OperandTraits to include num_warp_n parameter
      
      * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations.
      * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios.
      
      * lint fix
      
      * [Refactor] Update DispatchInstruction templates to include N parameter
      
      * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations.
      * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.
      d1c15bc5
    • Lei Wang's avatar
      [Bugfix] Fix the test data distribution of cumsum (#432) · 3d206235
      Lei Wang authored
      * [Refactor] Adjust layout inference calculations in Gemm and ParallelOp
      
      * Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference.
      * Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience.
      * Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code.
      
      * [Refactor] Clean up layout inference logic in Gemm and ParallelOp
      
      * Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process.
      * Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality.
      * Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts.
      
      * lint fix
      
      * [Enhancement] Improve cumulative sum functionality and annotations handling
      
      * Updated the `cumsum` function to include detailed documentation and error handling for dimension bounds.
      * Modified the `run_cumsum` test to utilize a random tensor supply type for profiling, enhancing test robustness.
      * Added annotations to the fused loop in `loop_fusion_utils.h`, ensuring proper metadata is preserved during loop fusion.
      
      * lint fix
      3d206235
  13. 24 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Remove DeReplicate during parallel loop layout inference (#430) · bb1a5fd8
      Lei Wang authored
      * [Refactor] Adjust layout inference calculations in Gemm and ParallelOp
      
      * Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference.
      * Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience.
      * Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code.
      
      * [Refactor] Clean up layout inference logic in Gemm and ParallelOp
      
      * Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process.
      * Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality.
      * Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts.
      
      * lint fix
      bb1a5fd8
  14. 23 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Layout] Enhance layout inference pass (#427) · 97d63fab
      Lei Wang authored
      * [Enhancement] Improve layout inference in Copy operation (#426)
      
      * Updated the Copy operation to infer layouts at multiple levels (kCommon, kStrict, kFree) for enhanced flexibility in layout optimization.
      * Added detailed documentation for layout inference levels in ParallelOp, clarifying their purposes and use cases.
      * Refactored layout inference logic to accommodate new levels, improving overall robustness and performance in parallel operations.
      
      * lint fix
      97d63fab
    • Lei Wang's avatar
      [Bugfix] Fix a bug for simplifier (#425) · afa74f4e
      Lei Wang authored
      * Update submodule 'tvm' to latest commit f4a8f9b
      
      * lint fix
      afa74f4e
  15. 22 Apr, 2025 7 commits
    • Lei Wang's avatar
    • Lei Wang's avatar
      [Language] Support tile operator `T.cumsum` (#423) · 88747fcd
      Lei Wang authored
      * [Feature] Implement CumSum operation in TileLang
      
      * Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic.
      * Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums.
      * Created tests for CumSum functionality in shared memory and fragment contexts.
      * Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang.
      * Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms.
      
      * lint fix
      88747fcd
    • Yu Cheng's avatar
      [Enhancement] Add TMA+WS support in pipeline planning logic (#422) · ae1e7399
      Yu Cheng authored
      * Introduced logic to check for TMA+WS enablement based on annotations in the pipeline planning stage.
      * Enhanced the handling of `order_anno` and `stage_anno` to determine if TMA+WS is activated, improving flexibility in loop processing.
      * Refactored the existing code to maintain clarity while integrating the new feature.
      ae1e7399
    • FeiyangChen's avatar
      [Bugfix] Fix a simplify issue in tvm (#421) · e6fb01e0
      FeiyangChen authored
      e6fb01e0
    • Yu Cheng's avatar
      [Refactor] Enhance layout inference logic in ParallelOp (#420) · bf27e641
      Yu Cheng authored
      * Updated the layout inference in ParallelOp to improve the selection of source buffers for layout accuracy.
      * Introduced logic to choose the read source buffer based on the number of indices, ensuring more precise layout inference.
      * Refactored the loop handling to maintain clarity and improve the overall robustness of the layout inference process.
      bf27e641
    • Zhao Wu's avatar
      [Enhancement] Support to find Cython path more automatically (#418) · 3c740a1b
      Zhao Wu authored
      
      
      * Support to find Cython path more automatically
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      3c740a1b
    • Lei Wang's avatar
      [Enhancement] Support Auto Layout Inference and Parallelism with variable constraint (#417) · 73a6cb8b
      Lei Wang authored
      * [Enhancement] Introduce thread range management in layout and operation handling
      
      * Added `SetThreadRange` method to `FragmentNode` for managing thread ranges.
      * Updated `LayoutNode::Inverse` to provide more informative error messages.
      * Refactored layout inference and operation lowering to utilize `thread_bounds` instead of `block_size`, enhancing flexibility for thread management.
      * Introduced new tests for tilelang operations to validate thread range functionality and ensure correctness in parallel execution scenarios.
      
      * lint fix
      
      * [Refactor] Improve thread variable handling in layout inference and operation lowering
      
      * Removed workaround for undefined thread_var in layout inference, ensuring proper handling of thread bounds.
      * Updated logic to define thread bounds based on the presence of thread_var, enhancing robustness in thread management.
      * Refactored thread_var initialization in lower_tile_op to maintain consistency across the codebase.
      
      * [Refactor] Update thread variable handling in layout inference and operation lowering
      
      * Refactored thread variable checks to ensure bounds are only accessed when defined, improving safety and clarity.
      * Initialized thread_var with a default range to prevent undefined behavior.
      * Updated logic in lower_tile_op to align with new thread variable handling, enhancing consistency across the codebase.
      73a6cb8b