1. 21 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling... · 41d4988b
      Lei Wang authored
      [Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization (#507)
      
      * [Refactor] Update reduce functions to support default dimension values and improve dimension handling
      
      * Added a helper function `_legalize_dim` to handle negative dimension values in reduction operations.
      * Updated `reduce_max`, `reduce_min`, `reduce_sum`, `reduce_abssum`, and `reduce_absmax` functions to accept a default dimension value of -1, enhancing usability and flexibility in buffer reduction operations.
      * Ensured consistent dimension handling across all reduction functions for improved clarity and correctness.
      
      * Update submodule `tvm` to latest commit c2921fd, ensuring compatibility with recent changes.
      
      * [Refactor] Enhance ReduceOp and JITKernel for improved dimension handling and initialization
      
      * Updated ReduceOp to handle 1D reduction cases and ensure correct dimension checks, improving robustness in reduction operations.
      * Initialized prim_func in JITKernel to enhance clarity and prevent potential null reference issues.
      * Added whitespace for better code readability in reduce.py.
      41d4988b
  2. 20 May, 2025 4 commits
    • Lei Wang's avatar
      [Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically (#505) · 84ddb9e1
      Lei Wang authored
      * [Refactor] Update GlobalMemChecker to use IRVisitorWithAnalyzer for improved analysis (#505)
      
      * Refactored GlobalMemChecker to inherit from IRVisitorWithAnalyzer, enhancing its capabilities for expression analysis.
      * Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
      * Added additional lower bound condition checks to ensure comprehensive validation of memory access indices.
      
      * [Refactor] Update GlobalMemChecker to use StmtExprVisitor for improved memory access validation
      
      * Refactored GlobalMemChecker to inherit from StmtExprVisitor, enhancing its capabilities for expression analysis.
      * Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
      * Ensured that the analyzer is passed correctly during instantiation, maintaining consistency in condition checks.
      84ddb9e1
    • Lei Wang's avatar
      [Refactor] Adjust GEMM fragment layout for improved clarity and performance (#504) · c59e1aab
      Lei Wang authored
      * Modified the layout creation in makeGemmFragmentB to enhance the order of operations, ensuring the Replicate method is called before Repeat for better readability and performance.
      * This change improves the logical flow of fragment creation, aligning with best practices for GEMM layout management.
      c59e1aab
    • Lei Wang's avatar
      [Refactor] Refactor `jit` to `_JitImplementation` to support `@tilelang.jit` (#502) · 8c8d8ca2
      Lei Wang authored
      * [Refactor] Rename `jit` class to `_JitImplementation` and improve debug path handling
      
      * Refactored the `jit` class to `_JitImplementation` for clarity and encapsulation.
      * Enhanced handling of `debug_root_path` to ensure it is correctly set as an absolute path when provided.
      * Updated the public `jit` function to serve as a decorator interface, allowing for both default and configured usage.
      * Added validation to ensure input tensors are contiguous in the Cython wrapper, improving error handling.
      
      * [Refactor] Improve formatting and handling in `_JitImplementation` and `jit` function
      
      * Refactored the `_JitImplementation` class to enhance readability by adjusting comment formatting and consolidating conditions for setting `debug_root_path`.
      * Updated the `jit` function signature for better alignment and clarity in parameter definitions.
      * Ensured consistent spacing and comments throughout the code for improved maintainability.
      
      * [Refactor] Update GEMM test parameters for performance optimization
      
      * Set num_stages to 0 and adjusted matrix dimensions in the GEMM test function to enhance performance and consistency across tests in test_tilelang_jit_gemm.py.
      * Reduced the number of threads used in the test to align with the updated configuration, improving overall test efficiency.
      
      * [Refactor] Enhance buffer error logging in layout inference
      
      * Updated the warning message in layout inference to provide clearer context when a buffer cannot be inferred due to its absence in the use list. This change improves the clarity of error reporting during layout inference operations.
      * Refactored tensor handling in the Cython wrapper to ensure input tensors are checked for contiguity before processing, enhancing error handling and robustness in tensor management.
      
      * bugfix
      8c8d8ca2
    • Zhiwen Mo's avatar
  3. 18 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] refactor `tilelang.jit` to support a faster and more flexible kernel cache (#501) · 25a50f1a
      Lei Wang authored
      * [Refactor] Update JIT kernel functions and streamline GEMM tests
      
      * Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity.
      * Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function.
      * Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability.
      * Cleaned up commented-out code and unnecessary imports to improve overall code quality.
      
      * Update main function call in GEMM test to use tilelang testing framework
      
      * Update README and example scripts to include JIT decorator comments
      
      * Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions.
      * Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity.
      
      * Update GEMM test parameters for improved performance
      
      * Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.
      25a50f1a
  4. 17 May, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility (#500) · 33937683
      Lei Wang authored
      * [Enhancement] Improve GEMM layout function and documentation
      
      * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
      * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
      * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.
      
      * lint fix
      
      * [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility
      
      * Adjusted the InferLayout method in gemm.cc to include trans_A in fragment creation, enhancing layout inference for transposed matrices.
      * Updated OperandTraits in gemm_sm89.h and gemm_sm90.h to change the Copy type from SM75_U16x4_LDSM_N to SM75_U16x4_LDSM_T, optimizing memory access patterns for different warp configurations.
      * Enhanced static assertions in gemm_sm90.h to clarify requirements for num_warp_m, ensuring compatibility with Hopper architecture.
      
      * [Refactor] Clean up formatting in GEMM implementation and CUDA templates
      
      * Simplified the formatting of the fragment creation in the InferLayout method of gemm.cc for better readability.
      * Adjusted the static assertion message in gemm_sm90.h to enhance clarity regarding the num_warp_m requirement for Hopper architecture.
      33937683
    • Lei Wang's avatar
      [Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      
      * Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.
      2837878f
    • Lei Wang's avatar
      [Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      68a3c4f3
  5. 16 May, 2025 3 commits
    • Lei Wang's avatar
      [Bugfix] Fix Hopper GEMM layout for small tile size (#497) · c93e8695
      Lei Wang authored
      * [Enhancement] Improve GEMM layout function and documentation
      
      * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
      * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
      * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.
      
      * lint fix
      c93e8695
    • Yu Cheng's avatar
      [Refactor] Update main function structure in example scripts and add tests (#475) · 73ae8087
      Yu Cheng authored
      * [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang
      
      * Refactor example_mla_decode.py to define a main function for better structure and clarity.
      * Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode.
      * Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing.
      * Ensure all new test files are integrated with tilelang testing framework.
      
      * [Test] Enhance test_example_mla_decode with argument mocking
      
      * Update test_example_mla_decode.py to mock sys.argv for better test isolation.
      * Ensure the main function of example_mla_decode is called with the correct arguments during testing.
      73ae8087
    • Lei Wang's avatar
      [Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      dca2fb48
  6. 14 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Introduce quantize components of TileLang and add testing for... · cde1886f
      Lei Wang authored
      [Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple (#494)
      
      * Remove deprecated example_dequant_gemm.py and add DataType import in __init__.py
      
      * lint fix
      
      * lint fix
      
      * Refactor dequantization examples to use tilelang imports and update data type handling in quantization utilities
      
      * lint fix
      cde1886f
  7. 13 May, 2025 3 commits
    • Wenhao Xie's avatar
      [CI] Add Reminder Bot for pull request contributions (#491) · 31dbb471
      Wenhao Xie authored
      * [CI] Add Reminder Bot for pull request contributions
      
      * upd
      31dbb471
    • 徐畅's avatar
      [CI] Add flash_decoding example to CI (#487) · 7b66fb19
      徐畅 authored
      * [CI] Add flash_decoding example to CI
      
      * Add output of ref latency
      
      * format example_gqa_decode.py
      7b66fb19
    • Lei Wang's avatar
      [Enhancement] Support register input for gemm when trans_a or trans_b is true (#490) · d4f096ef
      Lei Wang authored
      * [Refactor] Enhance makeGemmFragmentB to support transposition
      
      * Updated the `makeGemmFragmentB` function to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
      * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
      * Modified the function signature in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
      * Added a new `matmul_sr` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
      
      * [Refactor] Enhance makeGemmFragmentA and makeGemmFragmentB for transposition support
      
      * Updated the `makeGemmFragmentA` and `makeGemmFragmentB` functions to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
      * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
      * Modified function signatures in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
      * Added a new `matmul_rs` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
      *
      
      * Improve error messaging in layout equality checks
      
      * Enhanced the error output in layout equality checks to provide clearer context by adding line breaks for better readability in the debug output.
      * This change ensures that when layouts are structurally unequal, the current and previous layouts are displayed more distinctly, aiding in debugging.
      d4f096ef
  8. 12 May, 2025 2 commits
  9. 11 May, 2025 2 commits
  10. 10 May, 2025 7 commits
    • Wenhao Xie's avatar
    • Lei Wang's avatar
      [Refactor] Improve layout equality checks and error messaging (#471) · c2480907
      Lei Wang authored
      * [Refactor] Simplify buffer_region_to_tile_region function in copy.py
      
      * Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
      * Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.
      
      * [Refactor] Improve layout equality checks and error messaging
      
      * Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges.
      * Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging.
      * Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy.
      
      * lintfix
      
      * [Refactor] Rename SetThreadRange to BindThreadRange for clarity
      
      * Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity.
      * Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior.
      * Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method.
      * Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API.
      
      * [Refactor] Update BindThreadRange usage across layout inference methods
      
      * Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality.
      * Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation.
      * Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity.
      
      * lint fix
      c2480907
    • Lei Wang's avatar
      [Refactor] Skip patchelf if not installed (#477) · 273be768
      Lei Wang authored
      * [Refactor] Enhance TMA barrier validation and support for additional architectures
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * Enhance logging in setup.py and refactor TMA architecture checks in phase.py
      
      * Added logging configuration to setup.py, replacing print statements with logger for better traceability.
      * Updated download and extraction functions to use logger for status messages.
      * Refactored TMA architecture checks in phase.py to utilize the new `have_tma` function for improved clarity and maintainability.
      * Introduced support for additional compute capabilities in nvcc.py, including TMA support checks.
      
      * Update documentation for get_target_compute_version to reflect correct GPU compute capability range
      
      * Refactor have_tma function to accept tvm.target.Target instead of compute_version
      
      * Updated the `have_tma` function in nvcc.py to take a `target` parameter, improving clarity and usability.
      * Adjusted calls to `have_tma` in phase.py to pass the target directly, enhancing maintainability and consistency in TMA support checks.
      273be768
    • yyttt6's avatar
      [CI] Add Analyzer and blocksparse_attention examples to CI (#472) · 8dec14e0
      yyttt6 authored
      
      
      * yes
      
      * [Bugfix] fix the unexpected keyword error of autotune
      
      * format
      
      * test
      
      * [CI] Add Analyzer and blocksparse_attention examples to CI
      
      * format
      
      * try
      
      * try
      
      * try
      
      * try
      
      * t
      
      * format
      
      * d
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      8dec14e0
    • Yuxuan Hu's avatar
      [Refactor] set USE_LLVM to optional. (#476) · 66dba763
      Yuxuan Hu authored
      66dba763
    • Wenhao Xie's avatar
      [BugFix] Correct argparse for example_convolution test (#474) · 3f25bd1b
      Wenhao Xie authored
      
      
      * add convolution example to CI
      
      * lint fix
      
      * Update test_example_convolution.py
      
      * fix bug
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      3f25bd1b
    • Wenhao Xie's avatar
      [CI] Add Convolution example to CI (#473) · abe170a6
      Wenhao Xie authored
      
      
      * add convolution example to CI
      
      * lint fix
      
      * Update test_example_convolution.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      abe170a6
  11. 09 May, 2025 10 commits
    • Lei Wang's avatar
      [Refactor] Simplify buffer_region_to_tile_region function in copy.py (#470) · c5a989f5
      Lei Wang authored
      * Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
      * Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.
      c5a989f5
    • Lei Wang's avatar
      [Refactor] Update set_compile_args to allow None for out_idx parameter (#469) · 1f2f1554
      Lei Wang authored
      * Modified the `set_compile_args` method in `AutoTuner` to accept `None` as a valid input for the `out_idx` parameter, enhancing flexibility in argument handling.
      1f2f1554
    • Zhengju Tang's avatar
      [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI (#467) · 46eb4589
      Zhengju Tang authored
      
      
      * [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI.
      
      * Lint
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      46eb4589
    • Lei Wang's avatar
      [Typo] Rename `power_of_int` with `pow_of_int` for consistency (#468) · c99b7056
      Lei Wang authored
      * typo fix
      
      * Rename `power_of_int` to `pow_of_int` in math operations and update corresponding Python API reference. Adjusted registration attributes to reflect the new naming convention.
      c99b7056
    • Lei Wang's avatar
      [Feature] Implement fast integer power operation and related API (#466) · 1f5eb492
      Lei Wang authored
      * [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * [Feature] Implement fast integer power operation and related API
      
      * Added a new math operation `tl.power_of_int` in `math.cc` for efficient integer exponentiation.
      * Introduced a corresponding Python API `pow_of_int` in `tir/op.py` to facilitate usage in TileLang.
      * Enhanced `common.h` with a template function for integer power calculations.
      * Updated documentation to reflect the new functionality and usage examples.
      1f5eb492
    • Lei Wang's avatar
      [Bugfix] Fix copy region automation for dynamic extent (#465) · 2ffbd369
      Lei Wang authored
      * [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * [Refactor] Improve buffer region validation in copy.py
      
      * Added handling for variable extents in buffer_region_to_tile_region function to enhance type checking and error handling.
      * Introduced debug print statements to trace values of region extents and temporary extents during validation.
      * Updated logic to account for variable extent counts when determining alignment of extents.
      
      * [Refactor] Remove debug print statements in buffer_region_to_tile_region function
      
      * Eliminated unnecessary print statements that were used for debugging temporary extents and region extents.
      * Streamlined the code for better readability while maintaining the existing functionality of buffer region validation.
      
      * [Refactor] Clean up whitespace in buffer_region_to_tile_region function
      
      * Removed an unnecessary blank line in the buffer_region_to_tile_region function to improve code readability and maintain consistency in formatting.
      2ffbd369
    • Lei Wang's avatar
      [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) · f41c467c
      Lei Wang authored
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      f41c467c
    • Lei Wang's avatar
      [Bugfix] Fix for T.copy with dynamic range (#462) · d946d1d4
      Lei Wang authored
      * [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py
      
      * Refactored barrier functions to use new signatures for improved clarity and consistency.
      * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
      * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
      
      * [Refactor] Update warp_specialized_rewriter with license change and code cleanup
      
      * Replaced Apache License header with MIT License in `warp_specialized_rewriter.cc`.
      * Removed the `ThreadTagChecker` class to streamline the code, as it was no longer needed.
      * Added `#include` for `common/collector.h` to support new functionality.
      * Updated file documentation to reflect the correct filename and purpose.
      * Improved overall code readability by removing unnecessary comments and sections.
      
      * [Feature] Add thread synchronization functions in builtin.py and refine buffer region checks in copy.py
      
      * Introduced `sync_threads` and `sync_thread_partial` functions in `builtin.py` for improved thread synchronization capabilities.
      * Enhanced documentation for new synchronization functions to clarify usage and parameters.
      * Updated buffer region validation in `copy.py` to ensure type checking for integer values, improving error handling for region extents.
      
      * lint fix
      
      * [Feature] Introduce TMA barrier injection and related utilities
      
      * Added `inject_tma_barrier.cc` to implement TMA barrier rewriting for CUDA GPU (sm90+).
      * Created `common/attr.h` and `common/collector.h` for attribute checks and information collection from the IR.
      * Updated `ir.cc` to use a constant for the main block name instead of a hardcoded string.
      * Cleaned up `warp_specialized_rewriter.cc` by removing unnecessary whitespace.
      * Enhanced thread tag validation with `ThreadTagChecker` to ensure only `threadIdx.x` is used in TMA barrier contexts.
      
      * lint fix
      d946d1d4
    • Cunxiao Ni's avatar
      [CI] Add elementwise and gemv examples to CI. (#458) · dd7eb488
      Cunxiao Ni authored
      * [CI] Add elementwise and gemv examples to CI.
      
      * fix lint
      
      * test
      
      * fix gemv lint
      
      * fix lint
      dd7eb488
    • Rinne's avatar
      8d5e803e
  12. 08 May, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Update barrier functions and remove argparse in... · b0122d74
      Lei Wang authored
      [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py (#457)
      
      * Refactored barrier functions to use new signatures for improved clarity and consistency.
      * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
      * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
      b0122d74
    • Lei Wang's avatar
      [Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9
      Lei Wang authored
      * Add example for warp specialization with flash attention
      
      * Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
      * Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
      * Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
      * Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.
      
      * Add memory barrier functions to builtin.py
      
      * Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
      * Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
      * The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
      * Improved code organization and readability by adding blank lines for better separation of logical sections.
      
      * Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py
      
      * Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
      * Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.
      
      * [Refactor] Update barrier functions and add new example for GEMM with warp specialization
      
      * Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
      * Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
      * Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
      * Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
      * Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.
      
      * lint fix
      
      * Update loop variable checks in SIMT loop and buffer region validation
      
      * Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
      * Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.
      
      * lint fix
      
      * test fix
      a91bc2a9
  13. 07 May, 2025 1 commit