"...composable_kernel_rocm.git" did not exist on "a8efb3f09fb3ad22be36824ebd8824a8894ab088"
  1. 22 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Improve memory access condition checks in GlobalMemChecker (#591) · 41ec2bc6
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      41ec2bc6
  2. 20 Jun, 2025 2 commits
    • botbw's avatar
      [Bugfix] fix missing node in ws role maker (#587) · c15d35e4
      botbw authored
      c15d35e4
    • Lei Wang's avatar
      [Enhancement] align shared memory allocations (#583) · fecc8336
      Lei Wang authored
      * [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper
      
      - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
      - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
      - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.
      
      * [Feature] Implement dynamic shared memory allocation alignment
      
      - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
      - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
      - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.
      
      * [Enhancement] Update dtype and argument defaults in GEMM autotuning example
      
      - Changed data type from `float16` to `bfloat16` for improved precision in computations.
      - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.
      
      * [Enhancement] Improve thread range computation in storage access
      
      - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
      - Updated `AccessEntry` structure to include `thread_range`.
      - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
      - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.
      
      * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment
      
      - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
      - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
      - Added a blank line for improved readability in `thread_storage_sync.cc`.
      
      * [Refactor] Enhance storage access analysis and thread range computation
      
      - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
      - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
      - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.
      
      * [Enhancement] Update default matrix dimensions in GEMM autotune example
      
      - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.
      
      * typo fix
      
      * enhancement
      
      * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync
      
      - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
      fecc8336
  3. 18 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Update warp specialization checking (#580) · 6cede73d
      Lei Wang authored
      * Fix L2 cache size calculation to handle symbolic expressions and ensure float conversion of hit ratios in annotation
      
      * [Enhancement] Update warp specialization check in phase.py
      
      * lint fix
      
      * [Enhancement] Add ContainsSeqStmt method to improve statement handling in merge_shared_memory_allocations.cc
      
      * [Refactor] Simplify memory copy operations in GEMM kernel tests
      
      - Updated memory copy operations in `test_tilelang_kernel_gemm.py` to use shared memory allocations for both A and B matrices, improving clarity and performance.
      - Adjusted the main execution block to include a new `run_gemm_rs` function call for testing, enhancing the test structure.
      
      * revert memory reuse pass.
      
      * revert the memory resue and thread sync pass/
      
      * Update test_tilelang_kernel_gemm.py
      
      * Update test_tilelang_kernel_mha_bwd.py
      6cede73d
  4. 11 Jun, 2025 1 commit
    • Yu Cheng's avatar
      [Feature] Introduce Persistent Loop and Update GEMM Example (#563) · e7b97be2
      Yu Cheng authored
      * [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation
      
      - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid.
      - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code.
      - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels.
      - Updated relevant documentation and comments to reflect the addition of new features and usage instructions.
      
      * [Example] Add MLA Decode With Persistent Threadblock Example
      
      * [Feature] Introduce Persistent Loop and Update GEMM Example
      
      - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution.
      - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication.
      - Introduced a `loop_break` intrinsic for better control flow within persistent loops.
      - Updated relevant files to support the new features, including changes in code generation and language interface.
      
      * lint fix
      e7b97be2
  5. 07 Jun, 2025 2 commits
    • Yu Cheng's avatar
      [Feature] Support persistent kernels and add persistent GEMM examples (#559) · 225aca61
      Yu Cheng authored
      * [Enhancement] Fix multi-version buffer index in nested-loop
      
      * [Feature] Support persistent kernels and add persistent GEMM example
      
      * lint fix
      
      * lint fix
      
      * [CI] Remove test_tilelang_transform_annotate_device_regions.py
      225aca61
    • Lei Wang's avatar
      [Bugfix] Add tf32 casting to GEMM templates (#556) · 8cc8db52
      Lei Wang authored
      * Add tf32 casting functionality to GEMM templates
      
      - Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates.
      - Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types.
      - Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations.
      
      * lint fix
      
      * Refactor tfloat32 casting logic in GEMM templates
      
      - Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32.
      - Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types.
      - Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations.
      
      * Refactor GEMM template functions for improved readability
      
      - Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity.
      - Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability.
      
      * Enhance tf32 casting logic in GEMM templates
      
      - Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness.
      - Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability.
      
      * Refactor GEMM template functions and layout inference logic
      
      - Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability.
      - Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations.
      - Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling.
      8cc8db52
  6. 04 Jun, 2025 2 commits
    • Lei Wang's avatar
      [Bugfix] Enhance layout inference pass for flexibility (#550) · 444b7c4e
      Lei Wang authored
      * Enhance Layout
      
      * strict update
      
      * lint fix
      
      * Refactor layout inference by removing unnecessary logging statements in `parallel.cc` and `layout_inference.cc`. This cleanup enhances code readability and reduces log clutter during layout inference steps.
      
      * lint fix
      
      * Refactor file copying logic in setup.py to simplify directory creation and file copying process. Removed unnecessary existence check before copying source files to the target directory.
      444b7c4e
    • Lei Wang's avatar
      [Refactor] Include several examples into ci (#531) · 3ca3a8af
      Lei Wang authored
      * Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.
      
      * lint fix
      
      * fix split k example
      
      * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation
      
      * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
      3ca3a8af
  7. 29 May, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support `T.annotate_l2_hit_ratio` via `cudaStreamSetAttribute` (#539) · a65f481e
      Lei Wang authored
      * Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code
      
      * Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process.
      * Cleaned up unnecessary whitespace in the function for improved readability.
      * Enhanced the overall clarity and maintainability of the code.
      
      * Refactor AllocateNode handling in vectorize_loop.cc
      
      * Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic.
      * Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability.
      * Improved overall readability by eliminating unnecessary comments and code related to extent handling.
      
      * Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing.
      
      * Add buffer allocation planning step in OptimizeForTarget function
      
      * Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process.
      * This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.
      
      * Update submodule TVM to latest commit db50d4e, ensuring alignment with upstream changes.
      
      * Add L2 persistent annotation support and related functionality
      
      * Introduced a new file `lower_l2_persistent_annotation.cc` to handle the lowering of L2 persistent annotations.
      * Added functions to annotate L2 hit ratios for buffers, ensuring compatibility with global buffer requirements.
      * Updated the `LowerAndLegalize` function to include the new L2 persistent map lowering step.
      * Enhanced CUDA driver with a function to retrieve the maximum size of the persisting L2 cache.
      * Modified the `TLCUDASourceWrapper` class to integrate L2 persistent map handling during kernel launches.
      
      These changes improve the framework's ability to manage L2 cache optimizations, enhancing performance for CUDA applications.
      
      * lint fix
      a65f481e
  8. 28 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Disable legacy vectorization for buffer allocation (#535) · e71c7a17
      Lei Wang authored
      * Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code
      
      * Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process.
      * Cleaned up unnecessary whitespace in the function for improved readability.
      * Enhanced the overall clarity and maintainability of the code.
      
      * Refactor AllocateNode handling in vectorize_loop.cc
      
      * Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic.
      * Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability.
      * Improved overall readability by eliminating unnecessary comments and code related to extent handling.
      
      * Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing.
      
      * Add buffer allocation planning step in OptimizeForTarget function
      
      * Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process.
      * This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.
      e71c7a17
  9. 27 May, 2025 1 commit
    • Yu Cheng's avatar
      [Enhancement] Add warp specialization attribute handling in IR and rewriter (#518) · 41bc15cb
      Yu Cheng authored
      * Introduced an `AttrFrame` for warp specialization in the IR, enhancing the handling of warp-specific optimizations.
      * Refactored the `VisitStmt_` method in `warp_specialized_rewriter.cc` to check for the new warp specialization attribute, improving the detection of warp specialization conditions.
      * Removed outdated code related to condition checks in `IfThenElseNode`, streamlining the specialization logic.
      41bc15cb
  10. 26 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add commit ID to versioning and improve logging initialization (#524) · 62a8d7f0
      Lei Wang authored
      * Updated `get_tilelang_version` to include an optional commit ID in the version string.
      * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
      * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
      * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
      * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
      62a8d7f0
  11. 25 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support auto synchronization for global memory access (#519) · 623edf4c
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      623edf4c
  12. 24 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Support auto index bitwidth casting (#517) · 6ad73f6f
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      6ad73f6f
  13. 23 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness... · 0fdefe2b
      Lei Wang authored
      [Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management (#508)
      
      * Introduced a new StmtAttr structure to track the scope level of statements, enhancing the liveness analysis process.
      * Updated the UpdateStmtAttr function to manage statement attributes effectively during memory allocation visits.
      * Modified the VisitStmt_ methods to utilize the new scope level tracking, ensuring accurate memory access patterns.
      * Refactored the LivenessAnalysis and PlanMemory functions to incorporate statement attributes, improving the handling of gen and kill points in memory management.
      * Added a new helper function allow_warp_specialized in phase.py to conditionally enable warp specialization based on pass context and target, addressing potential bugs in the MergeSharedMemoryAllocations pass.
      * Enhanced the OptimizeForTarget function to conditionally apply the MergeSharedMemoryAllocations pass based on warp specialization settings, improving robustness in memory allocation strategies.
      0fdefe2b
  14. 22 May, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Introduce padding annotation and improve memory access validation (#511) · f23c4d30
      Lei Wang authored
      * Added a new attribute `kPaddingMap` in `builtin.h` for managing padding annotations.
      * Enhanced `SafeMemorysRewriter` to utilize an annotated padding map for buffer stores, improving memory access safety.
      * Implemented checks in `layout_inference.cc` to ensure buffers are correctly referenced during layout mapping.
      * Introduced a new test file for validating the padding annotation functionality in TileLang.
      f23c4d30
    • Lei Wang's avatar
      [Refactor] Update buffer handling in layout transformation functions (#509) · 094796b6
      Lei Wang authored
      * Modified `makeBufferWithLayout` to include a `var_remap` parameter for improved variable remapping during buffer creation.
      * Enhanced buffer load and store operations to utilize the new variable remapping logic, ensuring correct buffer references.
      * Commented out a check in `ThreadExtent` for clarity, maintaining functionality while improving code readability.
      094796b6
  15. 20 May, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically (#505) · 84ddb9e1
      Lei Wang authored
      * [Refactor] Update GlobalMemChecker to use IRVisitorWithAnalyzer for improved analysis (#505)
      
      * Refactored GlobalMemChecker to inherit from IRVisitorWithAnalyzer, enhancing its capabilities for expression analysis.
      * Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
      * Added additional lower bound condition checks to ensure comprehensive validation of memory access indices.
      
      * [Refactor] Update GlobalMemChecker to use StmtExprVisitor for improved memory access validation
      
      * Refactored GlobalMemChecker to inherit from StmtExprVisitor, enhancing its capabilities for expression analysis.
      * Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
      * Ensured that the analyzer is passed correctly during instantiation, maintaining consistency in condition checks.
      84ddb9e1
    • Lei Wang's avatar
      [Refactor] Refactor `jit` to `_JitImplementation` to support `@tilelang.jit` (#502) · 8c8d8ca2
      Lei Wang authored
      * [Refactor] Rename `jit` class to `_JitImplementation` and improve debug path handling
      
      * Refactored the `jit` class to `_JitImplementation` for clarity and encapsulation.
      * Enhanced handling of `debug_root_path` to ensure it is correctly set as an absolute path when provided.
      * Updated the public `jit` function to serve as a decorator interface, allowing for both default and configured usage.
      * Added validation to ensure input tensors are contiguous in the Cython wrapper, improving error handling.
      
      * [Refactor] Improve formatting and handling in `_JitImplementation` and `jit` function
      
      * Refactored the `_JitImplementation` class to enhance readability by adjusting comment formatting and consolidating conditions for setting `debug_root_path`.
      * Updated the `jit` function signature for better alignment and clarity in parameter definitions.
      * Ensured consistent spacing and comments throughout the code for improved maintainability.
      
      * [Refactor] Update GEMM test parameters for performance optimization
      
      * Set num_stages to 0 and adjusted matrix dimensions in the GEMM test function to enhance performance and consistency across tests in test_tilelang_jit_gemm.py.
      * Reduced the number of threads used in the test to align with the updated configuration, improving overall test efficiency.
      
      * [Refactor] Enhance buffer error logging in layout inference
      
      * Updated the warning message in layout inference to provide clearer context when a buffer cannot be inferred due to its absence in the use list. This change improves the clarity of error reporting during layout inference operations.
      * Refactored tensor handling in the Cython wrapper to ensure input tensors are checked for contiguity before processing, enhancing error handling and robustness in tensor management.
      
      * bugfix
      8c8d8ca2
  16. 16 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      dca2fb48
  17. 13 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support register input for gemm when trans_a or trans_b is true (#490) · d4f096ef
      Lei Wang authored
      * [Refactor] Enhance makeGemmFragmentB to support transposition
      
      * Updated the `makeGemmFragmentB` function to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
      * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
      * Modified the function signature in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
      * Added a new `matmul_sr` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
      
      * [Refactor] Enhance makeGemmFragmentA and makeGemmFragmentB for transposition support
      
      * Updated the `makeGemmFragmentA` and `makeGemmFragmentB` functions to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
      * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
      * Modified function signatures in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
      * Added a new `matmul_rs` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
      *
      
      * Improve error messaging in layout equality checks
      
      * Enhanced the error output in layout equality checks to provide clearer context by adding line breaks for better readability in the debug output.
      * This change ensures that when layouts are structurally unequal, the current and previous layouts are displayed more distinctly, aiding in debugging.
      d4f096ef
  18. 10 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve layout equality checks and error messaging (#471) · c2480907
      Lei Wang authored
      * [Refactor] Simplify buffer_region_to_tile_region function in copy.py
      
      * Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
      * Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.
      
      * [Refactor] Improve layout equality checks and error messaging
      
      * Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges.
      * Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging.
      * Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy.
      
      * lintfix
      
      * [Refactor] Rename SetThreadRange to BindThreadRange for clarity
      
      * Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity.
      * Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior.
      * Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method.
      * Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API.
      
      * [Refactor] Update BindThreadRange usage across layout inference methods
      
      * Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality.
      * Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation.
      * Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity.
      
      * lint fix
      c2480907
  19. 09 May, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) · f41c467c
      Lei Wang authored
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      f41c467c
    • Lei Wang's avatar
      [Bugfix] Fix for T.copy with dynamic range (#462) · d946d1d4
      Lei Wang authored
      * [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py
      
      * Refactored barrier functions to use new signatures for improved clarity and consistency.
      * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
      * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
      
      * [Refactor] Update warp_specialized_rewriter with license change and code cleanup
      
      * Replaced Apache License header with MIT License in `warp_specialized_rewriter.cc`.
      * Removed the `ThreadTagChecker` class to streamline the code, as it was no longer needed.
      * Added `#include` for `common/collector.h` to support new functionality.
      * Updated file documentation to reflect the correct filename and purpose.
      * Improved overall code readability by removing unnecessary comments and sections.
      
      * [Feature] Add thread synchronization functions in builtin.py and refine buffer region checks in copy.py
      
      * Introduced `sync_threads` and `sync_thread_partial` functions in `builtin.py` for improved thread synchronization capabilities.
      * Enhanced documentation for new synchronization functions to clarify usage and parameters.
      * Updated buffer region validation in `copy.py` to ensure type checking for integer values, improving error handling for region extents.
      
      * lint fix
      
      * [Feature] Introduce TMA barrier injection and related utilities
      
      * Added `inject_tma_barrier.cc` to implement TMA barrier rewriting for CUDA GPU (sm90+).
      * Created `common/attr.h` and `common/collector.h` for attribute checks and information collection from the IR.
      * Updated `ir.cc` to use a constant for the main block name instead of a hardcoded string.
      * Cleaned up `warp_specialized_rewriter.cc` by removing unnecessary whitespace.
      * Enhanced thread tag validation with `ThreadTagChecker` to ensure only `threadIdx.x` is used in TMA barrier contexts.
      
      * lint fix
      d946d1d4
  20. 08 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9
      Lei Wang authored
      * Add example for warp specialization with flash attention
      
      * Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
      * Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
      * Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
      * Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.
      
      * Add memory barrier functions to builtin.py
      
      * Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
      * Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
      * The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
      * Improved code organization and readability by adding blank lines for better separation of logical sections.
      
      * Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py
      
      * Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
      * Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.
      
      * [Refactor] Update barrier functions and add new example for GEMM with warp specialization
      
      * Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
      * Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
      * Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
      * Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
      * Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.
      
      * lint fix
      
      * Update loop variable checks in SIMT loop and buffer region validation
      
      * Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
      * Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.
      
      * lint fix
      
      * test fix
      a91bc2a9
  21. 06 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      
      * [Feature] Add examples for warp specialization and TMA barrier integration
      
      * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
      * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
      * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
      * Updated the `phase.py` to include TMA barrier injection in the optimization process.
      * Improved documentation and comments for better clarity on usage and functionality.
      
      * [Feature] Add example for warp specialization in GEMM with TMA barriers
      
      * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
      * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
      * Enhanced documentation and comments for clarity on usage and functionality.
      
      * lint fix
      
      * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection
      
      * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
      * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
      * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
      * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.
      
      * lint fix
      
      * [Feature] Add new examples for warp specialization and TMA integration
      
      * Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
      * Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
      * Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
      * Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.
      
      * lint fix
      
      * Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.
      b5faf25a
  22. 03 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      
      * [Feature] Add examples for warp specialization and TMA barrier integration
      
      * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
      * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
      * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
      * Updated the `phase.py` to include TMA barrier injection in the optimization process.
      * Improved documentation and comments for better clarity on usage and functionality.
      
      * [Feature] Add example for warp specialization in GEMM with TMA barriers
      
      * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
      * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
      * Enhanced documentation and comments for clarity on usage and functionality.
      
      * lint fix
      
      * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection
      
      * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
      * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
      * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
      * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.
      
      * lint fix
      fce16b00
  23. 01 May, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix safe memory legalization for fragment store (#446) · e46653ac
      Lei Wang authored
      * [Enhancement] Improve layout inference accuracy in ParallelOp (#441)
      
      * Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
      * Enhanced comments to clarify the rationale behind buffer selection in layout inference process.
      
      * [Enhancement] Add error handling macros and refactor loop partitioning logic
      
      * Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
      * Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
      * Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
      * Updated pass configuration management to streamline vectorization control in the optimization process.
      
      * lint fix
      
      * remove debug print
      
      * [Refactor] Update legalize_safe_memory_access.cc to improve memory access handling
      
      * Replaced Apache License header with MIT License.
      * Added logic to handle local buffer conditions in memory access.
      * Introduced IsLocalBuffer function to check buffer scope.
      * Enhanced comments for clarity on memory access operations.
      e46653ac
  24. 30 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support explicit programming for identified warp groups (#445) · 6972aed7
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      6972aed7
  25. 29 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix layout inference for free fragment buffer (#443) · 2ea45ae9
      Lei Wang authored
      * [Enhancement] Improve layout inference accuracy in ParallelOp (#441)
      
      * Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
      * Enhanced comments to clarify the rationale behind buffer selection in layout inference process.
      
      * [Enhancement] Add error handling macros and refactor loop partitioning logic
      
      * Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
      * Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
      * Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
      * Updated pass configuration management to streamline vectorization control in the optimization process.
      
      * lint fix
      
      * remove debug print
      2ea45ae9
  26. 26 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Simplify vectorization process in loop_vectorize.cc and add... · 3c5190e0
      Lei Wang authored
      [Enhancement] Simplify vectorization process in loop_vectorize.cc and add atomic add test (#436) (#439)
      
      * Removed redundant simplification step in vectorization logic to streamline performance.
      * Introduced a new test for atomic addition in TileLang, validating functionality with a reference implementation using PyTorch.
      3c5190e0
    • Lei Wang's avatar
      [Language] Support accumulative `T.reduce_sum` (#436) · 6c737768
      Lei Wang authored
      * [Enhancement] Update reduce operations to support clear option in sum and abs sum (#436)
      
      * Modified reduce_sum and reduce_absmax functions to include a clear parameter, allowing for accumulation on existing values.
      * Updated ReduceOp::Lower method to handle initialization and buffer duplication based on the clear flag for sum and abs sum operations.
      * Added new tests for reduce_sum and reduce_max with clear functionality to ensure correctness in various scenarios.
      * Enhanced documentation for reduce functions to clarify the behavior of the clear parameter.
      
      * lint fix
      
      * Update tensor type annotations in test_tilelang_transform_annotate_device_regions.py from Buffer to Tensor
      
      * Update tensor type in reduce sum tests from float16 to float32 for improved precision
      6c737768
  27. 25 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Bugfix] Removed the behavior that treated global -> local as a copy operation. (#435) · 181267c7
      Lei Wang authored
      * [Enhancement] Improve error handling in layout inference and update profiler type in tests
      
      * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B.
      * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy.
      
      * lint fix
      
      * [Refactor] Update OperandTraits to include num_warp_n parameter
      
      * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations.
      * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios.
      
      * lint fix
      
      * [Refactor] Update DispatchInstruction templates to include N parameter
      
      * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations.
      * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.
      
      * [Refactor] Simplify store buffer scope checks in pipeline planning
      
      * Removed redundant condition for 'local' scope in the store buffer checks, streamlining the logic for identifying global copy patterns.
      * Enhanced code clarity by reducing complexity in the conditional statements.
      181267c7
    • Lei Wang's avatar
      [Bugfix] Fix the test data distribution of cumsum (#432) · 3d206235
      Lei Wang authored
      * [Refactor] Adjust layout inference calculations in Gemm and ParallelOp
      
      * Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference.
      * Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience.
      * Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code.
      
      * [Refactor] Clean up layout inference logic in Gemm and ParallelOp
      
      * Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process.
      * Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality.
      * Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts.
      
      * lint fix
      
      * [Enhancement] Improve cumulative sum functionality and annotations handling
      
      * Updated the `cumsum` function to include detailed documentation and error handling for dimension bounds.
      * Modified the `run_cumsum` test to utilize a random tensor supply type for profiling, enhancing test robustness.
      * Added annotations to the fused loop in `loop_fusion_utils.h`, ensuring proper metadata is preserved during loop fusion.
      
      * lint fix
      3d206235
  28. 22 Apr, 2025 2 commits
    • Yu Cheng's avatar
      [Enhancement] Add TMA+WS support in pipeline planning logic (#422) · ae1e7399
      Yu Cheng authored
      * Introduced logic to check for TMA+WS enablement based on annotations in the pipeline planning stage.
      * Enhanced the handling of `order_anno` and `stage_anno` to determine if TMA+WS is activated, improving flexibility in loop processing.
      * Refactored the existing code to maintain clarity while integrating the new feature.
      ae1e7399
    • Lei Wang's avatar
      [Enhancement] Support Auto Layout Inference and Parallelism with variable constraint (#417) · 73a6cb8b
      Lei Wang authored
      * [Enhancement] Introduce thread range management in layout and operation handling
      
      * Added `SetThreadRange` method to `FragmentNode` for managing thread ranges.
      * Updated `LayoutNode::Inverse` to provide more informative error messages.
      * Refactored layout inference and operation lowering to utilize `thread_bounds` instead of `block_size`, enhancing flexibility for thread management.
      * Introduced new tests for tilelang operations to validate thread range functionality and ensure correctness in parallel execution scenarios.
      
      * lint fix
      
      * [Refactor] Improve thread variable handling in layout inference and operation lowering
      
      * Removed workaround for undefined thread_var in layout inference, ensuring proper handling of thread bounds.
      * Updated logic to define thread bounds based on the presence of thread_var, enhancing robustness in thread management.
      * Refactored thread_var initialization in lower_tile_op to maintain consistency across the codebase.
      
      * [Refactor] Update thread variable handling in layout inference and operation lowering
      
      * Refactored thread variable checks to ensure bounds are only accessed when defined, improving safety and clarity.
      * Initialized thread_var with a default range to prevent undefined behavior.
      * Updated logic in lower_tile_op to align with new thread variable handling, enhancing consistency across the codebase.
      73a6cb8b
  29. 19 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier (#408) · e8c2e794
      Lei Wang authored
      * Update TVM submodule and enhance vectorization logic in loop_vectorize.cc
      
      - Updated the TVM submodule to the latest commit.
      - Simplified the vectorization process by ensuring that the vectorized expression is simplified after vectorization, improving expression handling.
      - Added checks in loop_fusion_utils.h to prevent fusion of loops with non-power-of-2 extents, enhancing robustness in loop transformations.
      
      * lint fix
      e8c2e794
  30. 17 Apr, 2025 2 commits
    • Lei Wang's avatar
      [CI] Update CI configuration to run pytest with automatic parallelization (#393) · 6d3d4743
      Lei Wang authored
      * Update CI configuration to run pytest with automatic parallelization using the '-n auto' option.
      
      * Enhance Cython JIT Adapter Compilation Logic
      
      - Improved the locking mechanism during the compilation of the Cython JIT adapter to prevent race conditions.
      - Added checks to determine if another process has already compiled the library, reducing unnecessary recompilation.
      - Cleaned up the code by removing redundant imports and ensuring proper handling of temporary files during compilation failures.
      - Updated vectorization logic in loop_vectorize.cc to allow optional simplification of vectorized expressions.
      
      This update enhances performance and reliability in the JIT compilation process.
      
      * lint fix
      
      * Update CI configuration to run pytest with 4 parallel jobs instead of auto-detection
      
      * Add pytest markers for serial execution in MHA tests
      
      - Added @pytest.mark.serial to multiple MHA test functions to ensure they run sequentially.
      - This change improves test reliability by preventing potential race conditions during execution.
      
      * Update TVM submodule and enhance vectorization logic in loop_vectorize.cc
      
      - Updated the TVM submodule to the latest commit.
      - Modified the vectorization logic to include optional simplification of vectorized expressions and added checks to ensure the usage of vectorized variables, improving performance and reliability in expression handling.
      
      * Remove @pytest.mark.serial from multiple MHA test functions to allow parallel execution. This change enhances test performance by enabling concurrent test runs while maintaining reliability.
      
      * Remove tvm_simplify_test.py file, eliminating the test for expression simplification in TVM. This cleanup helps streamline the codebase by removing unused test cases.
      
      * Remove unused pytest import from test_tilelang_kernel_mha.py to streamline the test file.
      
      * lint fix
      
      * Update TVM submodule and refine vectorization logic in loop_vectorize.cc
      
      - Updated the TVM submodule to the latest commit.
      - Adjusted the return statements in loop_vectorize.cc to improve expression handling and ensure consistency in the visitor pattern.
      
      * Refactor vectorization logic in loop_vectorize.cc
      
      - Removed the check for the usage of the vectorized variable in the vectorization logic, simplifying the expression handling.
      - This change enhances the clarity and efficiency of the vectorization process.
      
      * Enhance vectorization checks in loop_vectorize.cc
      
      - Added a check to ensure the vectorized expression uses the vectorized variable, improving the robustness of the vectorization logic.
      - This change refines the expression handling and ensures that only valid vectorized expressions are processed.
      
      * Implement non-local buffer checks for loop vectorization in layout_inference.cc
      
      - Added logic to check for non-local buffer loads and stores before applying vectorization to loops. This enhancement ensures that vectorization is only applied when appropriate, improving the correctness of the loop transformations.
      
      * Refactor buffer handling in pipeline planning and layout inference
      
      - Renamed GlobalCopyPatternDetector to BufferRegionCollector for clarity and updated its logic to collect buffer read/write regions.
      - Enhanced the handling of conditional expressions in pipeline planning, allowing for better management of stages related to conditional statements.
      - Improved the processing of buffer regions during read/write operations, ensuring accurate tracking of buffer usage across different stages.
      
      * Refactor vectorization checks in loop_vectorize.cc
      
      - Removed the check for the usage of the vectorized variable in the vectorization logic, simplifying the expression handling.
      - This change enhances the clarity and efficiency of the vectorization process, ensuring that valid vectorized expressions are processed without unnecessary checks.
      6d3d4743
    • Zhengju Tang's avatar