1. 16 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Warp Specialize] Implicit Warp Specialize Programing Model (#605) · e2d25ba8
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      
      * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy
      
      - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
      - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.
      
      * [Refactor] Update barrier and clear operations in warp specialization examples
      
      - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization.
      - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency.
      
      * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling
      
      - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management.
      - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements.
      - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior.
      - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output.
      - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements.
      
      * [Enhancement] Add support for disabling TMA in Copy operations
      
      - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior.
      - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag.
      - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations.
      - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations.
      
      * [Refactor] Clean up whitespace and formatting in multiple files
      
      - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`.
      - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase.
      
      * [Enhancement] Refactor flash attention implementation for improved performance and configurability
      
      - Split the shared memory allocations for query and key-value pairs to optimize memory usage.
      - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example.
      - Updated kernel execution parameters to improve thread management and synchronization.
      - Enhanced the overall structure of the flash attention function for better readability and maintainability.
      
      * fix
      
      * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget
      
      * Refactor barrier handling and update example configurations
      
      - Replaced commented-out barrier creation with new barrier allocation in GEMM example.
      - Updated kernel configuration in warp specialization example to include async copy settings.
      - Enhanced barrier management in the phase optimization process to improve synchronization handling.
      - Introduced new barrier allocation function for better memory management in shared contexts.
      
      * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget
      
      - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization.
      - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code.
      - Added exit() call in OptimizeForTarget to halt execution after barrier lowering.
      
      * Enhance CMake configuration and clean up example scripts
      
      - Enabled compile command export in CMakeLists.txt for better build integration.
      - Removed unnecessary print statement in the warp specialization example.
      - Cleaned up commented-out code in GEMM example for improved readability.
      - Updated barrier handling in shared memory allocation transformations for better synchronization.
      
      * Refactor barrier handling in warp specialization examples
      
      - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization.
      - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts.
      - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management.
      
      * Update lower_shared_barrier.cc
      
      * Update phase.py
      
      * Update warp specialization example and Cython wrapper
      
      - Removed commented-out pass configuration options in the warp specialization example for clarity.
      - Added functionality to write the generated kernel source to a file named "kernel.cu".
      - Enhanced Cython wrapper to support boolean type conversion for improved type handling.
      
      * Add storage synchronization call in shared barrier transformation
      
      - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process.
      
      * remove debug files
      
      * Remove kernel source output to file in warp specialization example
      
      * remove comments
      
      * Refactor tensor handling and update test execution in TileLang
      
      - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency.
      - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`.
      - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing.
      - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity.
      
      * Update test_tilelang_language_reshape.py
      e2d25ba8
  2. 09 May, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) · f41c467c
      Lei Wang authored
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      f41c467c
    • Lei Wang's avatar
      [Bugfix] Fix for T.copy with dynamic range (#462) · d946d1d4
      Lei Wang authored
      * [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py
      
      * Refactored barrier functions to use new signatures for improved clarity and consistency.
      * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
      * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
      
      * [Refactor] Update warp_specialized_rewriter with license change and code cleanup
      
      * Replaced Apache License header with MIT License in `warp_specialized_rewriter.cc`.
      * Removed the `ThreadTagChecker` class to streamline the code, as it was no longer needed.
      * Added `#include` for `common/collector.h` to support new functionality.
      * Updated file documentation to reflect the correct filename and purpose.
      * Improved overall code readability by removing unnecessary comments and sections.
      
      * [Feature] Add thread synchronization functions in builtin.py and refine buffer region checks in copy.py
      
      * Introduced `sync_threads` and `sync_thread_partial` functions in `builtin.py` for improved thread synchronization capabilities.
      * Enhanced documentation for new synchronization functions to clarify usage and parameters.
      * Updated buffer region validation in `copy.py` to ensure type checking for integer values, improving error handling for region extents.
      
      * lint fix
      
      * [Feature] Introduce TMA barrier injection and related utilities
      
      * Added `inject_tma_barrier.cc` to implement TMA barrier rewriting for CUDA GPU (sm90+).
      * Created `common/attr.h` and `common/collector.h` for attribute checks and information collection from the IR.
      * Updated `ir.cc` to use a constant for the main block name instead of a hardcoded string.
      * Cleaned up `warp_specialized_rewriter.cc` by removing unnecessary whitespace.
      * Enhanced thread tag validation with `ThreadTagChecker` to ensure only `threadIdx.x` is used in TMA barrier contexts.
      
      * lint fix
      d946d1d4
  3. 03 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      
      * [Feature] Add examples for warp specialization and TMA barrier integration
      
      * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
      * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
      * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
      * Updated the `phase.py` to include TMA barrier injection in the optimization process.
      * Improved documentation and comments for better clarity on usage and functionality.
      
      * [Feature] Add example for warp specialization in GEMM with TMA barriers
      
      * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
      * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
      * Enhanced documentation and comments for clarity on usage and functionality.
      
      * lint fix
      
      * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection
      
      * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
      * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
      * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
      * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.
      
      * lint fix
      fce16b00