1. 09 May, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix for T.copy with dynamic range (#462) · d946d1d4
      Lei Wang authored
      * [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py
      
      * Refactored barrier functions to use new signatures for improved clarity and consistency.
      * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
      * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
      
      * [Refactor] Update warp_specialized_rewriter with license change and code cleanup
      
      * Replaced Apache License header with MIT License in `warp_specialized_rewriter.cc`.
      * Removed the `ThreadTagChecker` class to streamline the code, as it was no longer needed.
      * Added `#include` for `common/collector.h` to support new functionality.
      * Updated file documentation to reflect the correct filename and purpose.
      * Improved overall code readability by removing unnecessary comments and sections.
      
      * [Feature] Add thread synchronization functions in builtin.py and refine buffer region checks in copy.py
      
      * Introduced `sync_threads` and `sync_thread_partial` functions in `builtin.py` for improved thread synchronization capabilities.
      * Enhanced documentation for new synchronization functions to clarify usage and parameters.
      * Updated buffer region validation in `copy.py` to ensure type checking for integer values, improving error handling for region extents.
      
      * lint fix
      
      * [Feature] Introduce TMA barrier injection and related utilities
      
      * Added `inject_tma_barrier.cc` to implement TMA barrier rewriting for CUDA GPU (sm90+).
      * Created `common/attr.h` and `common/collector.h` for attribute checks and information collection from the IR.
      * Updated `ir.cc` to use a constant for the main block name instead of a hardcoded string.
      * Cleaned up `warp_specialized_rewriter.cc` by removing unnecessary whitespace.
      * Enhanced thread tag validation with `ThreadTagChecker` to ensure only `threadIdx.x` is used in TMA barrier contexts.
      
      * lint fix
      d946d1d4