• Lei Wang's avatar
    [Bugfix] Fix for T.copy with dynamic range (#462) · d946d1d4
    Lei Wang authored
    * [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py
    
    * Refactored barrier functions to use new signatures for improved clarity and consistency.
    * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
    * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
    
    * [Refactor] Update warp_specialized_rewriter with license change and code cleanup
    
    * Replaced Apache License header with MIT License in `warp_specialized_rewriter.cc`.
    * Removed the `ThreadTagChecker` class to streamline the code, as it was no longer needed.
    * Added `#include` for `common/collector.h` to support new functionality.
    * Updated file documentation to reflect the correct filename and purpose.
    * Improved overall code readability by removing unnecessary comments and sections.
    
    * [Feature] Add thread synchronization functions in builtin.py and refine buffer region checks in copy.py
    
    * Introduced `sync_threads` and `sync_thread_partial` functions in `builtin.py` for improved thread synchronization capabilities.
    * Enhanced documentation for new synchronization functions to clarify usage and parameters.
    * Updated buffer region validation in `copy.py` to ensure type checking for integer values, improving error handling for region extents.
    
    * lint fix
    
    * [Feature] Introduce TMA barrier injection and related utilities
    
    * Added `inject_tma_barrier.cc` to implement TMA barrier rewriting for CUDA GPU (sm90+).
    * Created `common/attr.h` and `common/collector.h` for attribute checks and information collection from the IR.
    * Updated `ir.cc` to use a constant for the main block name instead of a hardcoded string.
    * Cleaned up `warp_specialized_rewriter.cc` by removing unnecessary whitespace.
    * Enhanced thread tag validation with `ThreadTagChecker` to ensure only `threadIdx.x` is used in TMA barrier contexts.
    
    * lint fix
    d946d1d4
inject_tma_barrier.cc 10.5 KB