• Lei Wang's avatar
    [Language] Support atomic add with ret (#870) · aa0b1090
    Lei Wang authored
    * Add atomic operations for CUDA templates in new atomic.h file
    
    - Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
    - Implemented support for half, bfloat16, and float types with appropriate memory ordering.
    - Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
    - Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
    - Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.
    
    * Refactor atomic operations in CUDA templates for improved readability
    
    - Reformatted atomic operation implementations in atomic.h for better code clarity.
    - Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
    - Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.
    
    * Add thread storage synchronization configuration option
    
    - Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
    - Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
    - Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.
    
    * Refactor thread storage sync configuration retrieval
    
    - Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
    - Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.
    
    * test fix
    
    * Update atomic operations and tests for improved functionality
    
    - Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
    - Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
    - Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
    - Updated the TVM subproject to the latest commit for better compatibility.
    
    * Update attention sink examples to use 32 heads
    
    - Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
    - Ensured consistency across example scripts for improved usability and testing.
    
    * Refactor atomic add handling in vectorization
    
    - Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
    - Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.
    
    * Add loop break functionality and enhance thread synchronization
    
    - Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
    - Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
    - Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.
    
    * test fix
    aa0b1090
legalize_safe_memory_access.cc 12 KB