-
Lei Wang authored
* Add atomic operations for CUDA templates in new atomic.h file - Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types. - Implemented support for half, bfloat16, and float types with appropriate memory ordering. - Moved atomic-related utilities from common.h to the new atomic.h file for better organization. - Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts. - Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability. * Refactor atomic operations in CUDA templates for improved readability - Reformatted atomic operation implementations in atomic.h for better code clarity. - Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters. - Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase. * Add thread storage synchronization configuration option - Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access. - Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled. - Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option. * Refactor thread storage sync configuration retrieval - Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables. - Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file. * test fix * Update atomic operations and tests for improved functionality - Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability. - Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers. - Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite. - Updated the TVM subproject to the latest commit for better compatibility. * Update attention sink examples to use 32 heads - Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms. - Ensured consistency across example scripts for improved usability and testing. * Refactor atomic add handling in vectorization - Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance. - Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency. * Add loop break functionality and enhance thread synchronization - Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic. - Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization. - Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility. * test fix
aa0b1090