1. 26 Sep, 2025 2 commits
    • Tong WU's avatar
      [Example] Optimize sink attention forward via swizzled layout and report benchmark results (#885) · bf67fb19
      Tong WU authored
      
      
      * Enhance attention sink examples with swizzled layout and performance metrics
      
      - Added `make_swizzled_layout` annotations for shared tensors in the `flashattn` function across MHA and GQA examples to optimize memory access patterns.
      - Updated benchmark outputs to include speedup calculations comparing Triton and TileLang implementations.
      
      * Add README for Attention Sink example with algorithm details and benchmark results
      
      - Introduced a new README.md file for the Attention Sink example, outlining the forward and backward algorithms, including the computation of `dsinks`.
      - Provided benchmark results comparing performance metrics of the optimized implementation against Triton, highlighting speedup across various configurations.
      
      * Update README.md for Attention Sink example to include link to Triton implementation
      
      * Update examples/attention_sink/README.md
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * typo
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      bf67fb19
    • Tong WU's avatar
      [Example] Add efficient attention sink backward implementations and tests (#877) · ec24561a
      Tong WU authored
      * [Example] Add a new example to support attention sink for MHA
      
      - Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
      - Added a reference attention function to validate the implementation against PyTorch.
      - Included argument parsing for command-line execution of the example.
      
      * [Example] Replace MHA sink forward example with updated implementation
      
      - Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
      - Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
      - Updated argument parsing and reference functions to align with the new implementation.
      
      * Enhance MHA sink example with sliding window support
      
      - Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
      - Implemented assertions to ensure `window_size` is compatible with `block_N`.
      - Updated the main function to include a `tune` option for performance tuning.
      - Introduced a new test file to validate both full attention and sliding window scenarios.
      - Adjusted FLOPS calculation to account for the sliding window configuration.
      
      * lint
      
      * [Fix] Add checkinf process to fix the bug of swa
      
      * Migrate to BSHD layout to align with triton baselines
      
      * lint
      
      * fix typo
      
      * Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.
      
      * Add GQA sink example for optimized attention mechanism & lint fix
      
      * fix several typos and bugs
      
      * lint
      
      * fix speed issues of swa
      
      * Add flash attention example with backward pass for BHSD layout and corresponding test cases
      
      * Add backward pass implementation for flash attention with sinks and corresponding test case
      
      * fix lint and typo
      
      * Optimze the calculation of `dsinks`
      
      * Add support for swa backward and update examples
      
      * fix previous typos
      
      * Add example for GQA sink backward pass and update tests for both MHA and GQA sinks
      
      * fix lint
      
      * fix previous typos
      
      * typo
      ec24561a
  2. 25 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support atomic add with ret (#870) · aa0b1090
      Lei Wang authored
      * Add atomic operations for CUDA templates in new atomic.h file
      
      - Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
      - Implemented support for half, bfloat16, and float types with appropriate memory ordering.
      - Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
      - Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
      - Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.
      
      * Refactor atomic operations in CUDA templates for improved readability
      
      - Reformatted atomic operation implementations in atomic.h for better code clarity.
      - Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
      - Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.
      
      * Add thread storage synchronization configuration option
      
      - Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
      - Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
      - Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.
      
      * Refactor thread storage sync configuration retrieval
      
      - Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
      - Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.
      
      * test fix
      
      * Update atomic operations and tests for improved functionality
      
      - Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
      - Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
      - Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
      - Updated the TVM subproject to the latest commit for better compatibility.
      
      * Update attention sink examples to use 32 heads
      
      - Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
      - Ensured consistency across example scripts for improved usability and testing.
      
      * Refactor atomic add handling in vectorization
      
      - Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
      - Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.
      
      * Add loop break functionality and enhance thread synchronization
      
      - Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
      - Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
      - Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.
      
      * test fix
      aa0b1090
  3. 23 Sep, 2025 1 commit
    • Tong WU's avatar
      [Example] Add examples to support efficient attention sink forward process (#853) · d9a171ce
      Tong WU authored
      
      
      * [Example] Add a new example to support attention sink for MHA
      
      - Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
      - Added a reference attention function to validate the implementation against PyTorch.
      - Included argument parsing for command-line execution of the example.
      
      * [Example] Replace MHA sink forward example with updated implementation
      
      - Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
      - Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
      - Updated argument parsing and reference functions to align with the new implementation.
      
      * Enhance MHA sink example with sliding window support
      
      - Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
      - Implemented assertions to ensure `window_size` is compatible with `block_N`.
      - Updated the main function to include a `tune` option for performance tuning.
      - Introduced a new test file to validate both full attention and sliding window scenarios.
      - Adjusted FLOPS calculation to account for the sliding window configuration.
      
      * lint
      
      * [Fix] Add checkinf process to fix the bug of swa
      
      * Migrate to BSHD layout to align with triton baselines
      
      * lint
      
      * fix typo
      
      * Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.
      
      * Add GQA sink example for optimized attention mechanism & lint fix
      
      * fix several typos and bugs
      
      * lint
      
      * fix speed issues of swa
      
      * Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      
      * Update examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      d9a171ce