1. 17 Dec, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a
      Lei Wang authored
      * [Enhancement] Update examples and tests for improved type handling and functionality
      
      - Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
      - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
      - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
      - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.
      
      * [Refactor] Update accumulation data type to float32 across examples
      
      - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
      - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.
      
      * [Refactor] Standardize data type usage across benchmark scripts
      
      - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
      - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
      - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.
      
      * [Refactor] Standardize data type usage in templates and scripts
      
      - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
      - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
      - This change aims to streamline type handling and improve compatibility with existing workflows.
      
      * [Refactor] Standardize data type usage in examples and benchmarks
      
      - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
      - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.
      
      * [Refactor] Import dtypes from language.v2 module
      
      - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
      - This change aims to streamline data type management and improve overall code clarity.
      
      * fix
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
      - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
      - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.
      
      * [Refactor] Update data type handling for consistency and clarity
      
      - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
      - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
      - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
      - This refactor aims to streamline data type management and improve overall code clarity and maintainability.
      
      * [Enhancement] Improve data type handling and error messaging
      
      - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
      - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
      - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.
      
      * [Fix] Correct boolean flag in GEMM SP test case
      
      - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
      - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.
      
      * [Refactor] Standardize data type usage in various modules
      
      - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.
      
      * [Refactor] Update argument parsing for data types in benchmarks
      
      - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
      - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.
      
      * [Refactor] Update data type handling in benchmark and example scripts
      
      - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
      - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
      - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.
      
      * [Refactor] Fix data type conversion in multiple scripts
      
      - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
      - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.
      
      * [Refactor] Update float8 data type usage across multiple scripts
      
      - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
      - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.
      
      * [Refactor] Enhance float8 data type handling in CUDA code generation
      
      - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
      - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
      - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Streamline float8 data type handling in CUDA and related modules
      
      - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
      - Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Remove unnecessary cache disabling in float8 example script
      
      - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
      - This change enhances clarity and maintainability of the example script without affecting its functionality.
      
      * [Refactor] Update data type usage in debug print tests
      
      - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
      - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.
      
      * lint fix
      
      * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples
      
      * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
      c750fb8a
  2. 12 Dec, 2025 1 commit
  3. 24 Nov, 2025 1 commit
  4. 20 Nov, 2025 1 commit
  5. 16 Nov, 2025 1 commit
  6. 25 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support atomic add with ret (#870) · aa0b1090
      Lei Wang authored
      * Add atomic operations for CUDA templates in new atomic.h file
      
      - Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
      - Implemented support for half, bfloat16, and float types with appropriate memory ordering.
      - Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
      - Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
      - Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.
      
      * Refactor atomic operations in CUDA templates for improved readability
      
      - Reformatted atomic operation implementations in atomic.h for better code clarity.
      - Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
      - Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.
      
      * Add thread storage synchronization configuration option
      
      - Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
      - Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
      - Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.
      
      * Refactor thread storage sync configuration retrieval
      
      - Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
      - Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.
      
      * test fix
      
      * Update atomic operations and tests for improved functionality
      
      - Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
      - Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
      - Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
      - Updated the TVM subproject to the latest commit for better compatibility.
      
      * Update attention sink examples to use 32 heads
      
      - Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
      - Ensured consistency across example scripts for improved usability and testing.
      
      * Refactor atomic add handling in vectorization
      
      - Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
      - Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.
      
      * Add loop break functionality and enhance thread synchronization
      
      - Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
      - Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
      - Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.
      
      * test fix
      aa0b1090
  7. 26 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Simplify vectorization process in loop_vectorize.cc and add... · 3c5190e0
      Lei Wang authored
      [Enhancement] Simplify vectorization process in loop_vectorize.cc and add atomic add test (#436) (#439)
      
      * Removed redundant simplification step in vectorization logic to streamline performance.
      * Introduced a new test for atomic addition in TileLang, validating functionality with a reference implementation using PyTorch.
      3c5190e0