1. 12 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Add kernel selection option for GEMM v1 in environment settings (#1200) · 8fbe1b3a
      Lei Wang authored
      * Add kernel selection option for GEMM v1 in environment settings
      
      - Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version.
      - Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable.
      - Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable.
      
      * bug fix
      
      * Add kernel selection option for GEMM in environment settings
      
      - Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations.
      - Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value.
      - Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable.
      
      * Refactor GEMM macro generator to use BufferRegion instead of Buffer
      
      - Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`.
      - Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types.
      - Simplified buffer access logic for better clarity and maintainability.
      
      * Refactor GEMM functions to utilize BufferRegion for improved memory handling
      
      - Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices.
      - Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion.
      - Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability.
      - Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access.
      
      * Refactor GEMM code for improved readability and consistency
      
      - Cleaned up formatting and spacing in GEMM-related files for better readability.
      - Standardized comments and code structure across various GEMM functions and macros.
      - Enhanced error messages for clarity in buffer region checks.
      - Removed redundant lines and improved overall code maintainability.
      
      * Update GEMM correctness evaluation and macro generator for improved functionality
      
      - Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests.
      - Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing.
      - Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access.
      - Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations.
      
      * Refactor GEMM and intrinsic files for improved clarity and functionality
      
      - Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code.
      - Adjusted function signature formatting in `swizzle.py` for better readability.
      - Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation.
      - Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness.
      - Improved function signature formatting in `language.py` for consistency.
      
      * Enhance GEMM and MMA functionality for FP64 support
      
      - Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection.
      - Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations.
      - Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage.
      - Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms.
      - Enhanced utility functions to accommodate FP64 index mapping for shared memory operations.
      
      * lint fix
      
      * Refactor GEMM correctness evaluation and shared memory alignment handling
      
      - Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency.
      - Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope.
      - Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types.
      - Cleaned up commented-out test code in `correctness_evaluation.py` for better readability.
      
      * Enhance GEMM and MMA implementations with region-based memory handling
      
      - Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations.
      - Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication.
      - Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations.
      
      * Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls
      
      - Updated import statements to directly reference modules instead of individual test functions, enhancing clarity.
      - Modified function calls to use the new module structure for better organization and maintainability in testing examples.
      
      * Enhance OnArrayDeclaration method to handle repeated buffer declarations
      
      - Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations.
      - Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations.
      
      * Add abbreviation for bfloat16 data type in mfma_macro_generator.py
      
      - Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation.
      
      * Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation
      
      - Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits.
      - Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity.
      - Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8.
      
      * lint fix
      
      * Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP
      
      - Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings.
      - Updated logic to track user-selected backends and conditionally enable defaults based on environment variables.
      - Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity.
      - Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency.
      
      * Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py
      
      - Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity.
      - Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements.
      
      * Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity.
      
      * Refactor attention sink examples to simplify index calculations
      
      - Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices.
      - Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops.
      
      * Refactor attention sink examples to streamline index calculations
      
      - Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices.
      - Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity.
      
      * lint fix
      
      * bugfix
      
      * Refactor reduce operation handling in CUDA and Python
      
      - Removed outdated shared memory reduction logic from `reduce.cc`.
      - Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes.
      - Updated CUDA header to define a wider accumulator type for better numerical accuracy.
      - Enhanced error handling for buffer scope validation in the reduction process.
      
      * Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs
      
      * Enhance unit loop handling by refining annotation checks
      
      - Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present.
      - Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability.
      
      * clean clode
      8fbe1b3a
  2. 04 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve Python3.9 compatibility for ParamSpec and Self (#1190) · 7d961892
      Lei Wang authored
      * [Feature] Enhance fill operation to support various buffer types
      
      - Added support for `BufferLoad` in the `fill` function to handle different buffer types.
      - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
      - Introduced checks for static bounds in region definitions to ensure safety during operations.
      - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.
      
      * lint fix
      
      * [Refactor] Improve Python compatibility for ParamSpec and Self
      
      - Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively.
      - Updated type annotations across multiple files to ensure consistent usage of typing features.
      
      * [Update] Require Python 3.9 and enhance type annotations
      
      - Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`.
      - Removed references to Python 3.8 in classifiers.
      - Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility.
      - Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files.
      
      * [Refactor] Update import statements to enhance type annotations
      
      - Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity.
      - Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function.
      - Adjusted import statements in the language proxy file to maintain consistency in type annotations.
      
      * disable rocm rs nt test.
      
      * lint fix
      7d961892
  3. 23 Oct, 2025 1 commit
    • Tong WU's avatar
      [Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a
      Tong WU authored
      * [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen
      
      * Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
      * Enhanced the existing code to support both directions of conversion based on the lane count.
      * Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.
      
      * [Feature] Add float32 to float8 conversion support in CUDA codegen
      
      * Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
      * Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
      * Enhanced type handling for better compatibility with TileLang, particularly for float8 types.
      
      * lint
      
      * fix a bug
      
      * [Enhancement] Support lanes=4 cases and add unit test for vectorized cast
      
      * lint
      
      * [Feature] Refactor bf16 convertion operations and remove legacy compile flags
      
      * lint
      a148d62a
  4. 10 Oct, 2025 1 commit
    • Tong WU's avatar
      [Example] Add support for `bfloat16` and user-defined `sm_scale` in attention sink examples (#924) · 7cd0da99
      Tong WU authored
      
      
      * revert split+sum template for MHA backward
      
      * lint
      
      * Update example_mha_bwd.py
      
      * Update example_mha_bwd_wgmma_pipelined.py
      
      * Refactor attention sink examples to support bf16 and user-defined softmax scale
      
      * fix typos
      
      * Adding compile flags for fast math optimizations and enabling BF16 support in both GQA and MHA backward implementations.
      
      * Update backward configuration for GQA and MHA examples to align with flash attention
      
      * Refactor GQA backward implementation to improve atomic add performance
      
      * Allow for slightly larger numerical error for bf16
      
      * upd readme to show bf16 benchmark results
      
      * lint
      
      * fix ci and lint
      
      * fix comments and lint
      
      * refactor atomic add
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      7cd0da99
  5. 26 Sep, 2025 1 commit
    • Tong WU's avatar
      [Example] Add efficient attention sink backward implementations and tests (#877) · ec24561a
      Tong WU authored
      * [Example] Add a new example to support attention sink for MHA
      
      - Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
      - Added a reference attention function to validate the implementation against PyTorch.
      - Included argument parsing for command-line execution of the example.
      
      * [Example] Replace MHA sink forward example with updated implementation
      
      - Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
      - Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
      - Updated argument parsing and reference functions to align with the new implementation.
      
      * Enhance MHA sink example with sliding window support
      
      - Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
      - Implemented assertions to ensure `window_size` is compatible with `block_N`.
      - Updated the main function to include a `tune` option for performance tuning.
      - Introduced a new test file to validate both full attention and sliding window scenarios.
      - Adjusted FLOPS calculation to account for the sliding window configuration.
      
      * lint
      
      * [Fix] Add checkinf process to fix the bug of swa
      
      * Migrate to BSHD layout to align with triton baselines
      
      * lint
      
      * fix typo
      
      * Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.
      
      * Add GQA sink example for optimized attention mechanism & lint fix
      
      * fix several typos and bugs
      
      * lint
      
      * fix speed issues of swa
      
      * Add flash attention example with backward pass for BHSD layout and corresponding test cases
      
      * Add backward pass implementation for flash attention with sinks and corresponding test case
      
      * fix lint and typo
      
      * Optimze the calculation of `dsinks`
      
      * Add support for swa backward and update examples
      
      * fix previous typos
      
      * Add example for GQA sink backward pass and update tests for both MHA and GQA sinks
      
      * fix lint
      
      * fix previous typos
      
      * typo
      ec24561a