1. 17 Nov, 2025 1 commit
  2. 15 Nov, 2025 1 commit
    • Tong WU's avatar
      [BugFix] Refactor attention kernel to handle OOB positions by filling with... · 0af3fd7c
      Tong WU authored
      [BugFix] Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators. (#1222)
      
      * Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators.
      
      * lint
      
      * pre-commit
      
      * Update imports in flash attention test file to use new backward and forward examples for better clarity and consistency.
      0af3fd7c
  3. 15 Oct, 2025 1 commit
    • alex_xiao's avatar
      fix bug&add amd examples (#966) · 80665cd1
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Enhance AMD example script and update CI workflows
      
      - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
      - Added new CI workflows for AMD and documentation publishing.
      - Updated various requirements files to include necessary dependencies.
      - Introduced new test cases and examples for better coverage and functionality.
      - Refactored existing code for improved readability and maintainability.
      
      * Remove redundant tool cache cleanup step in AMD CI workflow
      
      * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
      
      * Add new AMD FlashAttention example and test script
      
      - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
      - Added `test.sh` script to facilitate running the new example with specified parameters.
      - Enhanced the overall structure and organization of the example for better clarity and usability.
      
      * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner
      
      - Reduced the number of threads and `num_split_q` options for improved performance.
      - Adjusted `panel_size` options to streamline configuration settings.
      
      * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217
      
      * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c
      
      * Add example for AMD Flash Attention backward pass implementation
      
      - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
      - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
      - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
      - Included reference implementation for validation against PyTorch's attention mechanism.
      
      This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.
      
      * Enhance AMD Flash Attention example with additional testing capabilities
      
      - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
      - Improved the main function to allow for better parameter configuration and benchmarking.
      - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.
      
      This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.
      
      * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * Refactor HIP intrinsic rules to CUDA
      
      - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
      - Adjusted include paths for better organization and clarity in the code structure.
      
      * Update AMD CI workflow to uninstall specific PyTorch packages before installation
      
      - Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
      - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.
      
      * Remove unused shared memory allocations in AMD Flash Attention backward example
      
      - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
      - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.
      
      * Remove unnecessary pip uninstall command from AMD CI workflow
      
      - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
      - This change simplifies the CI process and reduces potential overhead during package management.
      
      * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules
      
      - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
      - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.
      
      * Refactor formatting of HIP intrinsic rule registrations
      
      - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
      - No functional changes were made; this update focuses on code style improvements to enhance maintainability.
      
      * Update file name and documentation for HIP intrinsic rules
      
      - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
      - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.
      
      * Enhance DispatchHIPShuffle function with clang-analyzer comments
      
      - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
      - This change improves code clarity and maintains compliance with static analysis tools.
      
      * lint fix
      
      * fix
      
      * Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency.
      
      * Add backward attention example to test script
      
      - Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py.
      - Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example.
      - Updated the command for the backward tile example to match the new configuration.
      
      * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py
      
      - Introduced new functions for forward and backward configurations to enhance autotuning capabilities.
      - Updated the FlashAttention forward and backward functions to improve performance and maintainability.
      - Adjusted test script parameters for consistency and clarity, including the addition of group handling.
      - Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning.
      - Updated the main function to reflect changes in parameter names and types for better usability.
      
      * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
      
      - Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations.
      - Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning.
      - Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency.
      - Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass.
      - Enhanced correctness checks in the main function to include LSE validation alongside gradient checks.
      
      * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
      
      - Introduced a scaling factor for improved numerical stability in gradient calculations.
      - Optimized shared memory usage by adding new shared buffers for intermediate calculations.
      - Refined the handling of tensor fragments to improve performance and maintainability.
      - Updated the main function to ensure compatibility with the new output parameters for backward operations.
      - Removed unnecessary parameters from the test script to streamline execution.
      
      * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py
      
      - Updated the forward and backward functions to improve numerical stability and performance.
      - Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters.
      - Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters.
      - Added debugging and benchmarking functions for improved correctness verification and performance analysis.
      - Updated the main function to reflect changes in parameter handling and ensure consistency across examples.
      
      * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
      
      - Updated scaling factor application for improved numerical stability in gradient calculations.
      - Refined tensor handling to ensure consistency with forward pass operations.
      - Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision.
      - Adjusted comments for clarity and alignment with standard implementation practices.
      
      * Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh
      
      - Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning.
      - Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example.
      - Improved comments for clarity and alignment with the updated configurations.
      
      * Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py
      
      - Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost.
      - Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations.
      - Improved comments for better understanding of the performance metrics and implementation details.
      - Removed unnecessary parameter from test.sh to streamline execution.
      
      * Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing.
      
      * Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py
      
      - Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations.
      - Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning.
      - Refined scaling factor calculations for numerical stability in both forward and backward passes.
      - Improved comments and documentation for clarity and consistency across implementations.
      - Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements.
      
      * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py
      
      - Removed outdated comments and improved clarity in the code.
      - Enhanced the forward function to consistently return output and log-sum-exp (LSE) values.
      - Updated autotuner configurations to include new parameters for better performance tuning.
      - Refined tensor handling and scaling factor calculations for improved numerical stability.
      - Adjusted the main function to ensure compatibility with updated output requirements and parameter handling.
      
      * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
      
      - Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization.
      - Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns.
      - Modified tensor copy operations to utilize coalesced widths for optimized memory loads.
      - Enhanced GEMM operations with k_pack for improved computational efficiency.
      - Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios.
      
      * Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py
      
      - Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning.
      - Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation.
      - Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns.
      - Adjusted GEMM operations to improve computational efficiency without the use of k_pack.
      
      * Enhance HIP code generation and FP8 type support
      
      - Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility.
      - Updated error logging to include unsupported FP8 type details for better debugging.
      - Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method.
      - Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality.
      - Added overloads for AtomicAdd in common.h to support both pointer and value arguments.
      
      * Enhance FP8 type support and clarify accumulator handling in HIP
      
      - Expanded FP8 type support in codegen_hip.cc to include additional float8 formats.
      - Updated gemm.h to clarify the handling of the accumulator when clear_accum is true.
      - Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version.
      
      * Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py
      
      * Update print statement formatting for clarity in example_amd_flash_attn_bwd.py
      
      * Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output.
      
      * Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.
      
      * Refactor and enhance HIP code generation for improved FP8 support
      
      - Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability.
      - Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types.
      - Updated AtomicAdd function in common.h to streamline its implementation.
      - Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively.
      - Added function to manage the addition of new functions in the code generation process.
      
      * Fix formatting issue in HIP code generation for MFMA call
      
      - Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.
      
      * Refactor HIP code generation and enhance FP8 type handling
      
      - Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability.
      - Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types.
      - Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading.
      - Refined the AddFunction method to streamline function addition in the code generation process.
      
      * Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.
      
      * Refactor backward attention implementation in example_amd_flash_attn_bwd.py
      
      - Updated the GEMM operation to use shared memory for improved performance.
      - Adjusted parallelization parameters to enhance efficiency in the backward pass.
      
      * Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.
      
      * Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations
      
      * Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py`
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      80665cd1
  4. 06 Oct, 2025 1 commit
  5. 05 Oct, 2025 1 commit
  6. 18 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355
      Lei Wang authored
      * [Enhancement] Enable fast math optimization in tilelang JIT configurations
      
      - Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
      - Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
      - Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
      - Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.
      
      * lint fix
      
      * [Refactor] Introduce deprecated_warning utility for improved deprecation handling
      
      - Added a new `deprecated_warning` function to streamline deprecation messages.
      - Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
      - Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.
      e7e38355
  7. 30 Jun, 2025 1 commit
  8. 06 Jun, 2025 1 commit
  9. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  10. 22 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Example] Implement Kernel Example cumsum (#258) · cd9ec62e
      Lei Wang authored
      * Add GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
      - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
      - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
      - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
      
      * Refactor TileLang examples and enhance kernel compilation
      
      - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
      - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
      - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
      - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
      - Cleaned up import statements in `__init__.py` for better organization and clarity.
      
      * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
      - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
      cd9ec62e
  11. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  12. 09 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Feat] Introduce new caching mechanism for compiled kernels (#176) · 7bde63d5
      Lei Wang authored
      * Add kernel caching mechanism to TileLang
      
      - Implement a new `cached` function in `tilelang/cache/__init__.py` to cache and reuse compiled kernels
      - Expose the `cached` function in the main `tilelang/__init__.py`
      - Add a test case for cached matrix multiplication in `testing/python/cache/test_tilelang_cache_matmul.py`
      - Provide a `clear_cache()` function to reset the kernel cache when needed
      
      * Refactor kernel caching test and implementation
      
      - Simplify the `cached` function in `tilelang/cache/__init__.py`
      - Update test script `test_tilelang_cache_matmul.py` to use `tilelang.testing.main()`
      - Remove unnecessary whitespace and improve code formatting
      
      * Update import for `cached` function in MHA examples
      
      - Modify import statement in `example_mha_bwd.py` and `test_tilelang_kernel_mha_bwd.py`
      - Change import from `tilelang.profiler import cached` to `tilelang import cached`
      - Align with recent refactoring of kernel caching mechanism
      
      * Refactor `cached` function signature in kernel caching
      
      - Update function signature to use keyword-only arguments for `target` and `target_host`
      - Improve parameter order and readability of the `cached` decorator
      - Maintain existing functionality while enhancing function definition
      7bde63d5
  13. 11 Feb, 2025 1 commit
    • Yu Cheng's avatar
      [Dev] Add mha backward example (#77) · a6fe61e2
      Yu Cheng authored
      * [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized
      
      * Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests
      
      * [Dev] Add mha backward example
      a6fe61e2