Commits · 557589ffd7af10f2740d4bbf5f4f0ce70305ea3c · OpenDAS / tilelang

18 Sep, 2025 1 commit

[Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355

Lei Wang authored Sep 18, 2025

* [Enhancement] Enable fast math optimization in tilelang JIT configurations

- Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
- Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
- Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
- Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.

* lint fix

* [Refactor] Introduce deprecated_warning utility for improved deprecation handling

- Added a new `deprecated_warning` function to streamline deprecation messages.
- Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
- Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.

e7e38355

16 Jun, 2025 1 commit

[BugFix] Fix import error in nsa examples when `fla.__version__ >=0.2.1` (#579) · 18ab72c9

Tong WU authored Jun 16, 2025

* Update FLA import path for `prepare_token_indices`

* Update FLA import path for `prepare_token_indices`

* Compare versions via packaging.version.parse

18ab72c9

18 May, 2025 1 commit

[Refactor] refactor `tilelang.jit` to support a faster and more flexible kernel cache (#501) · 25a50f1a

Lei Wang authored May 19, 2025

* [Refactor] Update JIT kernel functions and streamline GEMM tests

* Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity.
* Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function.
* Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability.
* Cleaned up commented-out code and unnecessary imports to improve overall code quality.

* Update main function call in GEMM test to use tilelang testing framework

* Update README and example scripts to include JIT decorator comments

* Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions.
* Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity.

* Update GEMM test parameters for improved performance

* Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.

25a50f1a

12 Apr, 2025 1 commit

[Enhancement][Pipeline] More precise copy code block detection in pipeline (#384) · abaacde5

Lei Wang authored Apr 12, 2025

* Update legalize_safe_memory_access.cc

* Add cache path handling and file locking in Cython adapter

- Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management.
- Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling.
- These changes aim to optimize the compilation process and prevent race conditions during library loading.

* lint fix

* refactor

* refactor

* Add GlobalCopyPatternDetector to identify global memory copy patterns

- Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements.
- Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes.
- Improved code clarity and maintainability by encapsulating detection logic within the new class.

* Refactor copy stage detection logic in pipeline planning

- Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage.
- Removed redundant checks for read and write scopes, enhancing code clarity and maintainability.

* lint fix

abaacde5

26 Mar, 2025 1 commit

[Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1

Lei Wang authored Mar 26, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

bf8a6fc1

10 Mar, 2025 1 commit

[Examples] Implement NSA Backward kernels (#180) · 6891d3ec

Lei Wang authored Mar 10, 2025


* Update native sparse attention example with scale parameter handling

- Add scale parameter processing in native_sparse_attention function
- Modify example script to include custom scale value
- Update function calls to pass scale parameter
- Enhance flexibility of sparse attention implementation

* Refactor Triton Native Sparse Attention Example

- Improve code formatting and readability in example_triton_nsa_bwd.py
- Standardize function and parameter alignment
- Remove unnecessary whitespaces and optimize imports
- Enhance code style consistency with previous commits

6891d3ec