Commits · 1873dc007bc8147aa255f4d01b24d0ab11121c44 · OpenDAS / tilelang

30 Mar, 2025 1 commit

[Example] Add autotune to conv example (#301) · 1873dc00

yyttt6 authored Mar 30, 2025



* add autotune to example_gemm.py

* add autotune to conv

* still coding ...

* version 0

* version 0

* version 0

* refactor autotune

* refactor autotune

* add autotune to conv example

* add conv template to carver

* add conv template to carver

* add conv template to carver

* Update num_stages configuration values

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

1873dc00

29 Mar, 2025 1 commit

[Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely (#302) · d3bf4fe1

Zhengju Tang authored Mar 29, 2025



* [Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely

* lint fix

* update license

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

d3bf4fe1

28 Mar, 2025 5 commits

[Bugfix] Correct method call for block reduction check when analyzing memory footprint (#299) · 55051166
NaOHCC authored Mar 28, 2025

55051166

[Refactor] Improve documentation and add detailed docstrings across multiple modules (#298) · 3f294650

Lei Wang authored Mar 28, 2025

* [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h

- Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
- Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.

* [Enhancement] Improve documentation and add detailed docstrings across multiple modules

- Updated the `__init__.py` file to enhance module documentation, providing clarity on auto-tuning functionalities.
- Added comprehensive docstrings to the `JITContext`, `AutotuneResult`, and `AutoTuner` classes, detailing their attributes and methods.
- Enhanced memory allocation utilities in `allocate.py` with detailed descriptions for each allocation function.
- Improved documentation for various intrinsic operations in `builtin.py`, `copy.py`, `customize.py`, `frame.py`, `gemm.py`, `memscope.py`, and `reduce.py`, ensuring clear explanations of parameters and return values.
- Refactored the `KernelCache` class to improve clarity and maintainability, including detailed comments and docstrings for methods.
- Overall, these changes aim to enhance code readability and provide better guidance for future developers and users of the Tile-AI framework.

3f294650

[Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h (#297) · 9ad9d9cd

Lei Wang authored Mar 28, 2025

- Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
- Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.

9ad9d9cd

[Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061

Lei Wang authored Mar 28, 2025

* [Feature] Implement ParallelLoopTransformer for enhanced loop analysis

- Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
- Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
- Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
- Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.

* test fix

* Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.

* Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.

5c8de061

[doc/example] add gemv doc and examples (#293) · ff3cfa59

botbw authored Mar 28, 2025

* [doc/example] init gemv doc and examples

* [example] add vectorized read

* [example] use local register instead of smem

* [example] add bench

* [doc] update doc

* [doc] refine doc

* [lint] format code

* [doc] add tips

* [doc/example] fix typo

* [example] use tmv_all_reduce

* [doc] update doc accordingly

* [doc] add benchmark table

* [lint] format code

ff3cfa59

27 Mar, 2025 5 commits

[Dev] Correcting cxx compiler (#294) · 304b4465
penguin_wwy authored Mar 28, 2025

304b4465
Remove citation page (#292) · 5079e2a5
Lei Wang authored Mar 27, 2025

5079e2a5

[Doc] Python API docs generation (#278) · 5501b31c

Wenhao Xie authored Mar 27, 2025

* fix bug

* update performance.py

* update python api docs

* test workflow

* fix dependency

* fix bug

* fix

* update correct git config

* test workflow

* clear cache

* lint fix

* fix exclude path

5501b31c

[Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291) · 83412458

Lei Wang authored Mar 27, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

* [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py

- Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
- Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
- Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.

* [Refactor] Update imports in __init__.py for tir compatibility

- Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
- Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.

* lint fix

* [Refactor] Update imports in tir.ir.py for improved compatibility

- Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
- Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.

* [Refactor] Update function calls in tir.ir.py to return values

- Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.

* bugfix

* [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper

- Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.

* bugfix

* [Update] Sync subproject commit and modify CUDA atomic add functions

- Updated the subproject commit for TVM to edd35139a0481e9359aa269e3e50450b95ba2f5a.
- Commented out the CUDA capability check in the example convolution script to prevent execution errors.
- Refactored atomic add functions for BFLOAT16 in common.h to include a conditional compilation directive for improved compatibility with CUDA architectures.

83412458

[Language] Proxy tvm ir to make linter happy (#287) · be0bf36d

Lei Wang authored Mar 27, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

* [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py

- Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
- Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
- Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.

* [Refactor] Update imports in __init__.py for tir compatibility

- Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
- Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.

* lint fix

* [Refactor] Update imports in tir.ir.py for improved compatibility

- Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
- Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.

* [Refactor] Update function calls in tir.ir.py to return values

- Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.

* bugfix

* [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper

- Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.

* bugfix

* Uncomment main function call

be0bf36d

26 Mar, 2025 4 commits

[Feature] Introduce NoSetMaxNReg for warp specialization (#289) · 76435ca8

Yu Cheng authored Mar 26, 2025

- Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches.
- Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management.
- Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.

76435ca8

[Doc] Update README.md to correct documentation link for TileLang debug tools (#286) · eee45f17
Yu Cheng authored Mar 26, 2025

eee45f17

[Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1

Lei Wang authored Mar 26, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

bf8a6fc1

add autotune to example_gemm.py (#285) · 73d2c62e
yyttt6 authored Mar 26, 2025

73d2c62e

25 Mar, 2025 5 commits

[Refactor] Update cache key generation in KernelCache (#283) · 7bd59f21

Lei Wang authored Mar 25, 2025

- Changed the cache key generation to use the serialized script of the function instead of the function object itself, improving the uniqueness of cache keys.

7bd59f21

[Refactor] Enhance Autotune (#266) · 541e1685
yyttt6 authored Mar 25, 2025
```
* add autotune to example_gemm.py

* format init.py
```
541e1685

[Language] Introduce `T.ptr` and `T.Tensor` (#276) · 8ad53855

Lei Wang authored Mar 25, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

8ad53855

[CI] Add gemm performance test (#274) · 18f29277

Wenhao Xie authored Mar 25, 2025

* [Typo] Fix formatting in installation instructions in README.md

* [Enhancement] Improve CUDA path detection and update configuration handling

* fix typo

* remove IS_WINDOWS constant

* lint fix

* Improve error messages for CUDA detection failure

* lint fix

* lint fix

* Fix .gitignore to correctly include venv directory

* [Doc] Add instructions for installing nightly version of TileLang

* update installation instructions

* update install instruction

* update performance ci

* update

* update

* update

* update ci workflow

* delete test.yml

* lint fix

* update bot.yml

* update bot.yml

* remove changes in ci.yml

18f29277

[Bugfix]Add CUDA availability check in CtypesKernelAdapter (#267) · 29b7d374

Xiaochuan Ye authored Mar 25, 2025

* fix: Add CUDA availability check in CtypesKernelAdapter

* fix: Add CUDA availability check in CythonKernelWrapper

29b7d374

24 Mar, 2025 3 commits

[Refactor] Improve flash attention example and layout comparison logic (#270) · 5f5bf53c

Lei Wang authored Mar 24, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

5f5bf53c

[Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter (#269) · 2abd6ab7

Yu Cheng authored Mar 24, 2025

- Introduced TMAFinder and ProducerUsedBufferFinder classes to analyze TMA loads and identify buffers used in producer conditions.
- Enhanced WarpSpecializedRoleMarker to prepare and utilize the identified buffers during role marking.
- Updated VisitStmt methods to incorporate new analysis logic for IfThenElse and For nodes, improving the handling of TMA loads in the warp specialization process.

2abd6ab7

[Bugfix] Support `T.clear` for let binding (#268) · 47caf219

Lei Wang authored Mar 24, 2025

* Fix indentation in JIT adapter wrapper to ensure consistent formatting of return statement in generated C code.

* Enhance Fill Operation in TileLang

- Updated the Fill constructor to support BufferLoad instances, adding checks for ramp indices and ensuring only stride 1 ramps are processed.
- Introduced a region array to manage the bounds of the fill operation, improving error checking for static regions.
- Modified the MakeSIMTLoop method to utilize the new region array for loop variable bounds, enhancing flexibility in kernel generation.
- Updated the fill and clear functions in fill.py to accept both tir.Buffer and tir.BufferRegion types, improving usability and type handling.

* Refactor Fill Operation and Improve Readability

- Simplified the Fill constructor by enhancing the handling of BufferLoad instances and ensuring proper checks for ramp indices.
- Improved error messages for region size checks to enhance clarity.
- Cleaned up formatting in the Fill method for better readability.
- Added a blank line in the matmul function test to improve code organization.
- Introduced a blank line in the fill function to enhance readability in fill.py.

* Add matrix multiplication functionality and test in TileLang

- Introduced a new test file `test_tilelang_language_clear.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `__init__.py` in the utils module to include `map_torch_type`, enhancing type handling for tensor operations.

* lint fix

47caf219

23 Mar, 2025 3 commits

[Release] Bump version to 0.1.3 (#264) · 9981ac59

Lei Wang authored Mar 23, 2025

* Bump version to 0.1.3

* Refactor Docker script to streamline installation commands

- Removed the installation of the Python environment and CMake from the Docker run command, simplifying the execution process.
- Updated the command to focus on pip installation and running tox for testing across multiple Python versions.

9981ac59

Refactor matrix multiplication benchmark and autotuner logging (#263) · 8c94de32

Lei Wang authored Mar 23, 2025

- Updated `ref_program` in `benchmark_matmul.py` to remove the unused parameter `C`, simplifying the function signature.
- Changed logging level in `autotuner/__init__.py` from `INFO` to `DEBUG` for more detailed logging during autotuning.
- Modified the error handling in the autotuner to provide clearer messages and log errors at the debug level.
- Enhanced error reporting in the JIT adapter by adding detailed context to error messages in `cython_wrapper.pyx` when kernel calls fail.

8c94de32

[Language] Enhance alias to support blockwise memory load (#261) · 927e50d9

Lei Wang authored Mar 23, 2025

* [Enhancement] Introduce caching control and frame management in TileLang

- Added cache control functions (`enable_cache`, `disable_cache`, `is_cache_enabled`) in `env.py` to manage kernel caching behavior.
- Updated `kernel_cache.py` to utilize the cache state, preventing unnecessary kernel compilation when caching is disabled.
- Introduced a new `frame.py` module to manage LetFrame instances, including a stack for variable-value mapping and enhanced frame management.
- Updated imports in various modules to accommodate new caching and frame functionalities, improving overall organization and clarity.

* [Refactor] Clean up and enhance caching and frame management in TileLang

- Added spacing for improved readability in `env.py` and `frame.py`.
- Refactored `LetFrame` class to enhance clarity in buffer region assignment.
- Ensured consistent formatting and organization across caching control and frame management functions.

* [Feature] Add matrix multiplication functionality in TileLang

- Introduced a new test file `test_tilelang_language_alias.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul` function defines a kernel for performing tile-level GEMM operations, with support for customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated `gemm.py` to allow `tir.Buffer` or `tir.Var` as valid argument types for the `gemm` function, enhancing flexibility in argument handling.

* [Refactor] Improve formatting and readability in test_tilelang_language_alias.py

- Adjusted spacing and alignment in the `matmul` and `run_matmul` functions for better readability.
- Cleaned up unnecessary blank lines and ensured consistent formatting throughout the file.
- Enhanced overall code clarity without altering functionality.

927e50d9

22 Mar, 2025 5 commits

[Bugfix] Fix Benchmark/Example Code for Autotuning (#254) · 0430cfe7

Chaofan Lin authored Mar 23, 2025



* fix tune args

* lint

* Refactor gemm example and autotuner logging

- Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter.
- Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity.
- Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility.
- Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds.

* Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0430cfe7

[CI] Use auditwheel to generate manylinux wheels (#251) · 60923344

Yichen Yan authored Mar 23, 2025



* use auditwheel to get correct manylinux wheels

* fix

* make py3.8 happy

* trivial updates

* Add typing.Tuple import and update annotations

* fmt

* Remove unused import and update type hints

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

60923344

[Refactor] Move compilation outside critical section (#260) · 001e7b2a

You Jiacheng authored Mar 23, 2025



* move compilation outside critical section

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

001e7b2a

[Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

* Refactor CUDA post-processing callback registration in TileLang

- Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
- Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
- Added debug prints to the CUDA code generation process for better traceability during development.
- Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
- Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.

* lint fix

* Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters

- Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
- Implemented a function to generate various kernel configurations for autotuning.
- Refactored the main execution block to support both autotuned and default configurations.
- Improved the block mask generation to accommodate specified sparsity levels.
- Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.

f47b43c5

[Example] Implement Kernel Example cumsum (#258) · cd9ec62e

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

cd9ec62e

21 Mar, 2025 2 commits

[Language] Introduce `T.alloc_var` to define a variable like `int var;` (#255) · c770a58f

Lei Wang authored Mar 22, 2025

* [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT

- Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters.
- Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing.
- Updated test cases to validate the new matrix multiplication implementations.
- Enhanced error handling in library initialization and compilation processes across various modules.
- Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting.

* lint fix

* optimize

* Support var defiine

* lint fix

* Update TVM submodule and add alloc_variable function to allocate local variables in TileLang

- Updated the TVM submodule to the latest commit.
- Introduced `alloc_variable` function in `allocate.py` to support local variable allocation with specified data types and scopes.

* lint fix

* Refactor variable allocation functions for consistency

- Renamed `alloc_variable` to `alloc_var` across multiple files for improved consistency.
- Updated corresponding test functions to reflect the new naming convention.
- Adjusted imports in `__init__.py` to align with the changes.

c770a58f

add autotune to example_gemm.py (#252) · 316d3b97

yyttt6 authored Mar 21, 2025

* add autotune to example_gemm.py

* add autotune to example_gemm.py

* add autotune to example_gemm.py

* add autotune to example_gemm.py

316d3b97

20 Mar, 2025 3 commits

[Enhancement] Support float variable as arguments (#250) · 2d0c4169

Lei Wang authored Mar 20, 2025

* [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT

- Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters.
- Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing.
- Updated test cases to validate the new matrix multiplication implementations.
- Enhanced error handling in library initialization and compilation processes across various modules.
- Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting.

* lint fix

* optimize

2d0c4169

Update bib citation (#249) · 4fcf6abe
Lei Wang authored Mar 20, 2025

4fcf6abe

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180

Lei Wang authored Mar 20, 2025

* remove llvm build

* [Refactor] Update kernel compilation and profiling in examples

- Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
- Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
- Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.

* lint fix

* License Update

* [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files

- Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
- Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.

* [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files

- Improved comment alignment and readability in `cuda.h`.
- Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.

* lint fix

* fix

* License update

* [Enhancement] Update JITKernel to use artifact for kernel source

- Assigned the generated artifact to `self.artifact` for better management.
- Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.

* lint fix

* Add @tilelang.testing.requires_llvm decorator to vectorization tests

* Enhance setup.py and env.py for library management

- Added functionality to remove original files after copying in CMakeBuild.
- Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.

* Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py

* Refactor CMakeBuild file handling in setup.py

- Added a check to ensure the target library directory exists before copying .so files.
- Improved the logic for creating the target directory and copying files to enhance robustness.

* bugfix

* Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.

* lint fix

* Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.

* lint fix

* Add support for C target in device code generation

- Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.

* [Enhancement] Implement auto-clear cache feature based on environment variable

* Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
* Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
* Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.

* [Refactor] Update kernel invocation and import paths in tests and cache

* Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
* Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
* Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.

* [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py

* Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
* Enhanced overall code formatting to align with project standards.

* [Enhancement] Add bfloat16 test case and improve kernel caching logic

* Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
* Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
* Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
* Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
* Improved code formatting and readability across several files.

* lint fix

* Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage

f2e99180

19 Mar, 2025 3 commits

[Examples] Implement elementwise add kernel (#219) · 43bd9d3e

Chenghua authored Mar 19, 2025

* [Example] Modify tuning configurations for FlashAttention example

* [Examples] formatting example_gqa_fwd_bshd.py

* [Examples] Implement elementwise add kernel

* [Doc] Update ElementWise Operators document

* [Examples] Replace the example of elementwise add.

43bd9d3e

[Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters (#213) · e789808b

alex_xiao authored Mar 19, 2025



* [Dev] Add database mechanism to cache

* [Dev] Fix database cache and test for it

* [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment.

* [Refactor] Improve code formatting and readability in multiple files

* [Enhancement] Add execution backend options and improve kernel adapter initialization

* [Refactor] Rename cached function to cached_kernel and update related references

* [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case

* [Enhancement] Update kernel compilation to specify execution backend as "cython"

* [Refactor] Rename cached_kernel to cached and update references in the codebase

* [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code

* [Refactor] Clean up code formatting and improve readability in cache and adapter modules

* [Refactor] Remove unused imports

* [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety

* [Refactor] Update cached function calls to use PrimFunc and improve parameter handling

* [Refactor] Clean up import statements and improve code formatting in cache and kernel test files

* Update tilelang/jit/kernel.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

e789808b

[Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value... · efceb6ed
Yuxi Chi authored Mar 19, 2025
```
[Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to `minBlocksPerMultiprocesor ` (#248)
```
efceb6ed