Commits · 19c8590767fb066386adb590b9b3745de02e7041 · OpenDAS / tilelang

01 Apr, 2025 2 commits

[Bugfix] Fix logic error in ReduceOp when handling CUDA architecture (#316) · 19c85907

Yu Cheng authored Apr 01, 2025

* [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding

* [Bugfix] Fix logic error in ReduceOp when handling CUDA architecture

- Added a check for the existence of the target attribute "arch" to ensure that there is no undefined behavior when handling the specific architecture "sm_90". This change improves the robustness and compatibility of the code.

19c85907

[Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding (#315) · 3e859e67
Yu Cheng authored Apr 01, 2025

3e859e67

31 Mar, 2025 2 commits

[Bugfix] Fix layout conflict issue for gqa decoding examples (#314) · 0fd82ed5

Lei Wang authored Apr 01, 2025

* Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output.

* Refactor flashattn example to improve CUDA configuration handling

- Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures.
- Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output.
- Adjusted the `do_bench` method call to streamline performance profiling.

* lint fix

0fd82ed5

[Bugfix] Fix dynamic axis with variable extent (#311) · c30904ea

Lei Wang authored Mar 31, 2025

* [Enhancement] Improve error message for RampNode in CUDA codegen

- Updated the error message in the VisitExpr_ method for RampNode to include the specific Ramp node and lane count when the lane count exceeds the limit of 4. This change enhances debugging by providing clearer context for the error.
- Refactored the loop vectorization logic in loop_vectorize_dynamic.cc to improve readability and maintainability, ensuring that dynamic vectorization checks are performed correctly and efficiently.

* lint fix

c30904ea

30 Mar, 2025 1 commit

[Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d

Lei Wang authored Mar 31, 2025

* [Enhancement] Add support for CUDA architecture 8.9 in GEMM template

- Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
- This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.

* lintfix

* [Refactor] Clean up includes in gemm_sm89.h

- Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
- This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.

edbb9b6d

29 Mar, 2025 1 commit

[Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely (#302) · d3bf4fe1

Zhengju Tang authored Mar 29, 2025



* [Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely

* lint fix

* update license

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

d3bf4fe1

28 Mar, 2025 2 commits

[Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h (#297) · 9ad9d9cd

Lei Wang authored Mar 28, 2025

- Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
- Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.

9ad9d9cd

[Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061

Lei Wang authored Mar 28, 2025

* [Feature] Implement ParallelLoopTransformer for enhanced loop analysis

- Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
- Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
- Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
- Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.

* test fix

* Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.

* Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.

5c8de061

27 Mar, 2025 1 commit

[Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291) · 83412458

Lei Wang authored Mar 27, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

* [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py

- Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
- Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
- Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.

* [Refactor] Update imports in __init__.py for tir compatibility

- Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
- Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.

* lint fix

* [Refactor] Update imports in tir.ir.py for improved compatibility

- Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
- Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.

* [Refactor] Update function calls in tir.ir.py to return values

- Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.

* bugfix

* [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper

- Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.

* bugfix

* [Update] Sync subproject commit and modify CUDA atomic add functions

- Updated the subproject commit for TVM to edd35139a0481e9359aa269e3e50450b95ba2f5a.
- Commented out the CUDA capability check in the example convolution script to prevent execution errors.
- Refactored atomic add functions for BFLOAT16 in common.h to include a conditional compilation directive for improved compatibility with CUDA architectures.

83412458

26 Mar, 2025 1 commit

[Feature] Introduce NoSetMaxNReg for warp specialization (#289) · 76435ca8

Yu Cheng authored Mar 26, 2025

- Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches.
- Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management.
- Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.

76435ca8

24 Mar, 2025 3 commits

[Refactor] Improve flash attention example and layout comparison logic (#270) · 5f5bf53c

Lei Wang authored Mar 24, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

5f5bf53c

[Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter (#269) · 2abd6ab7

Yu Cheng authored Mar 24, 2025

- Introduced TMAFinder and ProducerUsedBufferFinder classes to analyze TMA loads and identify buffers used in producer conditions.
- Enhanced WarpSpecializedRoleMarker to prepare and utilize the identified buffers during role marking.
- Updated VisitStmt methods to incorporate new analysis logic for IfThenElse and For nodes, improving the handling of TMA loads in the warp specialization process.

2abd6ab7

[Bugfix] Support `T.clear` for let binding (#268) · 47caf219

Lei Wang authored Mar 24, 2025

* Fix indentation in JIT adapter wrapper to ensure consistent formatting of return statement in generated C code.

* Enhance Fill Operation in TileLang

- Updated the Fill constructor to support BufferLoad instances, adding checks for ramp indices and ensuring only stride 1 ramps are processed.
- Introduced a region array to manage the bounds of the fill operation, improving error checking for static regions.
- Modified the MakeSIMTLoop method to utilize the new region array for loop variable bounds, enhancing flexibility in kernel generation.
- Updated the fill and clear functions in fill.py to accept both tir.Buffer and tir.BufferRegion types, improving usability and type handling.

* Refactor Fill Operation and Improve Readability

- Simplified the Fill constructor by enhancing the handling of BufferLoad instances and ensuring proper checks for ramp indices.
- Improved error messages for region size checks to enhance clarity.
- Cleaned up formatting in the Fill method for better readability.
- Added a blank line in the matmul function test to improve code organization.
- Introduced a blank line in the fill function to enhance readability in fill.py.

* Add matrix multiplication functionality and test in TileLang

- Introduced a new test file `test_tilelang_language_clear.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `__init__.py` in the utils module to include `map_torch_type`, enhancing type handling for tensor operations.

* lint fix

47caf219

22 Mar, 2025 2 commits

[Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

* Refactor CUDA post-processing callback registration in TileLang

- Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
- Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
- Added debug prints to the CUDA code generation process for better traceability during development.
- Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
- Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.

* lint fix

* Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters

- Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
- Implemented a function to generate various kernel configurations for autotuning.
- Refactored the main execution block to support both autotuned and default configurations.
- Improved the block mask generation to accommodate specified sparsity levels.
- Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.

f47b43c5

[Example] Implement Kernel Example cumsum (#258) · cd9ec62e

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

cd9ec62e

21 Mar, 2025 1 commit

[Language] Introduce `T.alloc_var` to define a variable like `int var;` (#255) · c770a58f

Lei Wang authored Mar 22, 2025

* [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT

- Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters.
- Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing.
- Updated test cases to validate the new matrix multiplication implementations.
- Enhanced error handling in library initialization and compilation processes across various modules.
- Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting.

* lint fix

* optimize

* Support var defiine

* lint fix

* Update TVM submodule and add alloc_variable function to allocate local variables in TileLang

- Updated the TVM submodule to the latest commit.
- Introduced `alloc_variable` function in `allocate.py` to support local variable allocation with specified data types and scopes.

* lint fix

* Refactor variable allocation functions for consistency

- Renamed `alloc_variable` to `alloc_var` across multiple files for improved consistency.
- Updated corresponding test functions to reflect the new naming convention.
- Adjusted imports in `__init__.py` to align with the changes.

c770a58f

20 Mar, 2025 1 commit

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180

Lei Wang authored Mar 20, 2025

* remove llvm build

* [Refactor] Update kernel compilation and profiling in examples

- Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
- Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
- Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.

* lint fix

* License Update

* [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files

- Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
- Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.

* [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files

- Improved comment alignment and readability in `cuda.h`.
- Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.

* lint fix

* fix

* License update

* [Enhancement] Update JITKernel to use artifact for kernel source

- Assigned the generated artifact to `self.artifact` for better management.
- Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.

* lint fix

* Add @tilelang.testing.requires_llvm decorator to vectorization tests

* Enhance setup.py and env.py for library management

- Added functionality to remove original files after copying in CMakeBuild.
- Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.

* Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py

* Refactor CMakeBuild file handling in setup.py

- Added a check to ensure the target library directory exists before copying .so files.
- Improved the logic for creating the target directory and copying files to enhance robustness.

* bugfix

* Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.

* lint fix

* Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.

* lint fix

* Add support for C target in device code generation

- Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.

* [Enhancement] Implement auto-clear cache feature based on environment variable

* Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
* Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
* Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.

* [Refactor] Update kernel invocation and import paths in tests and cache

* Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
* Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
* Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.

* [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py

* Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
* Enhanced overall code formatting to align with project standards.

* [Enhancement] Add bfloat16 test case and improve kernel caching logic

* Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
* Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
* Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
* Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
* Improved code formatting and readability across several files.

* lint fix

* Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage

f2e99180

19 Mar, 2025 2 commits

[Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value... · efceb6ed
Yuxi Chi authored Mar 19, 2025
```
[Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to `minBlocksPerMultiprocesor ` (#248)
```
efceb6ed

[Enhancement] Add zero initialization option to GEMM operations (#246) · 701e9234

Yu Cheng authored Mar 19, 2025

* [Enhancement] Add zero initialization option to GEMM operations

- Introduced a new `zero_init` parameter to the GEMM function, allowing for optional zero initialization of the accumulator.
- Updated the GEMM implementation across various CUDA architectures to support the new parameter.
- Modified the Python interface for GEMM to include the `zero_init` argument, enhancing flexibility in kernel execution.
- Ensured compatibility with existing functionality while improving initialization control for performance optimization.

* rename zero_init to clear_accum

* lint

701e9234

18 Mar, 2025 3 commits

[Dev] Implement FlashAttention3 Backward (#244) · c264f37f

Yu Cheng authored Mar 18, 2025

* [BugFix] Fix bug of missing MBarrierExpectTX

* [Dev] Implement FlashAttention3 Backward

- Added a new example for Flash Attention using pipelined WGMMA, including forward and backward pass implementations.
- Introduced functions for forward and backward processing, leveraging tilelang for optimized tensor operations.
- Enhanced the attention mechanism with support for both causal and non-causal configurations.
- Included command-line arguments for batch size, number of heads, context size, and head dimension for flexibility in testing.
- Updated GEMM operations to support a new `wg_wait` parameter for improved synchronization in kernel execution.

c264f37f

[Refactor] Refactor for Better Layout Conflict Handling (#240) · 2a286ae6

Lei Wang authored Mar 18, 2025

* [Feature] Add reduce_max functionality and corresponding tests

* Introduced a new test file for the reduce_max operation in the tilelang language module.
* Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying.
* Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation.
* Enhanced profiling assertions to validate the output against reference implementations.

* Fix whitespace issues in reduce_max test file for improved readability

* [Refactor] Update DebugOutput methods to return strings instead of void

* Modified DebugOutput methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to return std::string instead of void, enhancing usability for logging and debugging.
* Updated corresponding header files to reflect the new return types.
* Improved layout inference error messages by incorporating DebugOutput for better clarity in layout conflicts.

* lint fix

* Fix typo in matmul function: changed loop from T.Parallel to T.grid for correct parallel execution in webgpu code generation tests.

* [Enhancement] Improve layout inference conflict handling in ParallelOp

* Updated the layout inference logic in ParallelOp to better handle conflicts for local.fragment buffers.
* Added checks to ensure that layout conflicts are reported only when both source and destination buffers are defined, improving clarity in error messages.
* Enhanced the overall robustness of the layout inference process by addressing specific cases where conflicts may arise.

* [Feature] Add IsEqual methods for layout comparison

* Introduced IsEqual methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to facilitate structural equality checks, allowing for optional index comparison.
* Enhanced layout inference logic in Copy and ParallelOp to utilize the new IsEqual methods for better conflict detection in local.fragment layouts.
* Improved error messages for layout conflicts to provide clearer guidance on potential issues.houm

* [Refactor] Update profiler usage in benchmark_nsa_fwd.py and improve layout inference in elem.cc and parallel.cc

* Modified the profiler call in benchmark_nsa_fwd.py to streamline latency measurement.
* Updated layout inference logic in elem.cc and parallel.cc to use const pointers for FragmentNode, enhancing type safety and clarity.
* Improved error messages in layout conflict checks to provide better guidance on potential issues.

* [Refactor] Clean up pointer formatting in layout inference files

* Standardized pointer formatting for FragmentNode in elem.cc and parallel.cc to improve code readability.
* Minor adjustments to error message formatting in layout conflict checks for better clarity.

2a286ae6

[BugFix] Fix bug of missing MBarrierExpectTX (#241) · 45534789
Yu Cheng authored Mar 18, 2025

45534789

17 Mar, 2025 1 commit

[Bugfix] Disable force inline for ldmatrix (#227) · a1da26f2

Lei Wang authored Mar 17, 2025

* Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture

- Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
- Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
- Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
- Include necessary header files for built-in operations and layout inference improvements.

* Refactor parameter formatting in CUDA matrix load functions for consistency

- Adjusted parameter alignment in `ptx_ldmatrix_x1`, `ptx_ldmatrix_x2`, `ptx_ldmatrix_x4`, and their transposed counterparts for improved readability.
- Added a blank line in `get_tensor_supply` function in `tensor.py` to enhance code clarity.

* Enhance tensor supply generation in `get_tensor_supply` function

- Introduced handling for unsigned integer and float8 tensor types, allowing for specific random tensor generation based on data type.
- Updated logic to return appropriate random tensors for different data types, improving flexibility and functionality of tensor supply generation.
- Refactored existing conditions for clarity and maintainability.

* Fix tensor supply generation logic in `get_tensor_supply` function

- Updated the variable reference from `tensor` to `param` to ensure correct handling of tensor data types.
- Improved the accuracy of unsigned integer and float8 checks for tensor supply generation, enhancing functionality and reliability.

* Enhance tensor supply checks in `get_tensor_supply` function

- Updated the logic for identifying unsigned integers and float8 types by using `removeprefix` on the dtype string, improving accuracy in tensor supply generation.
- Ensured better handling of tensor data types for more reliable random tensor generation based on the updated checks.

* Enhance KernelParam functionality and improve tensor supply checks

- Added methods `is_unsigned` and `is_float8` to the `KernelParam` class for better type identification of parameters.
- Updated the `get_tensor_supply` function to utilize the new methods, improving clarity and accuracy in tensor supply generation based on parameter types.

a1da26f2

16 Mar, 2025 1 commit

[Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper (#224) · c5bbc608

zqh-wz authored Mar 16, 2025



* add test for issue 101

* use ss_smem_selector from cutlass

* fix mismatch between smem layout and mma

* only fix for sm90

* Add CUDA requirements to GEMM thread tests

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

c5bbc608

14 Mar, 2025 3 commits

[Dev] Implement IfStmtBinding and MergeIfStmt transformations (#211) · 86f96f8a

Yu Cheng authored Mar 14, 2025



* [Dev] Implement IfStmtBinding and MergeIfStmt transformations

- Add IfStmtBinding to bind If statements to each statement in SeqStmt, enhancing the handling of conditional statements.
- Introduce MergeIfStmt to merge consecutive If statements within SeqStmt, optimizing the structure of conditional logic.
- Update phase.py to apply IfStmtBinding and MergeIfStmt transformations for the "sm_90" target.
- Enhance __init__.py with new functions for IfStmtBinding and MergeIfStmt, providing a clear interface for these transformations.

* Update license header in if_stmt_binding.cc

* Update license header in merge_if_stmt.cc

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

86f96f8a

[Enhancement] Avoid tvm ffi handling when out_idx is specified (#209) · 227ed7ec

Lei Wang authored Mar 14, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

* Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc

- Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
- Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.

* Refactor condition handling in inject_pipeline.cc

- Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
- Update condition comparison logic to use StructuralEqual for better accuracy.
- Enhance logging to provide detailed insights into condition changes and statement processing.
- Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.

* Improve logging and formatting in inject_pipeline.cc

- Enhance logging statements for better clarity on condition changes and statement processing.
- Adjust formatting for improved readability, including line breaks and consistent spacing.
- Ensure accurate condition comparison and handling in the pipeline logic.

* Refactor logging and clean up inject_pipeline.cc

- Remove excessive logging statements to streamline the code and improve performance.
- Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
- Maintain the core functionality while enhancing code readability and maintainability.

* Update Dockerfiles to specify exact version of libstdcxx-ng

- Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution.
- Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds.

* Refactor and enhance examples and kernel handling

- Adjusted the pipeline stages in `example_blocksparse_gemm.py` from 2 to 1 for improved performance.
- Added kernel source printing in `benchmark_nsa_fwd.py` for better debugging and profiling insights.
- Updated tensor allocation and parameter handling in `CtypesKernelAdapter` and `CythonKernelWrapper` to cache parameter dtypes and shapes, improving efficiency and clarity.
- Enhanced the handling of dynamic shapes in the Cython JIT kernel compilation process.
- Modified the benchmark script to accommodate new tensor output parameters and improved batch size defaults for testing.

* Update copyright header in Cython wrapper to reflect Tile-AI Corporation

* revert change

227ed7ec

[Enhancement] Allow mma fallback when wgmma is not supported (#206) · 45559a1f

Lei Wang authored Mar 14, 2025

* Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging.

* Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture

- Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
- Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
- Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
- Include necessary header files for built-in operations and layout inference improvements.

* lint fix

* Remove unused builtin.h include directive

* Update include path for builtin.h

45559a1f

13 Mar, 2025 2 commits

[Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5

zqh-wz authored Mar 13, 2025



* upgrade cutlass to upstream v3.8.0

* Implement fp8 gemm and add example script

* Fix dtype retrieval with map_torch_type for fp8 inputs

* Disable vectorization of fp8 values

* Make MMA declaration compatible with cutlass 3.4.0+

* Add test for fp8 T.gemm

* fix indent

* fix indent

* Add copyright and license header

* Add copyright and license header

* lint fix

* Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.

* clang format lint

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

2cccf1f5

[Enhancement] Enhancing the handling of conditional statements in the pipeline (#201) · dda8ebff

Lei Wang authored Mar 13, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

* Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc

- Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
- Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.

* Refactor condition handling in inject_pipeline.cc

- Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
- Update condition comparison logic to use StructuralEqual for better accuracy.
- Enhance logging to provide detailed insights into condition changes and statement processing.
- Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.

* Improve logging and formatting in inject_pipeline.cc

- Enhance logging statements for better clarity on condition changes and statement processing.
- Adjust formatting for improved readability, including line breaks and consistent spacing.
- Ensure accurate condition comparison and handling in the pipeline logic.

* Refactor logging and clean up inject_pipeline.cc

- Remove excessive logging statements to streamline the code and improve performance.
- Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
- Maintain the core functionality while enhancing code readability and maintainability.

dda8ebff

12 Mar, 2025 6 commits

[Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b

Lei Wang authored Mar 13, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

7ccec53b

[CMake] Add CUDA Major Version Detection for Conditional Compilation (#197) · 20f19611

Yu Cheng authored Mar 12, 2025

* [Feature] Add TMA Store Synchronization Support

- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
- Add new builtin operations in op/builtin.cc and op/builtin.h
- Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
- Update CUDA codegen to support new TMA store synchronization intrinsics
- Add Python language bindings for new TMA store synchronization operations

* [CMake] Add CUDA Major Version Detection for Conditional Compilation

- Introduce CUDA_MAJOR_VERSION CMake variable to dynamically detect CUDA toolkit version
- Update runtime and transform files to use CUDA_MAJOR_VERSION for version-specific code paths
- Replace hardcoded __CUDACC_VER_MAJOR__ with dynamically set CUDA_MAJOR_VERSION
- Improve cross-version compatibility for CUDA-dependent code sections

20f19611

[Feature] Add TMA Store Synchronization Support (#195) · eba7dd5a

Yu Cheng authored Mar 12, 2025

- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
- Add new builtin operations in op/builtin.cc and op/builtin.h
- Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
- Update CUDA codegen to support new TMA store synchronization intrinsics
- Add Python language bindings for new TMA store synchronization operations

eba7dd5a

[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp... · 94c758ad

Yu Cheng authored Mar 12, 2025

[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter (#194)

* [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter

- Introduce `SetMaxNRegCollector` to collect register hints from SetMaxNReg calls
- Modify `WarpSpecializedRewriter` to use collected register hints for producer and consumer code
- Add validation checks for register hint values in the collector
- Remove SetMaxNReg calls during code transformation
- Enhance flexibility of register allocation in warp specialized rewriting

* temporary remove check in lower_hopper_intrin

94c758ad

[Bugfix] Make quickstart work properly on cu118 (#193) · efb2b1d5
penguin_wwy authored Mar 12, 2025

efb2b1d5

[Bugfix] Fix `T.copy` for scalar datatypes (#190) · 454248c7

Lei Wang authored Mar 12, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

454248c7

11 Mar, 2025 1 commit

[Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug (#188) · fe0de672

Yu Cheng authored Mar 12, 2025

* [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug

- Implement two RMS normalization implementations in TileLang:
  * `rms_norm_splitk`: Split-K reduction approach for large matrices
  * `rms_norm`: Full reduction kernel with simplified implementation
- Add reference implementation using PyTorch for validation
- Include performance benchmarking for both kernel variants
- Demonstrate flexible block size and matrix size configurations

* [Examples] Simplify RMS Normalization Kernel Compilation

- Remove commented-out code for split-K RMS normalization
- Simplify kernel compilation by removing explicit TMA lowering configuration
- Update copyright header to Tile-AI Corporation
- Streamline main script for RMS normalization example

fe0de672

10 Mar, 2025 2 commits

[Bugfix] Improve Thread Variable Handling in Layout Inference (#179) · c39e540a

Lei Wang authored Mar 10, 2025

* [Refactor] Improve Thread Variable Handling in Layout Inference

- Update layout inference to handle thread variables more robustly
- Add explicit size check between infer_list_ and thread_var_vec_
- Modify thread variable access to use per-iteration thread variable
- Simplify thread predicate retrieval logic
- Add minor code cleanup and return variable assignment

* [Refactor] Update Layout Inference Copyright and Simplify Return Logic

- Replace Apache License header with Microsoft Corporation copyright notice
- Simplify LayoutInference function by directly returning substituted function
- Remove unnecessary variable assignment in return statement

* [Refactor] Update Layout Inference Copyright to Tile-AI Corporation

- Change copyright notice from Microsoft Corporation to Tile-AI Corporation
- Maintain existing file structure and licensing header

c39e540a

[Refactor] Enhance GPU Kernel Launch with Environment Thread Creation (#178) · 8ccf6ea2

Lei Wang authored Mar 10, 2025

- Introduce `CreateEnvThread` function to generate environment threads for GPU kernel launches
- Modify `KernelLaunch` to use `CreateEnvThread` for block and thread indices
- Improve thread variable naming with shorter, more descriptive identifiers (bx, by, bz, tx, ty, tz)
- Ensure proper thread environment setup within PrimFunc context

8ccf6ea2

09 Mar, 2025 2 commits

[Feat] Append Pass Context and TMA lowering configuration option (#175) · fb6b101c

Lei Wang authored Mar 09, 2025

* Add TMA lowering configuration option and update copyright notices

This commit introduces a new configuration option to disable TMA (Tensor Memory Access) lowering and updates copyright notices across multiple files. Key changes include:

- Add `kDisableTMALower` configuration option in builtin.h and builtin.cc
- Update copyright notices from Microsoft Corporation to Tile-AI Corporation
- Modify `LowerArgs` struct to include `disable_tma_lower` flag
- Update JIT compilation interfaces to support pass configuration
- Enhance error reporting in bulk copy lowering
- Propagate pass configuration through various adapter layers

* lint fix

fb6b101c

[Bugfix] Implement boundary check for the buffer shape with dynamic symbolic (#173) · 8344af52

Lei Wang authored Mar 09, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

* Add Native Sparse Attention Examples with Tilelang and Triton Implementations

- Introduce new example scripts for native sparse attention:
  * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
  * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
  * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
- Update reference.py with naive implementations for sparse attention
- Support different sparse attention scenarios including forward pass and inference
- Add comprehensive testing and validation against reference implementations

* lint fix

* Add Variable-Length Native Sparse Attention Examples for TileLang and Triton

- Introduce new example scripts for variable-length native sparse attention:
  * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
  * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
- Update reference.py to support variable-length sparse attention scenarios
- Enhance existing sparse attention implementations to handle variable-length inputs
- Add comprehensive testing and validation for variable-length sparse attention

* Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements

- Standardize function and parameter formatting across NSA example files
- Improve code readability by adjusting indentation and line breaks
- Enhance type hints and parameter alignment
- Remove unnecessary whitespaces and optimize imports
- Maintain consistent code style across TileLang and Triton implementations

* Add debug logging and extend execution backend in JIT and loop vectorization

- Add detailed logging in loop vectorization to help diagnose buffer shape handling
- Extend JIT execution backend to include 'cython' option
- Improve boundary condition checks in BufferLoadNode visit method

* Remove debug logging in loop vectorization BufferLoadNode visit method

- Remove unnecessary INFO log statements in VisitExpr_ method
- Simplify code by eliminating redundant logging
- Maintain core logic for handling buffer load node visits

8344af52