Commits · 137dab675f06540cca956b80fe913f7a4deb22de · OpenDAS / tilelang

11 Apr, 2025 2 commits

[Typo] Remove debug print (#373) · 137dab67

Lei Wang authored Apr 11, 2025

* [Enhancement] Add variable check in GlobalMemChecker for safe memory access validation

- Introduced a check in the GlobalMemChecker to determine if the index used in memory access has any variable components, enhancing the safety of memory access validation.
- Updated the condition handling in store operations to ensure that only boolean conditions are processed, improving type safety and error handling in memory operations.

* [Refactor] Rename VecAllocAccess to TLVecAllocAccess and enhance buffer access handling

- Renamed the `VecAllocAccess` class to `TLVecAllocAccess` for clarity in its purpose.
- Improved the handling of buffer access by mutating extents and rewriting access in the body, ensuring compatibility with vectorized operations.
- Added a TODO comment to suggest moving this pass to occur before StorageFlatten/FlattenBuffer for better optimization.
- Introduced a print statement in the phase optimization process for debugging purposes.

* lint fix

137dab67

[Language] Introduce `T.any_of` and `T.all_of` to reduce a bool arrary (#371) · c4638d65

Lei Wang authored Apr 11, 2025



* [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks

- Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements.
- Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework.
- Updated the `allocate.py` to handle boolean types correctly in shared memory allocations.
- Introduced tests for the new logical operations to ensure correctness and performance.
Co-authored-by: Zhiwen Mo <zhiwen.mo25@ic.ac.uk>

* lint fix

---------
Co-authored-by: Zhiwen Mo <zhiwen.mo25@ic.ac.uk>

c4638d65

08 Apr, 2025 1 commit

[Enhancement] Support pass config `disable_warp_specialize` to disable auto... · 7fdcedd0

Lei Wang authored Apr 08, 2025

[Enhancement] Support pass config `disable_warp_specialize` to disable auto specialization on hopper (#357)

* [Enhancement] Add warp specialization configuration option and update related functionality

* [Add] Introduced a new pass configuration option `kDisableWarpSpecialized` to control warp specialization behavior.
* [Refactor] Updated `WarpSpecializedRewriter` and `WSCodeEmitter` to utilize the new configuration option, allowing for more flexible optimization strategies.
* [Update] Modified the optimization pipeline in `phase.py` to include pipeline planning when warp specialization is disabled, enhancing performance with async copy.
* [Documentation] Updated JIT compilation parameters to reflect the new configuration option for better clarity.

* lint fix

* [Add] Implement test for GEMM with warp specialization configuration

* Introduced a new test file `test_tilelang_pass_config_disable_warp_specialized.py` to validate the functionality of the warp specialization configuration option.
* Added a `run_gemm` function to execute matrix multiplication tests with and without warp specialization, ensuring correctness through profiling against reference results.
* Included a specific test case for GEMM with float16 data types, enhancing test coverage for the new configuration feature.

* [Refactor] Improve formatting in test_tilelang_pass_config_disable_warp_specialized.py

* Reformatted the `tilelang.compile` call in the `run_gemm` function for better readability by breaking it into multiple lines.
* Added a blank line for improved code structure and clarity in the `test_gemm_f16f16f16_nn` function.

7fdcedd0

06 Apr, 2025 1 commit

[Enhancement] Support index bit width configuration (#343) · 70546adc

Lei Wang authored Apr 06, 2025

* [Refactor] Clean up whitespace in CUDA-related files

- Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
- This change enhances the overall organization of the codebase without altering functionality.

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

- Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
- The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
- Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
- The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.

* lint fix

* Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation

- Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety.
- Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance.
- Integrated the new transformation into the optimization pipeline in `phase.py`.
- Documented the new transformation in `__init__.py` for clarity.

* lint fix

* Add configuration option for index bitwidth and remove ExpandIndexDataType transformation

- Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth.
- Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation.
- Documented the new configuration option in the JIT compilation function's parameters for clarity.
- Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process.

* lint fix

* Refactor index bitwidth configuration handling

- Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values.
- Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility.

* lint fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

70546adc

04 Apr, 2025 1 commit

[Dynamic Symbolic] Adaptively vectorize with different condition expressions (#326) · 5ee58ec7

Zhengju Tang authored Apr 04, 2025



* [Dynamic Symbolic] Adaptively vectorize with different condition expressions

* Format

* Format

* Format

* Format

* Add MIT License headers to Python files

* Simplify return statement in loop vectorization

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

5ee58ec7

01 Apr, 2025 1 commit
- [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding (#315) · 3e859e67
  Yu Cheng authored Apr 01, 2025
  
  3e859e67
31 Mar, 2025 2 commits

[Bugfix] Fix layout conflict issue for gqa decoding examples (#314) · 0fd82ed5

Lei Wang authored Apr 01, 2025

* Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output.

* Refactor flashattn example to improve CUDA configuration handling

- Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures.
- Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output.
- Adjusted the `do_bench` method call to streamline performance profiling.

* lint fix

0fd82ed5

[Bugfix] Fix dynamic axis with variable extent (#311) · c30904ea

Lei Wang authored Mar 31, 2025

* [Enhancement] Improve error message for RampNode in CUDA codegen

- Updated the error message in the VisitExpr_ method for RampNode to include the specific Ramp node and lane count when the lane count exceeds the limit of 4. This change enhances debugging by providing clearer context for the error.
- Refactored the loop vectorization logic in loop_vectorize_dynamic.cc to improve readability and maintainability, ensuring that dynamic vectorization checks are performed correctly and efficiently.

* lint fix

c30904ea

29 Mar, 2025 1 commit

[Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely (#302) · d3bf4fe1

Zhengju Tang authored Mar 29, 2025



* [Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely

* lint fix

* update license

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

d3bf4fe1

28 Mar, 2025 1 commit

[Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061

Lei Wang authored Mar 28, 2025

* [Feature] Implement ParallelLoopTransformer for enhanced loop analysis

- Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
- Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
- Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
- Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.

* test fix

* Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.

* Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.

5c8de061

26 Mar, 2025 1 commit

[Feature] Introduce NoSetMaxNReg for warp specialization (#289) · 76435ca8

Yu Cheng authored Mar 26, 2025

- Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches.
- Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management.
- Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.

76435ca8

24 Mar, 2025 1 commit

[Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter (#269) · 2abd6ab7

Yu Cheng authored Mar 24, 2025

- Introduced TMAFinder and ProducerUsedBufferFinder classes to analyze TMA loads and identify buffers used in producer conditions.
- Enhanced WarpSpecializedRoleMarker to prepare and utilize the identified buffers during role marking.
- Updated VisitStmt methods to incorporate new analysis logic for IfThenElse and For nodes, improving the handling of TMA loads in the warp specialization process.

2abd6ab7

22 Mar, 2025 2 commits

[Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

* Refactor CUDA post-processing callback registration in TileLang

- Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
- Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
- Added debug prints to the CUDA code generation process for better traceability during development.
- Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
- Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.

* lint fix

* Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters

- Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
- Implemented a function to generate various kernel configurations for autotuning.
- Refactored the main execution block to support both autotuned and default configurations.
- Improved the block mask generation to accommodate specified sparsity levels.
- Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.

f47b43c5

[Example] Implement Kernel Example cumsum (#258) · cd9ec62e

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

cd9ec62e

21 Mar, 2025 1 commit

[Language] Introduce `T.alloc_var` to define a variable like `int var;` (#255) · c770a58f

Lei Wang authored Mar 22, 2025

* [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT

- Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters.
- Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing.
- Updated test cases to validate the new matrix multiplication implementations.
- Enhanced error handling in library initialization and compilation processes across various modules.
- Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting.

* lint fix

* optimize

* Support var defiine

* lint fix

* Update TVM submodule and add alloc_variable function to allocate local variables in TileLang

- Updated the TVM submodule to the latest commit.
- Introduced `alloc_variable` function in `allocate.py` to support local variable allocation with specified data types and scopes.

* lint fix

* Refactor variable allocation functions for consistency

- Renamed `alloc_variable` to `alloc_var` across multiple files for improved consistency.
- Updated corresponding test functions to reflect the new naming convention.
- Adjusted imports in `__init__.py` to align with the changes.

c770a58f

20 Mar, 2025 1 commit

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180

Lei Wang authored Mar 20, 2025

* remove llvm build

* [Refactor] Update kernel compilation and profiling in examples

- Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
- Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
- Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.

* lint fix

* License Update

* [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files

- Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
- Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.

* [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files

- Improved comment alignment and readability in `cuda.h`.
- Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.

* lint fix

* fix

* License update

* [Enhancement] Update JITKernel to use artifact for kernel source

- Assigned the generated artifact to `self.artifact` for better management.
- Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.

* lint fix

* Add @tilelang.testing.requires_llvm decorator to vectorization tests

* Enhance setup.py and env.py for library management

- Added functionality to remove original files after copying in CMakeBuild.
- Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.

* Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py

* Refactor CMakeBuild file handling in setup.py

- Added a check to ensure the target library directory exists before copying .so files.
- Improved the logic for creating the target directory and copying files to enhance robustness.

* bugfix

* Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.

* lint fix

* Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.

* lint fix

* Add support for C target in device code generation

- Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.

* [Enhancement] Implement auto-clear cache feature based on environment variable

* Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
* Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
* Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.

* [Refactor] Update kernel invocation and import paths in tests and cache

* Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
* Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
* Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.

* [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py

* Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
* Enhanced overall code formatting to align with project standards.

* [Enhancement] Add bfloat16 test case and improve kernel caching logic

* Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
* Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
* Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
* Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
* Improved code formatting and readability across several files.

* lint fix

* Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage

f2e99180

18 Mar, 2025 2 commits

[Refactor] Refactor for Better Layout Conflict Handling (#240) · 2a286ae6

Lei Wang authored Mar 18, 2025

* [Feature] Add reduce_max functionality and corresponding tests

* Introduced a new test file for the reduce_max operation in the tilelang language module.
* Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying.
* Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation.
* Enhanced profiling assertions to validate the output against reference implementations.

* Fix whitespace issues in reduce_max test file for improved readability

* [Refactor] Update DebugOutput methods to return strings instead of void

* Modified DebugOutput methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to return std::string instead of void, enhancing usability for logging and debugging.
* Updated corresponding header files to reflect the new return types.
* Improved layout inference error messages by incorporating DebugOutput for better clarity in layout conflicts.

* lint fix

* Fix typo in matmul function: changed loop from T.Parallel to T.grid for correct parallel execution in webgpu code generation tests.

* [Enhancement] Improve layout inference conflict handling in ParallelOp

* Updated the layout inference logic in ParallelOp to better handle conflicts for local.fragment buffers.
* Added checks to ensure that layout conflicts are reported only when both source and destination buffers are defined, improving clarity in error messages.
* Enhanced the overall robustness of the layout inference process by addressing specific cases where conflicts may arise.

* [Feature] Add IsEqual methods for layout comparison

* Introduced IsEqual methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to facilitate structural equality checks, allowing for optional index comparison.
* Enhanced layout inference logic in Copy and ParallelOp to utilize the new IsEqual methods for better conflict detection in local.fragment layouts.
* Improved error messages for layout conflicts to provide clearer guidance on potential issues.houm

* [Refactor] Update profiler usage in benchmark_nsa_fwd.py and improve layout inference in elem.cc and parallel.cc

* Modified the profiler call in benchmark_nsa_fwd.py to streamline latency measurement.
* Updated layout inference logic in elem.cc and parallel.cc to use const pointers for FragmentNode, enhancing type safety and clarity.
* Improved error messages in layout conflict checks to provide better guidance on potential issues.

* [Refactor] Clean up pointer formatting in layout inference files

* Standardized pointer formatting for FragmentNode in elem.cc and parallel.cc to improve code readability.
* Minor adjustments to error message formatting in layout conflict checks for better clarity.

2a286ae6

[BugFix] Fix bug of missing MBarrierExpectTX (#241) · 45534789
Yu Cheng authored Mar 18, 2025

45534789

17 Mar, 2025 1 commit

[Bugfix] Disable force inline for ldmatrix (#227) · a1da26f2

Lei Wang authored Mar 17, 2025

* Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture

- Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
- Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
- Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
- Include necessary header files for built-in operations and layout inference improvements.

* Refactor parameter formatting in CUDA matrix load functions for consistency

- Adjusted parameter alignment in `ptx_ldmatrix_x1`, `ptx_ldmatrix_x2`, `ptx_ldmatrix_x4`, and their transposed counterparts for improved readability.
- Added a blank line in `get_tensor_supply` function in `tensor.py` to enhance code clarity.

* Enhance tensor supply generation in `get_tensor_supply` function

- Introduced handling for unsigned integer and float8 tensor types, allowing for specific random tensor generation based on data type.
- Updated logic to return appropriate random tensors for different data types, improving flexibility and functionality of tensor supply generation.
- Refactored existing conditions for clarity and maintainability.

* Fix tensor supply generation logic in `get_tensor_supply` function

- Updated the variable reference from `tensor` to `param` to ensure correct handling of tensor data types.
- Improved the accuracy of unsigned integer and float8 checks for tensor supply generation, enhancing functionality and reliability.

* Enhance tensor supply checks in `get_tensor_supply` function

- Updated the logic for identifying unsigned integers and float8 types by using `removeprefix` on the dtype string, improving accuracy in tensor supply generation.
- Ensured better handling of tensor data types for more reliable random tensor generation based on the updated checks.

* Enhance KernelParam functionality and improve tensor supply checks

- Added methods `is_unsigned` and `is_float8` to the `KernelParam` class for better type identification of parameters.
- Updated the `get_tensor_supply` function to utilize the new methods, improving clarity and accuracy in tensor supply generation based on parameter types.

a1da26f2

14 Mar, 2025 2 commits

[Dev] Implement IfStmtBinding and MergeIfStmt transformations (#211) · 86f96f8a

Yu Cheng authored Mar 14, 2025



* [Dev] Implement IfStmtBinding and MergeIfStmt transformations

- Add IfStmtBinding to bind If statements to each statement in SeqStmt, enhancing the handling of conditional statements.
- Introduce MergeIfStmt to merge consecutive If statements within SeqStmt, optimizing the structure of conditional logic.
- Update phase.py to apply IfStmtBinding and MergeIfStmt transformations for the "sm_90" target.
- Enhance __init__.py with new functions for IfStmtBinding and MergeIfStmt, providing a clear interface for these transformations.

* Update license header in if_stmt_binding.cc

* Update license header in merge_if_stmt.cc

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

86f96f8a

[Enhancement] Avoid tvm ffi handling when out_idx is specified (#209) · 227ed7ec

Lei Wang authored Mar 14, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

* Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc

- Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
- Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.

* Refactor condition handling in inject_pipeline.cc

- Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
- Update condition comparison logic to use StructuralEqual for better accuracy.
- Enhance logging to provide detailed insights into condition changes and statement processing.
- Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.

* Improve logging and formatting in inject_pipeline.cc

- Enhance logging statements for better clarity on condition changes and statement processing.
- Adjust formatting for improved readability, including line breaks and consistent spacing.
- Ensure accurate condition comparison and handling in the pipeline logic.

* Refactor logging and clean up inject_pipeline.cc

- Remove excessive logging statements to streamline the code and improve performance.
- Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
- Maintain the core functionality while enhancing code readability and maintainability.

* Update Dockerfiles to specify exact version of libstdcxx-ng

- Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution.
- Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds.

* Refactor and enhance examples and kernel handling

- Adjusted the pipeline stages in `example_blocksparse_gemm.py` from 2 to 1 for improved performance.
- Added kernel source printing in `benchmark_nsa_fwd.py` for better debugging and profiling insights.
- Updated tensor allocation and parameter handling in `CtypesKernelAdapter` and `CythonKernelWrapper` to cache parameter dtypes and shapes, improving efficiency and clarity.
- Enhanced the handling of dynamic shapes in the Cython JIT kernel compilation process.
- Modified the benchmark script to accommodate new tensor output parameters and improved batch size defaults for testing.

* Update copyright header in Cython wrapper to reflect Tile-AI Corporation

* revert change

227ed7ec

13 Mar, 2025 2 commits

[Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5

zqh-wz authored Mar 13, 2025



* upgrade cutlass to upstream v3.8.0

* Implement fp8 gemm and add example script

* Fix dtype retrieval with map_torch_type for fp8 inputs

* Disable vectorization of fp8 values

* Make MMA declaration compatible with cutlass 3.4.0+

* Add test for fp8 T.gemm

* fix indent

* fix indent

* Add copyright and license header

* Add copyright and license header

* lint fix

* Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.

* clang format lint

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

2cccf1f5

[Enhancement] Enhancing the handling of conditional statements in the pipeline (#201) · dda8ebff

Lei Wang authored Mar 13, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

* Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc

- Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes.
- Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline.

* Refactor condition handling in inject_pipeline.cc

- Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity.
- Update condition comparison logic to use StructuralEqual for better accuracy.
- Enhance logging to provide detailed insights into condition changes and statement processing.
- Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements.

* Improve logging and formatting in inject_pipeline.cc

- Enhance logging statements for better clarity on condition changes and statement processing.
- Adjust formatting for improved readability, including line breaks and consistent spacing.
- Ensure accurate condition comparison and handling in the pipeline logic.

* Refactor logging and clean up inject_pipeline.cc

- Remove excessive logging statements to streamline the code and improve performance.
- Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing.
- Maintain the core functionality while enhancing code readability and maintainability.

dda8ebff

12 Mar, 2025 5 commits

[Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b

Lei Wang authored Mar 13, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

* Enhance block sparse attention implementation

- Update `blocksparse_flashattn` to use 2 stages for improved performance.
- Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
- Modify condition checks in the kernel to utilize boolean values.
- Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
- Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.

* Refactor and clean up code formatting across multiple files

- Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
- Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
- Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
- Ensured consistent formatting and style across the codebase.

7ccec53b

[CMake] Add CUDA Major Version Detection for Conditional Compilation (#197) · 20f19611

Yu Cheng authored Mar 12, 2025

* [Feature] Add TMA Store Synchronization Support

- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
- Add new builtin operations in op/builtin.cc and op/builtin.h
- Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
- Update CUDA codegen to support new TMA store synchronization intrinsics
- Add Python language bindings for new TMA store synchronization operations

* [CMake] Add CUDA Major Version Detection for Conditional Compilation

- Introduce CUDA_MAJOR_VERSION CMake variable to dynamically detect CUDA toolkit version
- Update runtime and transform files to use CUDA_MAJOR_VERSION for version-specific code paths
- Replace hardcoded __CUDACC_VER_MAJOR__ with dynamically set CUDA_MAJOR_VERSION
- Improve cross-version compatibility for CUDA-dependent code sections

20f19611

[Feature] Add TMA Store Synchronization Support (#195) · eba7dd5a

Yu Cheng authored Mar 12, 2025

- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
- Add new builtin operations in op/builtin.cc and op/builtin.h
- Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
- Update CUDA codegen to support new TMA store synchronization intrinsics
- Add Python language bindings for new TMA store synchronization operations

eba7dd5a

[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp... · 94c758ad

Yu Cheng authored Mar 12, 2025

[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter (#194)

* [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter

- Introduce `SetMaxNRegCollector` to collect register hints from SetMaxNReg calls
- Modify `WarpSpecializedRewriter` to use collected register hints for producer and consumer code
- Add validation checks for register hint values in the collector
- Remove SetMaxNReg calls during code transformation
- Enhance flexibility of register allocation in warp specialized rewriting

* temporary remove check in lower_hopper_intrin

94c758ad

[Bugfix] Make quickstart work properly on cu118 (#193) · efb2b1d5
penguin_wwy authored Mar 12, 2025

efb2b1d5

10 Mar, 2025 1 commit

[Bugfix] Improve Thread Variable Handling in Layout Inference (#179) · c39e540a

Lei Wang authored Mar 10, 2025

* [Refactor] Improve Thread Variable Handling in Layout Inference

- Update layout inference to handle thread variables more robustly
- Add explicit size check between infer_list_ and thread_var_vec_
- Modify thread variable access to use per-iteration thread variable
- Simplify thread predicate retrieval logic
- Add minor code cleanup and return variable assignment

* [Refactor] Update Layout Inference Copyright and Simplify Return Logic

- Replace Apache License header with Microsoft Corporation copyright notice
- Simplify LayoutInference function by directly returning substituted function
- Remove unnecessary variable assignment in return statement

* [Refactor] Update Layout Inference Copyright to Tile-AI Corporation

- Change copyright notice from Microsoft Corporation to Tile-AI Corporation
- Maintain existing file structure and licensing header

c39e540a

09 Mar, 2025 2 commits

[Feat] Append Pass Context and TMA lowering configuration option (#175) · fb6b101c

Lei Wang authored Mar 09, 2025

* Add TMA lowering configuration option and update copyright notices

This commit introduces a new configuration option to disable TMA (Tensor Memory Access) lowering and updates copyright notices across multiple files. Key changes include:

- Add `kDisableTMALower` configuration option in builtin.h and builtin.cc
- Update copyright notices from Microsoft Corporation to Tile-AI Corporation
- Modify `LowerArgs` struct to include `disable_tma_lower` flag
- Update JIT compilation interfaces to support pass configuration
- Enhance error reporting in bulk copy lowering
- Propagate pass configuration through various adapter layers

* lint fix

fb6b101c

[Bugfix] Implement boundary check for the buffer shape with dynamic symbolic (#173) · 8344af52

Lei Wang authored Mar 09, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

* Add Native Sparse Attention Examples with Tilelang and Triton Implementations

- Introduce new example scripts for native sparse attention:
  * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
  * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
  * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
- Update reference.py with naive implementations for sparse attention
- Support different sparse attention scenarios including forward pass and inference
- Add comprehensive testing and validation against reference implementations

* lint fix

* Add Variable-Length Native Sparse Attention Examples for TileLang and Triton

- Introduce new example scripts for variable-length native sparse attention:
  * example_tilelang_nsa_fwd_varlen.py: TileLang implementation with variable sequence lengths
  * example_triton_nsa_fwd_varlen.py: Triton implementation with variable sequence lengths
- Update reference.py to support variable-length sparse attention scenarios
- Enhance existing sparse attention implementations to handle variable-length inputs
- Add comprehensive testing and validation for variable-length sparse attention

* Refactor Native Sparse Attention Examples: Code Style and Formatting Improvements

- Standardize function and parameter formatting across NSA example files
- Improve code readability by adjusting indentation and line breaks
- Enhance type hints and parameter alignment
- Remove unnecessary whitespaces and optimize imports
- Maintain consistent code style across TileLang and Triton implementations

* Add debug logging and extend execution backend in JIT and loop vectorization

- Add detailed logging in loop vectorization to help diagnose buffer shape handling
- Extend JIT execution backend to include 'cython' option
- Improve boundary condition checks in BufferLoadNode visit method

* Remove debug logging in loop vectorization BufferLoadNode visit method

- Remove unnecessary INFO log statements in VisitExpr_ method
- Simplify code by eliminating redundant logging
- Maintain core logic for handling buffer load node visits

8344af52

28 Feb, 2025 2 commits

[Dev] Remove buffer flatten when debug print a shared buffer (#129) · 20bbb91a

Lei Wang authored Feb 28, 2025

* Add DeepSeek MLA decode example with Flash Attention implementation

* Add GEMM SplitK and StreamK example implementations

This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
- `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
- `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang

Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.

* Refactor GEMM SplitK and StreamK example implementations

Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
- Remove unused import (Profiler) in splitk example
- Simplify line breaks and improve code readability
- Standardize indentation and remove unnecessary whitespace
- Optimize atomic add and copy operations for better clarity

* Add block sparse attention benchmarks for multiple libraries

This commit introduces comprehensive block sparse attention benchmarks for different libraries:
- TileLang block sparse FMHA implementation
- Triton block sparse FMHA implementation
- PyTorch reference block sparse FMHA implementation
- FlashAttention dense FMHA reference implementation

The benchmarks include:
- Configurable benchmark parameters (batch size, heads, sequence length, etc.)
- Sparse mask generation using top-k and threshold methods
- Performance measurement for different sparse attention configurations
- Utility functions for mask generation and benchmarking

* Refactor block sparse attention benchmarks with code style improvements

- Add Ruff linter ignore comments to benchmark files
- Improve code formatting and line breaks
- Remove unused imports
- Standardize print statement formatting
- Enhance code readability across multiple library benchmarks

* lint fix

* Add CUDA atomic operations for BFLOAT16 and update function naming

- Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
- Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
- Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
- Update kernel and language customization to use new function names
- Add return type annotations in profiler module

* lint fix

* Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang

This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
- Group Query Attention (GQA) implementation
- Flash Attention forward pass
- Performance benchmarking
- Configurable parameters for batch, heads, sequence length, and dimension
- Autotuning support
- Reference implementation comparison

* Refactor IR lowering pipeline into modular phases

This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
- `LowerAndLegalize`: Handles initial IR legalization and transformation
- `OptimizeForTarget`: Applies target-specific optimizations

The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.

* lintfix

* nas kernel

* Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates

- Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
- Increased default number of heads and selected blocks in TileLang NSA example
- Added Ruff linter ignore comments to reference.py
- Standardized function signatures and improved code readability across NSA implementations

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

* Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation

- Update flash attention kernel to support positional embeddings (PE)
- Modify reference implementation to handle PE and group query attention
- Increase default batch size and adjust benchmarking parameters
- Improve kernel performance and readability
- Add einops and torch operations for more flexible tensor manipulation

* Update README.md with corrected Flash MLA Decoding example path

- Modify the example link for Flash MLA Decoding to point to the correct directory
- Ensure accurate navigation to the DeepSeek MLA decoding example

* Refactor Native Sparse Attention Kernel and Improve Utility Functions

This commit introduces several improvements:
- Simplified native sparse attention kernel by inlining macro functions in example_tilelang_nsa.py
- Enhanced error handling in loop_partition.cc with more informative error messages
- Updated print.py to support multi-dimensional buffer printing
- Improved torch_assert_close in testing/__init__.py with more detailed mismatch reporting
- Reduced default absolute tolerance in torch comparison from 1e-3 to 1e-2
- Added shape validation and detailed mismatch information in tensor comparison

* Refactor Code Formatting and Improve Utility Functions

This commit introduces several code formatting and utility improvements:
- Add Ruff linter ignore comment in example_tilelang_nsa.py
- Enhance code readability in loop_partition.cc and lower_tile_op.cc with improved line breaks
- Simplify print_flat_buffer_with_condition in print.py
- Refactor torch_assert_close in testing/__init__.py with improved line formatting

20bbb91a

[Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA... · 0d873fcf

Yu Cheng authored Feb 28, 2025

[Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA WGMMA pipelined example (#128)

* [Dev] Add RetNet Linear Attention example

* [Dev] Add WgmmaSync rewriter for pipelined WGMMA operations and add MHA WGMMA pipelined example (FA3-like scheduling)

This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler:

- Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc`
- Added pass registration for `RewriteWgmmaSync`
- Updated `tilelang/engine/phase.py` to include the new transformation pass
- Updated `tilelang/transform/__init__.py` to expose the new pass

The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels.

* [Bugfix] Fix bug in ThreadTagChecker for warp specialization

Improve thread tag validation in warp specialized rewriter to prevent unintended transformations:
- Add more precise checks for threadIdx.y and threadIdx.z
- Validate thread extent to ensure only single-extent thread bindings are allowed
- Prevent warp specialization for multi-extent thread bindings in y and z dimensions

* lint

* [CI] Add TMA descriptor attribute to transformed module in test case

0d873fcf

27 Feb, 2025 1 commit

[JIT] Enhance cython/ctypes wrapper for tma descriptor (#126) · 7b74bb01

Lei Wang authored Feb 27, 2025



* refactor code

* enhance tutorial

* Enhance error handling and code generation in CUDA and TileLang components

This commit introduces several improvements across multiple files:
- Added more informative error messages in GEMM layout checks
- Updated CUDA codegen to support more flexible function signature generation
- Improved TMA descriptor initialization and kernel dispatch logic
- Refined library generation and source code parsing utilities
- Enhanced error handling in various adapter and wrapper classes

* Add thread tag validation for warp specialization

Introduce a ThreadTagChecker to validate that a PrimFunc only uses threadIdx.x before applying warp specialization. This prevents unintended transformations on kernels with complex thread binding and provides a clear warning to users about potential issues with warp specialization.

* Update TileLang Profiling and Compilation in Flash Decoding Examples

Refactor the profiling and compilation workflow in two flash decoding example scripts:
- Replace `tilelang.lower()` and `tilelang.Profiler()` with `tilelang.compile()`
- Simplify profiler initialization using `get_profiler()`
- Update method calls to use the new profiler and compiled kernel objects
- Maintain existing performance benchmarking and validation logic

* Refactor and clean up code formatting in TileLang testing and adapter modules

This commit includes several code style and formatting improvements:
- Adjust whitespace and line breaks in test files
- Improve code formatting in CUDA source wrapper and adapter utilities
- Enhance readability of function calls and argument handling
- Remove unnecessary whitespace and standardize indentation
- Simplify function signatures and argument parsing

* Refactor CUDA codegen and improve code formatting

This commit includes several improvements to CUDA code generation and formatting:
- Enhance function signature generation in CodeGenTileLangCUDA
- Improve code formatting and readability in CUDA-related files
- Simplify parameter handling and type annotations
- Clean up whitespace and line breaks in codegen and layout files

---------
Co-authored-by: Ubuntu <dlisuser@h100testl730RPS.xu5snccwrbtejcqqalluoku5hb.xx.internal.cloudapp.net>

7b74bb01

15 Feb, 2025 1 commit

[Backend][WebGPU] Support WebGPU WGSL code generation (#86) · c8fc0cbb

Lei Wang authored Feb 15, 2025

* bump version into v0.1.0

* [Enhancement] Add custom develop command for editable installs and update .gitignore

* [Documentation] Update README to include system dependencies installation instructions

* [Build] Update setup.py to support library file copying for both release and develop modes

* [Build] Refactor library file copying logic in setup.py

* [Documentation] Remove unnecessary install section header in Installation.md

* [Build] Add tox configuration and local distribution script for multi-Python version support

* [Build] Improve git submodule update function with better error handling

* [Build] Update LLVM configuration path in ROCm installation script

* [Build] Add .tox/ to .gitignore for tox testing environment

* [Build] Add support for TVM prebuild path configuration in CMakeLists.txt

* [Cleanup] Remove unused TVM runtime error codes header

* [Cleanup] Fix TVM grid constant type reference in CUDA module

* [Cleanup] Remove unused customized_code function from IR module

* [Feature] Add TileLang thread synchronization and storage access analysis passes

* [Build] Reorder DLL search path directories for more flexible library loading

* [Refactor] Improve thread synchronization and library path handling

- Rename ThreadSync and TileLangThreadSync functions in C++ code
- Update Python docstring for ThreadSync with more detailed description
- Reorder library path detection in tilelang environment setup
- Minor comment and code cleanup in CUDA and warp specialization modules

* [Refactor] Improve thread synchronization code style and formatting

- Standardize pointer type spacing in storage_access.h and storage_access.cc
- Update whitespace and indentation in thread_storage_sync.cc
- Reorder include statements in thread_partial_sync.cc
- Minor code formatting improvements across thread synchronization files

* [Refactor] Fix global function registration for ThreadSync

- Correct global function registration to use ThreadSync instead of TileLangThreadSync
- Update TVM global registration to match recent refactoring efforts

* [Refactor] Simplify ThreadSync global function registration

- Remove unnecessary whitespace in global function registration
- Compact the TVM global registration line for ThreadSync

* [Feature] Add WebGPU code generation support in TileLang

- Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
- Add WebGPU target support in lower.py and target.py
- Update CMakeLists.txt to include WebGPU codegen source files
- Introduce WebGPU-specific code generation for WGSL shader language

* [Refactor] Improve WebGPU code generation formatting and readability

- Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
- Standardize pointer type spacing and indentation
- Improve line breaks and reduce line length for better readability
- Minor code style improvements in WebGPU code generation

* [Test] Add WebGPU matrix multiplication code generation test

- Implement test_webgpu_codegen.py for WebGPU matrix multiplication
- Add assert_gemm_codegen function to validate WebGPU code generation
- Include basic matrix multiplication kernel test case

* Update README with WebGPU codegen support announcement

c8fc0cbb

14 Feb, 2025 1 commit

[Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm (#85) · ec84188f

Lei Wang authored Feb 14, 2025

* bump version into v0.1.0

* [Enhancement] Add custom develop command for editable installs and update .gitignore

* [Documentation] Update README to include system dependencies installation instructions

* [Build] Update setup.py to support library file copying for both release and develop modes

* [Build] Refactor library file copying logic in setup.py

* [Documentation] Remove unnecessary install section header in Installation.md

* [Build] Add tox configuration and local distribution script for multi-Python version support

* [Build] Improve git submodule update function with better error handling

* [Build] Update LLVM configuration path in ROCm installation script

* [Build] Add .tox/ to .gitignore for tox testing environment

* [Build] Add support for TVM prebuild path configuration in CMakeLists.txt

* [Cleanup] Remove unused TVM runtime error codes header

* [Cleanup] Fix TVM grid constant type reference in CUDA module

* [Cleanup] Remove unused customized_code function from IR module

* [Feature] Add TileLang thread synchronization and storage access analysis passes

* [Build] Reorder DLL search path directories for more flexible library loading

* [Refactor] Improve thread synchronization and library path handling

- Rename ThreadSync and TileLangThreadSync functions in C++ code
- Update Python docstring for ThreadSync with more detailed description
- Reorder library path detection in tilelang environment setup
- Minor comment and code cleanup in CUDA and warp specialization modules

* [Refactor] Improve thread synchronization code style and formatting

- Standardize pointer type spacing in storage_access.h and storage_access.cc
- Update whitespace and indentation in thread_storage_sync.cc
- Reorder include statements in thread_partial_sync.cc
- Minor code formatting improvements across thread synchronization files

* [Refactor] Fix global function registration for ThreadSync

- Correct global function registration to use ThreadSync instead of TileLangThreadSync
- Update TVM global registration to match recent refactoring efforts

* [Refactor] Simplify ThreadSync global function registration

- Remove unnecessary whitespace in global function registration
- Compact the TVM global registration line for ThreadSync

ec84188f

03 Feb, 2025 1 commit

[Dev] Separate `LoopVectorize` Pass from upstream tvm (#62) · 7111239d

Lei Wang authored Feb 04, 2025

* [Enhancement] Add VectorizeLoop function and update imports for compatibility

* [CI][Test] Improve test cases for vectorization and fix typos in parser comments

* lint fix

* Fix incorrect module reference for VectorizeLoop transformation

* Refactor vectorize_loop transformation by removing unused extent mutation logic

7111239d

23 Jan, 2025 1 commit

[Bugfix] Replace thread binding detector in LayoutInference Pass (#31) · 34e0883d

Lei Wang authored Jan 23, 2025

* [Refactor] Rename AllocateCollector to ThreadBindingCollector and streamline thread binding logic

* [Refactor] Adjust formatting in ThreadBindingCollector for consistency

* [Refactor] Enhance clang-tidy check to handle cases with no changed C/C++ files

* [Refactor] Remove clang-tidy checks from format script to streamline formatting process

34e0883d

17 Jan, 2025 1 commit
- [CPU] Support CPU Code generation (#17) · 913d14f2
  Lei Wang authored Jan 18, 2025
```
* README.md fixed

* test fix

* cpu backend update

* cpu test case
```
  913d14f2
11 Jan, 2025 1 commit
- [Lint] Overall Typo and Linting Fixes (#13) · fa511857
  Lei Wang authored Jan 11, 2025
```
* README.md fixed

* update test ci

* Lint and Typo Fix

* Clang Format Lint Fix
```
  fa511857