Commits · 7e8d1f827369040b2c65bed65227a03333e018be · OpenDAS / tilelang

20 Dec, 2025 1 commit

[Enhancement] Enhance let binding handling in layout inference and warp specialized pass (#1484) · 7e8d1f82

Lei Wang authored Dec 21, 2025

* [Feature] Add FullyReplicated Fragment Layout and Enhance Layout Inference

* Introduced a new static method `FullyReplicated` in the `Fragment` class to create fully replicated fragment layouts, ensuring all threads hold identical copies of the buffer.
* Updated `CopyNode` to collect fragment layouts and mark them as fully replicated during layout inference.
* Enhanced `ParallelOpNode` to expand let bindings for fragment buffer accesses, improving layout inference accuracy.
* Added documentation for new methods and updated existing methods to support the new layout features.

* lint fix

* Remove debug logging statements from layout inference process to streamline output and improve performance.

7e8d1f82

10 Dec, 2025 1 commit
- [Enhancement] Add debug output methods for Layout and Fragment classes (#1392) · e7e4e65b
  Kuris authored Dec 10, 2025
  
  e7e4e65b
05 Dec, 2025 1 commit

[Layout] Enhance Free Layout Inference (#1375) · 6654064d

Lei Wang authored Dec 05, 2025

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py

* [Enhancement] Extend support for float8 data types in GEMM operations

- Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
- Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
- Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.

* lint fix

* [Enhancement] Add injective layout detection and exception handling

- Introduced `DetectInjective` method in `FragmentNode` to check for injective layouts.
- Added `LoopLayoutInjectiveException` to handle errors related to non-injective layouts.
- Updated `InferLayout` methods in `ParallelOpNode` to utilize injective checks and log relevant information.
- Refactored layout inference queue management to use `std::deque` for improved performance and added prioritization logic for buffer layouts.

* remove debug print

* minor layout fix

* fix for T.view

* [Enhancement] Improve injective layout detection in FragmentNode

- Updated the `DetectInjective` method to handle symbolic dimensions more effectively by introducing a mechanism to collect symbolic shapes and adjust the detection level accordingly.
- Added logging for cases where the layout detection falls back to NoCheck due to symbolic dimensions.
- Minor update to the test file to include the tilelang testing module.

* [Refactor] Simplify layout inference for bulk copy operations

- Removed unnecessary conditions for bulk load/store operations in the layout inference logic.
- Streamlined the handling of layout application for bulk copy instances to enhance clarity and maintainability.

* remove debug print

* [Enhancement] Introduce layout-related exceptions and improve error handling

- Added `LayoutConflictException` and `LoopLayoutInjectiveException` classes for better exception management in layout operations.
- Updated `InferLayout` method in `ParallelOpNode` to throw `LoopLayoutInjectiveException` with detailed error information when injective layout checks fail.
- Removed redundant exception class definitions from `parallel.h` to streamline code organization.

6654064d

13 Nov, 2025 1 commit

[Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape (#1248) · d7164abf

Lei Wang authored Nov 13, 2025

* fix

* Refactor tensor reshaping in fp8_lighting_indexer.py

- Replaced the allocation of `s_reshaped` with a reshape operation to improve clarity and performance.
- Updated the logic in the computation of `s_reshaped` to utilize the reshaped tensor, enhancing the overall functionality of the attention mechanism.

* Refactor analyzer usage in Layout and Fragment reshaping

- Consolidated analyzer logic in the `Reshape` methods of `LayoutNode` and `FragmentNode` to utilize a fallback analyzer, improving code clarity and preventing potential null dereference issues.
- Updated variable binding and simplification calls to use the selected analyzer consistently, enhancing robustness in shape validation and index computation.

d7164abf

12 Nov, 2025 1 commit

[Enhancement] Support Layout/Fragment Reshape (#1241) · 4370309b

Lei Wang authored Nov 12, 2025



* Update layout handling and introduce reshape functionality

- Updated the `LayoutNode` class to include a new `Reshape` method, allowing for dynamic reshaping of layouts based on input shapes.
- Enhanced the `OutputShape` method to provide better handling of cases where the analyzer cannot form an `IntervalSet`, implementing fallback mechanisms to ensure safe extents.
- Refactored the `ReduceOpNode` to utilize `BufferRegion` for improved memory handling during reduction operations.
- Added tests for reshaping functionality and layout transformations to ensure correctness and performance in various scenarios.

* lint fix

* Revert tvm submodule pointer to 1815c3e0b6ec4ead36370bbd1562025d8529017c; keep src unchanged

* Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1

* Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1

* remove useless prove

* remove comment

---------
Co-authored-by: tilelang-bot <bot@tilelang>

4370309b

11 Nov, 2025 1 commit

[Refactor] Simplify logic in the `CompleteBufferFragment` (#1226) · 7045f1d6

Lei Wang authored Nov 11, 2025



* fix

* Fix logging level in LayoutNode::InverseWithLevel method from WARNING to DLOG for symbolic layout fallback.

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

7045f1d6

10 Nov, 2025 1 commit

[Bugfix] Improve error handling in LayoutNode::InverseWithLevel (#1215) (#1220) · cf46b7bd

Lei Wang authored Nov 10, 2025

* Added logging and exception handling for layout errors in InverseWithLevel method.
* Replaced direct error check with a throw statement to enhance error reporting and debugging capabilities.

cf46b7bd

05 Nov, 2025 1 commit

[SM70] Refactor and minor fix for SM70 (#1195) · 4a9cb470

Lei Wang authored Nov 06, 2025

* [Feature] Add support for SM70 tensor core MMA instructions

- Introduced new intrinsic `ptx_mma_sm70` for Volta GPUs, enabling m16n16k4 shape with FP16 inputs and FP16/FP32 accumulation.
- Added `GemmMMASm70` class for handling GEMM operations specific to SM70 architecture.
- Implemented layout functions for Volta swizzled layouts and updated existing GEMM layout inference logic.
- Updated `requirements-dev.txt` to include `apache-tvm-ffi` dependency.
- Added correctness evaluation script for testing GEMM operations on SM70.

* [Refactor] Update formatting and installation commands in scripts

- Modified `format.sh` to install `pre-commit` and `clang-tidy` with the `--user` flag for user-specific installations.
- Improved readability in `correctness_evaluation_sm70.py` by adjusting the formatting of pytest parameters.
- Cleaned up spacing and formatting in various C++ source files for better consistency and readability.
- Removed unnecessary comments and improved layout function definitions in `mma_sm70_layout.py` and `mma_sm70_macro_generator.py` for clarity.
- Ensured consistent formatting in layout initialization and swizzle functions.

* typo fix

4a9cb470

02 Nov, 2025 1 commit

[Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb

Lei Wang authored Nov 03, 2025



* remove debug print

* pipeline fix

* use the correct buffer access scope

* rs support

* warp warpgroup_fence_operand

* fix

* fp8 dtype ptx enhance

* mma fix

* TCGEN05 Interface

* tcgen05 support

* rebase

* update

* Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.

* lint fix

* Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.

* wgmma fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

aef0a6bb

31 Oct, 2025 1 commit

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28

Lei Wang authored Oct 31, 2025



* 3rdparty tvm bump

* bump tvm into v0.22.0

* lint fix

* rebase tvm

* Update submodule tvm to latest commit 3085bc4

* Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang

* test fix

* add requirement

* atomic_fix

* atomic_fix

* phaseout py39

* optimize

* optimize

* lint fix

* do not clean cache

* do not clean cache

* [Minor] Minor update for Python versions and dependencies

* [Lint] fix lint for py39

* [Lint] fix lint for ROCm

* [Build][CI] Sync CI changes from upstream/sdist

* [Lint] fix lint for ROCm

* [Build][CI] Update `repair-wheel-command`

* [Minor] update abi3audit result format

* [Lint] fix lint for ROCm

* [BugFix] fix build

* [Lint] fix lint for ROCm

* [BugFix] set rpath for libtvm and libtvm_runtime

* [Deps] pin apache-tvm-ffi version

* [Build] set Python 3.9 Limited API for Cython target

* [Build] set Python 3.9 Limited API for Cython target

* [Deps] Restore Python 3.8 support

* [Build] use `apache-tvm-ffi`'s `libtvm_ffi`

* [BugFix] use `;` as delimiter for RPATH on macOS

* [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`

* [Build] support `sccache` if available

* [Build] add CIBW import test

* [Build][CI] enable ccache for CIBW on Linux

* [BugFix] set rpath for libtvm and libtvm_runtime

* Revert "[Build][CI] enable ccache for CIBW on Linux"

This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.

* [CI] fix perfbench bot

* [BugFix] use Python 3.9 to build wheel

* [Minor] update perfbench bot envs

* [BugFix] fix CIBW environment on Linux

* [CI] skip import test on CentOS 7

* [CI] use Python urllib to download file instead of Wget

---------
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

10911e28

20 Oct, 2025 1 commit

[Parallel] Support `T.Parallel` with dynamic extents (#990) · 27701c3d

Lei Wang authored Oct 20, 2025

* Allow dynamic extents in loop partition; warn when layout inversion falls back to NoCheck

* add test and introduce predicate

* test fix

* fix

* enhance

* inverse with level

* test fix

* bug fix

27701c3d

09 Oct, 2025 1 commit

[TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28

Lei Wang authored Oct 10, 2025

* [Feature] Introduce WGMMA support and enhance GEMM layout handling

- Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
- Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
- Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
- Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
- Improved documentation for layout functions and GEMM operations to clarify usage and parameters.

These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.

* [Refactor] Clean up code formatting and enhance layout function readability

- Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
- Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
- Enhanced comments and documentation in layout-related files to clarify usage and parameters.

These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.

* [Feature] Add descriptor initialization and offset manipulation for WGMMA

- Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
- Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
- Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
- Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
- Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.

These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.

* [Refactor] Improve code formatting and readability in various files

- Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
- Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
- Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.

These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.

* [Update] Update subproject commit and refactor layout function call

- Updated the subproject commit for `cutlass` to indicate a dirty state.
- Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.

These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.

* support more data types

* gemm_rs support

* lint fix

* wgmma wrapper

* Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.

* Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.

* Comprehensively support WGMMA GEMM SS

* remove debug print

* lint fix

* remove debug print

* reduce bwd test shape

* lint fix

* clear cache for pytest

* lint fix

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* test fix

* adjust test case

* test fix

* skip some test currently

a13cde28

02 Oct, 2025 1 commit

[Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452

Lei Wang authored Oct 02, 2025

* [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode

- Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
- Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
- Updated error handling for non-zero index access in fragment buffers to improve robustness.

* [Layout] Improve code formatting and readability in layout.cc and parallel.cc

- Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
- Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
- Ensured consistent formatting across conditional statements and comments for improved maintainability.

* updt

* optimize const index related op

* bug fix

* reduce gdn test

* test fix

* lintfix

* lint fix

* test fix

fc4bd452

23 Sep, 2025 1 commit

[Layout] Support layout forward with multi dimension (#867) · 9cbbbbc6

Lei Wang authored Sep 23, 2025

* Enhance LayoutNode::Forward method to handle variable transformations more robustly

- Updated the method to check for a minimum number of input dimensions.
- Introduced a mechanism to transform the last InputDim() elements of the input variables.
- Concatenated transformed variables with the remaining input variables for a comprehensive output.

* Refactor LayoutNode::Forward method for improved readability

- Removed unnecessary whitespace to enhance code clarity.
- Maintained existing functionality while streamlining the transformation process of input variables.

9cbbbbc6

30 Jul, 2025 1 commit

Refactor to support upstream tvm (#595) · a7c9a8b9

Siyuan Feng authored Jul 30, 2025



**Summarize part of the rebase pr:**

1. **Support T.thread_return() → CUDA return syntax**  
   Added support for translating `T.thread_return()` to CUDA's native `return` statement.

2. **Dynamic type support for function inputs**  
   Functions now accept dynamically typed parameters using `typing`:
   ```python
   dyn_type = T.int32 or T.float
   @T.prim_func
   def main(
       a: dyn_type,
   )
   ```

3. **Device Function Codegen**  
   Added support for generating `__device__` functions in CUDA:
   ```python
   @I.ir_module
   class Module:
       @T.prim_func(private=True)
       def add(a: T.int32, b: T.int32) -> T.int32:
           return a + b

       @T.prim_func
       def main(
           A: T.Buffer((128, 128), "int32"),
           B: T.Buffer((128, 128), "int32"),
           C: T.Buffer((128, 128), "int32"),
       ):
           T.func_attr({"global_symbol": "main"})
           length: T.int32 = Module.add(64, 64)  # Host call
           for bx in T.thread_binding(length, "blockIdx.x"):
               for tx in T.thread_binding(length, "threadIdx.x"):
                   C[bx, tx] = Module.add(A[bx, tx], B[bx, tx])  # Device call
   ```
   After compilation, `add` becomes a CUDA `__device__` function.

4. **Cython-based Python/C++ interop**  
   Replaced ctypes with Cython for all Python/C++ interactions:
   - Python → C++ calls
   - C++ → Cython calls  
   This improves performance by around 100x and reduces CPU overhead during compile/runtime.

5. **FP8 data type standardization**  
   Migrated `e5m2_float8` and similar types to Torch-standardized variants`float8_e5m2` and etc.



* Refactor CMakeLists.txt to set default build type and manage dependencies for tvm_cython modules

* Update default value of `check_well_formed` parameter in `prim_func` to False for improved flexibility in TIR function parsing.

* Add StorageRewrite function to transform module

Introduced the StorageRewrite function in the tilelang.transform module, which returns a TVM transform pass. This addition enhances the functionality of the module by providing a new transformation option for users.

* Refactor null option handling in IR and layout inference

- Updated instances of `NullOpt` to `std::nullopt` in `ir.cc` and `parallel.cc` for consistency with modern C++ practices.
- Enhanced layout inference logic in `layout_inference.cc` to improve type safety by replacing `as<Fragment>().get()` with `as<FragmentNode>()`.
- Adjusted error handling in `multi_version_buffer_rewriter.cc` and `persist_threadblock.cc` to use more concise null checks.
- Cleaned up test files by commenting out `tilelang.testing.main()` and replacing it with specific test function calls for better clarity.
- Removed unused test file `test_tilelang_kernel_deepseek_nsa.py` to streamline the testing suite.

* Update TVM subproject and refactor cluster planning and tile operation handling

- Updated the TVM subproject to a dirty commit state.
- Refactored copyright headers in `cluster_planning.cc` to reflect the new licensing.
- Enhanced error handling in `lower_tile_op.cc` to check for missing padding map annotations.
- Modified test files to improve clarity and functionality, including adjustments to kernel compilation and test assertions.
- Updated various test cases to ensure proper handling of annotations and configurations in the TileLang testing framework.

* Update annotation type in warp specialized test for consistency

- Changed the annotation type in the `test_warp_specialized` function from a literal integer to `T.int32(3)` for improved type safety and consistency with the TileLang framework.

* Refactor test execution in warp specialized test

- Replaced the direct call to `test_warp_specialized()` with `tilelang.testing.main()` in the test file to standardize test execution and improve integration with the TileLang testing framework.

* refactor

* [Enhancement] Add strict layout map for improved buffer layout inference (#594)

- Introduced a `strict_layout_map` to enhance layout inference by ensuring that buffers with strict layout requirements are properly accounted for during the inference process.
- Updated the inference logic to check for the presence of buffers in the `strict_layout_map` before applying layout changes, improving the accuracy of layout assignments.
- Refactored the layout inference steps to include the copying of layouts into the new strict map, ensuring a clear separation of layout handling based on inference levels.

* [Example] Update examples to use @tilelang.jit (#597)

* [Example] Update kernel compilation in examples to use @tilelang.jit

- Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
- Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
- Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.

* format

* Update example_tilelang_sparse_gqa_decode_varlen_indice.py

* Update example_dequant_gemm_fine_grained.py

* Update example_gemm_autotune.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks (#599)

* [Enhancement] Improve error messaging for global and shared range legality checks in LowerBulkCopy

- Updated error messages in the LowerBulkCopy function to provide clearer context when global and shared ranges are illegal.
- Enhanced the readability of the error output by including tensor names, improving debugging and validation processes during bulk copy operations.

* [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks

- Improved the clarity of error messages in the LowerBulkCopy function by enhancing the output format.
- Included additional context in error messages to aid debugging when global and shared ranges are found to be illegal, ensuring better traceability during bulk copy operations.

* [Enhancement] Introduce PassConfig `TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE` to enable aggressive shared memory reuse (#602)

* [Enhancement] Add aggressive shared memory merge option in memory allocation

- Introduced a new configuration option `tl.enable_aggressive_shared_memory_merge` to enable aggressive merging of shared memory allocations.
- Updated the `SharedMemLinearAccessPatternFinder` class to support an aggressive merge strategy, allowing for improved memory reuse.
- Modified the `MergeSharedMemoryAllocations` function to incorporate the new merging strategy based on the configuration.
- Enhanced the `PassConfigKey` enumeration to include the new aggressive merge option, ensuring it can be configured appropriately.

* lint fix

* [Enhancement] Add aggressive shared memory merge configuration option

- Introduced a new configuration option `kEnableAggressiveSharedMemoryMerge` to enable aggressive merging of shared memory allocations, enhancing memory management capabilities.

* [Enhancement] Update MergeSharedMemoryAllocations to support aggressive merge option

- Modified the `MergeSharedMemoryAllocations` function to accept an `enable_aggressive_merge` parameter, allowing for more flexible memory management.
- Introduced a new helper function `should_enable_aggressive_merge` to determine the aggressive merge configuration based on the pass context and target.
- Updated the relevant calls in the `phase.py` and `__init__.py` files to utilize the new aggressive merge functionality, enhancing the overall memory allocation strategy.

* [Refactor] Update accumulation handling in gemm_sm90.h (#603)

- Replaced the use of `tiled_mma.accumulate_ = GMMA::ScaleOut::Zero` with a call to `clear(acc)` for better clarity and maintainability in the accumulation logic.
- This change enhances the readability of the code by standardizing the approach to clearing accumulation values across multiple sections of the file.

* [Enhancement] Add tma bulk copy. (#600)

* [Bugfix] Fixed mha_bwd shape inconsistency error (#604)

* lint fix

* Update requirements-lint.txt to maintain clang-format version consistency

* [Bugfix] Avoid duplicate data access when cross thread buffer meet replicate register (#606)

* [Enhancement] Improve debug output formatting in layout and fragment nodes

- Updated the `DebugOutput` methods in `LayoutNode` and `FragmentNode` to provide more structured and informative output, including transformation details and thread range information.
- Enhanced layout inference logic in `ParallelOp` to add predicates for cross-thread shared memory access, improving layout handling in parallel operations.
- Minor adjustment in `layout_inference.cc` to ensure clarity in parallel loop handling.

* lint fix

* [Enhancement] Support tf32 gemm_rs (#607)

- Added a line break in `quickstart.py` for better readability.
- Simplified the JIT kernel compilation in `quickstart.py` by removing the unused execution backend option.
- Modified `example_elementwise_add.py` to disable cache for `tilelang` and optimized the element-wise addition kernel by utilizing shared memory for input tensors, improving performance.
- Updated default values for matrix dimensions and block sizes in the argument parser to enhance usability.

* [Enhancement] Introduce option `TL_DISABLE_FAST_MATH` and `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` (#609)

* [Enhancement] Introduce new PassConfig options for fast math and PTXAS verbosity

- Added `kDisableFastMath` and `kEnablePTXASVerboseOutput` configuration options to enhance control over compilation settings.
- Updated `LibraryGenerator` to utilize these new pass configurations, allowing for more flexible compilation behavior based on user preferences.
- Enhanced `PassConfigKey` enumeration to include the new options, ensuring they can be configured appropriately in the pass context.

* [Refactor] Update PTXAS verbosity configuration key in LibraryGenerator

- Changed the configuration key for PTXAS verbosity from `TL_VERBOSE_PTXAS_OUTPUT` to `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` to align with the new naming convention introduced in recent enhancements.
- This update ensures consistency in the configuration options used within the `LibraryGenerator` class, improving clarity and maintainability of the code.

* lint fix

* fix build

* [Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526)

* [experimental] add a draft gemm_sp

* [3rdparty] bump cutlass to v3.9.3

* [lint] run format.sh

* [chore] rebase

* [chore] use abs path

* [gemm_sp] add metadata layout

* [ci] add more example

* [lint] run format.sh

* [chore] polish

* [chore] move gemm_sp to experimental

* [chore] polish

* [lint] run format.sh

* [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test

* Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy.
* Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process.
* Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality.

* Implement Test

* [Enhancement] Update GEMM SP and SM89 templates for improved functionality

* Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture.
* Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets.
* Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types.
* Added conditional inclusion for GEMM SP header based on CUDA architecture version.

* lint fix

* [gemm_sp] support more layout and data types

* Enhancement: sync T.gemm_sp's layout inference with T.gemm

* Enhancement: support more block_k in compress util

* [Enhancement] enable block_k=64

* [Lint] run format.sh

* [Enhancement] compressor support more dtype

* Enhancement: enable block_K=32

* [Lint] format.sh

* [Fixbug] fix shape

* Refactor: sync gemm

* [Enhancement] enable transpose

* [Enhancement] enable fp8_e4m3

* [Enhancement] enable int8

* [Lint] run format.sh

* [Benchmark] add gemm_sp benchmark

* [Example] fix 256 threads hang

* [CI] fix ci

* [Chore] resolve gemini feedback

* [Benchmark] increase search space

* [Lint] format

* [CI] skip sparse tensor core related tests as only sm90 is supported

* [CI] pass local run

* Update gemm_sm89.h

* lint fix

* lint fix

* [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags

- Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation.
- Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag.
- Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected.
- Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch.
- Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module.

* Update test_compress_utils.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Doc] Phaseout Legacy documentations (#610)

- Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core.
- Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.

* [Refactor] Phaseout Pass ParallelLoopTransformer (#611)

* Refactor layout inference by removing the ParallelLoopTransformer class. Updated layout inference logic to streamline buffer access collection and condition handling in parallel loops. This change simplifies the code structure and enhances maintainability.

* Update MHA backward test cases to use reduced dimensions for batch size and context length

* fix build

* [Enhancement] Update ReduceOp initialization values for integer types (#614)

* [Enhancement] Update ReduceOp initialization values for integer types

- Modified the `MakeInitValue` method in `ReduceOp` to handle integer data types correctly by returning appropriate minimum and maximum values based on the bit width.
- Added checks for integer types to ensure correct initialization for `kMax` and `kMin` reduction types, enhancing the robustness of the reduction operations.

* [Enhancement] Update ReduceOp to handle unsigned integer initialization values

- Enhanced the `MakeInitValue` method in `ReduceOp` to include support for unsigned integer data types.
- Added conditions to return appropriate initialization values for `kMax` and `kMin` reduction types based on the data type, improving the robustness of reduction operations.

* Bump transformers from 4.50.0 to 4.51.0 in /examples/bitnet-1.58b (#615)

Bumps [transformers](https://github.com/huggingface/transformers) from 4.50.0 to 4.51.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.50.0...v4.51.0

)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.51.0
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Refactor] refactor autotune examples (#617)

* [Refactor] Update tilelang kernel functions and remove unused imports

- Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance.
- Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation.
- Removed unused import of `cached` from `tilelang`, streamlining the code.
- Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use.

* [Refactor] Simplify configuration generation in benchmark and example scripts

- Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability.
- Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning.
- Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized.

* [Refactor] Update configuration handling in benchmark scripts

- Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management.
- Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning.
- Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks.
- Cleaned up redundant code and improved overall readability in the affected files.

* [Refactor] Clean up formatting and update subproject commit

- Updated the subproject commit reference in the TVM directory to indicate a dirty state.
- Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability.
- Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability.

* [Refactor] Update AutoTuner configuration handling

- Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling.
- Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources.
- Updated comments for clarity and maintainability.

* lint fix

* Update TVM subproject commit to indicate dirty state and modify MHA backward test cases

- Updated the subproject commit reference in the TVM directory to reflect a dirty state.
- Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256.
- Uncommented the main testing function call for potential execution.

* lint fix

* Bump transformers from 4.51.0 to 4.52.1 in /examples/bitnet-1.58b (#619)

Bumps [transformers](https://github.com/huggingface/transformers) from 4.51.0 to 4.52.1.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.51.0...v4.52.1

)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.52.1
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix PTXAS options flag in LibraryGenerator for consistency (#620)

* Refactor FP8 type handling across multiple files to standardize usage of "float8_e4m3" and "float8_e5m2" instead of "e4m3_float8" and "e5m2_float8". This includes updates in benchmarks, examples, tests, and internal utilities.

* [Refactor] Add parallel loop transform pass for condition extraction (#618)

* [Refactor] Add parallel loop transform

* done format check

* pull 3rdparty repo

* Refactor loop variable handling in transformation utilities

- Updated the logic in `loop_parallel_transform_utils.h` to simplify the handling of related loop variables.
- Removed the check that enforced a single related loop variable, replacing it with a return statement when multiple variables are detected, enhancing clarity and maintainability of the transformation process.

* Update loop_parallel_transform_utils.h

* Refactor loop variable handling in transformation utilities

- Enhanced the logic in `loop_parallel_transform_utils.h` to improve clarity and maintainability by simplifying the handling of related loop variables.
- Replaced the previous enforcement of a single related loop variable with a return statement for multiple variables detected.

* remove disable cache flag as commit id has been key component

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Dev] Update linear attention examples to enhance performance on Hopper GPUs (#621)

* Tune linear attention examples on H100

* Add retnet fwd kernel

* fix lint

* [Enhancement] Add ahead of time cython compilation in setup.py (#622)

* [Enhancement] Add Cython support and compiler detection in setup.py

- Introduced a new `CythonExtension` class for building Cython-based extensions, enhancing the build process for Cython projects.
- Implemented functions to detect the Cython compiler and C++ compiler, improving compatibility and user experience.
- Updated the build process to handle Cython extensions alongside CMake extensions, ensuring a seamless integration for users.
- Added caching mechanisms for Cython compilation to optimize build times and reduce unnecessary recompilation.

* [Enhancement] Add Cython dependency and enable CMake extension building

- Added Cython as a required dependency in `pyproject.toml` to support Cython-based extensions.
- Updated `setup.py` to enable building CMake extensions, improving the build process for projects utilizing both Cython and CMake.
- Modified the Cython compiler detection logic to streamline installation instructions for users.

* [Enhancement] Support more flexible layout host pythonic expr (#623)

* [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr

- Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation.
- Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code.

* [Refactor] Simplify expression handling in pythonic_expr function

- Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability.
- Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation.

* [Enhancement] Improve expression representation in pythonic_expr function

- Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings.
- Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation.
- Included a docstring for better clarity on the function's purpose and usage.

* test fix

* [Enhancement] support composable expression for shape with symbolic vars (#624)

* [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr

- Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation.
- Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code.

* [Refactor] Simplify expression handling in pythonic_expr function

- Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability.
- Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation.

* [Enhancement] Improve expression representation in pythonic_expr function

- Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings.
- Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation.
- Included a docstring for better clarity on the function's purpose and usage.

* test fix

* minor update

* 🐍

Fix the file name "test_exmaple_tilelang_nsa" (#629)

* [Enhancement] Add CPU utilization and count settings for Auto-Tuning (#630)

* [Enhancement] Add CPU utilization and count settings for Auto-Tuning

- Introduced environment variables for CPU utilization, counts, and maximum CPU count for auto-tuning.
- Updated the AutoTuner class to utilize these new settings, improving flexibility and performance in multi-threaded environments.
- Enhanced logging to provide better insights into the auto-tuning process based on the configured CPU settings.

* typo fix

* [AutoTune] Support `with set_autotune_inputs` to set auto tuning input tensors (#632)

* [Refactor] Simplify and modularize autotuner implementation

- Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability.
- Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure.
- Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic.
- Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes.

* [Refactor] Clean up and enhance capture and tuner modules

- Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`.
- Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning.
- Streamlined the `CaptureStack` class for better thread-local context management.

* lint fix

* [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example

- Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity.
- Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning.
- Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability.
- Deleted obsolete test file for autotuning decorator, cleaning up the codebase.

* [Refactor] Improve code formatting and readability in autotune test file

- Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation.
- Fixed a typo in the `enable_rasteration` parameter name to ensure consistency.
- Cleaned up unnecessary blank lines to enhance overall code clarity.

* Update example_blocksparse_gemm.py

* Update capture.py

* [Pass] Introduce flag to diable cp async lowering (#633)

* [Enhancement] Update PipelinePlanner to support async copy configuration

- Modified the `Substitute` method in `PipelinePlanner` to accept a `use_async_copy` parameter, allowing for more flexible pipeline planning based on async copy requirements.
- Updated the constructor of `PipelinePlanner` to initialize the `use_async_copy_` member variable.
- Adjusted the logic in the pipeline planning process to conditionally apply async copy annotations based on the new parameter.
- Commented out the `LoopVectorizeDynamic` call in `LowerAndLegalize` to prevent unintended modifications during the legalizing phase.

* Refactor PipelinePlanning function for improved readability

- Adjusted the formatting of the `use_async_copy` variable assignment in the `PipelinePlanning` function to enhance code clarity and maintainability.

* fix typo (#635)

* [Pass][Simplify] Introduce symbolic level simplify for condition expression (#634)

* [Enhancement] Add argument simplification option to StmtSimplifier

- Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior.
- Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process.
- Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase.
- Added comments to clarify the purpose of the new flag and its impact on simplification logic.

* lint fix

* [Enhancement] Improve layout inference and reduce operation handling

- Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic.
- Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures.
- Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization.

* lint fix

* [Enhancement] Add input validation for GEMM parameters

- Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method.
- Added informative error messages to assist in debugging when the input parameters do not meet the required conditions.

* bug fix

* Enhance test coverage by adding LLVM requirement decorator to multiple function call tests. This ensures that tests for argument count, type code, null data pointer, and dimensionality checks are only executed when LLVM is available, improving test reliability and clarity.

* lint fix

* Fix software pipeline stage annotation and update optional config handling in StmtSimplifier

* Add Python executable detection in CMake configuration and update TVM submodule reference. Remove unused vectorization tests for improved clarity.

* Update TVM submodule reference and refactor FFI registration to use static initialization blocks for improved organization and clarity.

* Refactor attribute handling in layout and IR nodes to use reflection registration. This change replaces the VisitAttrs method with a RegisterReflection method for improved clarity and organization across multiple classes, including KernelLaunchFrameNode, WarpSpecializeFrameNode, LayoutNode, FragmentNode, and SwizzledLayoutNode.

* finish rebase

* tvm update

* Refactor FFI registration across tilelang modules to use the updated `tvm.ffi` namespace. This includes changes in various files to replace `tvm._ffi` with `tvm.ffi`, enhancing consistency and clarity in the codebase.

* lint fix

* Update TVM submodule reference and modify CUDA runtime argument handling to use the new runtime constants for improved clarity and consistency.

* lint fix

* Refactor tensor data type references from "e4m3_float8" and "e5m2_float8" to "float8_e4m3" and "float8_e5m2" across multiple files for consistency and clarity.

* lint fix

* Refactor forward_index initialization in Fragment class to default to an empty array instead of None, ensuring consistent handling of optional outputs.

* test fix

* lint fix

* bugfix

* lint fix

* reduce fix

* lint fix

* carver fix

* cast fix

* Update submodule and enhance kernel launch functionality with optional block size parameter; add device kernel launch transformation.

* lint fix

* bugfix

* Refactor test execution in test_tilelang_cpu_gemm.py and enhance device call checks in lower.py to exclude C packed functions from kernel launch conditions.

* lint fix

* Update runtime.cc

* phase out lisence

* Update subproject commit for TVM to 555cc71

* Update subproject commit for TVM to d39953fa

* Update subproject commit for TVM to 9574805f

* Update subproject commit for TVM to a08b7c3

* fix ci

* ci fix

---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: Cunxiao Ni <85601223+Cunxiao2002@users.noreply.github.com>
Co-authored-by: Yuxi Chi <cherichy@outlook.com>
Co-authored-by: Nathan Chen <120630832+Nathancgy@users.noreply.github.com>
Co-authored-by: botbw <wang1570@e.ntu.edu.sg>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: xs-keju <93414213+xs-keju@users.noreply.github.com>
Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Co-authored-by: Kadir Nar <kadir.nar@hotmail.com>
Co-authored-by: Yuqing Xia <35415939+xiayuqing0622@users.noreply.github.com>
Co-authored-by: xwhzz <wh.xie@outlook.com>

a7c9a8b9

30 Jun, 2025 1 commit

[Bugfix] Avoid duplicate data access when cross thread buffer meet replicate register (#606) · 8df45c9d

Lei Wang authored Jun 30, 2025

* [Enhancement] Improve debug output formatting in layout and fragment nodes

- Updated the `DebugOutput` methods in `LayoutNode` and `FragmentNode` to provide more structured and informative output, including transformation details and thread range information.
- Enhanced layout inference logic in `ParallelOp` to add predicates for cross-thread shared memory access, improving layout handling in parallel operations.
- Minor adjustment in `layout_inference.cc` to ensure clarity in parallel loop handling.

* lint fix

8df45c9d

22 May, 2025 1 commit

[Refactor] Update buffer handling in layout transformation functions (#509) · 094796b6

Lei Wang authored May 22, 2025

* Modified `makeBufferWithLayout` to include a `var_remap` parameter for improved variable remapping during buffer creation.
* Enhanced buffer load and store operations to utilize the new variable remapping logic, ensuring correct buffer references.
* Commented out a check in `ThreadExtent` for clarity, maintaining functionality while improving code readability.

094796b6

10 May, 2025 1 commit

[Refactor] Improve layout equality checks and error messaging (#471) · c2480907

Lei Wang authored May 10, 2025

* [Refactor] Simplify buffer_region_to_tile_region function in copy.py

* Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
* Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.

* [Refactor] Improve layout equality checks and error messaging

* Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges.
* Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging.
* Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy.

* lintfix

* [Refactor] Rename SetThreadRange to BindThreadRange for clarity

* Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity.
* Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior.
* Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method.
* Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API.

* [Refactor] Update BindThreadRange usage across layout inference methods

* Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality.
* Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation.
* Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity.

* lint fix

c2480907

08 May, 2025 1 commit

[Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9

Lei Wang authored May 08, 2025

* Add example for warp specialization with flash attention

* Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
* Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
* Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
* Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.

* Add memory barrier functions to builtin.py

* Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
* Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
* The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
* Improved code organization and readability by adding blank lines for better separation of logical sections.

* Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py

* Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
* Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.

* [Refactor] Update barrier functions and add new example for GEMM with warp specialization

* Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
* Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
* Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
* Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
* Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.

* lint fix

* Update loop variable checks in SIMT loop and buffer region validation

* Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
* Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.

* lint fix

* test fix

a91bc2a9

06 May, 2025 1 commit

[Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a

Lei Wang authored May 06, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

* [Feature] Add new examples for warp specialization and TMA integration

* Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
* Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
* Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
* Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
* Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.

* lint fix

* Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.

b5faf25a

22 Apr, 2025 1 commit

[Enhancement] Support Auto Layout Inference and Parallelism with variable constraint (#417) · 73a6cb8b

Lei Wang authored Apr 22, 2025

* [Enhancement] Introduce thread range management in layout and operation handling

* Added `SetThreadRange` method to `FragmentNode` for managing thread ranges.
* Updated `LayoutNode::Inverse` to provide more informative error messages.
* Refactored layout inference and operation lowering to utilize `thread_bounds` instead of `block_size`, enhancing flexibility for thread management.
* Introduced new tests for tilelang operations to validate thread range functionality and ensure correctness in parallel execution scenarios.

* lint fix

* [Refactor] Improve thread variable handling in layout inference and operation lowering

* Removed workaround for undefined thread_var in layout inference, ensuring proper handling of thread bounds.
* Updated logic to define thread bounds based on the presence of thread_var, enhancing robustness in thread management.
* Refactored thread_var initialization in lower_tile_op to maintain consistency across the codebase.

* [Refactor] Update thread variable handling in layout inference and operation lowering

* Refactored thread variable checks to ensure bounds are only accessed when defined, improving safety and clarity.
* Initialized thread_var with a default range to prevent undefined behavior.
* Updated logic in lower_tile_op to align with new thread variable handling, enhancing consistency across the codebase.

73a6cb8b

24 Mar, 2025 1 commit

[Refactor] Improve flash attention example and layout comparison logic (#270) · 5f5bf53c

Lei Wang authored Mar 24, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

5f5bf53c

20 Mar, 2025 1 commit

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180

Lei Wang authored Mar 20, 2025

* remove llvm build

* [Refactor] Update kernel compilation and profiling in examples

- Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
- Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
- Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.

* lint fix

* License Update

* [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files

- Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
- Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.

* [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files

- Improved comment alignment and readability in `cuda.h`.
- Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.

* lint fix

* fix

* License update

* [Enhancement] Update JITKernel to use artifact for kernel source

- Assigned the generated artifact to `self.artifact` for better management.
- Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.

* lint fix

* Add @tilelang.testing.requires_llvm decorator to vectorization tests

* Enhance setup.py and env.py for library management

- Added functionality to remove original files after copying in CMakeBuild.
- Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.

* Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py

* Refactor CMakeBuild file handling in setup.py

- Added a check to ensure the target library directory exists before copying .so files.
- Improved the logic for creating the target directory and copying files to enhance robustness.

* bugfix

* Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.

* lint fix

* Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.

* lint fix

* Add support for C target in device code generation

- Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.

* [Enhancement] Implement auto-clear cache feature based on environment variable

* Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
* Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
* Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.

* [Refactor] Update kernel invocation and import paths in tests and cache

* Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
* Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
* Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.

* [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py

* Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
* Enhanced overall code formatting to align with project standards.

* [Enhancement] Add bfloat16 test case and improve kernel caching logic

* Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
* Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
* Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
* Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
* Improved code formatting and readability across several files.

* lint fix

* Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage

f2e99180

18 Mar, 2025 1 commit

[Refactor] Refactor for Better Layout Conflict Handling (#240) · 2a286ae6

Lei Wang authored Mar 18, 2025

* [Feature] Add reduce_max functionality and corresponding tests

* Introduced a new test file for the reduce_max operation in the tilelang language module.
* Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying.
* Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation.
* Enhanced profiling assertions to validate the output against reference implementations.

* Fix whitespace issues in reduce_max test file for improved readability

* [Refactor] Update DebugOutput methods to return strings instead of void

* Modified DebugOutput methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to return std::string instead of void, enhancing usability for logging and debugging.
* Updated corresponding header files to reflect the new return types.
* Improved layout inference error messages by incorporating DebugOutput for better clarity in layout conflicts.

* lint fix

* Fix typo in matmul function: changed loop from T.Parallel to T.grid for correct parallel execution in webgpu code generation tests.

* [Enhancement] Improve layout inference conflict handling in ParallelOp

* Updated the layout inference logic in ParallelOp to better handle conflicts for local.fragment buffers.
* Added checks to ensure that layout conflicts are reported only when both source and destination buffers are defined, improving clarity in error messages.
* Enhanced the overall robustness of the layout inference process by addressing specific cases where conflicts may arise.

* [Feature] Add IsEqual methods for layout comparison

* Introduced IsEqual methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to facilitate structural equality checks, allowing for optional index comparison.
* Enhanced layout inference logic in Copy and ParallelOp to utilize the new IsEqual methods for better conflict detection in local.fragment layouts.
* Improved error messages for layout conflicts to provide clearer guidance on potential issues.houm

* [Refactor] Update profiler usage in benchmark_nsa_fwd.py and improve layout inference in elem.cc and parallel.cc

* Modified the profiler call in benchmark_nsa_fwd.py to streamline latency measurement.
* Updated layout inference logic in elem.cc and parallel.cc to use const pointers for FragmentNode, enhancing type safety and clarity.
* Improved error messages in layout conflict checks to provide better guidance on potential issues.

* [Refactor] Clean up pointer formatting in layout inference files

* Standardized pointer formatting for FragmentNode in elem.cc and parallel.cc to improve code readability.
* Minor adjustments to error message formatting in layout conflict checks for better clarity.

2a286ae6

16 Mar, 2025 1 commit

[Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper (#224) · c5bbc608

zqh-wz authored Mar 16, 2025



* add test for issue 101

* use ss_smem_selector from cutlass

* fix mismatch between smem layout and mma

* only fix for sm90

* Add CUDA requirements to GEMM thread tests

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

c5bbc608

09 Feb, 2025 1 commit

[Tools] Introduce `plot_layout` to visualize the fragment layout (#68) · f9b6a92e

Lei Wang authored Feb 10, 2025

* [Enhancement] Add VectorizeLoop function and update imports for compatibility

* [CI][Test] Improve test cases for vectorization and fix typos in parser comments

* lint fix

* Fix incorrect module reference for VectorizeLoop transformation

* Refactor vectorize_loop transformation by removing unused extent mutation logic

* [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen

* Fix formatting in CUDA FP8 header file for consistency

* Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity

* Update submodule 'tvm' to latest commit for improved functionality

* Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.

* Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.

* Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency

* Add CUDA requirements to FP8 test cases and update references for clarity

* Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py

* Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py

* Add CUDA requirements and FP8 test cases for matmul and gemv simulations

* Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py

* Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py

* Add BF16 support to matrix multiplication and introduce corresponding test cases

* Add a blank line for improved readability in BF16 GEMM test

* Update acknowledgements in README to include supervision by Zhi Yang at Peking University

* enhance acknowledgement

* Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives

* Update subproject commit for TVM dependency

* Update subproject commit for TVM dependency

* Add int4_t type and functions for packing char values in CUDA common header

* Add plot_layout example and implement GetForwardVars method in layout classes

* Refactor code for improved readability by adjusting line breaks and formatting in layout and test files

* Fix formatting by removing unnecessary line break in layout.h

* Refactor make_int4 function for improved readability by adjusting parameter formatting

f9b6a92e

25 Jan, 2025 1 commit
- [Dev] Add FlashDecoding example (#46) · cc08ba50
  Yu Cheng authored Jan 25, 2025
  
  cc08ba50
11 Jan, 2025 2 commits

[Lint] Overall Typo and Linting Fixes (#13) · fa511857
Lei Wang authored Jan 11, 2025
```
* README.md fixed

* update test ci

* Lint and Typo Fix

* Clang Format Lint Fix
```
fa511857

[Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c

Lei Wang authored Jan 11, 2025



* Add format.sh script for code formatting and linting

* docs update

* center align the title

* lint fix

* add ignore

* Add .gitignore for 3rdparty directory

* Add requirements-dev.txt, requirements-test.txt, and requirements.txt

* 3rdparty

* Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h

* Refactor CMakeLists.txt and include statements

- Update CMakeLists.txt to use a newer version of CMake and add project name
- Remove unnecessary include directories

Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc

- Update include paths to use relative paths instead of absolute paths

* Update submodule for 3rdparty/tvm

* update

* load dll first

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* git keep update

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* refactor code structure

* Update Readme

* CMakeLists Customized

* update readme

* update README

* update readme

* update usage

* with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import

* annotate lower transform global func with `transform` prefix

* Migrate Simplify Pass from tilelang tvm branch

* enhance system environment handling with __init__ and CMake

* Initial commit

* CODE_OF_CONDUCT.md committed

* LICENSE committed

* README.md committed

* SECURITY.md committed

* SUPPORT.md committed

* CODE_OF_CONDUCT Commit

* LICENSE Commit

* SECURITY Commit

* SUPPORT Commit

* Modify Support

* Update README.md

* security ci update

* remove examples

* Update and implement clang-format

* add composable kernel components

* Migrate from latest update

* submodule update

* Test update

* Update License

* Spell check

* lint fix

* add clang-tidy to apply static analysis for c source

* update tilelang examples

* Update Install Docs

* Refactor filetree

* Enhance Install

* conflict resloved

* annotate_version

* Initial Update

* test fix

* install

* Implement setup.py

* lint fix

* Separate Init

* Separate test

* docker file commit

* add logo

* Update Readme and Examples

* update readme

* update logo

* Implement AMD Installation

* Add License

* Update AMD MI300x Benchmark

* update README

* update mi300 benchmark scripts

* update ignore

* enhance build scirpt

* update image

* enhance setup.py to remove duplicated libraries

* remove debug files

* update readme

* update image

* update gemm examples

* update flashattention README

* readme update

* add cmake into requirements

* libinfo fix

* auto update submodule

* lint fix

* Fix AMD Build and Test

* Update check for transpose attribute for CDNA Arch

* typo fix for amd

* Implement Matmul Benchmark

* Refactor Code

* [TypoFix] Fix GEMM Example

* [Docs] Init Linear Attention README

* [TYPO] Typo fix

* [Lint] Lint Fix

* enhance example with intrinsics

* [Enhancement] Improve Buffer Collection during IR Parser

* [Dev] Introduce Current classmethod to get current frame

* submodule update

* fake test pass update

* support thread_extent_api

* code optimize

* Add GEMM function implementation for matrix multiplication

* Update logging format to reflect TileLang in logger messages

* Refactor CMakeLists.txt for improved readability and set default build type to Release

* Support Gemm SS Primitives Implementation

* [README] Upload Tile Language Logo (#5)

* update logo

* Update README.md to enhance formatting and center the title

---------
Co-authored-by: microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
Co-authored-by: Microsoft Open Source <microsoftopensource@users.noreply.github.com>
Co-authored-by: Yu Cheng <yu.cheng@pku.edu.cn>

57ab687c