Commits · 5683e6a6804ca13cbecb93dd8c182eb0fd8a479c · OpenDAS / tilelang

22 Oct, 2025 1 commit

[CI][Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow (#1044) · 5683e6a6

Xuehai Pan authored Oct 22, 2025

* [Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow

* chore: update clang-tidy settings

* chore: upgrade clang-format and clang-tidy version

* lint: resolve clang-tidy errors

* [Maint] restore format.sh

* [CI] pre-commit autoupdate

* [Minor] fix `command -v` usage

5683e6a6

21 Oct, 2025 4 commits

[Bugfix] Fix missing host cuTensorMapEncodeIm2col call (#1094) · 5cb5c068
Yu Cheng authored Oct 22, 2025

5cb5c068

[Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e

Lei Wang authored Oct 21, 2025

* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
    generated Allocates and the PrimFunc attrs
  - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
    allocations keep their tl.local_var_init annotations
  - teach annotation handling to accept scalar initializers, resolve buffers, and merge
    with existing stat

* lint fix

* enhance

* lint fix

* lint fix

bddb125e

[PassConfig] Introduce PassConfig `TL_STORAGE_REWRITE_DETECT_INPLACE` (#1089) · cdc67fc4

Lei Wang authored Oct 21, 2025

* • Enable configurable StorageRewrite inplace detection

  - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key.
  - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse.
  - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples.

* lint fix

* add test

* lint fix

cdc67fc4

[BugFix] Add memory order argument for non-vectorized atomic add (#1081) · 1d4b7180

Zhengju Tang authored Oct 21, 2025

* [BugFix] Add memory order argument for non-vectorized atomic add

* [Lint]

* [BugFix] Memory order

* [Lint]

* [BugFix] Argument in cuda template

* [Lint]

1d4b7180

20 Oct, 2025 4 commits

[Enhancement] Update async intrinsic handling in inject_fence_proxy (#1068) · bb8b3cd7

Tong WU authored Oct 21, 2025



* [Enhancement] Update async intrinsic handling in inject_fence_proxy

* Added support for wgmma async intrinsics in IsAsyncIntrinsic function.
* Changed handling of unknown externs to treat them as Generic instead of Async, improving accuracy in proxy kind determination.

* test fix

* Update testing/python/transform/test_tilelang_transform_inject_fence_proxy.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

bb8b3cd7

[Bugfix] Fix missing reg alloc in custom warp specialization (#1084) · f8d3e73e
Yu Cheng authored Oct 21, 2025

f8d3e73e
[Layout] Utilizing IsEqual instead of StructuralEqual (#1073) · 6a388c0e
Lei Wang authored Oct 20, 2025

6a388c0e

[Parallel] Support `T.Parallel` with dynamic extents (#990) · 27701c3d

Lei Wang authored Oct 20, 2025

* Allow dynamic extents in loop partition; warn when layout inversion falls back to NoCheck

* add test and introduce predicate

* test fix

* fix

* enhance

* inverse with level

* test fix

* bug fix

27701c3d

17 Oct, 2025 2 commits

[Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (#1050) · 72111642

Chaofan Lin authored Oct 17, 2025



* [Refactor] Refactor Pass  to support recursive load/store rewrite

* lint

* recursive collect conds for call_extern

* fix name

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* address comment

* rename pad_value to safe_value

* lint

* add oob store test

* [Lint]: [pre-commit.ci] auto fixes [...]

* fix

* fix

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

72111642

[Enhancement] Introduce a workaround for layout inference for local buffer store (#1055) · 278c0fbf

Lei Wang authored Oct 17, 2025



* [Enhancement] Improve layout inference for local buffer handling in parallel operations

* Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions.
* Updated the condition for determining parallel loop execution to account for local buffer stores.
* Cleaned up comments for clarity and future considerations.

* [Refactor] Clean up parallel loop condition formatting in layout inference

* Reformatted the condition for determining parallel loop execution for better readability.
* Maintained existing logic while enhancing code clarity for future modifications.

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

278c0fbf

16 Oct, 2025 1 commit
- [Feature]: Add test for atomicadd auto vectorize and remove useless code (#1019) · 0ff4f427
  Yuqi Dong authored Oct 16, 2025
```
* update

* format

* rabbit
```
  0ff4f427
15 Oct, 2025 2 commits
- [Refactor] Use `has_simt_copy` to decide whether to insert `set_max_nreg` (#982) · bd1c7b39
  Yu Cheng authored Oct 16, 2025
  
  bd1c7b39
- [TIR] Revert some changes of Pass `LowerIntrin` (#1035) · e5399527
  Lei Wang authored Oct 15, 2025
```
* keep >> instead of /

* re think replicate

* lint fix

* handle const int buffers

* rep fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  e5399527
14 Oct, 2025 1 commit
- [Transform] Migrate `LowerIntrin` from tvm into tilelang (#999) · 7a5077e4
  Lei Wang authored Oct 14, 2025
```
* Donot lower ceildiv to >>

* lint fix

* test fix

* fallback ceildiv changes
```
  7a5077e4
13 Oct, 2025 1 commit
- [Bugfix] Fix atomicadd auto vectorize identify var error (#883) · 340bfc50
  Yuqi Dong authored Oct 13, 2025
```
* update

* update

* update

* update
```
  340bfc50
11 Oct, 2025 1 commit

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36

Lei Wang authored Oct 11, 2025

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)

* • InjectFenceProxy docs and tests

  - annotate proxy fence injector with context comments for async/generic detection
  - add compiler internals doc covering the pass mechanics and link it in docs index
  - repair fence proxy test by fixing descriptor init usage and fence counter logic

* do not consider call_extern as async.

* doc update.

* reduce test size for sparse mla

ddfaac36

10 Oct, 2025 3 commits

[Bugfix] Fix dummy kernel compliation (#962) · 7913fb1d

Chaofan Lin authored Oct 10, 2025



* [Bugfix] Fix visit EvaluateNode in BufferGemmCollector

* address comment

* lint

* fix

* Add TileLang SplitHostDevice pass and tighten issue 830 test names

* lint fix

* enhance for kernel value unpacking.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7913fb1d

[CI] add `pre-commit` integration (#955) · 8fe35402

Xuehai Pan authored Oct 10, 2025



* chore: misc cleanup

* feat: add pre-commit config

* chore: update lint dependencies

* style: fix lint issues

* feat: add pre-commit hooks

* fix: fix typos

* chore: update .gitattributes

* [Lint]: [pre-commit.ci] auto fixes [...]

* docs: update CONTRIBUTING.md

* chore: update default venv name

* chore: revert and exclude CUDA files

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8fe35402

[Bugfix] Do not force inline let stmt (#947) · f8ae600c

Lei Wang authored Oct 10, 2025

* remove debug print

* Remove inline let expressions from the LowerAndLegalize function in phase.py

* add test

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* reduce test shape

* Update documentation structure and refactor main function parameters in example_fusedmoe_tilelang.py

- Added a new section for compiler internals in the documentation.
- Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.

* Update buffer access checks in merge_shared_memory_allocations.cc

- Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level.
- Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.

* lint fix

* Support pipeline with LetStmt

* lint fix

* • Fix LowerTileOp let handling to avoid LetInline dependency

  - inline let-bound BufferLoad nodes via resolver helpers and structured return
  - remap layouts/buffers using original data vars and only rewrite when needed
  - update pipeline planner to understand let-bound address_of buffers
  - document the new inline behaviour in docs/let_inline_fix.md

* fix for wgmma pipeline with let binding

* lint fix

* test fix

* reduce smem usage.

* let binding enhancement

* fix for dpgm

* fix simplify

* lint fix

* use tilelang.Simplify instead of tir.Simplify

* • Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage

  - register the new config in builtin headers/registration
  - add helper to pipeline enabling LetInline based on pass context
  - document LetStmt inlining controls and usage

f8ae600c

09 Oct, 2025 1 commit

[TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28

Lei Wang authored Oct 10, 2025

* [Feature] Introduce WGMMA support and enhance GEMM layout handling

- Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
- Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
- Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
- Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
- Improved documentation for layout functions and GEMM operations to clarify usage and parameters.

These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.

* [Refactor] Clean up code formatting and enhance layout function readability

- Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
- Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
- Enhanced comments and documentation in layout-related files to clarify usage and parameters.

These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.

* [Feature] Add descriptor initialization and offset manipulation for WGMMA

- Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
- Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
- Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
- Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
- Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.

These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.

* [Refactor] Improve code formatting and readability in various files

- Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
- Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
- Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.

These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.

* [Update] Update subproject commit and refactor layout function call

- Updated the subproject commit for `cutlass` to indicate a dirty state.
- Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.

These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.

* support more data types

* gemm_rs support

* lint fix

* wgmma wrapper

* Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.

* Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.

* Comprehensively support WGMMA GEMM SS

* remove debug print

* lint fix

* remove debug print

* reduce bwd test shape

* lint fix

* clear cache for pytest

* lint fix

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* test fix

* adjust test case

* test fix

* skip some test currently

a13cde28

05 Oct, 2025 1 commit
- [Example] Disable TMA and enable FastMath for NSA Examples (#941) · 3aecab8f
  Lei Wang authored Oct 05, 2025
```
* tma disable

* int64 cast fix.
```
  3aecab8f
02 Oct, 2025 1 commit

[Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa

Zhiwen Mo authored Oct 03, 2025

* Implements tcgen05.ld instruction support for copying from shared.tmem
  to local.fragment on SM100/Blackwell architecture. Adds layout inference
  and lowering logic for tensor memory operations with proper physical
  coordinate range analysis and warpgroup alignment checks.

  Changes:
  - Add kTMemLoad and kTMemStore to CopyInst enumeration
  - Implement CheckTMemLoad() and CheckTMemStore() validation functions
  - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
  - Add tmem layout inference in InferLayout() using expandTcgen05Layout
  - Support multiple instruction variants (32dp32b/64b/128b/256b)
  - Add physical layout bounds analysis for tmem coordinates
  - Change clear_accum from bool to PrimExpr in GEMM operations
  - Fix std::optional access checks in layout_inference.cc
  - Add tmem_allocate/deallocate PTX intrinsic support
  - Fix cooperative_groups grid.sync() code generation

* fix

* pipeline fix

* bug fix

* bool fix

5ccac4fa

28 Sep, 2025 1 commit

[SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43

Zhiwen Mo authored Sep 28, 2025

* update sm100 related utcmma, tmem, ld/st256 in src
* update sm100 related utcmma, tmem, ld/st256 in tilelang
* Remove deprecated GEMM examples and related README documentation for SM100 architecture support
* Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
* Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
* Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
* Update README and source files to reflect TCGEN5.MMA terminology changes
* Refactor CUDA GEMM header for improved readability

f58bcd43

26 Sep, 2025 1 commit

[Layout] Introduce Flexible Parallel to Support T.serial and local buffers... · c382dcbc

Lei Wang authored Sep 27, 2025


[Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop (#844)

* Support T.serial and local buffers inside T.Parallel loop.

* Fix reducer layout in T.Parallel nested inside other loops

* Debug output with LOG(INFO)

* Add disable option for WGMMA.

* fix

* Use DLOG; fix missing registration for new pass config

* bug fix

* lint fix

* Enhance GEMM instruction set with UTCMMA and improve local buffer handling in casting example

* Update format.sh shebang, improve logging in layout inference, and enhance buffer store wrapper with detailed comments

* Enhance GEMM instantiation logic and improve layout inference for local buffer detection

- Updated the GEMM instantiation logic to include a check for WGMMA compatibility, ensuring that the conditions for using WGMMA are more robust.
- Refined the layout inference process to better identify when loops manipulate only local buffers, improving the accuracy of thread binding decisions in parallel loops.

---------
Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>

c382dcbc

25 Sep, 2025 2 commits

[Language] Support atomic add with ret (#870) · aa0b1090

Lei Wang authored Sep 26, 2025

* Add atomic operations for CUDA templates in new atomic.h file

- Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
- Implemented support for half, bfloat16, and float types with appropriate memory ordering.
- Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
- Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
- Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.

* Refactor atomic operations in CUDA templates for improved readability

- Reformatted atomic operation implementations in atomic.h for better code clarity.
- Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
- Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.

* Add thread storage synchronization configuration option

- Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
- Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
- Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.

* Refactor thread storage sync configuration retrieval

- Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
- Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.

* test fix

* Update atomic operations and tests for improved functionality

- Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
- Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
- Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
- Updated the TVM subproject to the latest commit for better compatibility.

* Update attention sink examples to use 32 heads

- Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
- Ensured consistency across example scripts for improved usability and testing.

* Refactor atomic add handling in vectorization

- Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
- Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.

* Add loop break functionality and enhance thread synchronization

- Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
- Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
- Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.

* test fix

aa0b1090

[Bugfix] Use `ExprDeepEqual` instead of `StructuralEqual` when merge consecutive If stmt (#876) · 1dfac2e8

Lei Wang authored Sep 25, 2025

* Update submodule TVM to latest commit and fix condition comparison in merge_if_stmt.cc

* Update submodule TVM to latest commit 0524f760

* lint fix

1dfac2e8

24 Sep, 2025 1 commit

[Fix] tilelang can now vectorize `B[i,j] = c[i] + A[i,j]` (#798) · 2d4b848f

Kurisu authored Sep 24, 2025

* Fix bug 0905: vectorize with broadcasted value

* fix lint error

* [Refactor] Use `tvm::tir::UseVar` and use Vectorizer

* Add loop size check in vectorize planner

* fix lint error

2d4b848f

22 Sep, 2025 1 commit

[TMA] Bugfix when a shared buffer is both issued with tma store and tma load (#857) · b9a51c43

Lei Wang authored Sep 22, 2025

- Updated `init_desc_arg_map` to use `Var` as the key instead of `String` in `lower_hopper_intrin.cc`.
- Enhanced `func_call_args` method in `TLCUDASourceWrapper` to accept additional parameters for better argument mapping.
- Added assertions to ensure consistency between function parameters and arguments during kernel launches.
- Modified `generate_tma_descriptor_args` to utilize a mapping of variable names for TMA descriptor initialization.

b9a51c43

19 Sep, 2025 1 commit

[Refactor] Enhance buffer store transformation in TIR pass (#851) · 094e2298

Lei Wang authored Sep 19, 2025

- Updated the `AddWrapperForSingleBufStore` function to improve the handling of buffer stores by adding detailed checks for fragment buffer accesses and ensuring only index 0 is used.
- Introduced new helper functions for collecting buffer accesses and indices, enhancing code readability and maintainability.
- Refined the logic for determining tile operations and thread bindings to ensure accurate transformations without affecting existing parallel structures.

094e2298

15 Sep, 2025 2 commits

[Refactor] Update TVM subproject and streamline buffer store handling (#816) · 85d1a6b3

Yu Cheng authored Sep 16, 2025

- Updated the TVM subproject to the latest commit for improved functionality.
- Refactored `warp_specialized_rewriter.cc` to replace placeholder implementations for `BlockNode` and `BlockRealizeNode` with proper role filtering, enhancing code clarity and maintainability.
- Ensured consistent handling of the `cp_async_barrier_noinc` function in `builtin.py` by adding a newline at the end of the file.

85d1a6b3

[Refactor] Update TVM subproject and refactor BlockNode handling in... · 8b005226

Yu Cheng authored Sep 16, 2025

[Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc (#812)

* [Feature] Introduce custom warp specialization attribute and enhance warp group register allocation

- Added a new attribute `kCustomWarpSpecialization` to support custom warp specialization in the TileLang framework.
- Updated the `Collect` method in `SetMaxNRegCollector` to handle cases where warp specialization is detected, returning an empty array accordingly.
- Enhanced the `SetMaxNRegInjector` to skip processing when no registers are needed, improving efficiency.
- Modified the `WarpSpecialized` pass to include the new attribute in the function body when warp specialization is enabled, ensuring proper handling in transformations.

* lint

* lint

8b005226

14 Sep, 2025 1 commit

[Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality (#809) · ae9b7063

Yu Cheng authored Sep 14, 2025

- Introduced a new intrinsic `ptx_cp_async_barrier_noinc` for handling the `cp.async.mbarrier.arrive.noinc` operation in TileLang.
- Updated the CUDA code generation to support the new barrier operation.
- Added a corresponding function in the TileLang Python API for ease of use.
- Enhanced the barrier handling in CUDA templates to include the new no-increment operation, improving synchronization capabilities in parallel execution contexts.

ae9b7063

10 Sep, 2025 1 commit

[TileOp] Introduce a experimental python defined `T.gemm_v2` (#793) · 91a7bb2b

Lei Wang authored Sep 11, 2025

* Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability

- Removed deprecated prime factorization functions from `gemm.cc` and `gemm_sp.cc`.
- Introduced a new `GemmWarpPolicy` class to manage warp policy attributes and methods, improving encapsulation.
- Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.
- Enhanced `GetArchInt` function in `utils.cc` for better readability and type safety.
- Added new `gemm_v2` function in `gemm.py` for improved GEMM operation with additional parameters and checks.

* Refactor GEMM and frontend legalize operations for improved clarity and functionality

- Updated `gemm_py.h` to include the correct header for GEMM operations.
- Renamed `FrontendLegalizer` class to `LetInliner` and updated related methods to reflect this change, enhancing code clarity.
- Modified the pass function from `FrontendLegalize` to `LetInline` for better alignment with its purpose.
- Updated test cases to utilize the new `gemm_v2` function and adjusted the testing framework for improved output and clarity.
- Removed obsolete test file `test_tilelang_transform_frontend_legalize.py` to streamline the test suite.
- Enhanced the `LowerAndLegalize` function to utilize the new `LetInline` pass, improving the overall transformation process.

* Enhance CUDA code generation and testing for GEMM operations

- Added indentation printing in `codegen_cuda.cc` for improved assembly code formatting.
- Updated `test_tilelang_tilelibrary_gemm.py` to include additional GEMM test cases and shared memory allocation with specified scope.
- Introduced new `matmul_sr` and `run_gemm_sr` functions for GEMM operations with shared and fragment memory layouts.
- Refactored layout inference in `mma_macro_generator.py` to improve clarity and correctness in shared memory handling.
- Enhanced `gemm/__init__.py` to support new GEMM operation combinations and layout inference logic.

These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.

* Refactor GEMM layout and testing for improved clarity and functionality

- Updated `gemm_layouts.cc` to enhance the layout generation logic for transposed and non-transposed GEMM operations.
- Renamed and modified functions in `test_tilelang_tilelibrary_gemm.py` to reflect changes in GEMM function signatures and improve test coverage.
- Introduced new GEMM operation combinations in `gemm/__init__.py` to support additional layouts and configurations.
- Enhanced layout inference in `mma_layout.py` and `mma_macro_generator.py` for better handling of shared memory layouts.

These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.

* Refactor GEMM layout and Python integration for improved functionality

- Updated `gemm_layouts.cc` to correct the order of layout replication and repetition for transposed and non-transposed GEMM operations.
- Enhanced `gemm_py.cc` to handle block realization more robustly, ensuring correct assignment of global symbols and block attributes.
- Refactored `inject_pipeline.cc` to streamline buffer read/write region handling, improving clarity and maintainability.
- Cleaned up test cases in `test_tilelang_tilelibrary_gemm.py` by removing unnecessary print statements and adjusting function calls for better test execution flow.

These changes enhance the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.

* Refactor GEMM layout and testing for improved clarity and functionality

- Updated `gemm_layouts.cc` to enhance layout generation logic for transposed and non-transposed GEMM operations.
- Improved block realization handling in `gemm_py.cc` for better assignment of global symbols.
- Streamlined buffer read/write region handling in `inject_pipeline.cc` for clarity.
- Enhanced test cases in `test_tilelang_tilelibrary_gemm.py` by adjusting function calls and adding new GEMM operation combinations.

These changes improve the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.

* tfloat32 support.

* lint fix

* lint fix

* Refactor shared memory allocation in GEMM tests

- Removed unnecessary scope specification in shared memory allocation for matrices A and B in `test_tilelang_tilelibrary_gemm.py`.
- This change simplifies the allocation process and aligns with the updated GEMM function signatures.

91a7bb2b

09 Sep, 2025 1 commit

Refactor index handling in BufferStore and BufferLoad to promote 64-bit integers (#796) · 54aaec98

Lei Wang authored Sep 09, 2025

- Updated index processing in `BufferStore` and `BufferLoad` to ensure that integer indices with less than 64 bits are promoted to 64-bit integers.
- Introduced a new array to store the modified indices before updating the original indices, enhancing clarity and maintainability of the code.

54aaec98

06 Sep, 2025 1 commit

[TMA] Automatically lower 1d tma in appropriate cases (#788) · 9d7d45be

Lei Wang authored Sep 06, 2025

* Enhance layout inference and copy operations with 1D TMA support

- Updated `CopyNode` to introduce separate handling for 1D bulk load/store operations, including new methods for checking and lowering these operations.
- Modified `InferLayout` and `GetCopyInst` to accommodate additional parameters for layout maps and analyzers.
- Enhanced `AtomicAddNode` and `FillNode` to utilize the updated layout inference logic.
- Improved buffer out-of-bounds checks during layout inference to ensure safe memory access.

This update improves the efficiency and correctness of memory operations in the TileLang framework.

* Refactor layout inference calls for improved readability

- Updated `InferLayout` calls in `AtomicAddNode`, `CopyNode`, and `FillNode` to enhance code clarity by formatting parameters across multiple lines.
- Cleaned up whitespace and formatting in `copy.h` and `layout_inference.cc` to adhere to coding standards and improve maintainability.

This refactor aims to streamline the layout inference logic and improve overall code organization.

* Fix shared tensor check in CopyNode for bulk copy operations

- Updated the condition in `CheckBulkCopy1D` to verify contiguity of `shared_tensor` instead of `dst`, ensuring correct handling of shared memory layouts during bulk copy operations.
- This change enhances the accuracy of memory operations in the TileLang framework.

* Update test_example_gdn_compilation.py to invoke test function directly

- Commented out the call to `tilelang.testing.main()` in `test_example_gdn_compilation.py` and replaced it with a direct call to `test_example_chunk_delta_bwd_compilation()`. This change simplifies the test execution flow and focuses on the specific test case.

* Enhance bulk load/store checks in CopyNode with last dimension validation

- Updated `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to include an optional parameter for validating the last dimension during bulk copy operations.
- Adjusted related methods `CheckBulkLoad1D` and `CheckBulkStore1D` to pass the new parameter, improving the accuracy of bulk copy checks.
- This change enhances the robustness of memory operations in the TileLang framework by ensuring compliance with dimensional requirements.

* Refactor CheckBulkLoad and CheckBulkStore methods for improved readability

- Reformatted the parameter lists of `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to enhance code clarity by aligning parameters across multiple lines.
- This change improves the maintainability of the code and adheres to coding standards.

9d7d45be

05 Sep, 2025 1 commit

[Feat] Add tilelang T.assume support and assume injection for buffer shapes (#787) · e5b61e9b

Kurisu authored Sep 05, 2025

* Add InjectAssumes pass to speedup tvm prover

* Fix lint errors

* remove debug statements

* [Feat] add assume attr and assume support in tilelang

* Add convertion from tir.assume to tilelang assume

* [Fix] Add missing With constraint in IRMutator

* Fix typo in ir mutator

e5b61e9b

04 Sep, 2025 1 commit

[Refactor] Support python reflection for tile operators (#783) · 3cfefc8e

Lei Wang authored Sep 04, 2025

* Implement Fill operator and related reflection methods in TileLang

- Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers.
- Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities.
- Updated relevant files to register reflection methods and ensure proper initialization in static blocks.
- Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability.
- Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly.

* Refactor operator reflection methods and improve code clarity

- Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks.
- Consolidated static initialization blocks for various operators to a single line for improved consistency.
- Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability.
- Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports.

* Refactor GEMM and AtomicAdd operations for improved clarity

- Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety.
- Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method.
- Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity.
- Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports.

* Remove deprecated operator files from the tilelang IR module

- Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase.
- This cleanup enhances maintainability by removing unused code and improving overall organization of the module.

* Refactor imports in tilelang IR module for improved organization

- Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase.

* lint fix

* Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability

- Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability.
- Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code.
- Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity.
- Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities.
- Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.

* Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability

- Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure.
- Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization.
- Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability.
- Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability.

* comment

* Refactor operator header files for improved readability

- Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity.
- Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions.

* Refactor MakeReduce method in ReduceOpNode for clarity

- Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability.
- This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation.

* Update Reduce operation type checks for consistency

- Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity.
- This adjustment enhances the clarity and consistency of the reduce type handling in the codebase.

3cfefc8e

02 Sep, 2025 1 commit

[Lint] Introduce clang-tidy into format.sh (#777) · cdc5d8d3

Lei Wang authored Sep 02, 2025

* [Refactor] Update Clang-Tidy Checks and Improve Code Consistency

- Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization.
- Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity.
- Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions.
- Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code.
- General code cleanup and adherence to best practices for better maintainability.

* [Refactor] Enhance Code Consistency and Clang-Tidy Configuration

- Updated .clang-tidy configuration to include additional checks for improved code quality and performance.
- Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity.
- Replaced size checks with `empty()` method calls in various locations for clearer intent.
- Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable.
- General code cleanup to adhere to best practices and improve maintainability.

* [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency

- Added clang-tidy checks to the format script for improved code quality assurance.
- Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity.
- Updated the requirements-lint.txt file to include clang-tidy as a dependency.
- General code cleanup to adhere to best practices and improve maintainability.

* [CI] Update AMD CI Workflow to Include Build Directory Creation

- Added steps to create a build directory and configure CMake with ROCm support during the format check process.
- Ensured cleanup of the build directory after the format check to maintain a clean workspace.

* [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode

- Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members.
- This change enhances code clarity and maintainability by focusing on relevant attributes for each class.

* [Refactor] Update Clang-Tidy Integration and Code Improvements

- Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes.
- Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures.
- Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class.
- General code cleanup to adhere to best practices and improve maintainability.

* [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize

- Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity.
- Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency.
- General code cleanup to maintain consistency and adhere to best practices.

* [CI] Add Git Submodule Initialization to CI Workflow

- Included a step to initialize and update git submodules recursively in the CI workflow.
- This change ensures that all necessary submodules are available during the format check process, improving build reliability.

* [CI] Add Git Submodule Update Step to Format Check

- Included a command to initialize and update git submodules recursively in the CI workflow during the format check process.
- This enhancement ensures that all required submodules are available, contributing to improved build reliability.

* [Refactor] Update Function Signatures in AtomicAddVectorize

- Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity.
- This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase.

cdc5d8d3

01 Sep, 2025 1 commit

[BugFix] Refactor the op check in LowerTileOp pass using the member function... · 68af2159

Zhengju Tang authored Sep 01, 2025

[BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771)

* [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match

* [Lint]

68af2159